====== Brazilian population pyramids ======
January 28, 2015
===== Locating the data source =====
The main source of data on Brazil is the [[http://www.ibge.gov.br/|Brazilian Institute of Geography and Statistics]] (IBGE). The population pyramids data are likely contained in the results of [[http://www.ibge.gov.br/english/estatistica/populacao/censo2010/default.shtm|Census 2010]] and/or [[http://www.ibge.gov.br/english/estatistica/populacao/censo2000/default.shtm|Census 2000]].
We would like to get the pyramids at the level of [[http://en.wikipedia.org/wiki/Municipalities_of_Brazil|municipalities]]. Currently there are 5570 municipalities. After some searching we get to the pages about the population characteristics of municipalities inside selected regions. Here is the page for
the region of Sao Paulo [[http://www.cidades.ibge.gov.br/xtras/temas.php?lang=&codmun=355030&idtema=90&search=sao-paulo|sao-paulo|censo-demografico-2010:-resultados-da-amostra-caracteristicas-da-populacao-]] and here for its municipality Alambari [[http://www.cidades.ibge.gov.br/xtras/temas.php?lang=&codmun=350075&idtema=90&search=sao-paulo|alambari|censo-demografico-2010:-resultados-da-amostra-caracteristicas-da-populacao-]].
We were not able to find a way to download the complete data set or at least the data set for a region. Here are the URLs for displaying data on selected municipality (broken into two lines) and for downloading the corresponding CSV file.
http://www.cidades.ibge.gov.br/xtras/temas.php?lang=&codmun=350010&idtema=90&search=
sao-paulo|adamantina|censo-demografico-2010:-resultados-da-amostra-caracteristicas-da-populacao-
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=350010
To download the data about all municipalities we need their ''codmun'' numbers.
===== Downloading =====
Inspecting the region's page source http://www.cidades.ibge.gov.br/xtras/uf.php?lang=&coduf=12&search=acre (example for AC) for different regions AC, AL, ..., SP, TO we notice that the line 220 following the line ''
m1200013;AC;120001;acre|acrelandia;Acrelândia
m1200054;AC;120005;acre|assis-brasil;Assis Brasil
m1200104;AC;120010;acre|brasileia;Brasiléia
...
The so obtained lists were joined into a single CSV file ''brazilMunic.csv''. Using ''codmun''s from this file it is easy to generate in R download URLs for municipalities
> part <- 'http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun='
> data = paste(part,code,sep='')
> data
[1] "http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=270010"
We combine this into a downloading program:
setwd('C:/Users/batagelj/Downloads/data/brazil/DL')
L <- as.vector(read.csv('brazilMunic.csv',header=TRUE,sep=';')$codmuni)
part <- 'http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun='
for(fn in L){
fname <- paste(part,fn,sep=""); cat("---",fn,date(),"\n")
save <- paste('./muni/',fn,'.csv',sep='')
test <- tryCatch(download.file(fname,save,method="internal"),error=function(e) e)
}
date()
Here is a part of the printed trace of its execution:
--- 432149 Wed Jan 28 03:19:53 2015
trying URL 'http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=432149'
Content type 'application/csv' length unknown
opened URL
downloaded 20 Kb
...
--- 172210 Wed Jan 28 03:39:34 2015
trying URL 'http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=172210'
Content type 'application/csv' length unknown
opened URL
downloaded 20 Kb
Warning messages:
1: In download.file(fname, save, method = "internal") :
cannot open: HTTP status was '502 Proxy Error'
2: In download.file(fname, save, method = "internal") :
cannot open: HTTP status was '502 Proxy Error'
3: In download.file(fname, save, method = "internal") :
cannot open: HTTP status was '502 Proxy Error'
4: In download.file(fname, save, method = "internal") :
cannot open: HTTP status was '502 Proxy Error'
5: In download.file(fname, save, method = "internal") :
cannot open: HTTP status was '502 Proxy Error'
6: In download.file(fname, save, method = "internal") :
cannot open: HTTP status was '502 Proxy Error'
7: In download.file(fname, save, method = "internal") :
cannot open: HTTP status was '502 Proxy Error'
> date()
[1] "Wed Jan 28 03:39:35 2015"
There were some problems in downloading. We looked on the directory list of files by their sizes
{{notes:pics:files.png?450}}
The files corresponding to empty files were redownloaded manually
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261020
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261030
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261040
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261050
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261060
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261070
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261080
This time without problems. We replaced the empty files with them.
Again from the directory file list we can see that the downloading started at 2:10 and finished at 3:40.
The directory ''muni'' containing all data files is available as {{data:zip:muni.zip|muni.zip}}.
===== Making a data.frame =====
The problem in transformation of CSV files into summary data.frame is that in CSV files the integers are written with a position dot that is considered as decimal point when reading. We bypass it by reading numbers as strings, removing dots and converting the strings to integers.
setwd('C:/Users/batagelj/Downloads/data/brazil/DL')
cat('Brazilian municipalities\n',date(),'\n')
cn <- c(
"M 0-4 ", "M 5-9 ", "M10-14", "M15-19", "M20-24", "M25-29", "M30-39", "M40-49",
"M50-59", "M60-69", "M77- ", "F 0-4 ", "F 5-9 ", "F10-14", "F15-19", "F20-24",
"F25-29", "F30-39", "F40-49", "F50-59", "F60-69", "F77- " )
D <- read.csv('brazilMunic.csv',header=TRUE,sep=';',encoding='UTF-8')
L <- as.vector(D$codmuni); nam <- as.vector(D$name); reg <- as.vector(D$FUnit)
M <- matrix(nrow = 5570, ncol = 22)
colnames(M) <- cn
code <- vector("integer",5570)
region <- vector("character",5570)
k <- 0
for(fn in L){
k <- k+1
code[k] <- fn; region[k] <- reg[k]
fname <- paste('./muni/',fn,'.csv',sep="")
A <- read.csv(fname,sep=';',skip=3,header=FALSE,colClasses="character")
M[k,1:11] <- as.integer(gsub('\\.','',A$V2[34:44]))
M[k,12:22] <- as.integer(gsub('\\.','',A$V2[67:77]))
}
cat(date(),'\n')
F <- data.frame(nam,region,M)
rownames(F) <- L
save(F,file="Brazil.Rdata")
We get
Brazilian municipalities
Wed Jan 28 19:24:21 2015
Warning messages:
1: NAs introduced by coercion
2: NAs introduced by coercion
3: NAs introduced by coercion
4: NAs introduced by coercion
5: NAs introduced by coercion
6: NAs introduced by coercion
7: NAs introduced by coercion
8: NAs introduced by coercion
9: NAs introduced by coercion
10: NAs introduced by coercion
Wed Jan 28 19:24:38 2015
There are some problematic data.
> ok <- complete.cases(F)
> sum(ok)
[1] 5565
> F[!ok,]
nam region M.0.4. M.5.9. M10.14 M15.19 M20.24 M25.29 M30.39
500627 Paraíso das Águas MS NA NA NA NA NA NA NA
150475 Mojuí dos Campos PA NA NA NA NA NA NA NA
431454 Pinto Bandeira RS NA NA NA NA NA NA NA
422000 Balneário Rincao SC NA NA NA NA NA NA NA
421265 Pescaria Brava SC NA NA NA NA NA NA NA
We see that for the listed five municipalities the data are missing. The corresponding CSV files have fields with missing data.
The saved data.frame is available in {{data:zip:brazil.zip}}.