January 28, 2015
The main source of data on Brazil is the Brazilian Institute of Geography and Statistics (IBGE). The population pyramids data are likely contained in the results of Census 2010 and/or Census 2000.
We would like to get the pyramids at the level of municipalities. Currently there are 5570 municipalities. After some searching we get to the pages about the population characteristics of municipalities inside selected regions. Here is the page for the region of Sao Paulo sao-paulo|censo-demografico-2010:-resultados-da-amostra-caracteristicas-da-populacao- and here for its municipality Alambari alambari|censo-demografico-2010:-resultados-da-amostra-caracteristicas-da-populacao-.
We were not able to find a way to download the complete data set or at least the data set for a region. Here are the URLs for displaying data on selected municipality (broken into two lines) and for downloading the corresponding CSV file.
http://www.cidades.ibge.gov.br/xtras/temas.php?lang=&codmun=350010&idtema=90&search= sao-paulo|adamantina|censo-demografico-2010:-resultados-da-amostra-caracteristicas-da-populacao- http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=350010
To download the data about all municipalities we need their codmun
numbers.
Inspecting the region's page source http://www.cidades.ibge.gov.br/xtras/uf.php?lang=&coduf=12&search=acre (example for AC) for different regions AC, AL, …, SP, TO we notice that the line 220 following the line <ul id=“lista_municipios”>
contains a complete list of codmun
s of municipalities from a selected region. Using TextPad we extracted the line 220 for each region and transformed it into a cleaned list.
m1200013;AC;120001;acre|acrelandia;Acrelândia m1200054;AC;120005;acre|assis-brasil;Assis Brasil m1200104;AC;120010;acre|brasileia;Brasiléia ...
The so obtained lists were joined into a single CSV file brazilMunic.csv
. Using codmun
s from this file it is easy to generate in R download URLs for municipalities
> part <- 'http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=' > data = paste(part,code,sep='') > data [1] "http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=270010"
We combine this into a downloading program:
setwd('C:/Users/batagelj/Downloads/data/brazil/DL') L <- as.vector(read.csv('brazilMunic.csv',header=TRUE,sep=';')$codmuni) part <- 'http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=' for(fn in L){ fname <- paste(part,fn,sep=""); cat("---",fn,date(),"\n") save <- paste('./muni/',fn,'.csv',sep='') test <- tryCatch(download.file(fname,save,method="internal"),error=function(e) e) } date()
Here is a part of the printed trace of its execution:
--- 432149 Wed Jan 28 03:19:53 2015 trying URL 'http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=432149' Content type 'application/csv' length unknown opened URL downloaded 20 Kb ... --- 172210 Wed Jan 28 03:39:34 2015 trying URL 'http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=172210' Content type 'application/csv' length unknown opened URL downloaded 20 Kb Warning messages: 1: In download.file(fname, save, method = "internal") : cannot open: HTTP status was '502 Proxy Error' 2: In download.file(fname, save, method = "internal") : cannot open: HTTP status was '502 Proxy Error' 3: In download.file(fname, save, method = "internal") : cannot open: HTTP status was '502 Proxy Error' 4: In download.file(fname, save, method = "internal") : cannot open: HTTP status was '502 Proxy Error' 5: In download.file(fname, save, method = "internal") : cannot open: HTTP status was '502 Proxy Error' 6: In download.file(fname, save, method = "internal") : cannot open: HTTP status was '502 Proxy Error' 7: In download.file(fname, save, method = "internal") : cannot open: HTTP status was '502 Proxy Error' > date() [1] "Wed Jan 28 03:39:35 2015"
There were some problems in downloading. We looked on the directory list of files by their sizes
The files corresponding to empty files were redownloaded manually
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261020 http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261030 http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261040 http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261050 http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261060 http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261070 http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261080
This time without problems. We replaced the empty files with them.
Again from the directory file list we can see that the downloading started at 2:10 and finished at 3:40.
The directory muni
containing all data files is available as muni.zip.
The problem in transformation of CSV files into summary data.frame is that in CSV files the integers are written with a position dot that is considered as decimal point when reading. We bypass it by reading numbers as strings, removing dots and converting the strings to integers.
setwd('C:/Users/batagelj/Downloads/data/brazil/DL') cat('Brazilian municipalities\n',date(),'\n') cn <- c( "M 0-4 ", "M 5-9 ", "M10-14", "M15-19", "M20-24", "M25-29", "M30-39", "M40-49", "M50-59", "M60-69", "M77- ", "F 0-4 ", "F 5-9 ", "F10-14", "F15-19", "F20-24", "F25-29", "F30-39", "F40-49", "F50-59", "F60-69", "F77- " ) D <- read.csv('brazilMunic.csv',header=TRUE,sep=';',encoding='UTF-8') L <- as.vector(D$codmuni); nam <- as.vector(D$name); reg <- as.vector(D$FUnit) M <- matrix(nrow = 5570, ncol = 22) colnames(M) <- cn code <- vector("integer",5570) region <- vector("character",5570) k <- 0 for(fn in L){ k <- k+1 code[k] <- fn; region[k] <- reg[k] fname <- paste('./muni/',fn,'.csv',sep="") A <- read.csv(fname,sep=';',skip=3,header=FALSE,colClasses="character") M[k,1:11] <- as.integer(gsub('\\.','',A$V2[34:44])) M[k,12:22] <- as.integer(gsub('\\.','',A$V2[67:77])) } cat(date(),'\n') F <- data.frame(nam,region,M) rownames(F) <- L save(F,file="Brazil.Rdata")
We get
Brazilian municipalities Wed Jan 28 19:24:21 2015 Warning messages: 1: NAs introduced by coercion 2: NAs introduced by coercion 3: NAs introduced by coercion 4: NAs introduced by coercion 5: NAs introduced by coercion 6: NAs introduced by coercion 7: NAs introduced by coercion 8: NAs introduced by coercion 9: NAs introduced by coercion 10: NAs introduced by coercion Wed Jan 28 19:24:38 2015
There are some problematic data.
> ok <- complete.cases(F) > sum(ok) [1] 5565 > F[!ok,] nam region M.0.4. M.5.9. M10.14 M15.19 M20.24 M25.29 M30.39 500627 Paraíso das Águas MS NA NA NA NA NA NA NA 150475 Mojuí dos Campos PA NA NA NA NA NA NA NA 431454 Pinto Bandeira RS NA NA NA NA NA NA NA 422000 Balneário Rincao SC NA NA NA NA NA NA NA 421265 Pescaria Brava SC NA NA NA NA NA NA NA
We see that for the listed five municipalities the data are missing. The corresponding CSV files have fields with missing data.
The saved data.frame is available in brazil.zip.