Brazilian population pyramids

January 28, 2015

Locating the data source

The main source of data on Brazil is the Brazilian Institute of Geography and Statistics (IBGE). The population pyramids data are likely contained in the results of Census 2010 and/or Census 2000.

We would like to get the pyramids at the level of municipalities. Currently there are 5570 municipalities. After some searching we get to the pages about the population characteristics of municipalities inside selected regions. Here is the page for the region of Sao Paulo sao-paulo|censo-demografico-2010:-resultados-da-amostra-caracteristicas-da-populacao- and here for its municipality Alambari alambari|censo-demografico-2010:-resultados-da-amostra-caracteristicas-da-populacao-.

We were not able to find a way to download the complete data set or at least the data set for a region. Here are the URLs for displaying data on selected municipality (broken into two lines) and for downloading the corresponding CSV file.

http://www.cidades.ibge.gov.br/xtras/temas.php?lang=&codmun=350010&idtema=90&search=
  sao-paulo|adamantina|censo-demografico-2010:-resultados-da-amostra-caracteristicas-da-populacao-
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=350010

To download the data about all municipalities we need their codmun numbers.

Downloading

Inspecting the region's page source http://www.cidades.ibge.gov.br/xtras/uf.php?lang=&coduf=12&search=acre (example for AC) for different regions AC, AL, …, SP, TO we notice that the line 220 following the line <ul id=“lista_municipios”> contains a complete list of codmuns of municipalities from a selected region. Using TextPad we extracted the line 220 for each region and transformed it into a cleaned list.

m1200013;AC;120001;acre|acrelandia;Acrelândia
m1200054;AC;120005;acre|assis-brasil;Assis Brasil
m1200104;AC;120010;acre|brasileia;Brasiléia
...

The so obtained lists were joined into a single CSV file brazilMunic.csv. Using codmuns from this file it is easy to generate in R download URLs for municipalities

> part <- 'http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun='
> data = paste(part,code,sep='')
> data
[1] "http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=270010"

We combine this into a downloading program:

setwd('C:/Users/batagelj/Downloads/data/brazil/DL')
L <- as.vector(read.csv('brazilMunic.csv',header=TRUE,sep=';')$codmuni)
part <- 'http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun='
for(fn in L){
  fname <- paste(part,fn,sep=""); cat("---",fn,date(),"\n")
  save <- paste('./muni/',fn,'.csv',sep='')
  test <- tryCatch(download.file(fname,save,method="internal"),error=function(e) e)
}
date()

Here is a part of the printed trace of its execution:

--- 432149 Wed Jan 28 03:19:53 2015
trying URL 'http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=432149'
Content type 'application/csv' length unknown
opened URL
downloaded 20 Kb

...


--- 172210 Wed Jan 28 03:39:34 2015
trying URL 'http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=172210'
Content type 'application/csv' length unknown
opened URL
downloaded 20 Kb

Warning messages:
1: In download.file(fname, save, method = "internal") :
  cannot open: HTTP status was '502 Proxy Error'
2: In download.file(fname, save, method = "internal") :
  cannot open: HTTP status was '502 Proxy Error'
3: In download.file(fname, save, method = "internal") :
  cannot open: HTTP status was '502 Proxy Error'
4: In download.file(fname, save, method = "internal") :
  cannot open: HTTP status was '502 Proxy Error'
5: In download.file(fname, save, method = "internal") :
  cannot open: HTTP status was '502 Proxy Error'
6: In download.file(fname, save, method = "internal") :
  cannot open: HTTP status was '502 Proxy Error'
7: In download.file(fname, save, method = "internal") :
  cannot open: HTTP status was '502 Proxy Error'
> date()
[1] "Wed Jan 28 03:39:35 2015"

There were some problems in downloading. We looked on the directory list of files by their sizes

The files corresponding to empty files were redownloaded manually

http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261020
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261030
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261040
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261050
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261060
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261070
http://www.cidades.ibge.gov.br/xtras/csv.php?lang=&idtema=90&codmun=261080

This time without problems. We replaced the empty files with them.

Again from the directory file list we can see that the downloading started at 2:10 and finished at 3:40.

The directory muni containing all data files is available as muni.zip.

Making a data.frame

The problem in transformation of CSV files into summary data.frame is that in CSV files the integers are written with a position dot that is considered as decimal point when reading. We bypass it by reading numbers as strings, removing dots and converting the strings to integers.

setwd('C:/Users/batagelj/Downloads/data/brazil/DL')
cat('Brazilian municipalities\n',date(),'\n')
cn <- c(
   "M 0-4 ", "M 5-9 ", "M10-14", "M15-19", "M20-24", "M25-29", "M30-39", "M40-49",
   "M50-59", "M60-69", "M77-  ", "F 0-4 ", "F 5-9 ", "F10-14", "F15-19", "F20-24",
   "F25-29", "F30-39", "F40-49", "F50-59", "F60-69", "F77-  " )
D <- read.csv('brazilMunic.csv',header=TRUE,sep=';',encoding='UTF-8')
L <- as.vector(D$codmuni); nam <- as.vector(D$name); reg <- as.vector(D$FUnit)
M <- matrix(nrow = 5570, ncol = 22)
colnames(M) <- cn
code <- vector("integer",5570)
region <- vector("character",5570)
k <- 0
for(fn in L){
  k <- k+1
  code[k] <- fn; region[k] <- reg[k]
  fname <- paste('./muni/',fn,'.csv',sep="")
  A <- read.csv(fname,sep=';',skip=3,header=FALSE,colClasses="character")
  M[k,1:11] <- as.integer(gsub('\\.','',A$V2[34:44]))
  M[k,12:22] <- as.integer(gsub('\\.','',A$V2[67:77]))
}
cat(date(),'\n')
F <- data.frame(nam,region,M)
rownames(F) <- L
save(F,file="Brazil.Rdata")

We get

Brazilian municipalities
 Wed Jan 28 19:24:21 2015 

Warning messages:
1: NAs introduced by coercion 
2: NAs introduced by coercion 
3: NAs introduced by coercion 
4: NAs introduced by coercion 
5: NAs introduced by coercion 
6: NAs introduced by coercion 
7: NAs introduced by coercion 
8: NAs introduced by coercion 
9: NAs introduced by coercion 
10: NAs introduced by coercion 

Wed Jan 28 19:24:38 2015 

There are some problematic data.

> ok <- complete.cases(F)
> sum(ok)
[1] 5565
> F[!ok,]
                     nam region M.0.4. M.5.9. M10.14 M15.19 M20.24 M25.29 M30.39
500627 Paraíso das Águas     MS     NA     NA     NA     NA     NA     NA     NA
150475  Mojuí dos Campos     PA     NA     NA     NA     NA     NA     NA     NA
431454    Pinto Bandeira     RS     NA     NA     NA     NA     NA     NA     NA
422000  Balneário Rincao     SC     NA     NA     NA     NA     NA     NA     NA
421265    Pescaria Brava     SC     NA     NA     NA     NA     NA     NA     NA

We see that for the listed five municipalities the data are missing. The corresponding CSV files have fields with missing data.

The saved data.frame is available in brazil.zip.

notes/data/br.txt · Last modified: 2015/07/13 14:37 by vlado
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki