====== Downloading genealogies ====== In [[http://www.genealogyforum.com/gedcom/|The Genealogy Forum Surname Center GEDCOM File Library]] a big collection of (surname) genealogies is available. They are split in 17 subcollections 1, 1a, 1b, 2a, 2b, 3a, ..., 7b, 8a, 8b, 9a. In the subcollections (except in 1) the files are numbered by 4 digit numbers - the first digit determines the subcollection; the other three digits are the sequence number of the file. For example http://www.genealogyforum.com/gedcom/gedcom5b/gedr5301.htm http://www.genealogyforum.com/gedcom/gedcom5b/gedr5301.ged The a subcollections contain files with numbers 0-299 and the b subcollections contain files with numbers 300-500 (or less). Each genealogy is described by a HTML file and a data file in GED format. Some of the GED files are stored as ZIP files. In R we have a special function ''download.file'' for downloading files. Using it it is easy to write a program for downloading subcollections: # 5b cd <- c(); pa <- "http://www.genealogyforum.com/gedcom/gedcom5b/" for(i in 300:497){ fn <- paste("00",as.character(i),sep=""); nc <- nchar(fn) fna <- paste("gedr5",substr(fn,nc-2,nc),sep="") fname <- paste(fna,".htm",sep=""); page <- paste(pa,fname,sep="") test <- tryCatch(download.file(page,fname,method="auto"),error=function(e) e) if(is.integer(test)) { fname <- paste(fna,".ged",sep=""); page <- paste(pa,fname,sep="") test <- tryCatch(download.file(page,fname,method="auto"),error=function(e) e) if(!is.integer(test)) { fname <- paste(fna,".zip",sep=""); page <- paste(pa,fname,sep="") test <- tryCatch(download.file(page,fname,method="auto"),error=function(e) e) if(!is.integer(test)) cd <- c(cd,i) } } } Some data sets were removed from the collection and are not available: 1 1-166 : 28 116 141 144 155 1a 165-299 : 207 243 1b 300-487 : 340 375 417 420 421 424 436 437 438 470 2a 0-299 : 13 23 27 28 29 30 40 86 87 119 120 133 155 156 159 172 173 174 176 178 185 186 194 243 245 247 256 257 283 290 2b 300-498 : 300 301 309 332 333 334 335 336 344 349 359 386 399 400 403 436 465 3a 0-299 : 29 48 49 91 98 149 165 170 175 182 199 231 239 253 3b 300-497 : 331 332 397 405 426 449 461 4a 0-299 : 39 41 42 44 58 70 81 99 100 105 108 117 134 135 146 179 180 181 182 183 264 273 276 294 4b 300-496 : 301 302 310 320 335 339 368 392 404 409 478 5a 0-299 : 11 43 57 58 138 166 168 170 171 172 185 187 188 198 203 257 269 299 5b 300-497 : 309 348 392 486 6a 0-299 : 23 31 41 61 65 97 114 122 140 174 192 205 242 279 288 291 295 6b 300-493 : 307 336 339 341 363 386 414 422 431 444 456 458 466 469 476 483 7a 0-299 : 22 26 28 42 47 71 98 112 113 119 149 150 155 188 190 232 7b 300-496 : 334 392 409 445 461 485 8a 1-299 : 23 30 35 42 49 79 85 97 98 99 108 116 152 221 255 288 8b 300-452 : 361 410 429 9a 0-7 : NULL Solution to [[https://github.com/romunov/GEDCOM-archives/blob/master/gedcom-archive.R|download all the GEDCOM files]] by Roman Luštrik. Regular expressions in this solution can be shortened "ged[[:digit:]]+\\.htm|gedr[[:digit:]]+\\.htm" -> "gedr?[[:digit:]]+.htm" "gedcom[[:digit:]]+|gedcom[[:digit:]]+a|gedcom[[:digit:]]+b" -> "gedcom[[:digit:]]+(a|b)?" "http://.+\\.ged|http://.+\\.zip" -> "http://.+\\.(ged|zip)"