Downloading genealogies

In The Genealogy Forum Surname Center GEDCOM File Library a big collection of (surname) genealogies is available. They are split in 17 subcollections 1, 1a, 1b, 2a, 2b, 3a, …, 7b, 8a, 8b, 9a. In the subcollections (except in 1) the files are numbered by 4 digit numbers - the first digit determines the subcollection; the other three digits are the sequence number of the file. For example

http://www.genealogyforum.com/gedcom/gedcom5b/gedr5301.htm
http://www.genealogyforum.com/gedcom/gedcom5b/gedr5301.ged

The a subcollections contain files with numbers 0-299 and the b subcollections contain files with numbers 300-500 (or less). Each genealogy is described by a HTML file and a data file in GED format. Some of the GED files are stored as ZIP files.

In R we have a special function download.file for downloading files. Using it it is easy to write a program for downloading subcollections:

# 5b 
cd <- c(); pa <- "http://www.genealogyforum.com/gedcom/gedcom5b/"
for(i in 300:497){
  fn <- paste("00",as.character(i),sep=""); nc <- nchar(fn)
  fna <- paste("gedr5",substr(fn,nc-2,nc),sep="")
  fname <- paste(fna,".htm",sep=""); page <- paste(pa,fname,sep="")
  test <- tryCatch(download.file(page,fname,method="auto"),error=function(e) e)
  if(is.integer(test)) {
    fname <- paste(fna,".ged",sep=""); page <- paste(pa,fname,sep="")
    test <- tryCatch(download.file(page,fname,method="auto"),error=function(e) e)
    if(!is.integer(test)) {
      fname <- paste(fna,".zip",sep=""); page <- paste(pa,fname,sep="")
      test <- tryCatch(download.file(page,fname,method="auto"),error=function(e) e)
      if(!is.integer(test)) cd <- c(cd,i)
    }
  }
}

Some data sets were removed from the collection and are not available:

1     1-166 :   28 116 141 144 155
1a  165-299 :  207 243 
1b  300-487 :  340 375 417 420 421 424 436 437 438 470
2a    0-299 :   13  23  27  28  29  30  40  86  87 119 120 133 155 156 159 172 173 174 176 178 185 186 194 243
               245 247 256 257 283 290
2b  300-498 :  300 301 309 332 333 334 335 336 344 349 359 386 399 400 403 436 465
3a    0-299 :   29  48  49  91  98 149 165 170 175 182 199 231 239 253
3b  300-497 :  331 332 397 405 426 449 461
4a    0-299 :   39  41  42  44  58  70  81  99 100 105 108 117 134 135 146 179 180 181 182 183 264 273 276 294 
4b  300-496 :  301 302 310 320 335 339 368 392 404 409 478
5a    0-299 :   11  43  57  58 138 166 168 170 171 172 185 187 188 198 203 257 269 299 
5b  300-497 :  309 348 392 486
6a    0-299 :   23  31  41  61  65  97 114 122 140 174 192 205 242 279 288 291 295
6b  300-493 :  307 336 339 341 363 386 414 422 431 444 456 458 466 469 476 483
7a    0-299 :   22  26  28  42  47  71  98 112 113 119 149 150 155 188 190 232
7b  300-496 :  334 392 409 445 461 485
8a    1-299 :   23  30  35  42  49  79  85  97  98  99 108 116 152 221 255 288
8b  300-452 :  361 410 429
9a    0-7   :  NULL

Solution to download all the GEDCOM files by Roman Luštrik.

Regular expressions in this solution can be shortened

"ged[[:digit:]]+\\.htm|gedr[[:digit:]]+\\.htm" -> "gedr?[[:digit:]]+.htm"
"gedcom[[:digit:]]+|gedcom[[:digit:]]+a|gedcom[[:digit:]]+b" -> "gedcom[[:digit:]]+(a|b)?"
"http://.+\\.ged|http://.+\\.zip" -> "http://.+\\.(ged|zip)"
notes/gendl.txt · Last modified: 2015/07/13 14:52 by vlado
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki