Differences

This shows you the differences between two versions of the page.

Link to this comparison view

notes:net:gendl [2015/07/16 21:34] (current)
vlado created
Line 1: Line 1:
 +====== Downloading genealogies ======
 +
 +In [[http://www.genealogyforum.com/gedcom/|The Genealogy Forum  Surname Center  GEDCOM File Library]] a big collection of (surname) genealogies is available. They are split in 17 subcollections 1, 1a, 1b, 2a, 2b, 3a, ..., 7b, 8a, 8b, 9a. In the subcollections (except in 1) the files are numbered by 4 digit numbers - the first digit determines the subcollection; the other three digits are the sequence number of the file. For example  
 +<code>
 +http://www.genealogyforum.com/gedcom/gedcom5b/gedr5301.htm
 +http://www.genealogyforum.com/gedcom/gedcom5b/gedr5301.ged
 +</code>
 +The a subcollections contain files with numbers 0-299 and the b subcollections contain files with numbers 300-500 (or less). Each genealogy is described by a HTML file and a data file in GED format. Some of the GED files are stored as ZIP files.
 +
 +In R we have a special function ''download.file'' for downloading files. Using it it is easy to write a program for downloading subcollections:
 +<code>
 +# 5b 
 +cd <- c(); pa <- "http://www.genealogyforum.com/gedcom/gedcom5b/"
 +for(i in 300:497){
 +  fn <- paste("00",as.character(i),sep=""); nc <- nchar(fn)
 +  fna <- paste("gedr5",substr(fn,nc-2,nc),sep="")
 +  fname <- paste(fna,".htm",sep=""); page <- paste(pa,fname,sep="")
 +  test <- tryCatch(download.file(page,fname,method="auto"),error=function(e) e)
 +  if(is.integer(test)) {
 +    fname <- paste(fna,".ged",sep=""); page <- paste(pa,fname,sep="")
 +    test <- tryCatch(download.file(page,fname,method="auto"),error=function(e) e)
 +    if(!is.integer(test)) {
 +      fname <- paste(fna,".zip",sep=""); page <- paste(pa,fname,sep="")
 +      test <- tryCatch(download.file(page,fname,method="auto"),error=function(e) e)
 +      if(!is.integer(test)) cd <- c(cd,i)
 +    }
 +  }
 +}
 +</code>
 +Some data sets were removed from the collection and are not available:
 +<code>
 +1     1-166 :   28 116 141 144 155
 +1a  165-299 :  207 243 
 +1b  300-487 :  340 375 417 420 421 424 436 437 438 470
 +2a    0-299 :   13  23  27  28  29  30  40  86  87 119 120 133 155 156 159 172 173 174 176 178 185 186 194 243
 +               245 247 256 257 283 290
 +2b  300-498 :  300 301 309 332 333 334 335 336 344 349 359 386 399 400 403 436 465
 +3a    0-299 :   29  48  49  91  98 149 165 170 175 182 199 231 239 253
 +3b  300-497 :  331 332 397 405 426 449 461
 +4a    0-299 :   39  41  42  44  58  70  81  99 100 105 108 117 134 135 146 179 180 181 182 183 264 273 276 294 
 +4b  300-496 :  301 302 310 320 335 339 368 392 404 409 478
 +5a    0-299 :   11  43  57  58 138 166 168 170 171 172 185 187 188 198 203 257 269 299 
 +5b  300-497 :  309 348 392 486
 +6a    0-299 :   23  31  41  61  65  97 114 122 140 174 192 205 242 279 288 291 295
 +6b  300-493 :  307 336 339 341 363 386 414 422 431 444 456 458 466 469 476 483
 +7a    0-299 :   22  26  28  42  47  71  98 112 113 119 149 150 155 188 190 232
 +7b  300-496 :  334 392 409 445 461 485
 +8a    1-299 :   23  30  35  42  49  79  85  97  98  99 108 116 152 221 255 288
 +8b  300-452 :  361 410 429
 +9a    0-7   :  NULL
 +</code>
 +
 +Solution to [[https://github.com/romunov/GEDCOM-archives/blob/master/gedcom-archive.R|download all the GEDCOM files]] by Roman Luštrik.
 +
 +Regular expressions in this solution can be shortened
 +<code>
 +"ged[[:digit:]]+\\.htm|gedr[[:digit:]]+\\.htm" -> "gedr?[[:digit:]]+.htm"
 +"gedcom[[:digit:]]+|gedcom[[:digit:]]+a|gedcom[[:digit:]]+b" -> "gedcom[[:digit:]]+(a|b)?"
 +"http://.+\\.ged|http://.+\\.zip" -> "http://.+\\.(ged|zip)"
 +</code>
  
notes/net/gendl.txt · Last modified: 2015/07/16 21:34 by vlado
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki