Reconstructing historical populations from genealogical data: an overview of methods used for aggregating data from GEDCOM files
Publication date
2014
Editors
Advisors
Supervisors
DOI
Document Type
Working paper
Metadata
Show full item recordCollections
License
Abstract
The GEDCOM file format is by far the most widely used means of exchanging genealogical data and extensive collections of these files are available online. There is a huge potential bene-fit for historians and other academics who are able to make use of the data contained in availa-ble GEDCOM files, as these effectively repre-sent hundreds of thousands of hours of crowd-sourced work and a considerable source of knowledge about individual families. This paper details a number of methods that are being used to clean and aggregate such genealogical data; this includes a series of steps for screening out substantially flawed files, as well as for cleaning date and place information. A group-linking method is described for identifying duplicates / linkages within a genealogical database based on comparison of family structures. This is tested alongside conventional methods (i.e. comparison of name and birth date) and an estimation of the power of the differing methods is provided. It is proposed that use of the group-linking method provides advantages over conventional methods, because this provides a way of increasing the size and timespan of datasets that may be ex-tracted from a genealogical database with confi-dence that they do not contain duplicates. The method will be further improved by incorporat-ing probabilistic record linkage techniques, which take into account the frequencies of val-ues in the linkage arrays.
Keywords
Citation
Gellatly, C 2014 'Reconstructing historical populations from genealogical data: an overview of methods used for aggregating data from GEDCOM files'. < http://socialhistory.org/sites/default/files/docs/gellatly_-_reconstructing_historical_populations.pdf >