The potential of using crowd-sourced data to re-explore the demography of Victorian Britain

Home » conference » programme » abstracts » The potential of using crowd-sourced data to re-explore the demography of…

Duke-Williams, Oliver William, University of Leeds, UK, o.duke-williams@ucl.ac.uk

This paper describes a new project which aims to explore the potential of a set of crowd-sourced data based on the returns from decennial censuses in nineteenth century Britain. Using data created by an existing volunteer based effort, it is hoped to extract and make available sets of historical demographic data.

Crowd-sourcing is a term generally used to refer to the generation and collation of data by a group of people, sometimes paid, sometimes interested volunteers (Howe 2007). Whilst not dependent on Web technologies, the ease of communication and ability to both gather and re-distribute digital data in standard formats mean that the Web is a very significant enabling technology for distributed data generation tasks. In some cases – the most obvious being Wikipedia – crowd-sourcing has involved the direct production of original material, whilst in other cases, crowd-sourcing has been applied to the transcription (or proof-reading) of existing non-digitised material. An early example of this – predating the Web – was Project Gutenberg (Hart 1992), which was established in 1971 and continues to digitize and make available texts for which copyright has expired. A more recent example in the domain of Digital Humanities is the Transcribe Bentham project (Terras 2010), which harnesses international volunteer efforts to digitize the manuscripts of Jeremy Bentham, including many previously unpublished papers; in contrast to Project Gutenberg, the sources are hand-written rather than printed, and thus might require considerable human interpretative effort as part of the transcription process. Furthermore, the Transcribe Bentham project aims to produce TEI-encoded outputs rather than generic ASCII text, potentially imposing greater barriers to entry for novice transcribers.

FreeCEN¹ is a project which aims to deliver a crowd-sourced set of records from the decennial British censuses of 1841 to 1891. The data are being assembled through a distributed transcription project, based on previously assembled volumes of enumerator’s returns, which exist in physical form and on microfiche. FreeCEN is part of FreeUKGEN, an initiative aiming to create freely accessible databases of genealogical records, for use by family historians; other member projects being FreeREG, a collection of Parish Registers, and FreeBMD a collection of vital events registers. The database has grown steadily since its inception in 1999, but coverage is variable with differing levels of completeness between censuses, and for individual counties within each census. The current delivery of data from the FreeCEN project is targeted explicitly at genealogists, and is already of benefit to family historians, by delivery freely accessible data. However, the data (and associated interfaces) are not designed for demographic analysis of the content. Two main approaches may be of interest to demographers, historical geographers and historians in this regard: a series of cross-sectional observations, and a linked sample (such that an individual and hir/her circumstances at decennial intervals might be observed).

The first aim of the project reported in this paper is to extract aggregate counts of individuals grouped by their county of enumeration (that is, their place of residence at the time of the census) and their county of birth. Together, these form a ‘lifetime net migration’ matrix, which can be disaggregated by age and sex to allow exploration of differential patterns of mobility for a variety of sub-populations in nineteenth century Britain. An associated requirement will be the aggregation of area level population counts (disaggregated by the same age and sex categories to be used in the migration analysis) in order to permit systematic calculation of migration rates (as opposed to absolute volumes). A second phase of analysis will be to use individual and household level records in order to classify households into ‘household types’ (i.e. persons living alone, married couples with and without dependent or co-resident children etc, households with domestic staff present, households with lodgers present etc).

As usable lifetime migration data sets are extracted, they will be made available via CIDER², the Centre for Interaction Data Estimation and Research (Stillwell & Duke-Williams 2003). CIDER is funded as part of the ESRC Census Programme 2006-2011, and currently provides access to interaction (migration and commuting) data sets from the 1981, 1991 and 2001 Censuses, together with an increasing number of flow data sets from a variety of administrative sources. The addition of data from historic censuses will require the modification and extension of existing metadata structures, in order to effectively document the data.

The FreeCEN project aims to transcribe data from the 1841, 1851, 1861, 1871, 1881 and 1891 Censuses of Britain. The 1841 Census can be seen as a transitional census. Whereas the first British censuses – from 1801 onwards – had been completed at an area level by local officials and members of the clergy, the 1841 Census was based around a household schedule which listed each individual together with basic demographic details (rounded age, sex, profession and details of place of birth). However a variety of organisation weaknesses (Higgs 1989) were reflected in idiosyncratic results. A much more robust administrative structure was developed for the 1851 Censuses onwards, with improved instructions given to both enumerators and householders. The number and range of questions asked of householder changed over the course of the nineteenth century, with new questions introduced regarding marital status, relationship to head of family, whether the individual was blind, deaf and dumb, an ‘imbecile or idiot’, or ‘lunatic’ and whether the individual was an employer or employee. The aggregation work will concentrate in the first instance on results from the 1891 Census, as these are expected to be of higher quality than the earlier returns. A second strand will then examine data from the 1841 Census, precisely because the quality of the enumerator’s returns was poorer, and thus systemic problems are likely to be exposed. How do transcribers cope with poorly recorded or ambiguous original data? How widely do transcribers vary in their notation style?

As noted above, the data are far from complete, and so comprehensive national analysis is not currently possible. However, there are some counties such as Cornwall for which enumeration is complete or close to completion for all censuses in the period, and county or region based analyses should be possible. The overriding aim of the work reported in this paper is to explore the potential for use of these data, and to ascertain the ease with which aggregate observations can be extracted from the FreeCEN database as it grows in the future.

More significantly, there are more subtle issues with the records contained within the FreeCEN data, and there are a variety of issues to be explored. Primarily, it is necessary to examine the accuracy of transcriptions. Over the course of the 19th century the administrative prowess of census taking increased, and improved instructions to enumerators (probably in combination with improved literacy rates in the general population) caused an overall improvement in the quality of the original data; however the accuracy of current transcriptions must also be assessed. It must also be noted that both the quality and comprehensiveness of records transcribed will vary by area and transcriber: the initial impetus for many will be transcription of their own ancestors, and the sample may not therefore be complete or representative. The degree to which the sample and representative can be gauged by comparison to published totals, although regional and small area totals are less readily available than national totals, constraining the ability to make local level assessments of quality. Mapping of the data will demonstrate the extent to which there are spatial biases (at both local and national levels) in the records selected for transcription.

The lifetime net migration tables to be produced from this project will permit new analysis of mobility in Victorian Britain. Working with the results of the 1871 and 1881 Censuses, Ravenstein (1885) derived a number of frequently cited and debated ‘laws of migration’, including that migrants tend to move in a series of small steps towards a ‘centre of absorption’, that counter-flows exist, and that migration propensities reduce with distance. The actual ‘laws’ are not enumerated as a clear list, but described at separate locations within more than one paper; for a fuller discussion see Grigg (1977). In a discussion of Ravenstein’s work, Tobler (1995) states that the most interesting of a set of maps included in the 1885 paper is not referenced or described in the text, noting that such an omission would cause most modern editors to remove the map. Such a map could perhaps be re-created if migration matrices from the period were made available. Many subsequent studies have been made using the originally published results of these censuses (see for example, Lawton (1968) and Friedlander and Roshier (1966)). However, source data are hard to find, and it is now difficult for the general researcher to re-examine such work: thus, an aim of the work reported in this paper is to facilitate future research in this area.

A second approach to the study of these data is to develop a life course analysis, and find individuals linked across censuses, and examine the ways in which their lives have developed at decennial intervals. Whilst appealing, this approach is significantly harder due to difficulties in linking individuals between censuses. Anderson (1972) studying a sample of 475 people in consecutive censuses (1851, 1861) for example, found that 14% of the sample had inconsistent birthplace recording. Some inconsistencies are down to minor misspellings, but this underlines the difficulty of matching individuals.

This project is currently at its outset, and the paper will report on progress and demonstrate how the data can be accessed and used.

References

Anderson, M. (1972). The study of family structure. In E. Wrigley (ed.), Nineteenth-century society. Cambridge: Cambridge UP.

Friedlander, D., and R. Roshier (1966). A Study of Internal Migration in England and Wales: Part I. Population Studies 19(3): 239-279.

Grigg, D. (1977). E. G. Ravenstein and the “laws of migration”. Journal of Historical Geography 3(1): 41-54.

Hart, M. (1992). Gutenberg: the history and philosophy of Project Gutenberg http://www.gutenberg.org/wiki/Gutenberg:The_History_and_Philosophy_of_Project_Gutenberg_by_Michael_Hart, retrieved 31 October 2011

Higgs, E. (1989). Making Sense of the Census, Public Record Office Handbooks No 23. London: HMSO.

Lawton, R. (1968). Population Changes in England and Wales in the Later Nineteenth Century: an Analysis of Trends by Registration Districts. Transactions of the Institute of British Geographers 44: 55-74.

Howe, J. (2007). The Rise of Crowdsourcing. Wired 14(6),http://www.wired.com/wired/archive/14.06/crowds.html, retrieved 31 October 2011

Ravenstein, E. (1885). The Laws of Migration. Journal of the Statistical Society 46: 167-235.

Stillwell, J., and O. Duke-Williams (2003). A new web-based interface to British census of population origin-destination statistics. Environment and Planning A, 35(1): 113-132.

Terras, M. (2010). Crowdsourcing cultural heritage: UCL’s Transcribe Bentham project. Presented at: Seeing Is Believing: New Technologies For Cultural Heritage. International Society for Knowledge Organization, UCL (University College London).

Tobler, W. (1995) Ravenstein, Thornthwaite, and beyond. Urban Geography 16(4): 327-343.

Notes

1.http://www.freecen.org.uk

2.http://cider.census.ac.uk