Compiling large historical reference corpora of German: Quality Assurance, Interoperability and Collaboration in the Process of Publication of Digitized Historical Prints

Home » conference » programme » abstracts » Compiling large historical reference corpora of German: Quality Assurance,…

XML

authors & presenters

Geyken, Alexander, Berlin-Brandenburgische Akademie der Wissenschaften, Germany, geyken@bbaw.de

Gloning, Thomas, CLARIN-D, Germany, Thomas.Gloning@germanistik.uni-giessen.de

Stäcker, Thomas, Herzog August Bibliothek Wolfenbüttel, Germany, staecker@hab.de

Introduction

Problem Statement

It has been and still is one of the core aims in German Digital Humanities to establish large reference corpora for the historical periods of German, i.e. Old and Middle High German, Early New High German (ENHG), New High German and Contemporary German (NHG). This panel focusses on NHG and ENHG. A reference corpus of the 17th to 19th century German is currently being compiled by the Deutsches Textarchiv project at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW). The Herzog August Bibliothek Wolfenbüttel (HAB) constructs text corpora encompassing the period between the 15th and the 18th century, with a focus on ENHG. Apart from core activities like these, that usually are carried out by institutions for long time research such as academies, research centers and research libraries, many digital resources are created by individual scholars or small project teams. Often, however, such resources never show up in the pool of publicly available research corpora.

Building an integrated corpus infrastructure for large corpora of ENHG and NHG faces three main problems:

Despite the growing acceptance of annotation standards such as the TEI, there are different ‘encoding dialects’ and baseline formats used in different contexts (e.g. TextGrid Baseline Encoding, the TEI-Encoding recommended by HAB, the ›base format‹ of the DTA, and others). Neither one has gained wider acceptance nor have they been checked against each other. As a result, repositories of digital documents do not apply consistent, interoperable encoding. In addition, users cannot draw on these resources in an integrated way.
There is no approved system of evaluation for both the quality of corpus texts and the quality of the encoding. In addition, there is no reputation system for crediting individual researchers’ accomplishments with respect to the production and annotation of corpus texts. As a consequence, there is no ‘culture of sharing corpus resources’ in a collaborative way in the research community.
The vision of an integrated historical corpus of German and an integrated research platform is in conflict with the principle of local ascription of scholarly work. While the user would like to have one place to find all the available corpus texts and one platform to run the available corpus tools, each academy, each institute, each library, each research project and even individual researchers need to produce work and resources ascribable in a local or even in a personal way.

Panel Topics

Well-established infrastructure projects such as TextGrid, CLARIN or DARIAH can contribute enormously to the process of integration by establishing methods of interoperation, a system of quality assurance and credit, and a set of technical practices that allow to integrate resources of different origin, to credit the producers of resources in an appropriate way, to ensure public access in a persistent way and thereby to involve a greater scholarly community.

The proposed panel, organized by the Deutsches Textarchiv at the BBAW, the Special Interest Group ‘Deutsche Philologie’ (‘German Philology’) in CLARIN-D and the HAB Wolfenbüttel, will demonstrate how efforts for community-building, text aggregation, the technical realization of interoperability and the collaboration of two large infrastructure centers (BBAW and HAB) is set to work in the digital publication platforms of the Deutsches Textarchiv (BBAW) and AEDit (HAB).

Technical requirements include tools for long-term archiving as well as the implementation and documentation of reference formats. Interfaces need to be standardized, easy to handle and well documented to minimize the user’s effort to upload their resources. Transparent criteria have to be developed regarding the expected quality, the encoding level and requirements for interoperability of texts in the repository. The DTA provides a large corpus of historical texts of various genres, a heterogeneous text base encoded in one consistent format (i.e. DTA ‘base format’). Facilities for quality assurance (DTAQ) and for the integration of external text resources (DTAE) have been developed.

The DTA ‘base format’

Haaf, Susanne, Berlin-Brandenburgische Akademie der Wissenschaften, Deutsches Textarchiv, Germany, haaf@bbaw.de

Geyken, Alexander, Berlin-Brandenburgische Akademie der Wissenschaften, Deutsches Textarchiv, Germany, geyken@bbaw.de

The DTA ‘base format’ consists of about 80 TEI P5 <text>-elements which are needed for the basic formal and semantic structuring of the DTA reference corpus. The purpose of developing the ›base format‹ was to gain coherence at the annotation level, given the heterogeneity of the DTA text material over time (1650-1900) and text types (fiction, functional and scientific texts). The restrictive selection of ‘base format’ elements with their corresponding attributes and values is supposed to cover all annotation requirements for a similar level of structuring of historical texts. We will illustrate this by means of characteristic annotation examples taken from the DTA reference corpus.

We will compare the DTA ‘base format’ to other existing base formats considering their different purposes and discuss the usability of the DTA ‘base format’ in a broader context (DTAE, CLARIN-D). A special adaptation of the oXygen TEI-framework which supports the work with the DTA ‘base format’ will be presented.

DTAE

Thomas, Christian, Berlin-Brandenburgische Akademie der Wissenschaften, Deutsches Textarchiv, Germany, thomas@bbaw.de

DTAE is a software module provided by the DTA for external projects interested in integrating their historical text collections into the DTA reference corpus. DTAE provides routines for uploading metadata, text and images, as well as semiautomatic conversion tools from different source formats into the DTA ‘base format’. The text- and metadata are indexed for lemma-based full text search and processed with tools for presentation in order to offer parallel views of the source image, the XML/TEI encoded text as well as a rendered HTML presentation layer. In addition, external contributors can integrate the processed text into their own web site via <iframe> and use the DTA-query-API for full text queries. DTAE demonstrates how interchange and interoperability among projects can work on a large scale. The paper illustrates these issues by example of five selected cooperation projects. Possibilities (and limits) of the exchange of TEI documents will be discussed as well as the difficult, but worthwhile task of converting other/older text formats into DTA ‘base format’.

DTAQ

Wiegand, Frank, Berlin-Brandenburgische Akademie der Wissenschaften, Deutsches Textarchiv, Germany, wiegand@bbaw.de

DTAQ is a browser-based tool to find, categorize, and correct different kinds of errors or inconsistencies occuring during the XML/TEI-transcription resp. the optical character recognition of texts. Using a simple authentication system combined with a fine-grained access control system, new users can easily be added to this Quality Assurance system. The GUI of the tool is highly customizable and provides various views of source images, XML-transcriptions, and HTML-presentations. The backend of DTAQ is built upon many open source packages. Using Perl as a glue language, the system runs on Catalyst, connects to a PostgreSQL database via the DBIx::Class ORM and builds its web pages with Template Toolkit. The frontend makes heavy use of jQuery and Highcharts JS to create a very interactive and responsive user interface.

The project AEDit

Stäcker, Thomas, Herzog August Bibliothek Wolfenbüttel, Germany, staecker@hab.de

In the field of digital humanities there is a great demand for trustworthy platforms or repositories that aggregate, archive, make available and disseminate scholarly texts or databases. AEDit (Archiving, Editing, Disseminating) aims at establishing such a platform at the HAB Wolfenbüttel for documents with a focus on ENHG. Partners and contributors of the initial phase are four outstanding editorial projects from the German academy program. Central issues will be addressed, e.g. creating persistent identifiers at word level by IDs or Xpointer, determining the relation of community specific markup to basic text encoding by means of stand-off markup, defining a basic format for texts and databases, integrating research data originating from biographic, bibliographic or subject-related and lexical databases, examining ways of interconnection with already existing editorial tools such as TextGrid and last but not least setting up an institutional framework for supporting scholars who are willing to participate or publish online.

CLARIN-D – Historical Corpora, Collaboration, Community building

Gloning, Thomas, CLARIN-D, Germany, Thomas.Gloning@germanistik.uni-giessen.de

The task of the special interest group ›Deutsche Philologie‹ in Clarin-D is to support infrastructure centers (IDS, BBAW, MPI, HAB) in building up, enriching and integrating German language resources (e.g. corpora, tools, dictionaries, best practices in research methodology). One goal in the field of historical corpora is to create an infrastructure that allows users (i) to integrate their own documents into historical reference corpora; (ii) to gain proper credit and reputation in doing so; (iii) to reuse the documents in local corpora together with expert corpus technology. – In this talk I shall demonstrate usage scenarios that show how the DTA extension infrastructure (DTAE) can be used to create specialized corpora for specific research topics. These usage scenarios shall serve as prototypes for collaboration and community building in ENHG and NHG corpus research.

References

CLARIN-D http://clarin-d.net/

DTA http://www.deutschestextarchiv.de/

DTA ‘base format’, Description [in German]:

http://www.deutschestextarchiv.de/doku/basisformat

DTAE http://www.deutschestextarchiv.de/dtae

DTAQ http://www.deutschestextarchiv.de/dtaq

Herzog August Bibiliothek Wolfenbüttel http://www.hab.de/

Wolfenbütteler Digitale Bibliothek (WDB) http://www.hab.de/bibliothek/wdb/

Geyken, A., et al. (2011). Das Deutsche Textarchiv: Vom historischen Korpus zum aktiven Archiv. In S. Schomburg, C. Leggewie, H. Lobin, and C. Puschmann (eds.), Digitale Wissenschaft. Stand und Entwicklung digital vernetzter Forschung in Deutschland, 20./21. September 2010. Beiträge der Tagung. 2., ergänzte Fassung. Köln: HBZ, pp. 157-161.

Jurish, B. (2010). More than Words: Using Token Context to Improve Canonicalization of Historical German. Journal for Language Technology and Computational Linguistics (JLCL) 25(1).

Jurish, B. (2012) Finite-state Canonicalization Techniques for Historical German. PhD thesis, Universität Potsdam 2012 (urn:nbn:de:kobv:517-opus-55789).

Unsworth, J. (2011) Computational Work with Very Large Text Collections. Interoperability, Sustainability, and the TEI. Journal of the Text Encoding Initiative 1 (http://jtei.revues.org/215, 29. 8. 2011).

Bauman, S. (2011). Interchange vs. Interoperability. Presented at Balisage: The Markup Conference 2011, Montréal, Canada, August 2 – 5, 2011. In Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies, vol. 7. doi:10.4242/BalisageVol7.Bauman01.