No source: created in electronic format.
The DFG-funded project Deutsches Textarchiv (German
Text Archive, DTABerlin-Brandenburgische Akademie der
Wissenschaften (Berlin-Brandenburg Academy of Sciences and Humanities,
BBAW
The DTA provides linguistic applications for its corpus, i. e. serialization of tokens, lemmatization, lemma-based and phonetic search, and rewrite rules for historic spelling variation.
Each text in the DTA is encoded using the XML/TEI-P5 format. The markup describes text structures (headlines, paragraphs, speakers, poem lines, index items, etc.), as well as the physical layout of the text down to the position of each character on a page.
Even though our corpus of historic text exhibits very good quality, many errors still occur in the transcription, in the markup, or even on the level of presentation. Due to the heterogeneity of the corpus, e. g. in terms of text genres (novels, prose, scientific essays, linguistic reference works, cookbooks, etc.) there is a strong demand for a collaborative, easy to use quality assurance environment.
As of October 2011, the corpus consists of more than 260,000 pages (half a billion characters), several gigabytes of XML. Even though our digitization providers assure an accuracy rate of 99.95 %, many errors remain undetected, not to mention problems in the presentation layer of the DTA or workflow mistakes.
There are many kinds of possible errors in our transcribed texts: transcription errors (e. g. due to illegible text or text written in foreign scripts like hebrew, greek, runic, etc.) sometimes require specialized background knowledge, so we created various assorted tools to aid users in finding potentially problematic spots in our texts, and to help transcribers to obtain better and faster results.
In addition, DTA provides an interface DTAE
Quality assurance (QA) also has to take into account other levels of error prone representations and tasks, namely metadata, XML/TEI annotation, HTML presentation (and other media), and the robustness of workflow. DTAQ is our QA system dealing with all these potential errors: They need to be reported, stored and fixed.
DTAQ
Our linguistic tools (CAB,
The backend of DTAQ is built upon many open source packages, using Perl as a
glue language. The system runs on CatalystPostgreSQL
database via the DBIx::Class ORM and builds its web
pages with Template Toolkit. The frontend makes heavy
use of jQuery and Highcharts JS
to create a very interactive and responsive user interface.
Our XML/TEI files are automatically split up into individual pages and stored
in a gitgit repository.
Our poster will show the DTAQ workflow patterns, along with a live demonstration showing the various views, tools, and powerful features of the quality assurance platform.
Geyken, A., et al. (2011). Das Deutsche
Textarchiv: Vom historischen Korpus zum aktiven Archiv. In S. Schomburg,
C. Leggewie, H. Lobin, and C. Puschmann (eds.), Digitale Wissenschaft. Stand und Entwicklung digital vernetzter
Forschung in Deutschland, 20./21. September 2010. Beiträge der
Tagung. 2., ergänzte Fassung. Köln: HBZ, pp. 157-161.
Jurish, B. (2010). More than Words: Using Token
Context to Improve Canonicalization of Historical German. Journal for Language Technology and Computational
Linguistics (JLCL) 25(1).
Jurish, B. (2012). Finite-state
Canonicalization Techniques for Historical German. PhD thesis,
Universität Potsdam 2012 (urn:nbn:de:kobv:517-opus-55789).
Unsworth, J. (2011). Computational Work with Very
Large Text Collections. Interoperability, Sustainability, and the TEI.
Journal of the Text Encoding Initiative 1
(http://jtei.revues.org/215, 2011-08-29).