<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="../schema/xmod_web.rnc" type="compact"?>
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0"
           xmlns:xmt="http://www.cch.kcl.ac.uk/xmod/tei/1.0"
           xml:id="ab-141">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title>Topic Modeling the Past</title>
                <author>
                    <name>Nelson, Robert K.</name>
                    <affiliation>University of Richmond, USA</affiliation>
                    <email>rnelson2@richmond.edu</email>
                </author>
                <author>
                    <name>Mimno, David</name>
                    <affiliation>Princeton University, USA</affiliation>
                    <email>david.mimno@gmail.com</email>
                </author>
                <author>
                    <name>Brown, Travis</name>
                    <affiliation>University of Maryland, College Park, USA</affiliation>
                    <email>travisrobertbrown@gmail.com</email>
                </author>
            </titleStmt>
            <publicationStmt>
                <publisher>Jan Christoph Meister, Universität Hamburg</publisher>
                <address>
                    <addrLine>Von-Melle-Park 6, 20146 Hamburg, Tel. +4940 428 38 2972</addrLine>
                    <addrLine>www.dh2012.uni-hamburg.de/</addrLine></address>
            </publicationStmt>
            <sourceDesc>
                <p>No source: created in electronic format.</p>
            </sourceDesc>
        </fileDesc>
        <revisionDesc>
            <change>
                <date>2012-04-15</date>
                <name>DH</name>
                <desc>generate TEI-template with data from ConfTool-Export</desc>
            </change>
            <change>
                <date>2012-04-13</date>
                <name>LS</name>
                <desc>provide metadata for publicationStmt</desc>
            </change>
        </revisionDesc>
    </teiHeader>
    <!-- AT LEAST TWO TEI-ELEMENTS -
         for panels which behave as single paper use template for paper
         with <text @type="session">.
    -->
    <TEI>
        <teiHeader>
            <fileDesc>
                <titleStmt>
                    <title>Introduction</title>
                </titleStmt>
                <publicationStmt>
                    <publisher>Jan Christoph Meister, Universität Hamburg</publisher>
                    <address>
                        <addrLine>Von-Melle-Park 6, 20146 Hamburg, Tel. +4940 428 38 2972</addrLine>
                        <addrLine>www.dh2012.uni-hamburg.de/</addrLine>
                    </address>
                </publicationStmt>
                <sourceDesc>
                    <p>No source: created in electronic format.</p>
                </sourceDesc>
            </fileDesc>
        </teiHeader>
        <text>
            <body>
                <div>
                <p>The enormous digitized archives of books, journals, and newspapers produced
                        during the past two decades present scholars with new opportunities - and
                        new challenges as well. The possibility of analyzing increasingly large
                        portions of the historical, literary, and cultural record is incredibly
                        exciting, but it cannot be done with conventional methods that involve close
                        reading or even not-so-close skimming. These huge new text archives
                        challenge us to apply new methods. This panel will explore one such method:
                        topic modeling.</p>
                    <p>Topic modeling is a probabilistic, statistical technique that uncovers themes
                        and topics and can reveal patterns in otherwise unwieldy amounts of text. In
                        topic modeling, a ‘topic’ is a probability distribution over words or, put
                        more simply, a group of words that often co-occur with each other in the
                        same documents. Generally these groups of words are semantically related and
                        interpretable; in other words, a theme, issue, or genre can often be
                        identified simply by examining the most common words in a topic. Beyond
                        identifying these words, a topic model provides proportions of what topics
                        appear in each document, providing quantitative data that can be used to
                        locate documents on a particular topic or theme (or that combine multiple
                        topics) and to produce a variety of revealing visualizations about the
                        corpus as a whole.</p>
                    <p>This panel will, first and foremost, illustrate the interpretative potential
                        of topic modeling for research in the humanities. Robert K. Nelson will
                        analyze the similarities and differences between Confederate and Union
                        nationalism and patriotism during the American Civil War using topic models
                        of two historic newspapers. Travis Brown will explore techniques to tailor
                        topic model generation using historical data external to a corpus to produce
                        more nuanced topics directly relevant to particular research
                        questions. David Mimno (chief maintainer of the most widely used topic
                        modeling software, MALLET) will describe his work using topic modeling to
                        generate – while respecting copyright – a new scholarly resource in the
                        field of Classics that derives from and organizes a substantial amount of
                        the twentieth-century scholarly literature.</p>
                    <p>The panel will also address methodological issues and demonstrate new
                        applications of topic modeling, including the challenge of topic modeling
                        across multi-lingual corpora, the integration of spatial analysis with topic
                        modeling (revealing the constructedness of space, on the one hand, and the
                        spatiality of culture, on the other), and the generation of visualizations
                        using topic modeling useful for ‘distant reading.’ The panel thus addresses
                        issues of multilingualism, spatial history, data mining, and humanistic
                        research through computation.</p>
                </div>               
                </body>
            
        </text>
    </TEI>
    <TEI>
        <teiHeader>
            <fileDesc>
                <titleStmt>
                    <title>Modeling Nationalism and Patriotism in Civil War America</title>
                    <author>
                        <name>Nelson, Robert K. </name>
                        <affiliation>University of Richmond, USA</affiliation>
                        <email>rnelson2@richmond.edu</email>
                    </author>
                </titleStmt>
                <publicationStmt>
                    <publisher>Jan Christoph Meister, Universität Hamburg</publisher>
                    <address>
                    <addrLine>Von-Melle-Park 6, 20146 Hamburg, Tel. +4940 428 38 2972</addrLine>
                    <addrLine>www.dh2012.uni-hamburg.de/</addrLine></address>
                </publicationStmt>
                <sourceDesc>
                    <p>No source: created in electronic format.</p>
                </sourceDesc>
            </fileDesc>
        </teiHeader>
        <text>
            <body>
                <p>Scholars of the American Civil War have productively attended to particular
                    keywords in their analyses of the conflict’s causes and its participants’
                    motivations. Arguing that some words carried extraordinary political and
                    cultural weight at that moment, they have sought to unpack the deep connotations
                    of terms that are especially revealing and meaningful.  To take a couple of
                    recent examples, Elizabeth R. Varon frames <hi rend="italic">Disunion!: The
                        Coming of the American Civil War, 1789-1859</hi> (unsurprisingly) around the
                    term ‘disunion’: ‘This book argues that ‘disunion’ was once the most provocative
                    and potent word in the political vocabulary of Americans’ (Varon 2008:
                    1). Similarly, in <hi rend="italic">The Union War</hi> Gary W. Gallagher
                    emphasize the importance of ‘Union,’ arguing that ‘No single word in our
                    contemporary political vocabulary shoulders so much historical, political, and
                    ideological meaning; none can stir deep emotional currents so easily’ (Gallagher
                    2011: 46). Others studies have used terms like ‘duty,’ ‘honor,’ ‘manliness,’
                    ‘freedom,’ ‘liberty,’ ‘nation,’ ‘republic,’ ‘civilization,’ ‘country,’ and
                    ‘patriotism’ to analyze the ideological perspectives and cultural pressures that
                    shaped the actions and perspectives of soldiers and civilians during the Civil
                    War (Linderman 1987; Prior 2012; Gallagher 1997: 73).</p>
                    <p>Together, the production of enormous digital archives of Civil War-era
                    documents in the past decade and the development of new sophisticated
                    text-mining techniques present us with an opportunity to build upon the
                    strengths of this approach while transcending some of its limitations. While
                    unquestionably insightful, arguments that have relied heavily upon keyword
                    analyses are open to a number of critiques. How do we know that the chosen
                    keywords are the best window through which to examine the issues under
                    investigation?  How can we know – especially in studies which rely upon keyword
                    searches in databases – that we have not missed significant evidence on the
                    topic that does not happen to use the exact terms we look for and analyze? Does
                    the selection of those words skew our evidence and predetermine what we
                    discover? Topic modeling addresses these critiques of the keyword approach while
                    offering us potentially even greater insights into the politics and culture of
                    the era. First, as a ‘distant reading’ approach it is comprehensive, allowing us
                    to analyze models that are drawn not from a selection but from the entirety of
                    massive corpora. Second, as it identifies word distributions (i.e. ‘topics’),
                    topic modeling encourages – even forces – us to examine larger groups of related
                    words, and it surfaces resonant terms that we might not have expected. Finally
                    and perhaps most importantly, the topics identified by this technique are all
                    the more revealing because they are based on statistical relationships rather
                    than <hi rend="italic">a priori</hi> assumptions and preoccupations of a
                    researcher.</p>
                    <p>This presentation will showcase research into Union and Confederate
                    nationalism and patriotism that uses topic modeling to analyze the full runs of
                    the Richmond <hi rend="italic">Daily Dispatch</hi> and the <hi rend="italic">New
                        York Times</hi> during the war – taken together a corpus consisting of
                    approximately 90 million words. It will make three interrelated arguments drawn
                    from a combination of distant and close readings of topic models of the <hi
                        rend="italic">Dispatch</hi> and the <hi rend="italic">Times</hi>.</p>
                    <p>First, I will argue that Confederates and Yankees used the same patriotic
                    language to move men to be risk their lives by fighting for their
                    countries. Distinct topic models for the <hi rend="italic">Dispatch</hi> and the
                        <hi rend="italic">Times</hi> each contain topics with substantially
                    overlapping terms (see table below) – terms saturated with patriotism. Typically
                    celebratory of the sacrifices men made in battle, the patriotic pieces (often
                    poems) in these similar topics from each paper aimed to accomplish the same
                    thing: to evoke a love of country, God, home, and family necessary to move men
                    to risk their lives and believe that it was glorious to die for their
                    country.</p>
                <p><figure>
                    <graphic url="tabl141-1.jpg" rend="left" height="256px" width="341px"
                            mimeType="image/jpeg"/>
                        <head>Table 1: The top 24 predictive words for two topics from the
                            Dispatch and the Times, with the words they shared in bold</head>
                    </figure></p>
              
                       
                <p>Second, I will suggest that southerners (or at least the editor of the <hi
                        rend="italic">Dispatch</hi>) developed a particularly vitriolic version of
                    Confederate nationalism to convince southern men to kill northerners. The <hi
                        rend="italic">Dispatch</hi> was full of articles that insisted that
                    northerners were a foreign, unchristian, and uncivilized people wholly unlike
                    southerners; in vicious editorials the <hi rend="italic">Dispatch’s </hi>editor
                    maintained that northerners were infidels and beasts who it was not only okay
                    but righteous to kill. The remarkably similar signatures evident in a graph of
                    two topics (Figure 1) – one consisting of patriotic poetry aimed at moving men
                    to die, the other of vicious nationalistic editorials aimed at moving them to
                    kill – from a model of the <hi rend="italic">Dispatch</hi> suggests, first, the
                    close relationship between these two topics and, second, the particular moments
                    when these appeals needed to be made: during the secession crisis and the early
                    months of the war in the spring and summer of 1861 when the army was being
                    built, immediately following the implementation of the draft in April 1862, and
                    at the end of the war in early 1865 as Confederates struggled to rally the cause
                    as they faced imminent defeat.</p>
                
                <p><figure><graphic url="img141-1.jpg" rend="left" height="256px" width="341px" mimeType="image/jpeg"/><head>Figure 1</head></figure></p>
                
                    <p>Finally, I will argue the kind of nationalism evident in the <hi
                        rend="italic">Dispatch</hi> was not and could not be used by northerners for
                    the same purpose. While northerners and southerners used the same language of
                    patriotism, there is no analog to the vicious nationalistic topic from the <hi
                        rend="italic">Dispatch</hi> in the topic model for the <hi rend="italic">New
                        York Times</hi>. Unionists insisted that the South was part of the United
                    States and southerners had been and continued to be Americans – though
                    traitorous Americans, to be sure. As a title of one article in the <hi
                        rend="italic">Times</hi> proclaimed, this was ‘Not a War against the
                    South.’ It was a war against traitors, the <hi rend="italic">Times</hi>
                    insisted, and the ‘swords [of Union soldiers] would as readily seek a Northern
                    heart that was false to the country as a Southern bosom’ (‘Not a War against the
                    South,’ 1861).  Northern nationalism is evident in the model for the <hi
                        rend="italic">Times </hi>in a more politically inflected topic on Unionism
                    and a second topic consisting of patriotic articles and poems.  The graphs of
                    these two topics (Figure 2) with spikes during elections seasons suggest the
                    instrumental purpose of nationalistic and patriotic rhetoric in the <hi
                        rend="italic">Times</hi>: not to draw men into the army but rather to drive
                    them to the polls.  The editor of the <hi rend="italic">Times</hi> (correctly, I
                    think) perceived not military defeat but flagging popular will as the greatest
                    threat the Union war effort, and victory by copperhead Democrats who supported
                    peace negotiations would have been the most potent expression of such a lack of
                    will.</p>
                    <p>In briefly making these historical and historiographic arguments
                        about nationalism and patriotism and about dying and killing in the American
                        Civil War, this presentation aims to demonstrate the interpretative
                        potential of topic modeling as a research methodology.</p>
                
                
                <p><figure>
                        <graphic url="img141-2.jpg" rend="left" height="256px" width="341px"
                            mimeType="image/jpeg"/><head>Figure 2</head></figure></p>
                                    
                
            </body>
            <back>
                <div>
                    <head>References</head> 
                        <p><hi rend="bold">Gallagher, G. W.</hi> (1997). <hi rend="italic">The
                            Confederate War</hi>. Cambridge: Harvard UP.</p>
                        <p><hi rend="bold">Gallagher, G. W.</hi> (2011). <hi rend="italic">The Union
                            War</hi>. Cambridge: Harvard UP.</p>
                        <p><hi rend="bold">Linderman, G. F.</hi> (1997). <hi rend="italic">Embattled Courage: The Experience of Combat in the American Civil War</hi>. New
                        York: Free Press.</p>
                        <p><hi rend="bold">Not a War Against the South.</hi> (1861). <hi
                            rend="italic">New York Times</hi>. 10 May. Available at: <ref
                            target="http://www.nytimes.com/1861/05/10/news/not-a-war-against-the-south.html"
                            type="external"
                            >http://www.nytimes.com/1861/05/10/news/not-a-war-against-the-south.html</ref>
                        [Accessed on 13 March 2012].</p>
                        <p><hi rend="bold">Prior, D.</hi>  (2010).  Civilization, Republic, Nation: Contested Keywords, Northern Republicans, and the Forgotten Reconstruction
                        of Mormon Utah. <hi rend="italic">Civil War History</hi> 56(3): 283-310.</p>
                        <p><hi rend="bold">Varon, E. R</hi> (2008). <hi rend="italic">Disunion!: The
                            Coming of the American Civil War, 1789-1859</hi>. Chapel Hill: U of
                        North Carolina P.</p>
                    
                    
                </div>
            </back>
        </text>
    </TEI>
    <TEI>
        <teiHeader>
            <fileDesc>
                <titleStmt>
                    <title>Telling New Stories about our Texts: Next Steps for Topic Modeling in the Humanities</title>
                    <author>
                        <name>Brown, Travis </name>
                        <affiliation>University of Maryland, College Park, USA</affiliation>
                        <email>travisrobertbrown@gmail.com</email>
                    </author>
                </titleStmt>
                <publicationStmt>
                    <publisher>Jan Christoph Meister, Universität Hamburg</publisher>
                    <address>
                    <addrLine>Von-Melle-Park 6, 20146 Hamburg, Tel. +4940 428 38 2972</addrLine>
                    <addrLine>www.dh2012.uni-hamburg.de/</addrLine></address>
                </publicationStmt>
                <sourceDesc>
                    <p>No source: created in electronic format.</p>
                </sourceDesc>
            </fileDesc>
        </teiHeader>
        <text>
            <body>
                <div>
                   <p>Latent Dirichlet Allocation (LDA) topic modeling has quickly become one of
                        the most prominent methods for text analysis in the humanities, with
                        projects such as the work by Yang et al. (2011) on Texas newspapers and
                        Robert Nelson’s <hi rend="italic">Mining the Dispatch</hi> (Nelson 2010)
                        demonstrating its value for characterizing large text collections. As an
                        unsupervised machine learning technique, LDA topic modeling does not require
                        manually annotated training corpora, which are often unavailable (and
                        prohibitively expensive to produce) for specific literary or historical
                        domains, and it has the additional benefit of handling transcription errors
                        more robustly than many other natural language processing methods. The fact
                        that it accepts unannotated (and possibly uncorrected) text as input makes
                        it an ideal tool for exploring the massive text collections being digitized
                        and made available by projects such as Google Books and the HathiTrust
                        Digital Library.</p>
                    <p>LDA is an example of a generative model, and as such it has at its heart a
                        ‘generative story,’ which is a hypothetical narrative about how observable
                        data are generated given some non-observable parameters. In the LDA story,
                        we begin with a set of topics, which are simply probability distributions
                        over the vocabulary. The story then describes the process by which new
                        documents are created using these topics. This process (which has been
                        described many times in the topic modeling literature; see for example the
                        original presentation by Blei et al. (2003)) is clearly not a realistic
                        model of the way that humans compose documents, but when we apply LDA topic
                        modeling to a set of documents we assume that it is a useful simplification.
                        After making this assumption, we can essentially ‘play the story in
                        reverse,’ using an inference technique such as Gibbs sampling to learn a set
                        of topic distributions from our observed documents. Despite the simplicity
                        of the generative story, the method can produce coherent, provocative, and
                        sometimes uncannily  ‘insightful’ characterizations of collections of
                        documents.</p>
                    <p>While LDA topic modeling has a clear value for many applications, some
                        researchers in the fields of information retrieval and natural language
                        processing have described it as ‘something of a fad’ (Boyd-Graber 2011), and
                        suggest that more attention should be paid to the broader context of
                        generative and latent variable modeling. Despite the relatively widespread
                        use of LDA as a technique for textual analysis in the humanities, there has
                        been little work on extending the model in projects with a literary or
                        historical focus. In this paper I argue that extending LDA – to incorporate
                        non-textual sources of information, for example – can result in models that
                        better support specific humanities research questions, and I discuss as
                        examples two projects (both of which are joint work by the author and
                        others) that add elements to the generative story specified by LDA in order
                        to perform more directed analysis of nineteenth-century corpora.</p>
                    <p>The first of these projects extends LDA to incorporate geographical
                        information in the form of a gazetteer that maps place names to geographical
                            coordinates.<note>Joint work by the author with Jason Baldridge, Katrin
                            Erk, Taesun Moon, and Michael Speriosu. Aspects of this work were
                            presented by Speriosu et al. (2010) and will appear in an article in an
                            upcoming special issue of <hi rend="italic">Texas Studies in Literature
                                and Language</hi>.</note> We propose a <hi rend="italic">region
                            topic model</hi> that identifies topics with regions on the surface of
                        the Earth, and constrains the generative story by requiring each toponym to
                        be generated by a topic whose corresponding region contains a place with
                        that name, according to the gazetteer. This approach provides a distribution
                        over the vocabulary for each geographical region, and a distribution over
                        the surface of the Earth for each word in the vocabulary. These
                        distributions can support a wide range of text analysis tasks related to
                        geography; we have used this system to perform toponym disambiguation on
                        Walt Whitman’s <hi rend="italic">Memoranda During the War</hi> and a
                        collection of nineteenth-century American and British travel guides and
                        narratives, for example.</p>
                    <p>The second project applies a supervised extension of LDA (Boyd-Graber &amp;
                        Resnik 2010) to a collection of Civil War-era newspapers.<note>Joint work by the author with Jordan Boyd-Graber and Thomas Clay Templeton. We have also presented results from this work at the 2011 Chicago Colloquium on Digital Humanities and Computer Science on experiments using casualty rates as the response variable.</note> In this extension the model predicts an observed response
                        variable associated with a document – in our case contemporaneous historical
                        data such as casualty rates or consumer price index – on the basis of that
                        document’s topics. We show that this approach can produce more coherent
                        topics than standard LDA, and it also captures correlations between the
                        topics discovered in the corpus and the historical data external to the
                        corpus.</p>
                    <p>Both of these projects preserve the key advantages that the unsupervised
                        nature of LDA topic modeling entails – the ability to operate on large,
                        unstructured, and imperfectly transcribed text collections, for example –
                        while adding elements of supervision that improve the generated topics and
                        support additional kinds of analysis. While we believe that our results in
                        these experiments are interesting in their own right, they are presented
                        here primarily as examples of the value of tailoring topic modeling
                        approaches to the available contextual data for a domain and to specific
                        threads of scholarly investigation.</p>
                <p>This work was supported in part
                        by grants from the New York Community Trust and the Institute of Museum and
                        Library Services.</p>
                    
                </div>
            </body>
            <back>
                <div>
                    <head>References</head>
                        <p><hi rend="bold">Blei, D. M., A. Ng, and M. Jordan</hi>
                                (2003). Latent Dirichlet Allocation. <hi rend="italic">Journal of
                                    Machine Learning Research</hi> 3: 993–1022.</p>
                    <p><hi rend="bold">Boyd-Graber, J.</hi> (2011). Frequently Asked Questions. <ref
                            target="http://www.umiacs.umd.edu/~jbg/static/faq.html" type="external"
                            >http://www.umiacs.umd.edu/~jbg/static/faq.html</ref> (accessed 23 March
                        2012).</p>
                    <p><hi rend="bold">Boyd-Graber, J.,  and P. Resnik</hi> (2010). Holistic
                        Sentiment Analysis Across Languages: Multilingual Supervised Latent
                        Dirichlet Allocation. In<hi rend="italic"> Proceedings of Empirical Methods
                            in Natural Language Processing</hi>. Cambridge, MA, October 2010.</p>
                    <p><hi rend="bold">Nelson, R. K.</hi> (2010). <hi rend="italic">Mining
                        the Dispatch</hi>. <ref target="http://dsl.richmond.edu/dispatch/" type="external">http://dsl.richmond.edu/dispatch/</ref> (accessed 23 March 2012).</p>
                    <p><hi rend="bold">Speriosu, M., T. Brown, T. Moon, J. Baldridge, and K.
                            Erk</hi> (2010). Connecting Language and Geography with Region-Topic
                        Models. In<hi rend="italic"> Proceedings of the 1st Workshop on
                            Computational Models of Spatial Language Interpretation</hi>. Portland,
                        OR, August 2010.</p>
                    <p><hi rend="bold">Yang, T., A. Torget, and R. Mihalcea</hi> (2011). Topic
                        Modeling on Historical Newspapers. In <hi rend="italic">Proceedings of the
                            5th ACL-HLT Workshop on Language Technology for Cultural Heritage,
                            Social Sciences, and Humanities</hi>. Portland, OR, June 2011. <ref
                            target="http://www.aclweb.org/anthology/W11-1513" type="external"
                            >http://www.aclweb.org/anthology/W11-1513</ref> (accessed 23 March
                        2012).</p>


                    
                </div>
            </back>
        </text>
    </TEI>
    <TEI>
        <teiHeader>
            <fileDesc>
                <titleStmt>
                    <title>The Open Encyclopedia of Classical Sites: Non-consumptive Analysis from 20th Century Books</title>
                    <author>
                        <name>Mimno, David </name>
                        <affiliation>Princeton University, USA</affiliation>
                        <email>david.mimno@gmail.com</email>
                    </author>
                </titleStmt>
                <publicationStmt>
                    <publisher>Jan Christoph Meister, Universität Hamburg</publisher>
                    <address>
                    <addrLine>Von-Melle-Park 6, 20146 Hamburg, Tel. +4940 428 38 2972</addrLine>
                    <addrLine>www.dh2012.uni-hamburg.de/</addrLine></address>
                </publicationStmt>
                <sourceDesc>
                    <p>No source: created in electronic format.</p>
                </sourceDesc>
            </fileDesc>
        </teiHeader>
        <text>
            <body>
                <div>
                    <p>Traditional scholarship is limited by the quantity of text that a researcher
                        can read. Advances in large-scale digitization and data analysis have
                        enabled new paradigms, such as ‘distant reading’ (Moretti 2000). These
                        data-driven approaches, though not approaching the subtlety of human
                        readers, offer the ability to make arguments about entire intellectual
                        fields, from collections spanning hundreds of years and thousands of
                        volumes. Unfortunately, although such corpora exist, the current legal
                        environment effectively prohibits direct access to material published after
                        1922, even for the great majority of works that are not commercially
                        available (Boyle 2008). This paper explores the feasibility of scholarly
                        analysis on the limited, indirect view of texts that Google Books can
                        legally provide.</p>
                    <p>The proposed Google Books settlement (Google, Inc. 2011) presents the concept
                        of ‘non-consumptive’ use, in which a researcher does not read or display
                        ‘substantial portions of a Book to understand the intellectual content
                        presented within the Book.’ The most common mode of access supported by
                        archives such as JStor and Google Books is keyword search. When a user
                        provides a query, the search engine ranks all documents by their relevance
                        to a specific user-generated query and then displays short text ‘snippets’
                        showing query words in context. This interface, though useful, is not
                        adequate for scholarship. Even if researchers have a specific query in mind,
                        there is no guarantee that they are not missing related words that are also
                        relevant. Word count histograms (Michel et al. 2011) suffer similar
                        problems, and are also vulnerable to ambiguous words as they do not account
                        for context.</p>
                    <p>Another option is the application of statistical latent variable models. A
                        common example of such a method for text analysis is a statistical topic
                        model (Blei, Ng Jordan 2003). Topic models represent documents as
                        combinations of a discrete set of topics, or themes. Documents may be
                        combinations of multiple topics; each topic consists of a probability
                        distribution over words.</p>
                    <p>Statistical topic models have several advantages over query-based information
                        retrieval systems. They organize entire collections into interpretable,
                        contextually related topics.</p>
                    <p>Semantically related words are grouped together, reducing the chance of
                        missing relevant documents. Instances of ambiguous words can be assigned to
                        different topics in different documents, depending on the context of the
                        document. For example, if the word ‘relief’ occurs in a document with words
                        such as ‘sculpture’ or ‘frieze, ’ it is likely to be an artwork and not an
                        emotion.</p>
                    <p>Topic modeling has been used to analyze large-scale book collections
                        published before 1922 and therefore available in full-text form (Mimno &amp;
                        McCallum 2007). In this work I present a case study on the use of topic
                        modeling in digitized corpora protected by copyright that we cannot access
                        in their entirety, in this case books on Greco-Roman and Near-Eastern
                        archeology that have been digitized by Google. The resulting resource, the
                        Open Encyclopedia of Classical Sites, is based on a collection of
                        240-character search result snippets provided by Google. These short
                        segments of text represent a small fraction of the overall corpus, but can
                        nevertheless be used to build a representation of the contents of the
                        books.</p>
                    <p>The construction of the corpus involved first selecting a
                        subset of the entire books collection that appeared relevant to Greco-Roman
                        and Near-Eastern archeology. I then defined a set of query terms related to
                        specific archeological sites. These terms were then used to construct the
                        corpus of search result snippets. In this way I was able to use a
                        non-consumptive interface (the search engine) to create a usable sub-corpus
                        without recreating large sections of the original books.</p>
                    <p>Preprocessing was a substantial challenge. I faced problems such as
                        identifying language in highly multilingual text, recognizing improperly
                        split words, and detecting multi-word terms. This process required access to
                        words in their original sequence, and therefore could not be accomplished on
                        unigram word count data.</p>
                    <p>Finally I trained a topic model on the
                        corpus and integrated the resulting model with the Pleiades geographic
                        database. This alignment between concepts and geography reveals many
                        patterns. Some themes are geographically or culturally based, such as Egypt,
                        Homeric Greece, or Southern Italy. Other themes cut across many regions,
                        such as descriptions of fortifications or research on trade patterns.</p>
                    <p>The resulting resource, the Open Encyclopedia of Classical Sites, links
                        geography to research literature in a way that has not previously been
                        available. Users can browse the collection along three major axes. First,
                        they can scan through a list of topics, which succinctly represent the major
                        themes of the book collection, and the specific sites and volumes that refer
                        to each theme. Second, they can select a particular site or geographic
                        region and list the themes associated with that site. For example, the city
                        of Pylos in Greece is the site of a major Mycenaean palace that contained
                        many Linear B tablets, is associated with a character in the Homeric epics,
                        and was the site of a battle in the Peloponnesian war. The topics associated
                        with the site distinguish words related to Linear B tablets, Mycenaean
                        palaces, characters in Homer, and Athens and Sparta. Finally, users can
                        select a specific book and find the themes and sites contained in that
                        volume.</p>
                    <p>This project provides a model both for what is possible
                        given large digital book collections and for what is feasible given current
                        copyright law. Realistically, we cannot expect to analyze the full text of
                        books published after 1922. But we should also not be satisfied with search
                        and simple keyword histograms. Short snippets provide sufficient lexical
                        context to fix many OCR-related problems and support semantically enriched
                        searching, browsing, and analysis.</p>
                    </div>
                
                    <div>
                    <p><hi rend="bold">Funding</hi></p>
                        <p>This work was
                        supported by a Google Digital Humanities Research grant.</p>
                    
                    </div>             
                
            </body>
            <back>
                <div>
                    <head>References</head>
                    
                        <p><hi rend="bold">Blei, D. M., A. Ng, and M. I. Jordan</hi> (2003). Latent
                        Dirichlet Allocation. <hi rend="italic">Journal of Machine Learning
                            Research</hi> 3: 993-1022.</p>
                    <p><hi rend="bold">Boyle, J.</hi>
                            (2008). <hi rend="italic">The Public Domain</hi>. New Haven: Yale
                            UP.</p>
                    <p><hi rend="bold">Google, Inc.</hi> (2011). Amended Settlement Agreement. <ref
                            target="http://www.googlebooksettlement.com" type="external"
                            >http://www.googlebooksettlement.com</ref>, accessed Mar 24, 2012.</p>
                    
                    <p><hi rend="bold">Michel, J., Y. Shen, A. Aiden, A. Veres, M. Gray, J. Pickett,
                            D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M. Nowak, and E.
                            Aiden</hi>. (2011). <hi rend="italic">Quantitative analysis of culture
                            using millions of digitized books</hi>. Science 331(6014): 176-82.</p>
                    <p><hi rend="bold">Mimno, D., and A. McCallum</hi> (2007). <hi rend="italic"
                            >Organizing the OCA</hi>. In <hi rend="italic">Proceedings of the Joint
                            Conference on Digital Libraries</hi>. Vancouver, BC, June 2007.</p>
                    <p><hi rend="bold">Moretti, F.</hi> (2000). Conjectures on
                            World Literature. <hi rend="italic">New Left Review</hi> (Jan/Feb):
                            54-68.</p>
                  
                </div>
            </back>
        </text>
    </TEI>
</teiCorpus>