<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="../schema/xmod_web.rnc" type="compact"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"
     xmlns:xmt="http://www.cch.kcl.ac.uk/xmod/tei/1.0" 
     xml:id="ab-262">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title>Automatic Mining of Valence Compounds for German: A Corpus-Based Approach</title>
                <author>
                    <name>Brock, Anne</name>
                    <affiliation>University of Tuebingen, Germany</affiliation>
                    <email>anne.brock@uni-tuebingen.de</email>
                </author>
                <author>
                    <name>Henrich, Verena</name>
                    <affiliation>University of Tuebingen, Germany</affiliation>
                    <email>verena.henrich@uni-tuebingen.de</email>
                </author>
                <author>
                    <name>Hinrichs, Erhard</name>
                    <affiliation>University of Tuebingen, Germany</affiliation>
                    <email>erhard.hinrichs@uni-tuebingen.de</email>
                </author>
                <author>
                    <name>Versley, Yannick</name>
                    <affiliation>University of Tuebingen, Germany</affiliation>
                    <email>yannick.versley@uni-tuebingen.de</email>
                </author>
            </titleStmt>
            <publicationStmt>
                <publisher>Jan Christoph Meister, Universität Hamburg</publisher>
                <address>
                   <addrLine>Von-Melle-Park 6, 20146 Hamburg, Tel. +4940 428 38 2972</addrLine>
                   <addrLine>www.dh2012.uni-hamburg.de</addrLine>
              </address>
            </publicationStmt>
            <sourceDesc>
                <p>No source: created in electronic format.</p>
            </sourceDesc>
        </fileDesc>
        <revisionDesc>
            <change>
                <date>2012-04-15</date>
                <name>DH</name>
                <desc>generate TEI-template with data from ConfTool-Export</desc>
            </change>
            <change>
                <date>2012-04-13</date>
                <name>LS</name>
                <desc>provide metadata for publicationStmt</desc>
            </change>
        </revisionDesc>
    </teiHeader>
    <text type="paper">
        <body>
            
                <div>
                    <head>Introduction</head>
                    <p>The availability of large-scale text corpora in digital form and the
                        availability of sophisticated analysis and querying tools have profoundly
                        influenced linguistic research over the past three decades. The present
                        paper uses this eHumanities methodology in order to automatically detect and
                        analyze valence compounds for German. Valence compounds (in German: <hi
                            rend="italic">Rektionskomposita</hi>) such as <hi rend="italic"
                            >Autofahrer</hi> ‘car driver’ have been subject to extensive research in
                        the German linguistics. They are composed of a deverbal head (<hi
                            rend="italic">Fahrer</hi> ‘driver’) and a nominal non-head (<hi
                            rend="italic">Auto</hi> ‘car’). As the corresponding verb <hi
                            rend="italic">fahren</hi> ‘to drive’, from which <hi rend="italic"
                            >Fahrer</hi> is derived, governs its accusative object <hi rend="italic"
                            >Auto</hi>, the compound <hi rend="italic">Autofahrer</hi> is considered
                        a valence compound.</p>
                    <p>The automatic detection and semantic interpretation of compounds constitutes
                        an important aspect of text understanding for a language like German where
                        compounding is a particularly productive means of word formation and
                        accordingly occurs with high frequency. Baroni et al. (2002) report that
                        almost half (47%) of the word types in the APA German news corpus, which
                        they used as training material for a word prediction model for German, are
                        compounds.</p>
                    <p>Due to their productivity, compounds in German do not form a closed class of
                        words that can be listed in its entirety in a lexicon. Rather, as Lemnitzer
                        (2007) has shown, new German compounds are coined daily, and some of them
                        attain sufficient frequency to be eventually included in print dictionaries
                        such as the Duden. Novel compounds that are not yet listed in a dictionary
                        pose a particular challenge for natural language processing systems that
                        rely exclusively on dictionaries as their underlying knowledge source for
                        word recognition.</p>
                    <p>Since the analysis of compounds constitutes a major challenge for the
                        understanding of natural language text, the structural analysis and the
                        semantic interpretation of compounds have received considerable attention in
                        both theoretical and computational linguistics. Syntactic analysis of
                        compounds focuses on the correct (left- vs. right- branching) bracketing of
                        the constituent parts of a given compound, e.g., [[rock music] singer] vs.
                        [deputy [music director]]. Research on the semantic interpretation of
                        compounds has focused on the semantic relations that hold between the
                        constituent parts of a compound. The present paper focuses entirely on the
                        semantic interpretation of compounds; however see Henrich and Hinrichs
                        (2011) for previous research on the syntactic analysis of nominal compounds
                        in German.</p>
                </div>
            <div>
                    <head>Corpus-Based Experiments</head>
                    <p>The aim is to determine whether corpus evidence can form the basis for
                        reliably predicting whether a given complex noun is a valence compound or
                        not. For example, if we want to determine whether the complex noun <hi
                            rend="italic">Taxifahrer</hi> is a valence compound, we inspect a large
                        corpus of German and investigate whether there is sufficient evidence in the
                        corpus that the noun <hi rend="italic">Taxi</hi> can be the object of the
                        verb. The question of what exactly constitutes sufficient corpus evidence is
                        of crucial importance. Three different measures were applied to answer this
                        question:</p>
                    <xmt:oList rend="arabic">
                        <item>Relative frequency: The percentage of the verb-object pairs in
                                the corpus among all co-occurrences of the two words in the same
                                sentence with any dependency relations,
                        </item>
                        <item>The association score of mutual information for the verb-object
                                pairs, and</item>
                        <item>The Log-Likelihood ratio for the verb-object pairs.
                        </item>
                    </xmt:oList>
                
                    <p>The measure in (1) above constitutes a simplified variant of the data-driven
                        approach that Lapata (2002) applied for the purposes of automatically
                        retrieving English valence compounds from the British National Corpus.</p>
                    <p>The starting point of the corpus-based experiments was a list of 22,897
                        German complex nouns and the Tübingen Partially Parsed Corpus of Written
                        German (TüPP-D/Z).<note>See <ref target="http://www.sfs.uni-tuebingen.de/en/tuepp.shtml" type="external">http://www.sfs.uni-tuebingen.de/en/tuepp.shtml</ref></note> This corpus consists of 200
                        Mio. words of newspaper articles taken from the <hi rend="italic">taz</hi>
                        (‘die tageszeitung’) and is thus sufficiently large to provide a reliable
                        data source for the experiments to be conducted. The TüPP corpus was
                        automatically parsed by the dependency parser MaltParser (Hall et al.
                        2006).</p>
                    <p>Each of the 22,897 potential valence compounds has been split into its
                        deverbal head and its nominal modifier with the help of the morphological
                        analyzer SMOR (Schmid et al. 2004). For example, the compound <hi
                            rend="italic">Autofahrer</hi> receives the analysis
                        Auto&lt;NN&gt;fahren&lt;V&gt;er&lt;SUFF&gt;&lt;+NN&gt; in SMOR. From the
                        TüPP corpus, all occurrences of those object-verb pairs are extracted from
                        those corpus sentences where either the verb in the sentence matches the
                        deverbal head of the complex noun (e.g., fahren) or the accusative object of
                        the sentence matches the nominal modifier (e.g., Auto) of the compound.</p>
                    <p>Figure 1 gives an example of the type of dependency analysis of the
                    MaltParser from which the verb-object pairs are extracted. The dependency
                    analysis represents the lexical tokens of the input sentence as nodes in the
                    graph and connects them with vertices which are labeled by dependency relations.
                    Recall that the MaltParser annotation is performed automatically and thus not
                    100% accurate. In the case of the sentence <hi rend="italic">Aber dann würde
                        doch niemand mehr Auto fahren.</hi> (‘But then, no one would drive cars
                    anymore.’) shown in Fig. 1, <hi rend="italic">mehr</hi> is erroneously attached
                    to the noun <hi rend="italic">Auto </hi>instead of to the noun <hi rend="italic"
                        >niemand</hi>.</p>
                <p><figure>
                        <graphic url="img262-1.jpg" rend="left" height="256px" width="341px"
                            mimeType="image/jpeg"/>
                        <head>Figure 1: A MaltParser dependency graph for a TüPP corpus sentence
                            Aber dann würde doch niemand mehr Auto fahren</head>
                    </figure></p>
                    
                    
                    <p>Both the mutual information and the log-likelihood measures determine the
                        association strength between two words by considering the relative
                        co-occurrences shown in the contingency table (Table 1).</p>
                
                <table>
                    <head>Table 1: Contingency table for verb-object pairs</head>
                    <row>
                        <cell>   </cell>
                        <cell> Accusative object Auto </cell>
                        <cell> Accusative object ¬Auto </cell>
                    </row>
                    <row>
                        <cell> Verb fahren </cell>
                        <cell>
                            <hi rend="italic">Auto fahren</hi>
                        </cell>
                        <cell>
                            <hi rend="italic">Fahrrad fahren</hi>
                        </cell>
                    </row>
                    <row>
                        <cell> Verb ¬fahren </cell>
                        <cell>
                            <hi rend="italic">Auto waschen</hi>
                        </cell>
                        <cell>
                            <hi rend="italic">Wäsche waschen</hi>
                        </cell>
                    </row>
                </table>
                                    
                    <p>The association strength increases for both measures the more the number of
                        co-occurrences in the upper left corner of the contingency table outweighs
                        the number of occurrences in the remaining cells.</p>
                
                </div>
            <div>
                    <head>Evaluation</head>
                    <p>From the list of 22,897 putative valence compounds, a balanced sample of 100
                        valence compounds and 100 non-valence compounds was randomly selected in
                        order to be able to evaluate the effectiveness of the methods described
                        above. Each entry in this sample was manually annotated as to whether they
                        represent valence compounds or not. The sample as a whole serves as a gold
                        standard for evaluation.<note>Precision measures the fraction of retrieved valence compounds that are correctly analyzed. Recall measures the fraction of actual valence compounds that are retrieved.</note></p>
                    <p>For all three association measures described above recall, precision, and
                        F-measure were computed. The results are shown in Fig. 2, 3, and 4, for
                        log-likelihood, mutual information, and relative frequency, respectively.
                        The first two measures yield a continuous scale of association strength
                        values. The graphs in Fig. 2 and 3 plot, on the x-axis, association strength
                        thresholds that correspond to a quantile between 100% and 10% of the values
                        observed for these measures. The y-axis shows the corresponding effect on
                        precision and recall for each measure.</p>
                
                <p><figure><graphic url="img262-2.jpg" rend="left" height="256px" width="341px" mimeType="image/jpeg"/><head>Figure 2: Precision, Recall, and F1 for Log-Likelihood</head></figure></p>
                <p><figure><graphic url="img262-3.jpg" rend="left" height="256px" width="341px" mimeType="image/jpeg"/><head>Figure 3: Precision, Recall, and F1 for Mutual Information</head></figure></p>
                <p><figure><graphic url="img262-4.jpg" rend="left" height="256px" width="341px" mimeType="image/jpeg"/><head>Figure 4: Precision, Recall, and F1 for Relative Frequency</head></figure></p>
                    <p>For the relative frequency approach (Fig. 4) the decision to reject or accept
                        a candidate pair is made by weighting occurrences as a verb-object pair
                        against occurrences in other contexts. The weights can consist of any value
                        between zero and positive infinity. Unlike the association measures
                        (log-likelihood and mutual information), this approach does not yield a
                        ranking of candidates; in consequence, the precision (shown in Fig. 4) does
                        not decrease monotonically but shows an optimal parameter setting for values
                        between 1.0 and 2.0.</p>
                    <p>The results show that all three measures are independently valuable in the
                    corpus-based identification of valence compounds. Relative frequency and
                    log-likelihood yield the best recall (up to 81%), while mutual information
                    affords the best precision (up to 100%). Future research will address the
                    effective methods for combining the complementary strengths of all three
                    measures into an optimized classification approach.</p>
                    <p>In sum, the eHumanities method presented in this paper for the identification
                        of valence compounds in German has proven effective and can thus nicely
                        complement traditional methods of analysis which focus on the internal
                        structure of valence compounds as such.</p>
                  
                
            </div>
        </body>
        <back>
            <div>
                <head>References</head>
                      
            <p><hi rend="bold">Baroni, M., J. Matiasek, and H. Trost</hi> (2002). Predicting the
                Components of German Nominal Compounds. In F. van Harmelen (ed.), <hi rend="italic"
                    >Proceedings of the 15th European Conference on Artificial Intelligence
                    (ECAI)</hi>. Amsterdam: IOS Press, pp. 470-474.</p>
            <p><hi rend="bold">Hall, J., J. Nivre, and J. Nilsson</hi> (2006). Discriminative
                Classifiers for Deterministic Dependency Parsing. In <hi rend="italic">Proceedings
                    of the 21st International Conference on Computational Linguistics and 44th
                    Annual Meeting of the Association for Computational Linguistics (COLING-ACL)
                    Main Conference Poster Sessions</hi>, pp. 316-323.</p>
            <p><hi rend="bold">Henrich, V., and E. Hinrichs</hi> (2011). Determining Immediate
                    Constituents of Compounds in GermaNet. In <hi rend="italic">Proceedings of
                        Recent Advances in Natural Language</hi> Processing (RANLP 2011), Hissar,
                    Bulgaria, pp. 420-426.</p>
            <p><hi rend="bold">Lapata, M.</hi> (2002). The disambiguation of nominalizations. <hi
                    rend="italic">Computational Linguistics</hi> 28(3): 357-388.</p>
            <p><hi rend="bold">Lemnitzer, L.</hi> (2007). <hi rend="italic">Von Aldianer bis
                    Zauselquote: Neue deutsche Wörter, woher sie kommen und wofür wir sie
                    brauchen</hi>. Tübingen: Narr.</p>
            <p><hi rend="bold">Schmid, H., A. Fitschen, and U. Heid</hi> (2004). SMOR: A German
                Computational Morphology Covering Derivation, Composition, and Inflection. In <hi
                    rend="italic">Proceedings of the 4th International Conference on Language
                    Resources and Evaluation (LREC 2004)</hi>. Lisbon, Portugal, pp. 1263-1266.</p>
        </div>
        </back>
    </text>
</TEI>