The efficiency of search engines is based on the principle that the information sought can be retrieved by ‘looking for words’ conveying the information and that these words can be identified thanks to the string of characters they are comprised of. This view takes for granted that the words are always spelt in the same way and that they comply with orthographic rules.
Such is not the situation which prevails for the texts produced during the French Renaissance period. Therefore the availability of older texts for purposes of archiving and disseminating the cultural heritage tradition raises a particular problem. In texts edited in French before the 18th century, spellings are not consistent, as proper spelling has not been ‘invented’ yet. One and the same word may therefore be spelt in a variety of forms. This is not only a time-related variation, as would be expected from the evolution of the language between the 15th and the 17th century. In one and the same book many different spellings may be identified for the one and the same word: for the word côté, either coté, cotté, cote, costé, or couste could be used, the verb savoir may be spelt either scavoir or sçavoir, ‘je sais’ may be spelt ‘ie sçay’, and its past participle ‘su’ may appear as ‘sceu’.
It is therefore necessary to adapt search engines based on word form identification if they are to render the service expected. Several strategies can be envisaged and the purpose of this paper is to focus on those which resort to linguistic expertise, either included in the documents themselves (by annotation) or into the search engine (by query extension). The solutions considered are produced in the context of the Virtual Humanistic Library Project and its evolution (www.bvh.univ-tours.fr ). This part of the project called VARIALOG, is financed by a Google Digital Humanities Research Award.
The BVH/VHL context, considered here, is that of a highly expert environment of a relatively moderate size aiming at a complete editorial treatment and the dissemination of annotated and validated resources. Within this context, two solutions have been designed:
To solve the problem of spelling variation, one has to go back to observational evidence. Two directions may be taken in this respect: either observe the texts or observe the variants attested for a given form.
Comparing the searched forms and their spelling in text, a typology of the situations occurring may be offered. The form being searched is the same one (raisons/raisons) or the link can be very weak (impératrice/empériere). Between these two types, a whole gradation of situations can be organised on a linguistic basis: relations between sounds and different ways of spelling in modern french (c=ss; n=nn; r=rr; s=z; t=th; ai=ei,ai,ey,ay,oi,oy; [uv]=u,v; u=eu), flexionnal history (serais/seray/serois) and morphological history (hôpital/hospitalier; forêt/forestier; advis/avis). Due to the structural instability of this linguistic data, equivalences between character strings are difficult to track statistically and no model-based approach can be developed. But linguistic knowledge helps recognize regular replacement patterns, which can be turned into rules.
To test the first results, a small corpus of 7 words (vices/une/face/fesse/lu/vu/souverain) has been transformed by the substitution rules mentionned above. The results do contain all the relevant forms, but the 7 words have been extended to 118445 forms. There is obviously some correlation between the length of the word and the number of generated words due to the combinatory process.
The solution chosen to fix that problem is to describe, for each rule, the context in which the substitution is allowed. This aims at constraining their application strongly, and limits their productivity. This contextualisation is based on a good knowledge of the linguistic process involved. In the example given below, 8 simple rules are transformed into 9 more complex rules. Most of the time, one simple rule will be derived into 5 to 15 contextualised rules.
|(?<=[aeiouy])c(?=[eiy])=ss||^s(?=[eiy]) = c|
|(?<=[aeiouy])ss(?=[eiy])=c||(?!^.+)v = u|
|(?!^.+)n(?<!.+$)=nn||^u = v|
|(?!^.+)r(?<!.+$)=rr||s$ = z|
The results achieved are satisfactory: the rules produce all the linguistically permissible variants, and the number of variants is much lower. The 7 words generate 37 forms.
The tool itself is thought to be really user-friendly especially for the tuning of rules and the evaluation of their consequences (efficiency and non regression tests). It is a free available java program which first transforms a list of words into an extended list of forms, using that for a rules set. Having done this, the need is to localise the different forms attested in the old spelling in a text, according to the requested form. The output file of this last part of the process is an html file with a graphical highlighting (or bold character) of the identified variant. Moreover, each form is connected to a bubble showing the rules used to derive the variant. A table containing the summary of the used rules for the text is also available: the human validation process is quite friendly. This tool is being put forward to be integrated to an XTF platform.
Using a rule-based approach, VariaLog is designed to identify all the written forms that are likely to correspond to a query, since it is insensitive to variations in spelling. The recall rate (tested on 5000 forms) provides evidence that all the linguistically permissible variants in French are produced by the rules, so long as the problem is simply one of spelling (nuit/nuyct)and not a morphological one (e.g. impératrice/empérière). As far as precision is concerned, the rules may sometimes generate more ambiguity than anticipated. If ‘o’ becomes ‘ou’, then, école becomes écoule, which is not an acceptable variation, but volant becomes voulant, which is an acceptable variation; as a result, volant will correspond to vouloir (‘want’), thus increasing the ambiguity of this form which means already ‘flying, robbing, wheel, flounce, shuttlecock’. The generated ambiguity is no different from standard ambiguities, even in an orthographic environment.
Already used to search several French dialects, VariaLog can be used to process any form of spelling variation, in any language. One just needs to adjust one’s own specific spelling rules or dictionary. Our aim is to help locate spelling variation efficiently. User feedback is most welcome.
Baron, A., and Rayson, P. (2009). Automatic standardization of texts containing spelling variation, how much training data do you need?In M. Mahlberg, V. González-Díaz, and C. Smith (eds.), Proceedings of the Corpus Linguistics Conference, CL2009. University of Liverpool, UK, pp. 20-23.
Burnard, L. (1995). Text Encoding for Information Interchange – An Introduction to the Text Encoding Initiative.Proceedings of the Second Language Engineering Conference, 1995.
Craig, H., and R. Whipp (2010). Old spellings, new methods: automated procedures for indeterminate linguistic data.Literary and Linguistic Computing 25(1): 37-52.
Demonet, M.-L., and M. H. Lay (2011). Digitizing European Renaissance prints: a 3-year experiment on image-and-text retrieval. International Workshop on Digital Preservation of Heritage (IWDPH07). Kolkata, 2007.
Erjavec, T. (2011). Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene.Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 2011, Portland, pp. 33-38.
Hana, J., A. Feldman, and K. Aharodnik (2011). A Low-budget Tagger for Old Czech.Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 2011, Portland, pp. 10-18.
Lay-Antoni, M.-H. et al. (2000), Adaptation d’un lemmatiseur au corpus rabelaisien: naissance d’Humanistica.jadt 2000, Lausanne.
Lay, M.-H., et al. (2010). Pour une exploration humaniste des textes: AnaLog.jadt 2010, Rome.
Sánchez Marco, C., G. Boleda, and L. Padró (2011). Extending the tool, or how to annotate historical language varieties.ACL-HLT Workshop, 2011, Portland, pp. 1-9.
Scheible, S., R. J. Whitt,M. Durrell, and B. Bennett (2011). Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text. ACL-HLT Workshop, 2011, Portland, pp. 10-18.
Souvay, G. and J.M. Pierrel (2009). LgeRM: Lemmatisation des mots en moyen français. TAL 50(2): 149-172.
Thaisen, J. (2011). Probabilistic Analysis of Middle English Orthography: the Auchinleck Manuscript.Digital Humanities Conference Abstracts, 2011, Stanford.