No source: created in electronic format.
XSLT has often been criticized for its verbosity.
The rules governing the validity of particular XML instances are usually set forth in
one of several standard forms of specification (such as RELAX
NG,
In the trivial case, where the target schema describes a proper subset of the collections in question, Abbot operates more-or-less automatically, but more complex transformations are also possible. One can, for example, give the system two collections and have it generate a stylesheet that makes one collection conform to the schema of the other. One can also make several collections target an entirely different schema. In these latter cases, it becomes necessary to describe particular mappings in a configuration file, but that configuration uses a simple syntax unrelated either to that of a schema language or XSLT.
The key step here is the automatic generation of an XSLT stylesheet. Our choice of
XSLT as the language that generates that stylesheet might at first seem slightly
perverse, but because XSLT is a homoiconic language – a language in which the
primary representation of the language is itself a data structure in that same
language – code generation can be undertaken through the use of metapgramming (in
which code is passed into another, more abstract layer and
evaluated).
The Abbot system begins by running a ‘meta-stylesheet’ (analogous to a higher-order function in a traditional functional language) on both a target schema and a configuration file. The configuration file, while not written in XSLT, is nonetheless converted into XSLT by the surrounding runtime (using a translation method we discuss below). By default, that target schema describes TEI Analytics – a TEI subset that provides an encoding scheme optimized for text analysis. This meta-stylesheet generates, as its only output, a conversion stylesheet used for the actual transformation of the documents. This latter transformation yields files that will, in the majority of cases, validate against the target schema.
When Abbot reads the target schema, it accounts for all elements and associated attributes and generates a default XSLT template for each element. These default templates reflect the general assumption that elements and attributes in the input files resemble their counterparts in the target schema. If, for example, <foo n="001"/> exists in the input file and is specifically allowed in the target schema, then Abbot will pass the element through unaltered under the assumption that the element is fully valid. Anything beyond that needs to be articulated in the configuration file.
Ultimately, the custom transformations set forth in the configuration file need to be instantiated as XSLT templates and included in the conversion stylesheet at runtime. For example, to replace the <temphead> element with <teiHeader> (its TEI P5 counterpart), the system would need to generate the following:
<transformation type="xslt" activate="yes">
<desc>substitute ’temphead’ with ’teiHeader’</desc>
<xsl:template match="*[lower-case(name())=’temphead’]" priority="1">
<teiHeader>
<xsl:apply-templates/>
</teiHeader>
</xsl:template>
</transformation>
To replace spaces with underscores in the extent attribute of the <gap> element (a considerably more complex operation), requires substantially more code:
<transformation type="xslt" activate="yes">
<desc>add underscore to ’gap’ @extents containing a space</desc>
<xsl:template match="*[lower-case(name())=’gap’]" priority="1">
<xsl:choose>
<xsl:when test="contains(@extent,’ ’)">
<xsl:element name="gap">
<xsl:for-each select="@extent">
<xsl:choose>
<xsl:when test="contains(.,’ ’)">
<xsl:attribute name="extent">
<xsl:value-of select="replace(.,’ ’,’_’)"/>
</xsl:attribute>
</xsl:when>
<xsl:otherwise>
<xsl:copy-of select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
<xsl:apply-templates/>
</xsl:element>
</xsl:when>
<xsl:otherwise>
<xsl:copy-of select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</transformation>
Depending on the particular situation, the configuration file might have to contain dozens of hand-built templates for performing subtle transformations that cannot be deduced from the schema. But here, we undertake a second code-generation step using Clojure – a dialect of Lisp that runs on the Java Virtual Machine.
Because Lisp is also a homoiconic language, it too is well suited to code that reads and writes code. Moreover, XML is itself a first-class datastructure in Clojure, which can be easily (and lazily) transformed into a map object in which descendant nodes are represented as nested vectors. The problem of parsing a configuration file (in which complicated XSLT transformations are rendered in the form of a radically simplified DSL), becomes a matter of parsing the file into a map structure. Clojure can then trivially transform that map directly into XML (XSLT), which can be inserted at runtime into the conversion stylesheet. The first XSLT example above becomes something like:
temphead -> teiHeader
The second, more complicated example might be expressed as:
gap[@extent=’/ /’] -> gap[@extent=’/_/’]
In this way, Abbot becomes not merely a framework for effecting interoperability of
XML document collections, but a general purpose XML transformation framework that
avoids the need for XSLT itself.
Thinking of XSLT as an intermediate form – a language that is targeted much as a compiler might target assembly – allows us to imagine radically simplified document transformation languages that can (potentially) exploit the full range of XSLT itself. In the case of Abbot, radical simplification is possible, in part, because the problem domain is itself highly constrained. But such constraints constitute precisely the rationale for domain-specific languages that try to map a user’s domain knowledge to a simplified syntax. Such languages, while smaller and simpler than more general-purpose languages, often still require the full range of language design tools (lexers, parser generators, the specification of a grammar, and so forth). Exploiting the homoiconicity of languages that possess this feature – including XSLT itself – makes the process of designing a ‘mini-language’ considerably easier.
Adler, S. (1997). A Proposal for XSL. World Wide Web Consortium (W3C) http://www.w3.org/TR/NOTE-XSL.html.
(accessed 31 October 2011).
McIlroy, D. (1960). Macro Instruction Extensions of Compiler
Languages. Communications of the ACM 3(4): 214-220.
Pytlik-Zillig, B. (2009). TEI Analytics: Converting
Documents into a TEI Format for Cross-Collection Text analysis. Literary and Linguistic Computing 24(2): 187-192.
Pytlik-Zillig, B. (2011). TEI Texts that Play Nicely:
Lessons from the MONK Project. Journal of the Chicago
Colloquium on Digital Humanities and Computer Science 1(3): 1-5.
Unsworth, J. (2011). Computational Work with Very Large Text
Collections: Interoperability, Sustainability, and the TEI. Journal of the Text Encoding Initiative 1 http://jtei.revues.org/215. (accessed
21 October 2011).