LREC 2012 Workshop 'Best Practices for Speech Corpora in Linguistic Research'

December 12, 2011

Please note that the DEADLINE for submitting papers has been EXTENDED to 19 FEBRUARY 2012

This half-day-workshop addresses the question of best practices for the design, creation and dissemination of speech corpora in linguistic disciplines like conversation analysis, dialectology, sociolinguistics, pragmatics and discourse analysis. The aim is to take stock of current initiatives, see how their approaches to speech data processing differ or overlap, and find out where and how a potential for coordination of efforts and standardisation exists.

Largely in parallel to the speech technology community, linguists from such diverse fields as conversation analysis, dialectology, sociolinguistics, pragmatics and discourse analysis have, in the last ten years or so, intensified their efforts to build up (or curate) larger collections of spoken language data. Undoubtedly, methods, tools, standards and workflows developed for corpora used in speech technology often serve as a starting point and a source of inspiration for the practices evolving in the linguistic research community. Conversely, the spoken language corpora developed for linguistic research can certainly also be valuable for the development or evaluation of speech technology. Yet it would be an oversimplification to say that speech technology data and spoken language data in linguistic research are merely two variants of the same category of language resources. Too distinct are the scholarly traditions, the research interests and the institutional circumstances that determine the designs of the respective corpora and the practices chosen to build, use and disseminate the resulting data.

The aim of this workshop is therefore to look at speech corpora from a decidedly linguistic perspective. We want to bring together linguists, tool developers and corpus specialists who develop and work with authentic spoken language corpora and discuss their different approaches to corpus design, transcription and annotation, metadata management and data dissemination. A desirable outcome of the workshop would be a better understanding of

