A Computer-Based Approach for Predicting the Translation Time Period of Early Chinese Buddhism Translation

Home » conference » programme » abstracts » A Computer-Based Approach for Predicting the Translation Time Period of…

Hung, Jen-Jou, Dharma Drum Buddhist College, Taiwan, jenjou.hung@gmail.com

Bingenheimer, Marcus, Temple University, USA, m.bingenheimer@gmail.com

Kwok, Jieli, Dharma Drum Buddhist College, Taiwan, guo.jieli@ddbc.edu.tw

Buddhism is a world-religion which has managed to take roots in cultures vastly different from that of its origin. Its transmission from India to China between the 2nd and the 10th centuries happened against all odds. The ‘Buddhist conquest of China’ can be partly attributed to the successful translation of a great number of texts translated into Chinese from Indian languages. The current standard edition of the Chinese Buddhist canon (Taishō shinshū daizōkyō (Abbr.: T.) 大正新修大蔵經, edited 1924-1934) contains 3053 works in 85 volumes, including about 1000 texts of Indian (or alleged Indian) provenance. However, ca. 150 of these texts are marked as shiyi 失譯, indicating that the name(s) of the translator(s) are unknown. Furthermore, for the texts that were translated between the 2nd and the late 6th century, many attributions are uncertain, problematic or simply incorrect. The issue of doubtful and wrong attributions has been debated in the field of Buddhist studies over the last few decades, e.g., Zürcher (1991), Harrison (1993), (and) Nattier (2008).

Over the years Buddhist scholars have leveraged traditional text-critical methods to corroborate or dispute traditional attributions yet like every method philology has its limits. Faced with a large number of texts in ‘Buddhist Hybrid Chinese’ of unknown provenance/origin, the long-established note-taking on the usage of characters and words quickly runs into problems. As with European languages, computational linguistics might offer new avenues of data collection and verification. The corpus of Buddhist Hybrid Chinese is available in a reliable digital format (XML/TEI) since the first 55 volumes of the Taishō edition were published freely by the Chinese Buddhist Electronic Texts Association (CBETA).

We are now able to apply statistical methods and artificial intelligence algorithms to the analysis of this corpus. This enables us to obtain new evidence bearing on translatorship attribution problems. The major advantage of quantitative methods for translatorship attribution is being able to analyze large amounts of data and to discover patterns which are not evident to the human reader.

Quantitative translatorship attribution is often considered to be a classification problem, that is, a text with uncertain or problematic authorship will be analyzed and compared with a corpus of texts by possible authors and then attributed to the author which whose works the texts shares most ‘characteristics.’ Recent years have seen renewed interest in many issues involved in optimizing quantitative authorship attribution. One of them is the effect of the size of possible candidate authors. As Luyckx and Daelemans (2010) have shown the accuracy of authorship analysis will decrease as the number of possible authors increases. It is therefore advisable to limit the number of possible authors in order to get a high accuracy analysis result. In our case, however, many of the early Chinese Buddhist translations are only rarely mentioned in historical records and canonical catalogues, and few have attracted the attention of philologists. For these translations, it is difficult to reduce the range of possible translators.

Therefore, as part of our attempt to establish a foundation for quantitative translatorship attribution for early Chinese Buddhist translations, we propose a classification mechanism based on predicting the translation time or period of a text. The advantage of this mechanism is twofold. First, within a given time bracket for the translation, the number of possible authors is limited, thereby improving the performance of the translatorship attribution. Second, by examining the result classification mechanism, we are able to identify possible and probable stylistic features of translations for different periods.

The time periods we focus on in the present study include three early Chinese dynasties: the Eastern Han (C.E. 25-220), the Three Kingdoms (C.E. 220-280) and the Western Jin (C.E. 266-316). These three dynasties constitute the earliest phase of Buddhist translation history and most of the translations from these periods present attribution problems. In this research, we build up classification mechanisms for each of the three dynasties. These can be used to test whether the translation style of a text is similar to the one prevalent during a certain period. We are aware of the fact that within Buddhist Hybrid Chinese translation styles within a given period can vary greatly.

For the Eastern Han (C.E. 25-220) and the Three Kingdoms periods we build on recent philological scholarship (Nattier 2008), which has ascertained a number of attributions for this period. For the Western Jin textual corpus, we rely on contemporary research on traditional Buddhist sūtra catalogs, from which we exclude those texts for which current scholarship has not reached a consensus (Lancaster 2008; Lü 1981; Ren 1985; Yu 1993; Xu 1987). We then adopt the Variant Length N-gram algorithm (Hung et. al. 2009) to extract the stylometric features from the three corpora of ascertained texts. Variant Length N-gram is an extended form of the traditional n-gram algorithm. In the traditional n-gram algorithm, the length of grams n is fixed. Although the exploitation of n-gram algorithm has great impact on the performance of following analysis, deciding the best value of n is not straightforward. The Variant Length N-gram algorithm generates grams of all possible lengths, then removes those which are not significant. Thus, the importance of stylometric features is measured across grams of different length. This is crucial as there are no word boundaries in Buddhist Hybrid Chinese: gram-based analysis must therefore include grams of any length.

In the final stage, we use Fisher Linear Discriminant Analysis (FLDA) to analyze the stylometric features that have been extracted from the translations and to build up the classification mechanisms. The FLDA is a well-known dimension reducing and classification algorithm. It returns a linear function that transfers the high dimension source data of different groups into one-dimension points such that the ratio of total variances of projected points to the variances between groups of projected points is maximized. Since the FLDA’s transformation is based on assigning weight to n-grams, the analysis is capable of yielding distinctive features, i.e. strings of Chinese characters, that are characteristic of the dynasties in question.

According to our experiments, the classification mechanisms for the three dynasties have all reached an accuracy rate higher than 90%. Moreover, when the three classification mechanisms are combined and usedto predict the translation time of an unknown translation, we can achieve an accuracy rate and a recall rate both above 80%. Besides, we are able to identify characteristic translation terms for different time periods.

References

Harrison, P. (1993). The Earliest Chinese Translations of Mahāyāna Buddhist Sūtras: Some Notes on the Works of Lokaksema. Buddhist Studies Review 10(2): 135-177.

Hung, J., M. Bingenheimer, and S. Wiles (2009). Quantitative evidence for a hypothesis regarding the attribution of early Buddhist translations. Literary and Linguistic Computing 25(1): 119-134.

Lancaster, L. (2008). Catalogues in the Electronic Era: CBETA and The Korean Buddhist Canon: A Descriptive Catalogue. CBETA, Taipei, 2008 (electronic publication). Retrieved from http://jinglu.cbeta.org/lancaster.htm.

Lu, Cheng 呂澂 (1981). Xinbian hanwen dazangjing mulu 新編漢文大藏經目錄. Jinan: Jilu shushe 齊魯書社.

Luyckx, K., and W. Daelemans (2010). The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing. 26(1): 35-55.

Nattier, J. (2008). A Guide to the Earliest Chinese Buddhist Translations: Texts from the Eastern Han 東漢 and Three Kingdoms 三國 Periods. Tokyo: The International Research Institute for Advanced Buddhology, Soka University.

Ren Jiyu 任繼愈 (1985). Zhongguo fojiao shi 中國佛教史. Vol 1. Beijing: Zhongguo shehui kexue 中国社会科学出版社.

Xu Lihe 許理和 (1987). Zui zao de fojing yiwen zhong de donghan kouyu chengfen 最早的佛經譯文中的東漢口語成分, Yu yan xue lun cong 語言學論叢, Vol. 14. Beijing: Shangwu yinshuguan 商務印書館, pp. 197-225.

Yu Liming 俞理明 (1993). Fojing wenxian yuyan佛経文献語言 [The Language of the Buddhist Scriptures]. Chengdu: Bashu shushe巴蜀書社, p. 206.

Zürcher, E. (1991). A New Look at the Earliest Chinese Buddhist Texts. In K. Shinohara et al. (eds.), From Benares to Beijing: Essays on Buddhism and Chinese Religion in Honour of Prof. Jan Yün-hua.Oakville, Ontario: Mosaic Press, pp. 277-304.