The purpose of the current project is to report on the development of a system that visualizes the co-occurrence patterns of words in the Kokinshu (ca. 905). There are not many studies exploiting visualization technology for ancient languages, with researchers in many cases continuing to carefully analyze limited material through visual inspection. Within this context, the purpose of this study is the longitudinal analysis of ancient languages, and the development of an effective visualization system tailored to them. To that purpose, we will first prepare the vocabulary database of the ancient language, and then analyze the ancient language through a visualization providing a bird's-eye view of present classical vocabulary. Unlike modern languages, there is not a massive amount of ancient language data, and we must perform research within that limitation. In addition to that, considering that existing text expresses limited meaning and content, there are many differences between modern and ancient languages in the analyses. Uniformly solving the problem is not straightforward.

In this study, we limit the database to the Kokinshu, which is the first emperor-ordered anthology of Japanese classical poetry (ca. 905) consisting of 1,111 poems. We will use the poem texts and their contemporary language translations to examine whether the analytical process is proper and valid. Using poetry data, we will 1) calculate the similarity of each word, and 2) generate word relationship data between any two words (Mikolov et al. 2013). We have studied the extraction of relational pairs of ‘orange,’ ‘plum,’ and ‘cherry’ flowers in poetic Japanese (Hodošček and Yamamoto and, 2017). In this study, we statically generated 2-gram patterns of poem words without calculating each word’s similarity in advance. In the present study, however, we perform an exhaustive investigation of related pairs rather than investigate particular words one at a time. We will examine the differences between calculating and not calculating the similarity of each word, and consider whether it is possible to display dynamic word relationships without the thesaurus showing the vocabulary category explicitly.

We will apply the following procedure: first, the machine-learned processor evaluates the length of the appropriateness of the ancient language word; secondly, we will evaluate the classification of the meaning of poetic words; finally, according to the classification categories of poetic words, we will perform generation of co-occurrence patterns, and attempt to visualize. For determining the word units, we will set up the following four cases of co-occurrence patterns and examine if it is possible to generate the patterns according to the cases. I.e., 1) word pairs are adjacent, and a. the order is probabilistically constrained, b. the order is not probabilistically constrained; 2) word pairs are not adjacent, and a. the order is probabilistically constrained, b. the order is not probabilistically constrained.

An unsolved problem using machine learning for natural language processing tasks is that it cannot satisfactorily distinguish lexical similarity from lexical relevance. (Hill et al. 2015) As an example of the oftentimes disjoint relationship between lexical similarity and lexical relevance: ‘plum’ and ‘cherry’ are similar to each other, but there is no relevance between them other than that they are flowers. On the other hand, ‘plum’ and ‘branch’ are both associated as parts of plants, but are otherwise lexically dissimilar. If machine processing can separately recognize similarity and relevance, the modeling of vocabulary would be closer to a human experts, and that will contribute to the processing of classical vocabulary as well. The analysis and visualisation system will use the Kokinshu database of Nakamura et al. (1999) as the corpus.

The system of the present study is a working in progress, and the present paper will carry out the interim report. This work was supported by KAKENHI (18K00528), Grant-in-Aid for Scientific Research (C).

References

  1. Hill, F., Reichart, R., and Korhonen, A. (2015). SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation. Computational Linguistics, 41(4), 665–695. https://doi.org/10.1162/COLI_a_00237
  2. Hodošček, Bor, and Hilofumi Yamamoto (2017) A study on the extraction of relational pairs of ‘orange,’ ‘plum,’ and ‘cherry’ flowers in poetic Japanese. Symposium of Humanities and Computer, Proceedings of Humanities and Computer, Vol. 2017, No. 2, pp. 207–212, 12.
  3. Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean (2013) "Efficient Estimation of Word Representations in Vector Space" CoRR, URL: http://arxiv.org/abs/1301.3781
  4. Nakamura, Yasuo, Yoshihiko Tachikawa, and Mayuko Sugita (1999) Kokubungaku kenkyu shiryokan detabesu koten korekushon "Nijuichidaishu" Shohobanbon CD-ROM (Database Collection by National Institute of Japanese Literature "Nijuichidaishu" the Shoho edition CD-ROM; Iwanami Shoten.