The digitisation of official government documents is important from the viewpoint of history, archives, and digital humanities. However, only little digitisation and publication of historical government documents has been done in Japan. There are several reasons for this (Koga, 2005). First, Japan’s frequent natural disasters make it difficult to preserve documents. Moreover, various documents were lost during the Second World War due to air raids, evacuations, and disposal (National Archives of Japan, 2006).

Two of the few exceptions are government documents from the governors general of Taiwan and Korea, which have been preserved in exceptionally good conditions (Kato, 2002). The former includes administrative documents depicting public organisations in pre-war Japan. Unlike most official Japanese documents, this collection has been published as image data and partially digitised in text format (Higashiyama, 2017).

The goal of this study is to develop a tool for visually grasping an overview of the enormous amounts of administrative documents of Taiwan. Moreover, it is also the aim of this study to investigate the usefulness of digitally archived text data of official documents for historical and social studies. For this purpose, an automatic KML data construction application for GIS visualisation systems has been developed.

Target contents

This study examines data from the ‘Catalog Database of the Government-General of Taiwan’s Administrative Documents’ (ISSCU, 2015), a database outlining the documents found in the ‘Catalog of the Government-General of Taiwan’s Administrative Documents’ (台湾総督府文書目録, CCCADGT, 1994). These text data include the titles of official documents from 1895 to 1914.

In addition to the catalogue documents, three types of datasets were created for automatic KML data construction. The first is a dictionary for old and historical words used in the target documents. The second is a word category for historical topics (Murai and Kawashima, 2018), and the third is a dataset for old place names and positions in Taiwan.

The dictionary for old and historical words in the Catalog Database of the Government-General of Taiwan’s Administrative Documents includes 136 words related to places and 61 historical or cultural words. These words were added to a normal Japanese dictionary for morphological analysis, and the texts were divided into words. The word category for historical topics includes 785 words from 26 topics (see Table 1). This category was utilised for detecting topics within each official document title (Murai and Kawashima, 2018).

The dataset for old place names in Taiwan includes 274 pairs (see Table 2) of place names and geographic information (latitude and longitude).

Table 1. Number of words in topic categories

Topic categoryNumber of wordsTopic categoryNumber of words
Personnel shift119Monopoly16
Publicity29Establishment and closure13
Office work20Disaster and welfare22
Permission8Mining and engineering12
Hygiene91Rebellion and indigenous22
Economy and tax36Agriculture33
Police16Imperial family and rituals28

Table 2. Number of place name and global positions

AreaNumber of place namesAreaNumber of place names
台北 (Taipei)32南投 (Nantou)11
台中 (Taichung)26雲林 (Yunlin)15
台南 (Tainan)29嘉義 (Chiayi)18
桃園 (Taoyuan)7屏東 (Pingtung)16
高雄 (Kaohsiung)30宜蘭 (Yilan)9
基隆 (Keelung)3花蓮 (Hualien)10
新竹 (Hsinchu)8台東 (Taitung)15
苗栗 (Miaoli)13澎湖 (Wuhu)7
彰化 (Changhua)25

These names were extracted manually from the ‘Catalog of the Government-General of Taiwan’s Administrative Documents’.

KML construction

The method for extracting topics and geographic information based on the three types of databases is depicted in Figure 1. Moreover, the target titles of official documents also contain chronological information since official documents include promulgation dates. Therefore, the extracted topic categories and geographical information with chronological information were combined into the KML data format. As a result, users can visually recognise chronological and spatial shifts in historical topics by utilising the GIS application (Figure 2).

Fig. 1: Method for extracting topics and geographic information
Fig. 2: Example of extracted KML on Google Earth

Conclusions and future work

The datasets including the topic word categories and place names enable the extraction of chronological and spatial characteristics for historical documents. Moreover, KML data enable users to visually recognise the geographical characteristics as well as the chronological and spatial shifts in historical topics.

Although there are various limitations to a catalogue-only analysis, the Catalog of the Government-General of Taiwan demonstrated the utility of this dataset.

One of possible future works may be utilization of named entity recognition based on natural language processing techniques. However, the accuracy is not enough at this stage, therefore, it is necessary to wait for future development of these field.

If appropriate datasets of old place names, geographical information, and words that indicate historical topics are constructed, this method and software would be applicable for other historical documents. Also there are other valuable data formats for GIS, it may be useful to support multiple formats for generalization.


