АВТОМАТИЗАЦИЯ ЛЕКСИЧЕСКОГО ПОИСКА В НАЦИОНАЛЬНОМ КОРПУСЕ ЧУВАШСКОГО ЯЗЫКА: МЕТОДЫ ИССЛЕДОВАНИЯ ПРОСТРАНСТВА ХУДОЖЕСТВЕННЫХ ТЕКСТОВ

Научная статья

Желтов Павел Валерианович

Желтов Валериан Павлович

Губанов Алексей Рафаилович

DOI:

https://doi.org/10.18454/RULB.7.37

Выпуск: № 3 (7), 2016

PDF

Желтов Павел Валерианович

Чувашский государственный университет им. И.Н. Ульянова

Желтов Валериан Павлович

Чувашский государственный университет им. И.Н. Ульянова

Губанов Алексей Рафаилович

Чувашский государственный университет им. И.Н. Ульянова

Аннотация

В статье рассмотрена автоматизация лексического поиска в национальном корпусе чувашского языка. Определена концептуальная модель пространства художественных текстов и на ее основе обозначен набор методов анализа художественных текстов. Рассматриваются следующие Методы исследования пространства художественных текстов: метод токенизации текстов, метод нормализации текста, метод морфологического анализа, метод распознавания имен собственных, метод классификации текстов, метод поиска текстов, метод определения тематики текстов.

Ключевые слова:

поисковик, текстовый корпус, разметка текста, запрос, индексирование.

Introduction

In the modern era of information and telecommunication technologies very large amounts of literary texts are available for the analysis (digitized old texts, many new texts are already in the digital form), but the possibility of their analysis by individual philologist / linguist (hereinafter referred as “investigator”) is still very modest. Let’s note that the matter is not only in the researcher’s limited time resources, but also in his/her limited cognitive resources.

Nowadays national corpora are created for this purpose. Such text corpora were established already for many languages of Russian Federation, and they represent huge structured repositories of texts with a quick search on several language levels: morpheme, morphologic, syntactic, text and semantic.

The authors work on the creation of the Chuvash National Corpora. This publication was made in the scope of the scientific project №15-04-00532 supported by the Russian Foundation for Humanities (RFH).

Automation of analysis of а national corpora provides linguists with big possibilities for scientific explorations.

Our constructive method towards automatic analysis of national corpora consists of the following stages: firstly, we determine conceptual model of literary texts space, and secondly, we denote minimal set of methods of literary texts analysis.

Literary texts space and stages of its analysis

Literary texts space consists of the following levels:

Level of literary texts authors.
Level of literary texts.
Level of meanings.

On the level of authors we consider many literary authors, related to each other in a different ways (e.g., relations of borrowings, relations of literary heritage etc.). On the level of literary texts we consider both set of literary fictions (novels, stories, tales etc.) and part of literary fictions (chapters, paragraphs, sentences, phrases etc.), related to each other in a different ways (e.g., relations of including, relations of belonging to the same author, relations of consecution). On the level of meanings we consider set of semantic conceptions, lying underneath literary texts.

So, basic elements of model of literary texts space are the following object kinds:

Text author.
Informational (literary) text: fiction, paragraph, sentence, phrase.
Meaning, described by semantic descriptor (semantic descriptor is the set of keywords, connected by logical connectives and determining the meaning).
Informational object is some entity, event, person etc. (informational objects can be complicated and consist of other informational objects).

Researcher should have possibility to work at any level of space: at level of literary texts authors, at level of literary texts and at level of meanings.

Research methods of literary texts space

Let’s consider a set of formal methods of researching literary texts space, required for automation of researcher’s activity. We consider main methods, requirements to them and examples of their use (requirements should bу defined depending on addressable substantial task). The methods are follows: tokenization and normalization of text, morphologic analysis, named entity recognition, texts classification, text search, determination of text's or texts set’s topic.

Tokenization and normalization of text

Firstly, investigated text should undergo primary processing - «tokenization». Also necessary punctuation marks need to be introduced (full points, i.e. sentences separation), if they are absent. During this process are determined word boundaries (each element is called «token») and boundaries of sentences. This processing is required for forthcoming morphological text analysis (see below). That’s why this method is frequently a part of method of morphologic and syntactic analysis.

Required functions:

Determination of boundaries of words and sentences, including abbreviations, such as RF (Russian Federation), ChR (Chuvash Republic) etc.
Work quickness (possibility of processing large text arrays).
Preferably: determination of sentence boundaries in case of absent punctuation and capital letters (it happens in case of error in text digitizing).

Morphologic analysis

Literary texts, both in Russian and Chuvash, at large are quite correct, but writing style variations are possible, typos in text are possible and digitization error are possible too. Orthographic, grammatical and stylistic errors, as well as absent punctuation and capitalization can be present too. Meanwhile, most of public standard methods for texts analysis (morphologic and syntactic analyzers etc.) are designed for analysis of grammatically correct texts.

Required functions:

Correction of orthographic errors (errors in words writing).
Preferably: conversion of non-standard words to standard lexicon.

It is noteworthy that morphologic analysis is essential part of most of methods of text processing. This kind of analysis allows getting lemma (root word form) for every word form and bunch of morphologic categories (content word, gender, number, case etc.).

Example of morphologic segmentation of noun:

Root form (in Russian language – subjective case, in Chuvash language – ablative case; singular number);
Proper or common;
Animate or inanimate; in Russian language – gender, in Chuvash - aspect; declension, number; case;
Role in sentence.

Required functions:

Qualitative support of Russian and Chuvash language.
Work quickness (possibility of processing large texts array).
Preferably: solution of homonymy.
Preferably: correct work with short noised texts with errors and non-standard vocabulary.

Named entity recognition

During Named Entity Recognition (NER) algorithm automatically highlights names of companies, persons, geographic names etc. Marks of this sort can be useful for solution of variety tasks of literary texts analysis.

Required functions:

Marking names of companies, persons, geographic names, indications of time, numbers, sums and percents.
Preferably: work with texts with errors, without punctuation or capitalization.
Preferably: considering literary texts.

Application examples:

Application of extracted proper names (names in Chuvash literary texts differ from Russian ones) for improvement of classification and clusterization of texts and their authors.

Texts classification

Classification of texts (and their parts) with respect to a certain set of categories is a key task in a large number of applications of literary texts analysis. Examples of such applications are given below.

Required functions:

Construction of classifiers based on modern algorithms of machine learning, such as logistic regression, machine of support vectors and solutions tree.
Extraction of different types from text, pre-processed with morphological analyzer:
- «bag-of-words» (see http://en.wikipedia.org/wiki/Bag-of-words_model)
- n-grams of symbols
- coincidences with words from given vocabulary
- n-grams of lemmas
- n-grams of content words (pos)
- preferably: syntactic features (dependency parsing)
Fast work (possibility of model learning on tens of thousands – millions of texts).
Possibility of integration into classifier additional features, not extracted from the text (topic, author’s nationality etc.).
Preferably: automatic selection of features (feature selection).

Text search

For the purpose of operative access to the set of collected texts in our search engine, it is necessary to develop method of texts search, more than that we need to develop text search engine. This system allows finding variety of documents, where keywords could be found, as well as find many documents similar to given one.

Required functions:

Indexing many texts.
Supporting Russian and Chuvash languages («lemmatization» or «stemming»).
Finding similar texts upon the request of keywords or by given document.
Possibility of work with large data amount (up to tens of millions of text documents).

Determination of text's or texts set’s topic

Required functions:

Determination of text’s topic in terms of pre-defined categories set, such as «prose», «lyrics», «romanticism» etc. Thus, it is necessary to construct topic rubricator (classifier).
Preferably: determination of text’s topic without presetting categories set. Thus, it is necessary to construct topic model (topic model)

Application examples:

Topical text categorization. Exposure of authors’ key interests.

Conclusion

As can be seen from the above, methods of researching space of texts, which are planned to apply for national corpora of Chuvash language, have been determined and particularly described. These methods are planned to realize in search engine, which is being created for national corpora. Search engine’s main task is providing to researchers possibilities for collecting literary texts in automated informational repository, for researching these texts in different analytic terms (researching denoted above different levels and objects of literary texts space with the help of denoted above analysis methods) and for using text/results their analysis in scientific papers.

Список литературы

Желтов П.В. Лингвистические процессоры, формальные модели и методы: теория и практика / П.В. Желтов. – Чебоксары: Изд-во Чуваш. ун-та, 2006. – 208 с.
Желтов П.В. Формальные методы в сравнительно-сопоставительном языкознании / П.В. Желтов. – Чебоксары: Изд-во Чуваш. ун-та, 2006. – 252 с.
Желтов П.В. Лингвистические процессоры в системах искусственного интеллекта / П.В. Желтов. – Чебоксары: Изд-во Чуваш. ун-та, 2007. – 100 с.
Zheltov, Pavel. Morphological markup system for the national corpora of the Chuvash language /Pavel Zheltov // Proceedings of the International conference “Turkic Languages Processingz: TurkLang 2015”. – Kazan: Academy of Sciences of the Republic of Tatarstan Press, 2015. – pp. 328-330.