ПОДСИСТЕМА АНАЛИЗА ТЕКСТОВ В ПОИСКОВИКЕ ДЛЯ НАЦИОНАЛЬНОГО КОРПУСА ЧУВАШСКОГО ЯЗЫКА

Научная статья

Желтов Павел Валерианович

Желтов Валериан Павлович

Губанов Алексей Рафаилович

DOI:

https://doi.org/10.18454/RULB.7.36

Выпуск: № 3 (7), 2016

PDF

Желтов Павел Валерианович

Чувашский государственный университет им. И.Н. Ульянова

Желтов Валериан Павлович

Чувашский государственный университет им. И.Н. Ульянова

Губанов Алексей Рафаилович

Чувашский государственный университет им. И.Н. Ульянова

Аннотация

В статье рассмотрена подсистема анализа текстов в поисковике. На данном этапе подсистема анализа текстов состоит из следующих компонент: компоненты токенизации текста; компонента выделения предложений в тексте; компоненты морфологического анализа предложений. Для хранения лингвистических данных необходимы следующие специальные структуры данных в виде набора классов, описанная в статье. Компонента токенизации текста преобразует текст в набор токенов. Для задания правил токенизации используется файл настройки.

Ключевые слова:

поисковик, текстовый корпус, разметка текста, запрос, индексирование.

Nowadays, development of national electronic corpora is one of the urgent tasks in Computational Linguistics. Such corpora have the form of electronic library of annotated texts with the ability to quickly search on multiple language levels: morphemic, morphological, syntactic, and semantic text. Similar text corpora have already been created for many languages of the Russian Federation (Russian, Tatar, Bashkir, Kalmyk, Mari, Mordvin, Udmurt, Komi, and Khakassia). Currently the authors of the paper are working on the creation of National Corpora of the Chuvash language. The publication was made in the scope of the scientific project №15-04-00532 supported by the Russian Foundation for Humanities (RFH).

National language corpora are served by a large number of software products that allow its processing and perform various user queries, aimed at the study of texts and selection of certain relevant data.

One of the main software products in the national corpora is a search engine. The search engine, in turn, can include a plurality of modules, one of which is a text analysis subsystem.

Let us consider the text analysis subsystem in the search engine. At this stage, the text analysis subsystem consists of the following features: 1) components of text tokenization; 2) component of separation of sentences in the text; 3) components of morphological analysis of sentences.

The following special data structures (special classes) are necessary for the storage of linguistic data obtained as a result of operation of search engine components:

Word - a word form with a list of possible objects – Analysis results.

Word class variables:

InDict	Flag that indicates that the word form was found in the dictionary
Form	Word Form
Start	Offset of the Token Beginning in the Source Message
Finish	Offset of the Token End in the Source Message
User	User Data
Analyses	List of Objects – Analysis Results

Analysis - results of morphological analysis correspond to each word (may be several alternative variants of the results because of the uncertainty and ambiguity present in the text), see more details in [1].

Analysis class variables:

Lemma	Nested Word
Tag	PoS Tag/ Part-of-Speech Tag
Descriptor	Descriptor
Probability	Probability that a Word form (Word object) has indeed such Lemma/Tag.
User	User Data

Sentence - class contains a list of words that make up a complete sentence.
Document - class contains a list of sentences that make up the message of the expert.

Note that the text tokenization component converts text into a set of tokens (words, abbreviations, etc.). To define translation rules the configuration file containing regular expressions and the list of word acronyms is used.

The rule of tokenization consists of two parts: rule name and a regular expression that is used to highlight a token. Examples of tokenization rules:

Regular expression	Rule name
WORD {[[:alnum:]º°]}+[\+]*	Rule for separation of words
TIMES (([01]?[0-9]\|2[0-4]):[0-5][0-9])	Rule for separation of time from the text

Samples of abbreviations (interpreted as a single token):

Abbreviations	Meaning
arithm.	arithmetical
Dr. Sci. in P. M.	Doctor of Physical and Mathematical Sciences

Component of separation of sentences takes on entry a list of tokens and returns a list of suggestions.

To separate sentences the configuration file is used, which specifies the need to split sentences of the text that is located between two markers (start marker and finish marker); the list of pairs of “start-finish” markers (these are pairs of characters (pairs of groups of characters), such as "[" and "]", "{" and"}" etc.) is given; the list of possible starting and ending symbols of a sentence (for example, ".", "!", "?") is given, besides, the need for analysis of the next title character or start character is specified for end characters.

The dictionary search module searches the specified word and returns the corresponding lemma and PoS tags. The file of dictionary forms is a simple text file consisting of text lines. Text lines contain the words in the form "form lemma1 part of speech1 | lemma2 part of speech2 | ...".

Abbreviations, corresponding to the parts of speech, based on the available set of markup tags of the National Corpora of the Chuvash language, partly described in [4].

The dictionary was created on the basis of inversion, grammatical dictionary – Reverse Dictionary of Chuvash [5], because the practical importance of dictionaries of this type involves the grouping of words according to the same end: for Chuvash this principle is particularly important, as affixes in it are located to the right of the root. The words in the inversion dictionary can then be grouped due to morphological characteristics (part of speech, presence or absence of an affix). In particular, the analysis of the existing reverse dictionaries, in practice, has allowed us to represent diversity of affixal means of names in the Chuvash language, and their productivity. In the Reverse dictionary are arrays of words (more than a thousand in each) that have a certain affix.

The dictionary search module defines be means of this dictionary a priory probability of each possible analysis of each word form in a sentence (of course, only in case of multiple options of analysis). If the analysis result is not defined for the word, the module tries to guess possible PoS tags (part of speech) of the word based on the word ending.

Modules are finite state machines used to select numbers and dates in the text.

Based on the above-mentioned modules the individual informational system "Lexical Search Engine" aimed at the search and analysis of the artworks sentences, which contain the keywords specified by users, was developed. The system developed by us consists of the following components or modules:

• User interface control module. The module accepts user queries, sends the queries to other modules and outputs the results of the query to the user.

• Module of indexing and searching of texts. The module is based on user-selected keywords, finds all relevant sentences from the index base of literary texts, then shows them to the user, using the structural database (in addition to the sentence the user is given the author of an artistic work, and title of artistic work, etc.).

• Text analysis module used by all other modules. The module allows conducting of the lexical, morphological and syntactic analysis of texts.

A start form, which consists of several areas, is loading when the system starts up. The left pane "Search" consists of text input fields: "Author of work (sentences will be found only from the works of the stated authors), "Title of work" (sentences will be found only of the stated works), "Keywords" (sentences containing the key words will be found). The user can use logical connectives AND/OR/NOT, nested parentheses, as well as special meta-characters * (replaces any number of letters) and ? (replacing one letter). Fill in the required fields, the user can click on the "Find" button (located in the same left pane); this starts the module of indexing and searching of texts (and indirectly the module of text analysis). Relevant to the user query sentences (with additional meta-information) are displayed in the upper right area "Artistic works".

The user can select the desired artwork (you can use classification by fields of meta-information: author, title, date of publication, etc.), and double click on it, then in the right middle area "Artwork" all the required sentences (or rather a list of sentences containing user-specified keywords) will be loaded, you can see the text context for each sentence.

Conclusions

One of the main problems complicating the work of search engine is non-standard Chuvash orthography, namely the ongoing controversy on issue of joined-up or separate writing of Chuvash analytical, including izafat (postpositional attributive group) structures, of which in Chuvash, like in other Turkic languages, new concepts are build.

The above-mentioned features were taken into account in creation of lexicographical basis of the inversion, grammatical dictionary. In this regard, analytical structures given in two forms: joined-up and separate.

In the course of work on the search engine, we have identified some problems of grammatical classification, reflecting the characteristics of the Turkic languages in general and Chuvash in particular. So, blurring of boundaries between inflectional classes in the course of the development of algorithms for morphological tagging is found not only in name but also within other parts of speech. For example, if in Chuvash the figures of the category of separation added to the name, it starts performing a predicate function; in a sentence, the adjective in the role of aktant can accept nominal figures.

On the whole, The Chuvash morphology fits into the overall scheme of categories and forms commonly found in the Turkic languages, and the Chuvash analytical structures are typical Turkic.

Список литературы

Желтов П.В. Лингвистические процессоры, формальные модели и методы: теория и практика / П.В. Желтов. – Чебоксары: Изд-во Чуваш. ун-та, 2006. – 208 с.
Желтов П.В. Формальные методы в сравнительно-сопоставительном языкознании / П.В. Желтов. – Чебоксары: Изд-во Чуваш. ун-та, 2006. – 252 с.
Желтов П.В. Лингвистические процессоры в системах искусственного интеллекта / П.В. Желтов. – Чебоксары: Изд-во Чуваш. ун-та, 2007. – 100 с.
Zheltov, Pavel. Morphological markup system for the National Corpora of the Chuvash language /Pavel Zheltov // Proceedings of the International conference “Turkic Languages Processingz: TurkLang 2015”. – Kazan: Academy of Sciences of the Republic of Tatarstan Press, 2015. – pp. 328-330.
Zheltov, Pavel. Reverse Dictionary of Chuvash / Pavel Zheltov, Eduard Fomin, Jorma Luutonen // Société Finno-Ougrienne. – Helsinki. 2009. – 344 p.