СЕМАНТИЧЕСКАЯ РАЗМЕТКА НАЦИОНАЛЬНОГО КОРПУСА ЧУВАШСКОГО ЯЗЫКА

Научная статья

Желтов Павел Валерианович

DOI:

https://doi.org/10.18454/RULB.7.35

Выпуск: № 3 (7), 2016

PDF

Желтов Павел Валерианович

Чувашский государственный университет им. И.Н. Ульянова

Аннотация

В статье описана система семантических тэгов, готовых для использования в национальном корпусе чувашского языка. Этот подход основан на семантической классификации лексики и является универсальным и применимым к любым языкам. Практическая польза разметки словаря и текстового корпуса заключается в улучшении качества поиска и расширения пользовательских возможностей. Разметка и семантическая классификация должны быть ориентированы на какую-либо парадигму программирования. Мы выбрали функциональную парадигму.

Ключевые слова:

национальный корпус, сементические тэги, семантическая классификация, тезаурус, лексика.

In the paper we develop approaches towards the semantic classification for the National Corpora of the Chuvash language. The publication was made in the scope of the scientific project №15-04-00532 supported by the Russian Foundation for Humanities (RFH).

The semantic tagging of national corporas greatly improves the quality of search and enlarges user’s facilities when requesting linguistic information. The semantic information about each lexem, that makes an entry, is represented as a set of semantic markups or tags and usually reflects it in the semantic classification of a language’s lexicon.

The problem is the creation of an axiomatic basis of such a classification, i.e. a minimal set of semantic tags through which other semantic tags can be defined. The fact is that no one can a priori think out such a classification for a language that will be universal and will not lack some semantic classes or subgroups.

Usually, when creating a semantic classification for a dictionary or a thesaurus, one divides the lexicon on topics, which are called semantic classes, and creates if needed subgroups in each of the classes. These subgroups are tagged as well and the system of tags can be applied to a dictionary or a thesaurus. If a lexem is not appropriate to any existing semantic subgroup in a class, but is appropriate to the parent class, one can either create a new subgroup and a new semantic tag for it, or use the semantic tag of the class to which the lexem is more or less appropriate.

Such a semantic classification, where there exists an axiomatic basis of semantic classes through which other classes can be defined, is called a logically extendable one or an axiomatic oriented one. Its main logical function/operation is a logical and semantical recursion. One can tell that it represents a semantic space.

Otherwise, when a semantic classification is not axiomatic oriented, it can be called an enumerable set.

Its two main logical operations are the inclusion and exclusion operations.

In our corpora we have chosen an axiomatic oriented classification for Chuvash language lexicon.

Its axiomatic basis is formed by the following semantic elements:

1) < space >;

2) < time >;

3) < object >;

4) < subject >;

5) < action >;

6) < state >;

7) < notion >;

8) < signal >.

These semantic elements, being some sorts of axes, must be measured. That is why measure is the basic element of the axiomatic set of elements, that is hierarchical higher (or more basical) if one can tell so, but that doesn’t exist by itself, and is a quality of the axiomatic elements.

This classification is an abstract and universal one as it is oriented towards such philosophical categories as object (material entity) and subject (nonmaterial entity) and towards the physical basics of the material world. Any entity, described by the language, can be also defined using this axiomatic classification.

The logic operators used for logical definition according this classification are:

‘=’ – ‘is equal’;

‘ $\subset$ ’ – ‘belongs to a set’;

‘→’, ‘->’ – ‘inclusion or exclusion of an element from a set, depending on the sequence of the operands’, i.e. a → A means inclusion of the element a into the set A, while A → a means exclusion of the element a from the set A;

‘ $\cap$ ’ – ‘sets intersection’, the result of this operation is a set that may be a void set;

‘ $\cup$ ’ – ‘sets unification/addition’;

‘:’ – ‘consists of’;

‘+’ – ‘elements assemblage’, the result of this operation is a set;

‘=>’ – ‘logical deduction’;

‘<=’ – ‘logical induction’;

if … then … – ‘logical conclusion’;

‘ $\exists$ ’ – ‘existence quantor’;

‘ $\forall$ ’ – ‘any’;

‘AND’ – ‘logical AND’;

‘OR’ – ‘logical OR’.

Another type of classification that was introduced for the developed national corpora ensures a substantial and adequate description of a language’s lexicon, and together with the morphological and syntactical markup give the researcher a sufficient information about the behavioral patterns of all lexical and semantic classes in the texts of a language.

The tags of this classification divide the lexicon from an encyclopedic point of view and somewhat reflect the human’s image of the world.

The list of the semantic tags of this classification looks like this:

<Nature> – describes natural objects and phenomena.

<Human> – describes humans.

<Artificial/human world> – describes artificial/human world and human activities.

<Human perception> – describes human perception.

<Human qualities> – describes human qualities.

<Human emotions> – describes human emotions.

<Human measurements> – describes human measurements.

<Human topology> – describes human topology.

<Human emotions> – describes human emotions.

At the present time there exist two main approaches towards the implementation of the semantic classification of a lexicon in a national corpora using computers.

The first one is based on an a priori made semantic classification that is applied by a human operator to the lexicon.

This approach is used in most of text corporas and electronic dictionaries, including thesauri. Being quite simple in action it is however quite boring in application and takes months for the human operator to complete, as one has to view step by step all the words of a thesaurus and classify them according to the classification used.

The automation of this process encounters great difficulties, because: 1) there exist no computer software able to classify a new word ex nihil, without relation on previously defined axiomatic/basic words; 2) any a priori classification, when applied to a lexicon, shows as a rule its incompleteness and becomes complete only when one reaches the end of a lexicon, as the person who creates the classification can not preview all the spectrum of semantic classes in a language.

The second approach to the semantic classification is oriented towards a practical or complete automation of the process.

The advantages of this approach are quite clear: 1) one has not to sit months before the computer to classify and tag all the lexicon of a dictionary or a thesaurus, all is done automatically and the process may be completed in several days using ordinary personal computers with an average performance; 2) the classification is complemented automatically during the process, using logical conclusions and previous definitions of basic semantic classes.

The main problem of the automation, as it was pointed out before, is what there exist no computer software able to classify a new word ex nihil, it is necessary to have an explanatory dictionary that fits the criteria of applicability of logical conclusions and that can be done only by humans. Many minority languages, such as Chuvash language, don’t have one yet. In fact creating such a dictionary takes the same time as classifying and tagging a lexicon and even more. The resolution of this problem is using bilingual dictionaries and the explanatory dictionary of the second language (from the bilingual dictionary), such as Chuvash-Russian dictionary and the Ozhegov’s explanatory dictionary of the Russian language, the entries of which fit the criteria of applicability of logic conclusions.

When we have a representative text corpora, which includes (incorporates) a huge amount of texts, we can, however, automate the process of creating an explanatory dictionary as well.

The process of creating a software for automation of the semantic classifiction being very complicated we have chosen a compromise strategy, that allows to implement this approach only partly, but to benefit from it as more as possible.

Our strategy is based on creating a basic semantic classification and on applying it to a basic list of words and roots/stems (the number of which in a language doesn’t exceed 500-1000 words).

As a result is being created a minimal semantic dictionary. This strategy can be applied to the bilingual way of the resolution of the automation problem that was pointed out above for the semantic tagging.

In this case one must create a basic semantic dictionary for the Russian language, that one can afterwards put into the scheme Chuvash-Russian dictionary ↔ basic Russian semantic dictionary ↔ Russian explanatory dictionary and obtain as an output result a sort of Chuvash explanatory dictionary with Chuvash entries, but Russian explanatory articles; an expanded semantic classification and a complemented Chuvash semantic dictionary.

The other possibility is to create a complemented Chuvash semantic dictionary by analyzing the list of words from a bigger Chuvash dictionary or thesaurus on the subject of their direct equality or derivational relation to the words of the basic semantic dictionary of the Chuvash language and in case of a positive answer tag them with the same semantic tags as the words they are derived from.

Conclusions

The strategy of semantic tagging of Chuvash National Corpora, presented in the article, is an optimal one, as it gives way to a thesaurus-oriented non axiomatic classification and to an axiomatic logic-oriented one as well. The last opens a space for perspective research in the field of the artificial intelligence.

The result of the semantic tagging of a language’s national corpora can differ in the predicate part of the lexicon. The predicate part of the lexicon forms ontologies that reflect links between notions and show the image of world of the people.

Список литературы

Apresyan Yu. D., Boguslavskiy T.M., Iomdin B.L. and others. Syntactically and semantically annotated corpora of the Russian language: modern state and perspectives // National corpora of the Russian language: 2003–2005. – Moscow: «Indrik», 2005 – pp. 193-214.
Желтов П.В. Лингвистические процессоры, формальные модели и методы: теория и практика / П.В. Желтов. – Чебоксары: Изд-во Чуваш. ун-та, 2006. – 208 с.
Желтов П.В. Формальные методы в сравнительно-сопоставительном языкознании / П.В. Желтов. – Чебоксары: Изд-во Чуваш. ун-та, 2006. – 252 с.
Желтов П.В. Лингвистические процессоры в системах искусственного интеллекта / П.В. Желтов. – Чебоксары: Изд-во Чуваш. ун-та, 2007. – 100 с.
Zheltov, Pavel. Morphological markup system for the national corpora of the Chuvash language /Pavel Zheltov // Proceedings of the International conference “Turkic Languages Processingz: TurkLang 2015”. – Kazan: Academy of Sciences of the Republic of Tatarstan Press, 2015. – pp. 328-330.