ВАРИАНТЫ ОПТИМИЗАЦИИ МОРФЕМНОГО РАЗБОРА СЛОВОФОРМ НА ОСНОВЕ СТАТИСТИЧЕСКИХ ДАННЫХ

Научная статья

Фадеев Сергей Георгиевич

Желтов Павел Валерианович

DOI:

https://doi.org/10.18454/RULB.7.33

Выпуск: № 3 (7), 2016

PDF

Фадеев Сергей Георгиевич

Чувашский государственный университет им. И.Н.Ульянова

Желтов Павел Валерианович

Чувашский государственный университет им. И.Н.Ульянова

Аннотация

В статье предложен подход к оптимизации морфемного разбора словоформ. Словоформа представляется в виде последовательности 3-х морфемных групп – префиксной группы, группы основы и постфиксной группы. Каждая из этих групп имеет свои особенности, учитываемые при выполнении морфемного разбора. Перечислены статистические особенности естественных языков, позволяющие оптимизировать морфемный разбор префиксной и постфиксной групп. Рассмотрена проблема омонимов, препятствующая внедрению оптимизации на основе статистической информации в морфемный разбор. Для решения этой проблемы введен параметр «глубина морфемного разбора», позволяющий найти компромисс и управлять скоростью и точностью морфемного разбора на основе статистической информации.

Ключевые слова:

морфемный анализ, словоформа, морф, оптимизация.

Morphemic analysis of word forms

Word forms are composed of 2 types of morphs – roots and affixes. Some affixes can be found only before roots, some - only after roots and some of the affixes – within roots. Within this framework word form can be represented by a sequence of 3 groups of morphs:

The prefix group consists of a sequence of affixes, which can be found only before roots (prefixes, the left part of confixes).
The postfix group consists of a sequence of affixes, which can be found only after roots (suffixes, the right part of confixes).
The stem group consists of a sequence of stems, and possibly affixes, which can be interchange with stems (infixes, interfixes, etc.)

The morphs set of the prefix group stands for $\Delta =\left \{ \delta _{1},\delta _{2},..., \delta _{N} \right\}$ , the morphs set of the stem group – $\Psi =\left \{\psi _{1}, \psi _{2}, ..., \psi _{M}\right \}$ , the morphs set of the postfix group – $\Omega =\left \{\omega _{1}, \omega _{2}, ..., \omega _{K} \right \}$ , wherein $N, M, K$ – are respectively the number of morphs of the prefix group, the stem group, and the postfix group for a given natural language. Each natural language has its own sets $\Delta ,\Psi ,\Omega$ and its own amount of $N, M, K$ determined by its morphology.

Let us consider the analysis on the example of the prefix group. There may be more than one prefix morph in a word, that is why even after the first successful check, one should move to the next step, preceding the analysis of the rest of the word form, and repeat these steps for as long as the inclusions will no longer be detected. In general, each of the morphs set $\Delta$ can be found both at the beginning, at the end and in the middle of the prefix group. Therefore, every step should contain verification of inclusions in the word form each of the morphs set $\Delta$ .

The disadvantage of this approach is the necessity of looking over all the elements of set $\Delta$ at each step sequentially. And while at the first step the number of verifications is equal to $N$ , then amount of verifications at the next steps becomes equal to $k*N$ , where $k$ - the number of ways of analysis found at the previous step.

The postfix group has the similar way of analysis; with the only difference that it is more convenient to begin the analysis in the end of the word form and the analysis involves morphs from the set $\Omega$ .

The reiterated complete verification of all elements of set slows down work on the analysis of text.

Optimization of the analysis

There are peculiarities in natural languages that can help in acceleration of analysis.

The 1st peculiarity. The probabilities $P()$ of coming across different morphs in a group differ from each other.

The 2nd peculiarity. The probability $P()$ of coming across a morph in a word form depends on the location in the group.

The 3rd peculiarity. The probabilities $P()$ of coming across different combinations of morphs differ from each other.

The 4st peculiarity. The probabilities $P()$ of coming across different combinations of morphs depends on the location in the group.

Considering these statistics features, it is possible to accelerate the analysis. However, this would require collecting statistics on the specific natural language, upon which primarily perform verification of that morphs and their combinations, which are more common in this place of analysis.

Possible solutions of optimization based on statistics

Solution for the 1st peculiarity. Instead of disordered set of morphs it is possible to use their one-dimensional ordered array, in which a morph with a higher probability of coming across has a lower index, than a morph with a lower probability. Verification of morphs for presence in word form should be performed in ascending order of their index. In this case, the morphs with the greatest probability will be checked first.

Solution for the 2nd peculiarity. It is possible to use two-dimensional ordered array. The 1st row of array includes morphs in order of descending of their probability of coming across at the 1st step. The 2nd row of array includes morphs in order of descending of their probability of coming across at the 2nd step, etc. Each step of analysis has its own corresponding number of the array row. As a result, at each step, the morphs with the highest probability of coming across will be checked first at this step.

Solution for the 3rd peculiarity. One-dimensional ordered array used as the solution for the 1st peculiarity can be supplemented with the combinations of morphs with a higher probability of coming across. The higher probability of coming across the lower index in the array. The verification is performed in order of increasing of element’s index. In this case, the morphs (combinations) with the highest probability will be checked first.

Solution for the 4st peculiarity. Two-dimensional ordered array used as the solution for the 2nd peculiarity can be supplemented with the combinations of morphs with a higher probability of coming across. Morphs and their combinations in each row are sorted in descending order of probability of their coming across. In this case, the morphs (combinations) with the highest probability will be checked first.

These solutions with the corresponding change of analysis algorithms will enable to find ways to the final result much faster. After receiving the result, one can discard the remaining verifications, thereby reducing the time of analysis.

These solutions are a practical approach for use with the prefix and postfix groups, since the number of elements is relatively small and the arrays will not be too cumbersome in them.

The problem of analysis based on statistics

The main problem is existence of homonyms. Homonyms are identical in spelling but different in their meanings, thus the analysis of a homonym should give several results instead of one. Therefore the analysis should not terminate at the first found result; it should pass on, since there is a possibility of finding another option for the analysis and possibly not even one.

The consequence of the requirement to continue the analysis after finding the first option is the necessity for a complete search of all morphs, and combinations thereof. If so, there is no point in their arrangement – they still have to be looked through in any case, therefore, time spent on analysis, will be the same one way or another.

In order to get out of this situation, we shall consider 3 situations:

The element $\delta _{i}$ in its order of appearance has already come across during the set of statistics in this language $(P(\delta _{i})>0)$ .
The element $\delta _{i}$ in its order of appearance cannot be come across, since the language morphology does not allow its appearance in the current location (for example, the verb endings cannot be found together, when examining the location, far from the end of the postfix group).
Other situations, which are not related to par. 1 и 2. Here $P(\delta _{i})=0$ .

In the 1st situation the searching of elements should be continued, since a common situation has already occurred in a given language.

In the 2nd situation the searching of elements should be ceased and then a shift should be produced.

In the 3rd situation it is not clear whether the searching of elements should be continued a shift should be produced. If continued then there is another question – how many elements of the set from с $P(\delta _{i})=0$ should be searched before ceasing and moving to another step?

To implement a flexible solution that can carry into effect the ceasing of searching as well as its continuation, an additional numerical parameter $D$ is introduced, which will determine the actions in the 3rd situation. Let us call it “the depth of the morphemic analysis”. It will determine number of the ordered set elements which are to be processed after elements with $P(\delta _{i})=0$ have proceeded. When $D=0$ such elements are not processed (with the exception of cases described below), when $D=1$ , one element is processed, upon which the shift to the following step is produced. Accordingly, when $D=2$ , 2 elements are processed, etc.

It should be noted that when $D=0$ , there may be cases where processing elements with $P(\delta _{i})=0$ is still necessary. For example, if the analysis is over, but no final alternative was found. In such cases, it is necessary go back to the step at which the analysis was interrupted, and continue with the interrupted point. In order to make it possible, it is necessary to be remembered all the states in which the analysis was interrupted.

It is more sensible to define parameter $D$ outside the decomposition algorithm, so that the user could choose the behavior of the analysis. During initial set-up of the model the larger value of $D$ is more appropriate, since the statistics has not been collected yet and almost all the elements have $P(\delta _{i})=0$ . According to continuation, the values of $D$ can be gradually reduced, looking for a compromise between performance and accuracy of the analysis.

Conclusion

The proposed optimization options can reduce time on morphemic analysis of word forms. This will require a preliminary collection of statistics on the basis of a language corpus. Moreover, the presence of suffixes and roots’ dictionaries of this language is a necessary condition as well.

Список литературы

Желтов П.В. Лингвистические процессоры, формальные модели и методы: теория и практика. – Чебоксары: Изд-во Чуваш. ун-та, 2006. – 208 с.
Желтов П.В. Формальные методы в сравнительно-сопоставительном языкознании. – Чебоксары: Изд-во Чуваш. ун-та, 2006. – 252 с.
Желтов П.В. Лингвистические процессоры в системах искусственного интеллекта. – Чебоксары: Изд-во Чуваш. ун-та, 2007. – 100 с.
Zheltov, Pavel. Morphological markup system for the national corpora of the Chuvash language / Pavel Zheltov // Proceedings of the International conference “Turkic Languages Processingz: TurkLang 2015”. – Kazan: Academy of Sciences of the Republic of Tatarstan Press, 2015. – pp. 328-330.