Research article
Issue: № 3 (7), 2016


Speech is known to be a source of information about the individual as well as their mental state and thus a valuable diagnostic tool. One of the crucial tasks facing the modern psychodiagnostics is predicting suicidal tendencies. One of the promising fields is analyzing speech of individuals who committed suicides. However, designing such text corpora is a daunting scientific task. The article describes the text corpora employed in studies of the features of speech of individuals who committed suicides (mostly in English) and introduces the first Russian corpus RusSuiCorpus and outlines the perspectives for further studies.

Introduction. According to the International Health Organization, over 800000 people die of suicides daily, i.e. every 40 seconds there is a suicide and only 30 % of individuals declared their intentions of committing one [5]. Therefore there is a pressing need to develop methods to predict suicidal tendencies and prevent any possible suicide attempts. One of the most valuable diagnostic tool for monitoring individual mental conditions as well as suicidal tendencies is analyzing speech production including its formal grammatical level that cannot be consciously controlled.

1. Personality profiling based on texts. Personality profiling using texts has been performed for decades and recently there has been a growing worldwide interest in the problem due to increasing numbers of Internet communications and a need for methods allowing one to recreate an individual’s personality (gender, age, education level, native language, psychological traits, etc.) using quantitative analysis of anonymous and pseudoanonymous texts [2, p. 12].

Psychologists and linguists have traditionally taken the lead in personality profiling using texts. However, the 1990s saw mathematicians and information technology specialists join the effort by actively employing the methods of mathematical statistics, computational linguistics and natural language processing (NLP) in particular for quick processing of large massifs of textual data.  Based on the identified correlations between the numerical values of quantifiable linguistic text parameters and authors’ characteristics, mathematical models are designed and developed for automatic personality profiling using texts. Formal grammatical text parameters that cannot be controlled by authors and thus cannot be consciously distorted (function words, POS bigrams and trigrams, etc.) are of particular importance [2]. Note that most research has dealt with English texts.

There have also been attempts to identify mental disorders (depression, schizophrenia, bipolar disorder, etc.)  in authors of written texts. As the literature review suggests, content-analysis of speech production does not provide complete information about the psychological status of the individual. Hence Baddeley [6] who analyzed emails showed that depressed individuals used more words describing positive emotions than in the test group probably in this way trying to mask their real feelings. It is quite obvious that in order to identify the psychological state using texts, their different levels besides vocabulary, which is easy to imitate, need to be analyzed.

It is obvious that in order to identify the psychological characteristics as well as suicidal tendencies, a comprehensive psycholinguistic analysis of individual speech production needs to be performed employing the neuropsychological data, neuropsychology of individual differences, neurobiology of suicidal behaviour based on the modern methods of automatic language processing and corpus technologies [4]. Studies of different genres of written speech of suicidal individuals produced at different points of their lives that are used to identify linguistic cues of suicidal behaviour, i.e. changes in different levels of the parameters of texts as cognitive tools as suicidal tendencies progress, compared to the speech production in the test group made up of individuals who have almost identical education levels and other characteristics as well but did not commit suicides would allow one to develop diagnostic tools to predict suicidal tendencies based on the quantitative parameters of texts.

2. Text corpora in studies of speech of suicidal individuals. Scientists dealing with the features of texts by suicidal individuals have mainly analyzed suicidal notes. There are also similar corpora [18] where their formal grammatical characteristics (average sentence length, proportions of different parts of speech, etc.) and content (proportions of words describing positive and negative emotions; time, place, etc.) [13]; [17]; [19] are analyzed. Mathematical models are also being designed to distinguish between genuine and fake suicide notes using quantitative text parameters [9].

Despite a pressing need to investigate suicide notes, being small, they do not offer opportunities to look at all the features of speech production of suicidal individuals. Therefore scientists have become aware of the importance of analyzing texts of different genres written by individuals who committed suicides compared to those who did not (considering the genre, demographic characteristics, etc.) as well as their dynamics in order to identify changes in the idiostyle as the tragic ending is pending. There are few such studies due to complexities associated with working on research text corpora.    

Texts by famous individuals such as writers, poets, musicians are commonly used in research and it is not their natural written speech but their literary texts that are investigated.

In the literature that we have studied the point is made that qualitative methods of analyzing suicide texts employed by psychologists and psychiatrists in particular should be employed in combination with quantitative methods relying on software.  Hence in [21] using the LIWC software to compute the proportions of different parts of speech, certain vocabulary groups, etc. in a text [15], it was found that in poetic texts by suicidal individuals written at different periods of time, the pronoun “I” is more frequently used compared to the texts by the test group.  As time went by, suicidal individuals were seen to use fewer “we” pronouns as well as interaction verbs (e.g., talk, share, listen), but contrary to the common belief, more words describing negative emotions (there were no statistically significant differences between the suicidal individuals and the test group in this parameter). It is argued that the results are consistent with a suicide genesis theory connecting suicidal behaviour with growing alienation from other people.

A study involving about 300 poetic texts by 18 American and Russian suicidal and non-suicidal poets came to be popular in the foreign media even though its methodology is controversial. In particular, it was not original Russian texts but their English translations that were analyzed. Besides, poetic texts are not commonly carefully edited with their investigations sometimes taking years or even decades, which undermines their significance for research.

Original Russian poetic texts by 6 Russian poets (three suicidal and three non-suicidal ones) were examined in a special study by Ch. Davidson [7]. The parameters (nothing is mentioned about the labelling of the texts) were those that according to S. W. Stirman, J. W. Pennebaker [21] differentiate between the texts by suicidal individuals and by the test group.  The texts by suicidal individuals were found to contain fewer words decribing human interactions. However, the proportions of “I” prnouns (and their indirect forms) in the texts by suicidal individuals were found to increase over time instead of remaining high and the opposite applied for the test group. There were also differences from the results obtained by S. W. Stirman, J. W. Pennebaker, which suggests that in the context of this issue it is not translated but original texts that are to be analyzed. In addition, unlike S. W. Stirman, J. W. Pennebaker, the author analyzed the number of negations (not, no) and established that their proportion increases in the texts by suicidal individuals as opposed to those in the test group as time goes by. Thus he assumes that the results obtained using the texts by authors of different nationalities should be compared.

In [11] using the LIWC and Coh-Metrix software texts by individuals who committed suicides were analyzed and compared with those in the test group and suicidal individuals were found to use more abstract words and fewer words overall, more verbs and fewer words relating to “Death”. However, this research has certain limitations: song lyrics are a product of collective mind and are not always relevant for studies of individual idiostyles.

In the paper by Mulholland M., Quinn J. [14] conducted using song lyrics, the objective was to develop a mathematical model to classify texts as belonging to suicidal individuals or the test group, however a lot more text parameters  were added (TTR, proportions of some parts of speech, some semantic groups, n-grams). The texts were labelled using the modern automatic language processing tools and a classifier with the accuracy of 70.6% was designed using machine learning. The results show the paramount importance of addressing suicidal risk evaluation based on the quantitative text analysis by means of NLP and mathematical statistics. As the authors are justified in commenting, in order to improve the accuracy of the model, text corpora and a selection of parameters for analysis are to be expanded.

Research corpora are to be expanded due to non-edited (unlike literary) texts that are samples of natural written speech of individuals who committed suicides.

Diaries, letters, Internet communication, interviews by individuals who ended up committing suicides (e.g., see [16]; for a review of similar research see [10]) have been analyzed recently.

Note that the above studies did not make it their objective to design methods for predicting suicidal tendencies based on qualitative analysis of speech production and merely identify statistically significant differences in texts by suicidal and non-suicidal individuals. The major tool for analyzing texts was the LIWC software [8]. Only English texts were analyzed. In addition, in similar studies individual texts are investigated and no corpus linguistics methods are employed, which makes one question how universal the conclusions made are.  

3. Research methods. Hence as suggested by the literature review, most studies to identify typical features of speech of suicidal individuals using statistical methods and automatic text processing tools have been conducted using English texts. It is obvious that other languages need to be explored as well, in particular Russian. In order to investigate the features of texts by Russian suicidal individuals, the following should be done first:

  • designing corpora of texts by individuals who committed suicides. There are presently no such corpora of Russian texts;
  • developing the principles of selecting text parameters to study. Note that while doing so, researchers abroad prioritize automatic extraction of their numerical values by means of available software and in some cases of existing psychological theories accounting for suicidal behaviour. The data on neurobiological mechanisms of suicidal behaviour obtained by scientists at home and abroad are neglected [1; 8];
  • choosing an available software or developing one to analyze research text corpora;
  • formulating the mathematical statement of the problem.

The corpus of Russian texts RusSuiCorpus written by individuals who committed suicides is being compiled. It currently contains texts by 45 individuals aged from 14 to 25, the total volume of the corpus is 200 000 words. All the texts are manually collected and are Internet texts by individuals who committed suicides (searched on social media “Vkontakte” and “Zhivoj Zhurnal”; as most posts on “Vkontakte” contain a lot of non-original material, the corpus is mainly made up of texts from “Zhivoj Zhurnal”, i.e. so-called “death diaries”). The fact that suicides were actually committed was checked by analyzing friends’ comments, media texts, etc.

All the texts are processed using the Russian version of LIWC as well as morphological and syntactic tagging tools [20].

Statistically significant differences between the texts by individuals who committed suicides and those in the test group are being researched and models to distinguish between texts by suicidal and non-suicidal individuals are being designed using machine learning methods. The text corpus RusPersonality [3] with metadata with the information about the authors is employed with texts by individual of a certain age being selected. RusPersonality contains texts of natural written speech, which makes it suitable for comparison with blogging texts.

Conclusions. The methods of corpus linguistics are most important in investigating the features of individuals who committed suicides. Studies to identify the typical features of suicidal individuals would allow us to develop diagnostic tools for evaluating suicidal tendencies based on linguistic analysis of speech production.

The currently designed RusSuiCorpus and studies employing it by means of modern software, statistical and machine learning methods would enable us for the first time to obtain the data regarding the features of Russian written speech of individuals who committed suicides and compare the results with those for other languages.