Making and using concordances

I’ve been using digital concordances to analyse word frequency in literary texts for quite some time. While I was working on my PhD thesis, on Dostoevsky’s The Idiot in the late nineties, I regularly used the Petrozavodsk State University on-line Dostoevsky concordance, as mentioned in a previous post. The site was pretty new at the time, and although it provided me with some very useful material, the process could best be described as arduous. They’ve made huge improvements since then, but as I’ve been researching other authors recently, I’ve started building my own concordances to do this sort of work.

I’ve been using TextSTAT, a nice, free, open source application (thanks to John for helping me get started with all this). It’s easy to use and doesn’t have a problem with Cyrillic. The first concordances I built were of Shalamov’s works — nine corpora, containing his six collections of short stories about his seventeen years in the Stalinist Gulag (Kolyma Tales, The Left Bank, Artist of the Spade, Sketches on the Criminal World, Resurrection of the Larch, and Glove, or KT-2), and three memoir-type works (The Fourth Vologda, Vishera: an anti-novel, and Memoirs). I didn’t really have a clear set of questions I wanted to ask when I started, which was a useful lesson as I very quickly realized that such tools have little intrinsic value and what really matters is your reason for using them. Nevertheless, even doing some quick comparisons of the different collections and general sweeps of word-forms in the corpora threw up some interesting facts and provided a couple of potential answers to questions that had been on my mind.

I discovered how little dramatic and sensationalist vocabulary Shalamov uses, which is perhaps surprising in the context of the violence, brutality and misery his stories depict.  For example, words denoting ‘horror’ (ужас, ужасный, etc.) appear very rarely — seven times in the first collection, four in the second. ‘Suffering’ words (страдание, страдать, etc.) are also remarkably absent (two occurrences in the second collection; eight in the third). What is particularly telling here is the fact that when they are used, they generally refer not to people but to inanimate objects; in the story ‘The Weismannist’, in which Shalamov’s alter-ego Andreev is studying on a medical assistants’ course — a classic route out of general hard labour in the camps — his anatomy notebook is described as ‘suffering more than the others’. Given that Shalamov’s stories are about the most appalling suffering, this seems curious, to say the least.

Which brings me to the second lesson of using concordances: they can provide you with evidence, but you still have to interpret it. In relation to the question of suffering, and in particular the ‘suffering’ notebook, one could view this as a process of transference, the author and his characters avoiding thinking about their own situation, particularly at at point when the worst aspects of the Gulag seem to be behind them, and instead projecting that suffering onto things that cannot suffer in the way that human beings can and do. On the wider question of the unsensational vocabulary Shalamov employs, I think this relates to the question of readability. Shalamov’s stories are profoundly shocking, not only because of their brutality but also because, unlike many Gulag writers, there is so little sense of the triumph of the human spirit — on the contrary, he repeatedly insists that moral life cannot survive under such conditions. It often makes for very bleak reading, and I’ve wondered on many occasions what saves them from absolute unreadability. There are undoubtedly many answers to this, but I’ve started to think that the ordinariness of the language may be part of it. This definitely needs further thought and investigation, but I now have specific questions to direct the next stage of my analysis.

For the moment, however, I’ve moved on to another (related) topic and different texts, as I’m currently working on a paper on narratives about pre-revolutionary exile and hard labour. I’m focusing on forms of knowledge (or rather, lack of knowledge) of the convicts, how these relate to the position of the author/narrator as an outsider, and what they suggest about prison/exile as a form of colonization. I’m comparing three texts, Dostoevsky’s Notes from the House of the Dead (1861), George Kennan’s Siberia and the Exile System (1891), and Chekhov’s Sakhalin Island (1895). Unfortunately, no usable digital version of Kennan’s book is available (a limited preview of vol. 1 is available on Google books, but that’s it), but the Russian texts are both easily accessible on, a wonderful (if slightly badly organized and presented) resource for Russian literature. As there was a problem with the Cyrillic when I tried to make corpora straight from the website, which I had done for Shalamov, I copied and pasted the works into text documents, and created concordance corpora of both (my initial plan was to use the Petrozavodsk Dostoevsky concordance, but given how easy it now is to make one’s own, I thought I might as well, to make doing comparisons as simple as possible — I also discovered other good reasons for creating my own Dostoevsky concordance, of which more below).

My initial investigations focused on the verb ‘to know’ (знать), and discovered that the authors’ usage of the verb, and in particular of its negation, is very different. Dostoevsky’s novel, based on his experiences in a Siberian labour camp after his arrest for political crimes, has a very strong predominance of negated first-person forms and positive third-person forms (thus: I do not know, but they know). In Chekhov’s travelogue/sociological treatise on prisoner conditions, in contrast, the verb ‘to know’ is very rarely used at all, and when it is, it is overwhelmingly negated. I also checked the verb ‘to understand’ (понимать/понять) and found comparable results: in Dostoevsky, the narrator generally does not understand, or indeed is misunderstood, but ‘they’ (the other prisoners, from the peasantry) do understand; in Chekhov again there is little or no understanding by anyone. Finally I searched terms relating to the ‘known’ (известно), and was largely unsurprised by this stage to find that in Dostoevsky things are known but in Chekhov, they are almost always unknown (неизвестно).

I’m not going to write about my interpretation of these features here, as I’m saving that for a forthcoming conference. For the moment I’m more interested in the process, and the advantages I’ve found of using TextSTAT. The on-line Dostoevsky concordance only allows you to search by initial letter or by whole word. This creates several problems. First, Russian is very rich in prefixes, and there is no way of combining all the possible prefixes in a single search. Secondly, Russian is inflected, and the search function does not recognize different endings as the same word, so if, say, you search for ‘author’ (автор) you get one hit, but you have to type in the dative (автору — to the author) and instrumental (автором — by or with the author), etc., to get results for those. TextSTAT, on the other hand, will produce a list of all words including a particular string or root, no matter where it occurs in the word, so you can see all the variations of both prefix and ending at a glance. Lemmatized concordances are problematic, but at least having a frequency list of all the forms used gives you an immediate overview, even if you then have to check them separately.

By clicking on a word form, TextSTAT’s concordance gives you a context of roughly ten words, with the search term in the middle (you can then click on each occurrence to see a larger chunk of text, and go from that to your original corpus). This is probably not quite enough, but it does allow you to see immediate collocations — I discovered that in Dostoevsky’s novel, use of ‘they know’ or ‘they knew’ very frequently emphasizes the collective, inclusive nature of the knowledge with the addition of ‘all’ (все). I probably wouldn’t have seen this in the much larger extracts provided in the on-line concordance. I also rather like the export function on TextSTAT, which allows you to create text documents with the immediate context of all the forms — very useful for doing comparisons.

The final problem with the on-line concordance is that the process of typing in your own words to search the on-line concordance is complicated considerably by the use of old orthography (using the pre-revolutionary alphabet). I’ve always been dubious about the decision to use old orthography for other reasons, but in relation to the search function, it creates a real difficulty — I know what the spelling changes were, and can read old orthography, but my ability to type it accurately is pretty limited. It’s a new feature but other decisions made long ago mean its usefulness is questionable.

This sounds like an attack on the on-line Dostoevsky concordance, which I don’t mean it to be. When I first used it I was amazed at what it could do and it was a very helpful tool in conducting my research. It’s just a question of how quickly digital technology has developed. Fifteen or so years ago, the project to digitize Dostoevsky’s works and create a concordance was a huge undertaking for a large group of scholars. Now, with digital texts readily available for so many works, I can build a concordance with greater functionality in a couple of minutes.

