Friday, September 27, 2013


Lately I have been thinking about using topic modeling for a project on commemoration of WWII using a large database of documents.* But along the lines of working with large databases of text, I have been impressed by Google's Ngrams, a tool that searches through the Google Books corpus to find the frequency of a word's use for each year. The Russian language corpus is quite good and it can provide some broad insights for historians. In particular I looked up an Ngram for "applause" after a day working with stenographic records from the Stalin era (see chart here):

Graph using three-year smoothing, meaning that results are averaged over the surrounding three years on 
each side.  For example, 1937 is actually the average frequency from 1934-1940, inclusive.

Anyone who has worked with Soviet-era documents or who has seen Soviet propaganda films knows that whenever important figures--above all Stalin--walked into a room, they would be greeted invariably with "applause," "thunderous applause," or (for the biggest names) "thunderous, continuous applause." Sometimes it seemed to me that the life of a party leader in the 1930s must have been a nightmare, surrounded constantly by deafening cheers. And the graph for applause does show a peak in the use of the word in the second half of the 1930s. But it also shows a spike in applause occurs under Khrushchev that then falls off and picks up again in the late Brezhnev era.

So what does this mean? This graph is registering the percentage of the total corpus that "applause" makes up. At the apex of each Soviet leader's authority the frequency of applause peaks. To me, it speaks to the way that applause reflected the certainty of authority. At times when the political hierarchy in the Soviet Union was unclear, who should be applauded was also unclear. Thus, during the post-Stalin and post-Khrushchev collective leaderships, who exactly should get how much applause was an open question.  In transitional times you didn't want to applaud someone (or probably more accurately, to describe that person being applauded in an account of a meeting) who might fall out of power. What this graph reflects is not approval, but agreement about the state of the political hierarchy. If we think about it this way, the question that comes up is why there was so little applause in the late 1940s, when Stalin was firmly in charge. My theory is that as Stalin aged he made fewer official appearances at meetings and therefore there were fewer records of applause.

This example shows that Ngrams on its own isn't enough without knowledge of the period that might provide an explanatory framework.  It also is not especially useful for trying to register the comparative relevance of ideas over time. Applause tops out at a little more than .001 percent of the total corpus in the 1930s. This seems insignificant compared to words like "love" (ranging from about .012 percent before the revolution to .045 percent lows in the 1930s and 1970s) or "death" (~.009 pre-revolution to .003 in the 1970s, with a big spike during World War II). But there is no way to register the relative importance of concepts related to "love" or "death" or "applause" from these numbers. Even using just one word, we have to be wary of changes in discursive practices. Just because "applause" doesn't appear very frequently until the late nineteenth century, it does not mean that the tsars had no authority. It simply signifies that the words that symbolized authority in written language differed from the Soviet period. (Look perhaps at "solemnly"?) Nonetheless, when the usage is consistent over a period of time, Ngrams seems like a useful tool for teaching or even for conceptualizing research.

*Topic modeling uses computer algorithms to sort through large databases of documents and generate sets of words (topic models) that generally appear in the same documents. The allure of this method is that if there was coherence in a large enough database (hundreds or thousands of documents), it seems possible to find some broad connections between documents in that data pool and find avenues for further research. The most popular tool now for topic modeling is called MALLET.  It works well out of the box and because there is a graphical user interface available, it has a lot of potential as an introduction to topic modeling for humanities students. There is a easy to follow tutorial by Shawn Graham, Scott Weingart, and Ian Milligan on installing and getting started with MALLET at The Programming Historian.  Another tool is called Paper Machines, an extension for Zotero.  I am trying to use Gensim, a module that has the advantage of being in the programming language that I like the best (Python) and will be able to handle Cyrillic in Unicode. For more on topic modelling, see the Journal of Digital Humanities. In particular see Ben Schmidt's article, which is an excellent examination of topic modeling's limitations.