Tuesday, October 27, 2015

How to Map the Gulag (the data)

I had a few people ask how I made the gulag videos--what kind of tools and time were involved. So this writeup is not about the gulag itself but about how I made the last map video. The last post was intended primarily as something I might show my students to illustrate the development of the camp system. What I want to do here is present something that scholars or digital history students could use to think about how one might make a map like this. This isn't a walkthrough since I doubt anyone is willing to put in the time I did to generate the data. There is code included but far from all. But for people interested in doing digital history, it may be useful to see the process and to get a sense of the kind of coding that is necessary to get usable data from a set of websites on the web.

Any project like this can be divided broadly into two steps with some substeps:
  1. Getting data
    • Downloading
    • Sorting
    • Cleaning
    • Geocoding
    • Extrapolating
    • Outputting
  2. Visualizing (for next post)

Getting Data

A dynamic map needs a set of time-staggered geographical data. These data could be nuclear detonations (Isao Hashimoto), eighteenth and nineteenth century shipping routes (Ben Schmidt)  or the shifting borders in Europe since 4000BC (some YouTube user named MrOwnerandPwner who makes a lot of these maps). The last two maps are a little more complicated to make because they use lines and polygons to visualize data. Points are less problematic since the minimum you need for each data point is a timestamp and coordinates. If the data has a populartion or other categories it is possible to weight the points or otherwise differentiating the visualization by category, like colors to show what kind of work prisoners did. The gulag is a good example since it has points and also because all the data are available on Memorial's website (click on the Лагуправление header).

The easiest way to import a point map into a GIS program is as a delimited file, often called comma separated values or csv. This is just a text file that duplicates the functions of a spreadsheet by using a comma (or tab, or semi-colon or whatever else) to separate the cells. The first row usually contains the category names. Ultimately, the csv for a map like this will consist of many thousands of rows like the first in the csv for the gulag map, which looks like this (available here):

date;name;y;x;size
1924-01;SOLOVETsKII ITL OGPU;65.028257;35.717330;3531


So how do we get this nice looking (technically speaking) data. Memorial's gulag site was written in HTML4 and uses frames to keep the header and sidebar while loading the camps in the central frame. If you click on the side link for the Solovetskii ITL, it will load Memorial's information on that camp. Although you can't tell from the url, each of the camps is a different page. Here is the page for the Solovetskii camp:



There are a few ways to get the data out of this collection. It would be possible to go through all the camps and copy and paste the information manually into a spreadsheet. But there are 475 camps and copying every entry would be very time consuming. A better way of getting the data is with programming, by finding patterns in the data structure and using a scripting language to format the information in a way that is usable. This is also known as data scraping.

Learning how to scrape data with Python or another language takes some time and each page is different. I am going to post code here and it will be comprehensible if you are familiar with programming, even if you don't know Python. If you are just beginning, The Programming Historian has several lessons under the Data Manipulation heading (including mine) that go through the Python syntax needed for scraping. That site also has information about installation. I also like Codecademy lessons if you want practice with Python.


Downloading

The url for each camp follows the pattern: http://www.memo.ru/history/nkvd/gulag/r3/r3-X.htm with X being a number between 1 and 475. If you know this, you can use Python's urllib2 module to download all the camp pages very quickly. Python's BeautifulSoup module parses the data, making it possible to find a part of the page by its HTML tag. If you wanted to download and store a parsed version of the HTML for all the sites into a Python list, you would use this code (note, it will take a few minutes to run the last line):

import urllib2 
from bs4 import BeautifulSoup 
pages=[BeautifulSoup(urllib2.urlopen('http://www.memo.ru/history/nkvd/gulag/r3/r3-'+str(number)+'.htm').read()) for number in range(1,476)]


Sorting

The name of the camp is the first text on the page and it is the first text with bold <b> header. It's possible to create a dictionary, an "associative array" that keeps a list of data under headers, where each camp itself is a dictionary with information. Here is the code for that:
camps={camp.b.text:{} for camp in pages}

Now if you entered camps['СОЛОВЕТСКИЙ ИТЛ ОГПУ'], it would pull up an empty entry from that dictionary. Memorial gives data in a set of tables where the first table cell <td> is the category heading and the data themselves are in the second cell. For Python, the first cell is actually cell zero. The data needed for this map relate to time, location and size: Time of Operation (Время существования), Location (Дислокация) and Size (Численность). On my computer--and my understanding is that this is true for English-language Windows in general--Python does not like Cyrillic. I often transliterate the entire text (see my Programming Historian lesson). But in this case I needed to keep the Cyrillic because Yandex likes it better for Geocoding. Instead I used unicode, a set of codes that symbolize letters, to create the variables. Each of the funny \u0441 codes symbolizes a Russian letter. The variable names should let you know what they mean:

timeofoperation=u'\u0412\u0440\u0435\u043c\u044f \u0441\u0443\u0449\u0435\u0441\u0442\u0432\u043e\u0432\u0430\u043d\u0438\u044f:'
size=u'\u0427\u0438\u0441\u043b\u0435\u043d\u043d\u043e\u0441\u0442\u044c:'
location=u'\u0414\u0438\u0441\u043b\u043e\u043a\u0430\u0446\u0438\u044f:'

for camp in pages:
    campname=camp.b.text
    for datapoint in camp.find_all('tr'):
        category=datapoint.find_all('td')[0].text
        entry=datapoint.find_all('td')[1].text
        if category in [timeofoperation,location,size]:
            camps[campname][category]=entry


Cleaning

All the data are now in the Python variable camps. However, the data need to be cleaned, meaning that we need to make the information uniform so that a computer can read it into a program. Cleaning data is in many cases, including this one, the most time consuming and tedious part of putting together a visualization.

The Memorial data are very thorough but very messy. I get the feeling someone typed it out by hand and occasionally made little mistakes that are difficult to catch with a program. At this point, it might be easier (although longer) to print out the data we have, put it into a spreadsheet and edit the 475 camps in Excel or using regular expressions (patterns for identifying text) and copying the coordinates from a website. What I did was look for patterns in the text with Python, removing what I didn't need and regularizing the rest. I am not going to go through the code here because it would take too long. Instead, I will identify the challenges of cleaning the data in broad terms:
  1. The size of the camp:
    • The entries for a camp's size at a given date usually have a regular pattern that looks like this "(xx.)xx.xx — xx xxx" (e.g., 01.01.30 — 53 123). Python has a regular expression generator that can help find these patterns and capture these data points. However, there are some camps where the data are formatted differently and we need to get that information out as well.
    • I searched with the "xx.xx — xx xxx" pattern on the data to put most of the camp size information into groups of date-size that are easier for Python to read. For camps where the formatting follows a different style, I searched with a different pattern. When only a handful of entries remained, I looped through and entered the numbers manually.
  2. The date the camp opened and when it closed:
    • Most entries include a phrase "Organized xx.xx.xx" and "Closed xx.xx.xx" but some were reopened and closed again. How can we deal with that problem? 
    • I searched with the "Organized xx.xx.xx" and "Closed xx.xx.xx" patterns to get the majority of the operational dates of camps. For camps that reopened at some point, I included an entry in the data for the camps' size that made the size of the camp zero when it was temporarily closed.

Geocoding

The locational information is dirty but good enough for Yandex, which will make mistakes no matter what it is given. In an ideal world we would feed Yandex locational information like you are supposed to write on mail. For example, my work address: Russia, Moscow, Ulitsa Petrovka, 12. Yandex knows exactly what to do with this. This is not an ideal world, though, so we are feeding Yandex the addresses we have. For example, the first camp alphabetically the Automobile-Transport Camp of Dalstroi has the following address with a source citation: Magadanskaia oblast, pos. Miakit {21. l. 667, 733}. But Yandex can sometimes work miracles and if you enter that address, Yandex returns the exact village we need. Other times Yandex will provide an address that is wildly mistaken (see my previous post). What I did is semi-automate the geocoding with the module Geocoder. I looped through the camps, getting Yandex's geocodes. Then I printed the proposed latitude and longitude to myself and plugged the coordinates into Yandex. If they were okay, I could go to the next camp. If they weren't okay, I could try different locations to put into Yandex until I was satisfied. The code looks something like this:

import geocoder
for camp in camps:
    location=camps[camp][location]
    okay='no'
    while okay=='no':
        lat,lng=geocoder.yandex(location).latlng
        print lat+','+lng
        okay=raw_input('Is this okay?')
        if okay=='no':
            location=raw_input('What location to try?')
    camps[camp][lat]=lat
    camps[camp][lng]=lng


Going through all the camps and checking the coordinates was time consuming. Historical addresses are difficult because towns disappear and streets received different names--especially in the former USSR. If a project is too large, it may be impractical to go through coordinates semi-manually as I did. But of course, a project like this is only as good as the geographical data it uses so checking the coordinates is not time wasted.

Extrapolating

Some statistical sets may have a data point for each time increment you want to map. That is not the case for the gulag. The data are sporadic. Some camps have a few points per year and others have one for their whole existence. I had to think about how to handle the time increments. Did I want to print the map yearly? Monthly? Weekly? Daily? Ultimately, it made the most sense to print monthly since there was too little change if I incremented on a weekly or daily basis and there was too much change in a data set with yearly increments.

Choosing a monthly increment raised the problem of extrapolating for camps that existed during a given month but where Memorial offered no specific data. I explained in the previous post that my approach was to take an average of the nearest point before and after where there was data. For example, if a camp existed from 01.1930 to 01.1932 and I had data points for 01.1930 (500 prisoners) and 01.1931 (1000 prisoners), for 02.1930 to 12.1930, I would take the average of the two data points (750 prisoners). In the months after 01.1931, I would keep the same number of prisoners as in 01.1931 until the camp closed. There are probably more sophisticated ways of doing this. I could have weighted so that 12.1930 was closer to the figure from 01.1931. Or I could have been more of an interventionist, adjusting my formula so that it favored lower figures for periods when I knew the gulag population was lower (e.g., during the war and after Stalin's death). I would be more comfortable with the former approach (more objective) than the latter (subjective). In any event, there is no way that the Memorial data would ever give the real number of gulag prisoners and I was satisfied that the data I got approximated the general movement of the gulag population.

The important question becomes how to get this data with a script or manually. I wrote a function in Python that returned a camp's population where there was a data point or an extrapolation in the case where the Memorial data had none. But it would also be possible to pull the data into a spreadsheet at this point and use Excel formulas to fill in the rest.


Outputting

Once the data are cleaned and geocoded, you can export from Python to a .csv file. Or if you are using Excel, you can save to a .csv or .txt file. Excel can do some funky things with Cyrillic so it could make sense to copy and paste the entire spreadsheet from Excel into a programming text editor like Notepad++ (with encoding set to a common standard like UTF8), which would created tab-delimited format where the tab separates the categories. How I did it was to loop through all the camps in each month of each year, running a function that checks that the camp was open. If it was open, the script gets the size of the camp. If the camp had any prisoners, the script writes the date, the camp, its latitude, longitude and size to a file using semi-colons as delimiters. That code looks something like this:

[Edit: I forgot that you will also need the entire prisoner population that Memorial's data gives and the total number of camps. This csv will be needed for the next post. I have added the code below.]

with open('C:\\Your Path...\\gulag.csv','w') as f:
    f.write('date;name;y;x;size')

#ADDED IN EDIT
with open('C:\\Your Path...\\gulagtotals.csv','w') as f:
    f.write('date;camps;total;x;y')
#END EDIT

for year in range(1924,1960):
    for month in range(1,13):
        #ADDED IN EDIT
        campcount=0
        totalsize=0
        #END EDIT
        date=str(1924)+'-'+str(month)
        for camp in camps:
            campdata=camps[camp]
            if campopen(campdata):
                size=getsize(campdata,date)
                if size>0:
                    with open('C:\\Your Path...\\gulag.csv','a') as f:
                        f.write('\n'+';'.join([date,camp,campdata[lat],campdata[lng],str(size]).encode('utf8'))
                    #ADDED IN EDIT
                    campcount+=1
                    totalsize+=size
        with open('C:\\Your Path...\\gulagtotals.csv','a') as f:
            f.write('\n'+date+';'+str(campcount)+';'+str(totalsize)+';'+'15;76')
            #END EDIT


This post should have given a taste of how useful programming can be in generating data for historical research and visualizations. With about fifteen lines of Python code you can download an entire set of web pages and extract the (very, very dirty) data into a spreadsheet. All-in-all it took about ten hours to get, clean and export the data using Python, including trial and error and tea breaks. Cleaning the data was what took the most time by far. If you generated the data manually in a spreadsheet after downloading with Python you are probably looking at about forty hours of work (5 minutes average per camp * 475 camps), which might be reasonable if the data are important for your project but I would rather put the extra hours into learning Python. Various projects by Memorial provide especially good sources to scrape this kind of raw data (here for example, a list of several million victims of Stalinist repression). In the next post, I'll write about how to take this csv and turn it into a map.

Monday, October 19, 2015

Mapping the Gulag over Time

From the 1920s to the end of the 1950s, the Soviet government ran a brutal system of camps that came to be known by its acronym, gulag (Chief Administration of Camps). The gulag has been on my mind lately because I picked up Alan Barenberg's Gulag Town, Company Town (see a series of commentaries on the book by its author and specialists here) and also because the latest issue of the journal Kritika carried a series of articles about the camp system. With all this new information coming out about Soviet prison camps, it struck me that there is an opportunity to produce some digital content as well. I have also been thinking of data sets to use with QGIS, a powerful, open source mapping program, and Soviet forced labor provides a good one in many ways. While there was an entire project dedicated to producing gulag maps, it doesn't really take advantage of all the possibilities the data and technology present. Instead, I created a couple video maps from gulag data.

I'll explain what these visualizations are before analyzing what they mean. Using Python, I took the data from the Russian human rights organization Memorial's project on the gulag. I used only the data from the Camp Administrations (Lagupravleniia) tab, since these include individual camps rather than entire camp systems. These camps were only part of the gulag administration that was itself just part of the Soviet policing apparatus (OGPU-NKVD-MVD). The gulag administration ran a vast prison empire that included colonies for juvenile delinquents, ordinary jails and "special settlements" for exiled dekulakized peasants and supposedly hostile national groups. My visualizations only include what might be considered the "classic gulag," the "correctional labor camps" and transit camps of Solzhenitsyn's Ivan Denisovich or Shalamov's Kolyma Tales.  From the entries of 475 individual camps, I pulled the dates of operation, number of prisoners and geocoded the location. This allowed me to create 432 maps like this:




Then using QGIS and a plug-in called Time Manager, I plotted the points over time on a map, creating an image for each month between January 1924 and December 1959.

The first video contains heat maps showing the density of labor camp prisoners:




The second video contains point maps showing the size of camps:





These videos crystallize much of the new research on the gulag. The works of Barenberg, Wilson Bell (here as well) and others are showing that prisoners had far more contact with the outer world than we previously thought. Prison camps were often near towns, prisoners associated with guards and former prisoners were effectively forced to take up residence in the camp town. Although no one is trying to diminish the brutality of the forced labor system, new research suggests that the archipelago metaphor is not accurate. As economist Tatiana Mikhailova shows, cities formed around the gulag itself. And these maps allow us to see that very dense populations of prisoners were relatively close to cities--even Moscow. Moreover, during the final years of Stalin's reign, to March 1953, camps were everywhere. At the same time, it is worth pointing out that the rather dull first ten seconds of the visualizations give a clue as to why Solzhenitsyn's description endured. The main camp for political prisoners until 1929 was Solovetskii Special Purpose Camp, the island-bound prison north of Leningrad. Although the gulag system changed dramatically after 1929, it is worth remembering that this iconic image of Soviet forced labor--like that of many other aspects of the USSR--comes from the 1920s.

My data do a good job of approximating the total number of gulag prisoners at a time. The data set isn't perfect, of course. Memorial's camp data give the number of prisoners on a non-uniform basis. For example, the Birskii camp in Bashkorostan existed from April 1939 to January 1942. However, it only has five data points for that period. Clearly its population changed more that five times and rather than try to guess what it was in unlisted months, I averaged the number of prisoners in months in between. For February through June 1940, my map gives its population as 12,063, the average of its January 1940 population (12,866) and its July 1940 population (11,261). This approximation is problematic during WWII, when the map displays some camps as existing very close to German-occupied territory. My guess is that the NKVD created or reformed camp administrations in advance of the creation of camps on these territories and that my calculation picked up the first data point after the camp population returned. It also means that some of the camp data from February 1953 reflect the amnesty of prisoners in March 1953.

Rather than presenting the gulag's own summary statistics like Getty, Rittersporn and Zemskov did in this article, I tabulated the total number my camp data gave, warts and all. Nonetheless, it comes awfully close to the summary statistics from that article. More importantly, it captures the trajectory of the major expansions and contractions of the gulag over the period:

1924-1929: Limited camp system
1929-1933: Expansion based on wave of collectivization repression
1935-1939: Expansion based on political and social repression in the Great Terror
1941-1945: Contraction during war as prisoners join army or die during famine conditions
1948-1953: Late Stalinist expansion to largest camp system
March 1953-1956: Contraction during post-Stalin amnesty and destalinization

A final point that these maps hammer home is that the gulag system sent people to the far reaches of the USSR but it also had a huge footprint in European Russia. Punishment and proactive incarceration of "anti-Soviet elements" were the main motives behind mass repression under Stalin. However, the Soviet Union had a labor-hungry economy and construction sites and factories throughout the country demanded laborers. Research like Nick Baron's on forced labor in Karelia or James Harris's on the Urals show the large role that forced labor played in the planned economy. Prisoners were famously the key force in building the Moscow-Volga canal. From this data set, it is clear that the gulag increasingly became a tool of settlement in territories far from the populated European territories. However, it should be equally visible that the gulag remained a labor source for the territories that were already relatively developed.

Those are my thoughts on the videos. I will probably write up a little explanation of how I made the maps because it is not difficult to do if you have time staggered geographical data. If anyone wants to play with the numbers, the totals are available in csv form here and the month-by-month, camp-by-camp csv here. Comments are welcome--especially suggestions for music as backing tracks!