Tuesday, October 27, 2015

How to Map the Gulag (the data)

I had a few people ask how I made the gulag videos--what kind of tools and time were involved. So this writeup is not about the gulag itself but about how I made the last map video. The last post was intended primarily as something I might show my students to illustrate the development of the camp system. What I want to do here is present something that scholars or digital history students could use to think about how one might make a map like this. This isn't a walkthrough since I doubt anyone is willing to put in the time I did to generate the data. There is code included but far from all. But for people interested in doing digital history, it may be useful to see the process and to get a sense of the kind of coding that is necessary to get usable data from a set of websites on the web.

Any project like this can be divided broadly into two steps with some substeps:
  1. Getting data
    • Downloading
    • Sorting
    • Cleaning
    • Geocoding
    • Extrapolating
    • Outputting
  2. Visualizing (for next post)

Getting Data

A dynamic map needs a set of time-staggered geographical data. These data could be nuclear detonations (Isao Hashimoto), eighteenth and nineteenth century shipping routes (Ben Schmidt)  or the shifting borders in Europe since 4000BC (some YouTube user named MrOwnerandPwner who makes a lot of these maps). The last two maps are a little more complicated to make because they use lines and polygons to visualize data. Points are less problematic since the minimum you need for each data point is a timestamp and coordinates. If the data has a populartion or other categories it is possible to weight the points or otherwise differentiating the visualization by category, like colors to show what kind of work prisoners did. The gulag is a good example since it has points and also because all the data are available on Memorial's website (click on the Лагуправление header).

The easiest way to import a point map into a GIS program is as a delimited file, often called comma separated values or csv. This is just a text file that duplicates the functions of a spreadsheet by using a comma (or tab, or semi-colon or whatever else) to separate the cells. The first row usually contains the category names. Ultimately, the csv for a map like this will consist of many thousands of rows like the first in the csv for the gulag map, which looks like this (available here):

date;name;y;x;size
1924-01;SOLOVETsKII ITL OGPU;65.028257;35.717330;3531


So how do we get this nice looking (technically speaking) data. Memorial's gulag site was written in HTML4 and uses frames to keep the header and sidebar while loading the camps in the central frame. If you click on the side link for the Solovetskii ITL, it will load Memorial's information on that camp. Although you can't tell from the url, each of the camps is a different page. Here is the page for the Solovetskii camp:



There are a few ways to get the data out of this collection. It would be possible to go through all the camps and copy and paste the information manually into a spreadsheet. But there are 475 camps and copying every entry would be very time consuming. A better way of getting the data is with programming, by finding patterns in the data structure and using a scripting language to format the information in a way that is usable. This is also known as data scraping.

Learning how to scrape data with Python or another language takes some time and each page is different. I am going to post code here and it will be comprehensible if you are familiar with programming, even if you don't know Python. If you are just beginning, The Programming Historian has several lessons under the Data Manipulation heading (including mine) that go through the Python syntax needed for scraping. That site also has information about installation. I also like Codecademy lessons if you want practice with Python.


Downloading

The url for each camp follows the pattern: http://www.memo.ru/history/nkvd/gulag/r3/r3-X.htm with X being a number between 1 and 475. If you know this, you can use Python's urllib2 module to download all the camp pages very quickly. Python's BeautifulSoup module parses the data, making it possible to find a part of the page by its HTML tag. If you wanted to download and store a parsed version of the HTML for all the sites into a Python list, you would use this code (note, it will take a few minutes to run the last line):

import urllib2 
from bs4 import BeautifulSoup 
pages=[BeautifulSoup(urllib2.urlopen('http://www.memo.ru/history/nkvd/gulag/r3/r3-'+str(number)+'.htm').read()) for number in range(1,476)]


Sorting

The name of the camp is the first text on the page and it is the first text with bold <b> header. It's possible to create a dictionary, an "associative array" that keeps a list of data under headers, where each camp itself is a dictionary with information. Here is the code for that:
camps={camp.b.text:{} for camp in pages}

Now if you entered camps['СОЛОВЕТСКИЙ ИТЛ ОГПУ'], it would pull up an empty entry from that dictionary. Memorial gives data in a set of tables where the first table cell <td> is the category heading and the data themselves are in the second cell. For Python, the first cell is actually cell zero. The data needed for this map relate to time, location and size: Time of Operation (Время существования), Location (Дислокация) and Size (Численность). On my computer--and my understanding is that this is true for English-language Windows in general--Python does not like Cyrillic. I often transliterate the entire text (see my Programming Historian lesson). But in this case I needed to keep the Cyrillic because Yandex likes it better for Geocoding. Instead I used unicode, a set of codes that symbolize letters, to create the variables. Each of the funny \u0441 codes symbolizes a Russian letter. The variable names should let you know what they mean:

timeofoperation=u'\u0412\u0440\u0435\u043c\u044f \u0441\u0443\u0449\u0435\u0441\u0442\u0432\u043e\u0432\u0430\u043d\u0438\u044f:'
size=u'\u0427\u0438\u0441\u043b\u0435\u043d\u043d\u043e\u0441\u0442\u044c:'
location=u'\u0414\u0438\u0441\u043b\u043e\u043a\u0430\u0446\u0438\u044f:'

for camp in pages:
    campname=camp.b.text
    for datapoint in camp.find_all('tr'):
        category=datapoint.find_all('td')[0].text
        entry=datapoint.find_all('td')[1].text
        if category in [timeofoperation,location,size]:
            camps[campname][category]=entry


Cleaning

All the data are now in the Python variable camps. However, the data need to be cleaned, meaning that we need to make the information uniform so that a computer can read it into a program. Cleaning data is in many cases, including this one, the most time consuming and tedious part of putting together a visualization.

The Memorial data are very thorough but very messy. I get the feeling someone typed it out by hand and occasionally made little mistakes that are difficult to catch with a program. At this point, it might be easier (although longer) to print out the data we have, put it into a spreadsheet and edit the 475 camps in Excel or using regular expressions (patterns for identifying text) and copying the coordinates from a website. What I did was look for patterns in the text with Python, removing what I didn't need and regularizing the rest. I am not going to go through the code here because it would take too long. Instead, I will identify the challenges of cleaning the data in broad terms:
  1. The size of the camp:
    • The entries for a camp's size at a given date usually have a regular pattern that looks like this "(xx.)xx.xx — xx xxx" (e.g., 01.01.30 — 53 123). Python has a regular expression generator that can help find these patterns and capture these data points. However, there are some camps where the data are formatted differently and we need to get that information out as well.
    • I searched with the "xx.xx — xx xxx" pattern on the data to put most of the camp size information into groups of date-size that are easier for Python to read. For camps where the formatting follows a different style, I searched with a different pattern. When only a handful of entries remained, I looped through and entered the numbers manually.
  2. The date the camp opened and when it closed:
    • Most entries include a phrase "Organized xx.xx.xx" and "Closed xx.xx.xx" but some were reopened and closed again. How can we deal with that problem? 
    • I searched with the "Organized xx.xx.xx" and "Closed xx.xx.xx" patterns to get the majority of the operational dates of camps. For camps that reopened at some point, I included an entry in the data for the camps' size that made the size of the camp zero when it was temporarily closed.

Geocoding

The locational information is dirty but good enough for Yandex, which will make mistakes no matter what it is given. In an ideal world we would feed Yandex locational information like you are supposed to write on mail. For example, my work address: Russia, Moscow, Ulitsa Petrovka, 12. Yandex knows exactly what to do with this. This is not an ideal world, though, so we are feeding Yandex the addresses we have. For example, the first camp alphabetically the Automobile-Transport Camp of Dalstroi has the following address with a source citation: Magadanskaia oblast, pos. Miakit {21. l. 667, 733}. But Yandex can sometimes work miracles and if you enter that address, Yandex returns the exact village we need. Other times Yandex will provide an address that is wildly mistaken (see my previous post). What I did is semi-automate the geocoding with the module Geocoder. I looped through the camps, getting Yandex's geocodes. Then I printed the proposed latitude and longitude to myself and plugged the coordinates into Yandex. If they were okay, I could go to the next camp. If they weren't okay, I could try different locations to put into Yandex until I was satisfied. The code looks something like this:

import geocoder
for camp in camps:
    location=camps[camp][location]
    okay='no'
    while okay=='no':
        lat,lng=geocoder.yandex(location).latlng
        print lat+','+lng
        okay=raw_input('Is this okay?')
        if okay=='no':
            location=raw_input('What location to try?')
    camps[camp][lat]=lat
    camps[camp][lng]=lng


Going through all the camps and checking the coordinates was time consuming. Historical addresses are difficult because towns disappear and streets received different names--especially in the former USSR. If a project is too large, it may be impractical to go through coordinates semi-manually as I did. But of course, a project like this is only as good as the geographical data it uses so checking the coordinates is not time wasted.

Extrapolating

Some statistical sets may have a data point for each time increment you want to map. That is not the case for the gulag. The data are sporadic. Some camps have a few points per year and others have one for their whole existence. I had to think about how to handle the time increments. Did I want to print the map yearly? Monthly? Weekly? Daily? Ultimately, it made the most sense to print monthly since there was too little change if I incremented on a weekly or daily basis and there was too much change in a data set with yearly increments.

Choosing a monthly increment raised the problem of extrapolating for camps that existed during a given month but where Memorial offered no specific data. I explained in the previous post that my approach was to take an average of the nearest point before and after where there was data. For example, if a camp existed from 01.1930 to 01.1932 and I had data points for 01.1930 (500 prisoners) and 01.1931 (1000 prisoners), for 02.1930 to 12.1930, I would take the average of the two data points (750 prisoners). In the months after 01.1931, I would keep the same number of prisoners as in 01.1931 until the camp closed. There are probably more sophisticated ways of doing this. I could have weighted so that 12.1930 was closer to the figure from 01.1931. Or I could have been more of an interventionist, adjusting my formula so that it favored lower figures for periods when I knew the gulag population was lower (e.g., during the war and after Stalin's death). I would be more comfortable with the former approach (more objective) than the latter (subjective). In any event, there is no way that the Memorial data would ever give the real number of gulag prisoners and I was satisfied that the data I got approximated the general movement of the gulag population.

The important question becomes how to get this data with a script or manually. I wrote a function in Python that returned a camp's population where there was a data point or an extrapolation in the case where the Memorial data had none. But it would also be possible to pull the data into a spreadsheet at this point and use Excel formulas to fill in the rest.


Outputting

Once the data are cleaned and geocoded, you can export from Python to a .csv file. Or if you are using Excel, you can save to a .csv or .txt file. Excel can do some funky things with Cyrillic so it could make sense to copy and paste the entire spreadsheet from Excel into a programming text editor like Notepad++ (with encoding set to a common standard like UTF8), which would created tab-delimited format where the tab separates the categories. How I did it was to loop through all the camps in each month of each year, running a function that checks that the camp was open. If it was open, the script gets the size of the camp. If the camp had any prisoners, the script writes the date, the camp, its latitude, longitude and size to a file using semi-colons as delimiters. That code looks something like this:

[Edit: I forgot that you will also need the entire prisoner population that Memorial's data gives and the total number of camps. This csv will be needed for the next post. I have added the code below.]

with open('C:\\Your Path...\\gulag.csv','w') as f:
    f.write('date;name;y;x;size')

#ADDED IN EDIT
with open('C:\\Your Path...\\gulagtotals.csv','w') as f:
    f.write('date;camps;total;x;y')
#END EDIT

for year in range(1924,1960):
    for month in range(1,13):
        #ADDED IN EDIT
        campcount=0
        totalsize=0
        #END EDIT
        date=str(1924)+'-'+str(month)
        for camp in camps:
            campdata=camps[camp]
            if campopen(campdata):
                size=getsize(campdata,date)
                if size>0:
                    with open('C:\\Your Path...\\gulag.csv','a') as f:
                        f.write('\n'+';'.join([date,camp,campdata[lat],campdata[lng],str(size]).encode('utf8'))
                    #ADDED IN EDIT
                    campcount+=1
                    totalsize+=size
        with open('C:\\Your Path...\\gulagtotals.csv','a') as f:
            f.write('\n'+date+';'+str(campcount)+';'+str(totalsize)+';'+'15;76')
            #END EDIT


This post should have given a taste of how useful programming can be in generating data for historical research and visualizations. With about fifteen lines of Python code you can download an entire set of web pages and extract the (very, very dirty) data into a spreadsheet. All-in-all it took about ten hours to get, clean and export the data using Python, including trial and error and tea breaks. Cleaning the data was what took the most time by far. If you generated the data manually in a spreadsheet after downloading with Python you are probably looking at about forty hours of work (5 minutes average per camp * 475 camps), which might be reasonable if the data are important for your project but I would rather put the extra hours into learning Python. Various projects by Memorial provide especially good sources to scrape this kind of raw data (here for example, a list of several million victims of Stalinist repression). In the next post, I'll write about how to take this csv and turn it into a map.

No comments:

Post a Comment