Sunday, October 13, 2013

Slavic-specific resources for digital scholarship

I've just had a lesson come out on automatic transliteration of Cyrillic sources in The Programming Historian so I thought that I would devote this post to shameless self promotion. Then I decided I should also write a little about some of the tools I use to build databases from web information and create visualizations. I'll pay particular attention to resources that I have found useful for Russian/Eurasian history.

The bulk of my programming I do in a language called Python. Compared to other languages like Java or C, Python syntax is closer to natural language, making it easy to understand. The open-source community has put together many modules for Python, some of which are quite powerful and useful. There is a great Python full-length course available free at Udacity (Computer Science 101) and Code Academy has good interactive lessons for mastering syntax (not only for Python, by the way). The Programming Historian has lessons for Python and other tools geared toward humanities scholars who would like to learn specific skills (e.g., counting the words in a set of documents or downloading a set of web pages) without learning an entire language.

What makes Python indispensable for me is its ability to extract data easily from web pages. Using a parsing module called Beautiful Soup, you can, for example, go through all three million names in Memorial's list of gulag victims and generate a table (and then a map and a blog post) of the sources of the entries in about a dozen lines of code. The Programming Historian has two lessons that deal with Beautiful Soup.

These are general tools for generating and manipulating data--so what is there that is field specific? There are a couple tools I have found especially useful for my work with sources in Russian. The first is that a transliteration module I wrote for Python and wrote up for The Programming Historian. One of the challenges of doing digital scholarship in Russian or other languages that use Cyrillic characters is that computers like the American Standard Code for Information Interchange (ASCII), the set of characters based on the English alphabet. For this reason, I developed code that takes a block of text and transliterates whatever characters are in Cyrillic into Latin characters using the modified Library of Congress standard historians write with. It is also possible to use a program like this for working with other alphabets. I've found this module useful in my own work, especially for transliterating large numbers of names for a non-Russian reading audience or for a non-Russian reading computer.

A tool I use often is geocoding with both Google's and Yandex's geocoding application programming interfaces (API). Using JavaScript, you can create dynamic webpages with maps by coding the locations into the page itself, by having users input the location or by using a database (e.g., Google Fusion Tables [update 6/2021: Fusion Tables is was discontinued. ArcGIS is what I am using now for easy mapping of lots of points.). Through either service, you can use a Python module to access the latitudes and longitudes of locations. In general, I have found that Yandex (which I use for Python with the Yandex-Maps module) will be more reliable and provide the coordinates for more locations within the former Soviet Union.  However, Google (which I use for Python with the PyGeocoder module) is better elsewhere. Once I have these locations, it is easy to upload them to a Fusion Table and place them on a webpage through the Google Maps API.

The last tool--related to the previous--is Google's Geocharts, one of the many charts available through Google Charts. Again, accessing this involves JavaScript. It is mostly cutting and pasting code in Google's tutorials but you can do some more interesting things if you know how to read JavaScript. And JavaScript is a nice thing to know anyway and can be learned quickly at Code Academy. Geocharts generates a density map or a marker map where the size of data is correlated to color or size of the marker, respectively. What makes it especially useful for Russian/Eurasian studies (and area studies in general) is that it can create a map of the world by country, by region (e.g., Eastern Europe) or by country. The Russia map includes the former Soviet Union and at least parts of all the former Eastern European satellites, which makes it a quite useful tool for displaying regional data. Moreover, it can break down the map by province. The problem with Geocharts is that it is quite inflexible. Getting a chart that includes, for example, both Russia and Germany means losing the ability to display province level data for either country. All the same, what makes Geocharts an amazing tool is that it requires just a little more than basic HTML. The GIS software I am familiar with requires a lot more effort to get something as useful as Geocharts.

In sum, there are some great tools out there for learning how to put your computer to better use. Some of these tools are built for English and it can be frustrating trying to do digital scholarship in another language. However, there are ways to get around Cyrillic difficulties and with tools like Geocharts the geography of Eurasia makes it a more useful tool than it is for other regions. The tools I posted here are just the ones I have been using so I'd be interested if anyone else has found anything else out there for dealing with Russia/Eurasia data.