Wednesday, July 17, 2013

Gene Palma and Kevin Bacon's wife (a brief overview of network theory)

In the next weeks I am going to post some data about professional networks in Soviet cinema.  But before I do that--and while my oldish computer chokes on what is a relatively small data set--I thought I would write up some of the ideas that are informing this project.

The main inspiration for my "Soviet Kevin Bacon" data is the excellent Oracle of Bacon, a site built originally by mathematician Brett Tjaden in the 1990s.  The project was similar to the Erdos Number Project, a site about the prolific Paul Erdos, who published more papers than any other mathematician (and many about graph theory, which is the field of mathematics I am drawing on for this project more generally).  Because mathematicians often publish papers with colleagues, many of Erdos's papers were co-written, allowing the creators of the Erdos Number Project to make a network of scholars based on their degree of separation from Erdos--their Erdos Number.  For example, Dr. X was co-author of Erdos and therefore has an Erdos Number of 1.  Dr. Y never published with Erdos but did write an article with Dr. X, and so has an Erdos Number of 2.  The basic concept should be familiar to anyone who knows the "Six Degrees of Kevin Bacon" game that became popular in the mid-1990s.  The main idea behind the game is that every figure in cinema has a "Bacon Number" of 6 or less.  Using data from the Internet Movie Database, Tjaden created a web application that produces the shortest distance from any figure in IMDB to Kevin Bacon.

Being able to connect anyone to Kevin Bacon is a neat trick but it doesn't tell you very much about the structure of the world of Hollywood.  We know from the site that Kevin Bacon has been in a lot of movies with a lot of people who have also been in a lot of movies with a lot of people.  In fact, it was so unusual to find people with Bacon numbers higher than 6 that the site created a hall of fame for people who found figures with Bacon numbers of 7 or higher.  But what does that tell us about the place of Kevin Bacon (or other people) in the network of cinema?  And what can creating a network from IMDB tell us about the structure of professional networks in cinema in general?

This is where measuring centrality is important.  Centrality is just what it sounds like--a measure of how central a person is in a network.  A crude but useful way of finding centrality to take a simple average of the number of steps it takes to get from one person to any other person.  For example, Kevin Bacon has been in films with 2,769 people (Bacon Number 1). They have been in films with 305,215 people who haven't worked with Kevin Bacon (Bacon Number 2). Another 1,021,901 people have Bacon Number 3. And so on.  The average of these numbers is just under three (2.994), meaning that the average name on IMDB who is in Kevin Bacon's network (and that is almost all of IMDB) can be connected in about three steps.  (See this explanation for more on Kevin Bacon's centrality.)  As it turns out, Kevin Bacon is not the most central actor in Hollywood but is actually closer to number 400. (Harvey Keitel is currently the most central actor.)

This measure of centrality is useful.  It gives us a basic understanding of how connected a person is in the network. And if you can get the measure of centrality of every person in the network, you can test to see if the network is a "small world," to borrow the phrase of social psychologist Stanley Milgram.  If a large proportion of the people in a network had a low centrality, that would be a pretty good indication that it was a tight knit community, or at least one where there were many highly connected persons.

There are other, perhaps more accurate, ways of measuring centrality.  What if we believed--as I think is indisputably the case--that some professional connections are more important than others?  Kevin Bacon has worked with his wife, Kyra Sedgwick on multiple films.  It would be difficult to factor marriage into a centrality formula but working on several films together could be taken into account.  Their connection should be stronger than Kevin Bacon and Gene Palma, who had an uncredited role as a street drummer in 1980's Hero at Large, where Kevin Bacon played 2nd Teenager.  

Ideally, the network created represents how the lived network works in practice.  John Travolta is a good example of where this representation fails.  John Travolta has a Bacon Number of 2, which would indicate that he was less connected to Kevin Bacon than Gene Palma.  But John Travolta has a Sedgwick Number of 1, because he and Kyra Sedgwick were in Phenomenon together.  It must be easier for John Travolta than for Gene Palma to reach out to Kevin Bacon but the crude measure of centrality does not represent this reality.  (Also, so I don't come off as anti-Gene Palma, here is a scene with Gene Palma from Taxi Driver, his only other credit in IMDB.  His two films alone earn him a centrality rating of 3.515!)

To get a more accurate representation, the connections in the network need to be weighted.  I won't get too mathy but there are various ways to weight a connection, with a lower number indicating a better connection.  For example, Wikipedia says that Kevin Bacon and Kyra Sedgwick have been in four movies together.  If we wanted, we could count each of those movies and give the strength of their connection 1/number of movies--so .25.  As far as I know, Kyra Sedgwick has only been in one movie with John Travolta, so their connection is still 1.  But now instead of the path from Kevin Bacon to John Travolta having a weight of 2, it has a weight of 1.25--the sum of the paths from John Travolta to Kyra Sedgwick to Kevin Bacon.  This still isn't a great representation of the reality we know, since Gene Palma is still closer to Kevin Bacon than John Travolta.  But it's possible to get a weighted path from John Travolta to Kevin Bacon in under 1 if someone has been in three movies with Kevin Bacon and John Travolta. (I'm looking at you, Harvey Keitel.)  Finding the right formula for weighting is tricky but clearly using weighted numbers can give a more accurate representation of how this network works.

So now that I have given a brief introduction (albeit kind of long for a blog post) of some ideas and applications in graph theory using Oracle of Bacon.  As a non-mathematician, I found mathematician/sociologist Duncan Watts's Six Degrees to be a useful and relatively non-technical starting point to developments in graph theory, especially those involving social networks.  But for now this should be enough for me to post information about the network of Soviet cinema in the upcoming weeks.

Monday, July 15, 2013

Beginnings: Using Big Data in Russian/Eurasian History

This blog is a new venture for me based on some of the work I have been trying out in digital history in the Russian/Soviet/Eurasian history field.  My interest in using computational methods in history first grew out of courses I was taking to reconnect with a longtime interest in computer science that was waylaid by a small matter of my dissertation.  But I began to see more and more of the applications for processing sources in history with computers.  In particular, I think there is a place for computers in helping historians collect and process "big data" and in creating new ways of visualizing the past.

What are my goals for this blog?  At first, I will just post data I'm processing here.  There are a few projects I have in mind but the first one will probably be what I am considering calling "Who Was the Soviet Kevin Bacon?"  Using roughly 8,000 entries from, I am applying graph theory to figure out who were the most central figures in Soviet cinema from a quantitative standpoint.  And, as a system, I'll try to find out how connected the world of Soviet cinema was.

At some point, I would like to expand this project to include web applications.  To me, this is the most exciting and new aspect of this site.  Instead of just publishing processed data, web applications can allow users themselves to query dynamically for visualizations and information.  The sub-project here that is closest to completion is a database of marriage data for each region from the Russian/Soviet censuses from 1897 to 1989.

The last use of this blog will be a forum for me to post some broader thoughts on digital history.  Here I will let readers under the hood a little bit and talk about some of the tools I am using (mostly the programming language Python).  But I also would like to examine the digital history field as a whole and in Russian/Eurasian history.