Abstractualized: networks

Showing posts with label networks. Show all posts

Monday, November 23, 2015

Flying the USSR: Summer of 1948

In one of the last posts, I linked to a visualization I like a lot--Ben Schmidt's 100 Years of Ships based on ship records digitized for climate data research. As it happened, I was thinking about this visualization when I came across KGB records of ships arriving in Soviet Ukraine from abroad. They are great sources, but only a handful of ships arrived each month, the coordinates of their journey were not given and the reports weren't regular. I thought about trying to put together a visualization from them, but they can't produce a shipping map like the shipping records from the 18th and 19th centuries. Nonetheless, the records led me to think about mobility within the USSR and the socialist bloc. Most people who have been to the former USSR have done some train travel, which has a long history (part of which, the cheapo platskart section, is coming to an end). However, it struck me that I didn't know that much about how the Soviet Union flew. It also seemed obvious that flights would make an interesting visualization, and one that could be revealing for thinking about how technology and social geography influenced travel.

I dug around a bit and found that plane enthusiasts are digitizing old Aeroflot flight schedules. This site is full of scanned timetables, like this one for Moscow from 1964:

What really interested me was not just looking at one city's schedule but putting together the whole air network, at least for a limited period of time. The owner of a Russian site called Avia History put up a large number of spreadsheets with timetables. Even better, it has time tables from seven cities for the summer of 1948: Adler (Sochi), Kiev, Leningrad, Moscow, Novosibirsk, Simferopol and Sverdlovsk (Ekaterinburg). Using these schedules, I came up with a visualization of the flight routes from tht year, represented as beginning on one day (but lasting three days in the case of routes with multiple stops):

And here is a longer video (~12 minutes but with music, so maybe people will stick with it for a few minutes) with the May 5 to September 14. It's worth watching at least a little of this video, since it gives a sense of the daily rhythms of the network. If you watch a month or two, the flights shifted somewhat to the south to the Caucasus and Crimea as the Soviet Union went on vacation.

Here is how I made these videos: I geocoded the start and end location of each flight, figured out the time between takeoff city and destination, and orientation of the plane. Then, for each flight I generated the location and orientation of the plane every five minutes as if it was traveling in a direct line to the destination at an equal speed the entire time. I exported this flight path plus color coding for flight type to a csv (see here). For the long flight video, I exported every flight for each day the flight was scheduled from May 5 to September 14. Then, I took the spreadsheet for use as a time layer using QGIS and TimeManager, following the same steps from the previous post about mapping the gulag.

Before I start analyzing what these clips and the data in general reveal about flight in the early postwar USSR, I'll mention caveats. The person who put up the spreadsheets does not give the source for the data in for the 1948 spreadsheets. Spreadsheets from other years give a source--an official timetable or a regional newspaper. I am assuming these timetables come from a similar source, although my attempts to contact the site's owner did not produce any results.

These are Aeroflot flights, that is, civil aviation (passenger, mail and cargo). They passed through, began or ended in one of the cities that I was able to find data for. This simulation probably captures 95+ percent of non-military flights. Obviously, having the schedule of every city would have been ideal. However, these flight schedules carried the entire route a plane took. So, for instance, both the Adler and Moscow timetables give Flight 208 as Adler-Moscow but the flight route went Tbilisi-Adler-Krasnodar-Rostov-Voronezh-Moscow. Those other flights (e.g., Tbilisi-Adler) are all listed in the schedule as well. For this reason, I don't think that having the Kazan' or Kuibyshev (Samara) schedules would add many, or any, flights, even though they are two of the biggest nodes in the network. How many flight routes would have originated in those cities that didn't also go through Moscow, Sverdlovsk or Leningrad? The two schedules that are most lacking are probably Khabarovsk and Tashkent, since some flights to distant locales may have originated there.

This is a simulation of the flight schedule, not the actual flights. Weather put flights off course, there were mechanical errors, accidents and so on. The planes take a linear route from their city of departure to their destination, unlike in reality. The visualization 100 Years of Shipping used captain's logs from each trip, which plotted the course of ships with a high level of detail. I don't even have both the exact departure and arrival times. Instead, the schedules show when a plane arrives in one city and when it arrives in the next and do not provide for layovers. This is especially visible on the long-haul Moscow-Siberia-Far East flight routes that took two days. In the simulation, it seems like the plane flying the whole night. Obviously, that is impossible. However, I didn't want to guess what the layover times were so readers will have to interpret very slowly moving nighttime planes as Soviets enjoying Siberian layovers. A similar problem is that one flight (Gorkii-Leningrad) gives what may be too fast a time and zooms by once every couple days.

Caveats aside, these visualizations reveal interesting aspects of how Soviet transport networks operated, the relationship of the provinces to the center and how technology and distance affected travel in the USSR. Putting together this visualization was made easier because the state-run Soviet civil air system was very centralized, unlike in the United States where there were many commercial airlines, private planes and so on.

The big thing to note about air travel in this period was that its structure was more like long-haul train travel. Of just over a thousand different flights, 822 were part of a route with multiple stops. Of the 284 multiple-stop routes, 107 were two-hops like Moscow-Leningrad-Helsinki and another 72 were three-hops, like Moscow-Voronezh-Rostov-Krasnodar. There were 105 multi-stop routes that had four or more hops, including a route (and its reverse) with ten stops: Moscow-Kazan'-Sverdlovsk-Omsk-Novosibirsk-Krasnoiarsk-Irkutsk-Chita-Takhtamygda-Tygda-Khabarovsk. Many of them also combined passenger service with cargo and mail delivery. Here is a map of the flight network that illustrates this point about the multi-leggedness of Soviet air travel in 1948:

Looking at this map (where thicker lines reflect greater frequency of the route), it would be easy to confuse it with a rail map, except in the few cases where flights go over water. Part of this outcome depended on technological limitations. Among the three planes Aeroflot was using, the Lisunov Li-2 had the best maximum range at 2,500 kilometers. That will not go from Moscow to Vladivostok, but it will get pretty far. Here is a video showing the maximum range of the Li-2 (as well as the city names--a little busy for the visualization, but useful in this case) from Moscow:

The range is actually not so bad, even in a country the size of the USSR. It would have tested the limits of the technology at the time to run a Moscow-Omsk flight but it's likely that distance was only one factor. My guess is that the planes operated like trains from Moscow. Most passengers going to the end of the line probably came from the capital rather than from an intermediate stop. It seems plausible that limited numbers of planes and the lack of people/things to drop off in each regional city made it attractive to run a route from Moscow through many cities to save resources.

This pattern, with Moscow being THE center, is somewhat different from what was going on in commercial air travel in the US at the time. The comparison is difficult because it was possible to hire a plane in the US in a way that was impossible in the USSR. An ordinary passenger in the US could travel outside the main routes and Soviets could not. I am also not sure how it would compare to US air cargo transit. But looking at passenger routes between major cities is reasonable. For example, to get from Washington DC to Los Angeles in 1951 on American Airlines, you didn't have to fly DC-Nashville-Dallas-Phoenix-LA. It would make more sense to go DC-Tulsa/Chicago-LA, skipping all the smaller cities.

From Airways News

That Moscow was important in the air network of Stalin's USSR is not all that surprising. It's possible to show the centrality of Moscow mathematically in a couple ways using network theory. All the cities in network are connected to one another through the network but it takes a varying number of steps. For instance, the longest distance between two points is eleven legs between Vladivostok and Vorkuta. Averaging the number of hops between one city and all the other cities in the network is that city's centrality. (There are ways to weight the distance a flight takes as well, taking into account, for example, the actual distance or the number of flights per week. I talked about this a lot in a previous post about Soviet film networks.)

Moscow is easily the most central city. Here is a spreadsheet with the centralities. From Moscow, it took an average of 2.3 flights to get anywhere in the network and a maximum of seven flights. Sverdlovsk and Leningrad were not too far behind, both at about two and a half hops. On the whole it took an average of about four flights to get from any random city in the network to any other random city. Here is another good test that shows Moscow's centrality: Close a city's airports, meaning no flights go to or from, effectively taking it out of the network. What happens to the average number of steps it take to get anywhere in the network? How many places become inaccessible to the main network? This test really shows how important Moscow was in the network since the average number of flights it took to get from any random point to any other random point went up a half flight (data here). The only other city that has a significant uptick by this measure was Aktiubinsk (Aktobe), which was a shortcut in the network to cities in southern Central Asia. Here is what it looks like when you take Moscow out of the network:

Moscow was so central that removing all the routes that originate or end in Moscow eliminates half of the cities in the network. The disappearing cities are mostly from foreign flights and the long-haul routes to the Far East. All the cargo only flights disappear, although mixed cargo/mail/passenger flights were run so it is hard to tell if the lack of cargo flights actually says something about the use of aviation for deliveries between Moscow and the provinces. Here is what the network looks like with these Moscow routes removed:

When Moscow is removed, though, the remaining forty-seven cities serviced by non-Moscow routes became more central. Removing routes at the very end of the network means that the average number of flights between random cities goes down (e.g., not possible to go from Leningrad to Vladivostok without the Moscow route, thus the large number of stops it takes are not included). It is also a sign that the Soviet air network existed outside of Moscow. I expected that this interconnectedness between regional capitals was because the summer schedule meant that vacationers from central Russia, Siberia or the Urals were going to the south. However, the cities that should have become more central if that was the case did not gain as much as I expected. Mineralnye Vody is the sixth most central and Adler is the tenth. Simferopol and Krasnodar are even lower.

Much like the US flight network in the 1950s, the Soviet network gravitated toward big cities and stopover points that led to large populations (Tulsa). Thus, the two capitals Moscow and Leningrad were major hubs. But the network also had technical and geographical limitations that partially remain today. The best example is Sverdlovsk. Novosibirsk was about the same size or even a little larger than Sverdlovsk in 1948. With the planes of the time, though, there was no way to reach Novosibirsk without stopping. Moreover, Sverdlovsk was in the middle of the dense, interconnected USSR network and could reach the more populated areas of the country. Novosibirsk was at the very edge of that network. Here is a map of the flight network on top of a population density map (it did not georeference perfectly, unfortunately) of the USSR from the late Soviet period (but still basically reflecting the population density of the 1940s):

So geographic centrality, technology and population density equalled centrality in the flight network. That really hasn't changed a whole lot, although new technology has made previously hard-to-reach cities like Novosibirsk into mini-hubs like Sverdlovsk was (and Ekaterinburg still is today). What has changed is the train-like route system Aeroflot used. No one flies from Moscow to Vladivostok now with ten stops in between. From anywhere within the former USSR, it shouldn't take more than one or two stops to fly to Vladivostok (e.g., Murmansk-Moscow(-Novosibirsk)-Vladivostok). Perhaps mail or cargo may still take these long routes but the passenger experience has become more centralized. To use one example from this network, Aktiubinsk acted as an important node connecting Central Asia with other parts of the Soviet Union. Today, Aktobe's airport has one flight to Moscow and a few to cities in Kazakhstan.

What happened to Aktobe and other cities as mini-hubs? Advances in technology probably obviated the need to use them as a refueling point. But in later flight schedules, too, regional airports tended to have more flights than they do now. The airport at Elista in Kalmykia in 1975, for example, flew to nineteen or so major cities and maybe a dozen minor cities. Now you can only fly commercially from Elista to Moscow. My guess is that when the state was running the country's aviation industry, it was possible to run flights to the corners of the Soviet Union without worrying about making a profit. It was also possible to use the flights for shipping and mail. As post-Soviet airlines commercialized, running a route to Elista or Aktobe stopped being viable. And, of course, now the wealthy can charter jets to provincial cities whereas the USSR's privileged did not have that luxury.

After playing with plane visualizations and network theory, the social-political historian who began this post from the archives has begun to wonder what this all means for the history of Soviet politics, economics or living people who flew in the USSR. Here is where an in depth analysis of the plane network would benefit from some archival or memoir sources. A big question is--how connected were the Soviet regions to one another without the capital? A pure mathematical answer says they were pretty well connected. But I suspect the experience of Soviet passengers and airline workers would echo Chekhov's Three Sisters: To Moscow! To Moscow! To Moscow! And of course this experiment in mapping the Soviet airline network can say little about how Soviet people and politicians viewed air travel. Was it a revolution in transportation that could take people to entirely new places? This network seems to say that it was more of an evolution, a kind of express train. However, commentary from people at the time would be important to answering that question.

What I am pointing to here is both the promise and limitations of digital humanities projects that work with big data like this one. The problem with just studying an air transport network is that it has no humans. In some ways we are entering the realm of historical transportation geography and graph theory. On the other hand, this network gives an idea of the broader possibilities that ordinary people and authorities encountered, as well as their technical or geographical (social, political and physical) limitations. It can give a sense of the ways that people and things could move in the USSR.

In short, I would love to see someone write a good archival/memoir-based history of Soviet air travel that might serve as a nice companion to Lewis Siegelbaum's book on automobiles in the USSR. But I would also hope that work considers broader structural issues that I have explored here.

Reader, this has been a long post. If you have made it this far, you deserve to enjoy my favorite Soviet song about flying:

Wednesday, August 14, 2013

Six Degrees of Soviet Cinema

I have my version of the Oracle of Bacon for Soviet cinema close to finished on my computer and will post a link to it here for other Soviet cinema nerds to play with once I can figure out remote hosting. [Edit: It is now available at here: Six Degrees of Soviet Cinema. Second Edit: It was throwing a server error and it was too much work to fix so I deleted it.] In the meantime, I have all the data for Soviet cinema ready. At the end I will reveal the Soviet Kevin Bacon, which is surely why most are reading this.

Here are the basics about my sources: I derived the database with Python from the website kino-teatr.ru. It lists has 7,866 films listed and in these films it comes up with 52,703 people who were actors, directors, screenwriters, composers, art directors, camera operators or producers. (It seems like a handful (~10) of these entries are the English language translations of the film title itself that my parser picked up as being actors. This isn't ideal but not enough to throw the centrality measures off significantly.) The collection seems like a surprisingly complete set of Soviet films. If there are missing films I had trouble finding them. For example, I looked for and found the obscure, lost 1932 film The Guy from the Missouri River, a film about a fictional agricultural commune based on the Seattle (American) Commune that I have done research on. A large number of the entries are for films that IMDB doesn't have.

There are ways that the database is incomplete, though. IMDB does a great job of including every person who worked on a film and kino-teatr.ru does not, so many people--especially crew--are not included. On the flip side, it could be argued that some films and actors should not be in the database at all. The Blue Bird a Lenfilm/Twentieth Century Fox fantasy film co-production from 1976 is one example (and the reason that Elizabeth Taylor (Elizabet Teilor) and Jane Fonda (Dzhein Fonda) appear in the database). It is also might make sense to have banned films count differently, although they are included for technical and intellectually justifiable reasons because they were part of the Soviet film industry. The database includes some television shows. (Only mini-series programs, it seems, but I could be wrong.) It does not include documentary films, which is a hard pill to swallow because it necessarily meant the exclusion of significant figures like Dziga Vertov. I included a video below as a form of apology. However, it was necessary because it would have introduced a large number of "actors" into the database who were actually the subjects of these films and not involved in their production. But all in all, it seems like the listings for fiction films are as complete as can be found without digitizing Soviet film catalogs by hand. (Of course, if there are any corrections for individual films or if there is a more complete data set I would appreciate a heads up.)

Dziga Vertov's Soviet Toys (1924)

I had two goals when putting this database together. The first was to make something similar to Oracle of Bacon that would be a little toy for Soviet cinema buffs to test out. But I also wanted to use graph theory to assess the structure of Soviet cinema as a professional network from a quantitative perspective. The main metric I think is helpful for understanding the network is centrality, the average of the distances from any one person to any other person in their network. I calculated the centrality two ways. The gold standard classic of centrality calculations (the Coca-Cola if you will) is degrees of separation between people. In this network representation it doesn't matter if two people have done one movie together or a hundred movies together; any connection is a connection. The other calculation is the centrality based on weighted distance (Pepsi?). This representation creates connections count more or less based on the number of films two people worked on together (representing a stronger connection). The first I included because it is more intuitive (if someone's centrality is two, it means that they can get to a random figure in the network in two steps). The second I included because I think it more accurately represents how professional networks operate (take a look at my last post for information on this). And remember that a lower number is more central because it means less distance from that person to get to any other person on average.

Even before I did any of these calculations, though, the main thing that jumped out at me was just how connected the world of Soviet cinema was. Of the 52,703 entries in the database, only ten cannot be connected to the main the network. (The two casts from films from the interwar period Zasukha (1932) and Beloe zoloto (1929) apparently only worked on those films, respectively.) The rest--more than 52,000 people--can all be connected to each other through one film or another in six or fewer steps. In fact, the majority of figures (37,141) can reach any other figure in that network in four steps and most of the rest (15,467) can reach any other figure in five steps. There are thirty-five super connected figures who can get to anyone in three or fewer steps. In other words--and this may seem obvious--the world of Soviet cinema was very small.

There are other things this graph can tell us that are less obvious. For example, how connected was the average Soviet film worker and what can that tell us about Soviet professional networks more generally? I made a chart that aggregated the centrality of the people in the network into quarter steps.

The average was 2.76, meaning it took the average person in the Soviet cinema network can be connected to any other figure in between two and (more likely) three steps on average. A majority of people in the network fall right into that mean. Then there is group of about 15 percent of the network who could reach anyone else in between two and two and a half steps. And then there is a very small elite group who can get to anyone in about two steps. (If I broke it down further, it would show that those under two steps are really averaging about 1.97 steps.) If this was an accurate representation of the professional network of Soviet cinema, it would seem to suggest that there were some very well connected people in Soviet cinema, many people who had average connections and only a few who were very poorly connected.

There is another way of measuring centrality, which is to weight the network. It make sense to account for the strength of someone's connection to others in networks like film where someone (Tim Burton) might work dozens of times with one or two highly central people (Johnny Depp) and that person's centrality to the network does not register as much as it should.

I don't think this picture even has all the Burton/Depp collaborations.

This strength of connection can be measured mathematically in various ways. The basic way is to count the number of times two people from a network are connected and make the cost of their connection equal to one/count of connections. So if person1 was in four movies with person2 who was in two movies with person3 the path from person1 to person3 would be .25 + .50 = .75. In a second calculation of centrality, I used this way of calculating shortest weighted path. Here is the chart of the aggregated weighted path centralities by quarter steps:

It turns out that a couple differences from the unweighted chart are interesting and, based on what I know about Soviet cinema, are probably more representative of the professional network of Soviet cinema. This data divides film personnel more clearly into two groups. There is a group of film personnel who have are quite central (to about 1.6 on this scale), a dip and then the majority of personnel clumped around the mean. (And the dip around 1.6 would be bigger if I broke the data into tenths rather than quarters.) Compared to the unweighted data, it would suggest that there was a two-tiered hierarchy in Soviet cinema. The comparison of the two data sets (using their standard deviation, since the average weighted path is shorter) makes it clearer:

Why do I think the weighted calculation represents the professional world of Soviet cinema better than the unweighted calculation? I analyzed the issue in my last post, but in general I think that a weighted representation in a social or professional network makes more sense, since collaboration on a larger number of projects usually reflects stronger personal/professional connections than collaboration on fewer. But in the specific case of this cinema database and the reality of Soviet cinema, I also think there are conditions that make a weighted calculation more appropriate. My anecdotal observation from looking through the database is that many people in Soviet cinema did one or two films (as represented by the big bulge of people in the middle of the weighted data) and then moved on to whatever else. (And some of those people are like "Dzhein Fonda," who obviously has done many films but only one that registers in this database.) But a small-medium size group of people in the database (12,000-15,000) did many, many films and therefore was far more plugged into the Soviet film network. My interpretation based on these figures are that these people represent the core of the film industry in the USSR. (But only represent, since the entire crew of films is not included in the database.)

There are other interpretations that I think also might explain (or contribute) to this distribution of centrality. One that struck me as very plausible was that the explosion of media in the post-Stalin period ("Moscow Prime-Time" as Kristen Roth-Ey puts it in her book), meant that figures who worked in that period were naturally going to rate as more central overall. I had problems thinking of a really famous Stalin-era actor but we can take Mikheil (Mikhail) Gelovani--the actor who played Stalin and was therefore in many films until 1953. Gelovani ranks just 2501th overall in the weighted calculation (about fifth percentile) and 4873rd in the unweighted calculation (about tenth percentile). In contrast, Aleksandr Dem'ianenko--the star of Leonid Gaidai's Shurik movies and a fixture of post-war film--ranks 27th and 15th, in the first percentile for both.

Gelovani as Stalin in The Fall of Berlin, one of his best known roles

Similarly, Sergei Eisenstein ranks lower than Gaidai, even though they worked on a similar number of films. Eisenstein is in the top third and top half of directors by weighted and unweighted centrality whereas Gaidai is in the top percentile and top eighth percentile in the same rankings. Even though Eisenstein and Gaidai worked on similar numbers of films, Gaidai measures as way more central because comparative size of the industry during his era. That said, I think that it just takes figures from before the 1950s a hop or two to get out of their era. That just pushes them to the back of the core professional cinema group but not out of it entirely. More on the difference in the film industry in different eras in another post because my computer is currently chewing on the networks from each of the different eras.

(I'd also like to note here that the weighted-Gaidai metric proves that Leonid Gaidai is basically the Soviet Judd Apatow. Directors/producers who always work with the same people don't register as being as central in the unweighted calculation but the strong connections to collaborators show up in the weighted stats. Here is a video of the best of Gaidai. How many times do Dem'ianenko or Iurii Nikulin or Georgii Vitsin or other familiar faces appear in it? Answer: a million times.)

Best of Gaidai according to someone on YouTube

Besides the different eras, I had considered the possibility that the distribution was being affected by the different jobs of film workers. I assumed that actors would be more central on average (a lower number) because they can work on many projects while directors and other non-acting personnel have to invest more time in individual projects. This result would have suggested that the director was not the center of the Soviet film universe. And, as every book about the history of cinema that uses qualitative measures shows, it would have been inaccurate. Directors were clearly the people at the reins of the Soviet film industry. However, the data actually back up the the traditional interpretation that puts the director at the center of Soviet film. Take a look at the average centrality by job:

I had to think a little about why directors came out on top, because all the top figures are actors (except Gaidai, but even he is only in the top 200 in the weighted calculation). So what is going on? According to my database there were a small number of actors who ranked very highly (less than .01% in either calculation) but the majority of actors rated as much less central than workers in other tasks in Soviet film. For example, about 43 percent of directors have weighted centrality ratings between .75 and 1.5 but only about 15 percent of actors are in those categories. In fact, actors on average have the worst centrality rankings, except for producers--which is a small category that seems to have been imported in the late 1980s. Take a look at the distributions of the different professions' centrality in the network by percentage (I just included the weighted. If anyone wants the unweighted centrality I can post that as well but it is similar.):

Of course, normalizing the data by using percentages belies the overwhelming number of actors (40,000+) in the database versus the other types of workers. The reason for this large number is that almost every movie has three times or more the credited cast as it has credited crew (according to kino-teatr.ru). I don't know that this contributed to the higher average centrality of the crew positions. It might decrease their average centrality if every crew member was included for every film. But including more credits might also make some crew who are listed but undercredited more central. What I think it really shows is that you didn't need to be a professional actor to be in a movie. Even architectural historian Vladimir Papernyi is in the database for a bit part he had! (He was in the film Leap Year (1961). His rank: somewhere in the 29,000s, not part of the core professional cinema group. ) Check out his profile here. So if many people could do some acting, not every person off the street could be a director, art director (artist) or camera operator.

There is also some overlap between the categories, especially director and actor, and that seems to have added a few highly central people to the ranks of the directors. The database counts anyone who has been in any of the positions toward each of those positions. That means that someone like Aleksei Alekseev, who did a lot of work both on screen and as a voice actor for dubbed films, gets to be a director because he was the sound director on the Soviet-Italian film Life is Beautiful. He is the second most central director but based almost exclusively on his acting career rather his scant directing credentials. There are more legitimate borderline cases, like Aleksei Batalov, who was one of the most famous Soviet actors but who also directed three films and is credited as the screenwriter of four Soviet-era films. I felt uneasy arbitrarily deciding who had done what and just used kino-teatr.ru's classification. (Who am I to say if Clint Eastwood is an actor, director or lunatic chair-shaman?)

Aleksei Batalov as the ultimate specimen of

Soviet masculinity in Moscow Doesn't Believe in Tears

Finally if you were wondering--Aleksei Batalov was not the Soviet Kevin Bacon. And really, I should say he is not the number one central person, which in IMDB is actually Harvey Keitel. I think the answer will be quite obscure. According to the unweighted graph, it is Iurii Sarantsev:

Iurii Sarantsev

Sarantsev fits the bill, though, in a lot of ways. He came up as a young actor just at the right time in Soviet cinema, in the early to mid 1950s, so that he could take part in the post-Stalin media explosion. He had a few starring roles but mostly was a character actor. Wikipedia credits him with being in seventy-nine (!) Soviet-era films and another fifty-two voice-acting roles for Soviet and foreign films. (The latter the database doesn't register.) And he had some pipes. Here he is as a singing taxi driver:

The weighted candidate for the honor of being the Soviet Kevin Bacon/Harvey Keitel is Nikolai Grabbe:

Grabbe is in a lot of ways similar to Sarantsev. Lots of small or medium parts, a few starring roles and maybe as many films overall as Sarantsev. But Grabbe probably gains in the weighted calculation from his having started his career earlier (as a young actor during World War II in We're from the Urals) and then was in a few bigger films (including a small role in Andrei Rublev) that connected him with other highly connected people. Here are the other top ten that the database came up with:

Top Ten Centrality Rankings of Soviet Film Personnel
Rank	Unweighted	Weighted
1	Iurii Sarantsev	Nikolai Grabbe
2	Artem Karapetian	Artem Karapetian
3	Ivan Ryzhov	Iurii Sarantsev
4	Mikhail Gluzskii	Ivan Ryzhov
5	Nikolai Grabbe	Vladimir Ferapontov
6	Mariia Vinogradova	Konstantin Tyrtov
7	Igor' Efimov	Mikhail Gluzskii
8	Aleksandr Beliavskii	Viktor Filippov
9	Daniil Netrebin	Viktor Ural'skii
10	Konstantin Tyrtov	Nikolai Smorchkov

I don't think I am being especially ignorant to say that I don't recognize any of these names. (A search in Russian on youtube for most of these actors picks up movies they have been in but very few of the clips created to showcase the work of the truly famous. There's no Grabbe clip, for example.) From reading over their biographies, it seems that most were like Sarantsev and Grabbe: born in the 1920s or 1930s (acting from the 1940s or 1950s onward), professional acting education, tons of acting work but nothing that would have made them huge names (feel free to correct me in the comments if I am wrong about their fame). So my first reaction to this list of names was surprise: Where are all the big names I am familiar with? What about Sergei Bondarchuk or El'dar Riazanov (or Evgenii Leonov or Iurii Yakovlev like Jared guessed). Surely they were more important?

I was thinking about it all wrong. What this calculation of network centrality measures is not necessarily fame but rather position. The relative obscurity of these figures highlights the value of network analysis for understanding Soviet cinema. The data reveal a different kind of influence (maybe banal) lost with qualitative sources. Those sources tend to focus on the brightest figures but don't register those ubiquitous people who stood out less. In the same way you would probably not name Harvey Keitel as the most influential or important actor in Hollywood (maybe in the 1990s you might think Kevin Bacon, right?) but it makes a lot of sense to say that Keitel is one of the better positioned and networked people in film. In the same way those ubiquitous people from Soviet films were around all the time for a reason, and it probably was because, in their own way, they made Soviet cinema run.

To sum up, I think the data makes a pretty good case for a few conclusions about the Soviet film industry as a network:

It was highly interconnected.
There was a small-medium sized group of professional filmmakers (more than 10,000 fewer than 15,000) that was the most connected in this network.
In that connected group, there was a small group of mega-connected actors who did lots of work but there were larger groups of professional filmmakers--especially directors, art directors and camera operators--who were at the center of Soviet cinema.

I also think I make a convincing argument that the weighted representation of the Soviet cinema network is probably more historically accurate (if less intuitive) than the unweighted network. But it would also probably be less fun to try to guess what the weighted path between Kevin Bacon and Aleksei Batalov would be. (My guess, 2.458.)

Wednesday, July 17, 2013

Gene Palma and Kevin Bacon's wife (a brief overview of network theory)

In the next weeks I am going to post some data about professional networks in Soviet cinema. But before I do that--and while my oldish computer chokes on what is a relatively small data set--I thought I would write up some of the ideas that are informing this project.

The main inspiration for my "Soviet Kevin Bacon" data is the excellent Oracle of Bacon, a site built originally by mathematician Brett Tjaden in the 1990s. The project was similar to the Erdos Number Project, a site about the prolific Paul Erdos, who published more papers than any other mathematician (and many about graph theory, which is the field of mathematics I am drawing on for this project more generally). Because mathematicians often publish papers with colleagues, many of Erdos's papers were co-written, allowing the creators of the Erdos Number Project to make a network of scholars based on their degree of separation from Erdos--their Erdos Number. For example, Dr. X was co-author of Erdos and therefore has an Erdos Number of 1. Dr. Y never published with Erdos but did write an article with Dr. X, and so has an Erdos Number of 2. The basic concept should be familiar to anyone who knows the "Six Degrees of Kevin Bacon" game that became popular in the mid-1990s. The main idea behind the game is that every figure in cinema has a "Bacon Number" of 6 or less. Using data from the Internet Movie Database, Tjaden created a web application that produces the shortest distance from any figure in IMDB to Kevin Bacon.

Being able to connect anyone to Kevin Bacon is a neat trick but it doesn't tell you very much about the structure of the world of Hollywood. We know from the site that Kevin Bacon has been in a lot of movies with a lot of people who have also been in a lot of movies with a lot of people. In fact, it was so unusual to find people with Bacon numbers higher than 6 that the site created a hall of fame for people who found figures with Bacon numbers of 7 or higher. But what does that tell us about the place of Kevin Bacon (or other people) in the network of cinema? And what can creating a network from IMDB tell us about the structure of professional networks in cinema in general?

This is where measuring centrality is important. Centrality is just what it sounds like--a measure of how central a person is in a network. A crude but useful way of finding centrality to take a simple average of the number of steps it takes to get from one person to any other person. For example, Kevin Bacon has been in films with 2,769 people (Bacon Number 1). They have been in films with 305,215 people who haven't worked with Kevin Bacon (Bacon Number 2). Another 1,021,901 people have Bacon Number 3. And so on. The average of these numbers is just under three (2.994), meaning that the average name on IMDB who is in Kevin Bacon's network (and that is almost all of IMDB) can be connected in about three steps. (See this explanation for more on Kevin Bacon's centrality.) As it turns out, Kevin Bacon is not the most central actor in Hollywood but is actually closer to number 400. (Harvey Keitel is currently the most central actor.)

This measure of centrality is useful. It gives us a basic understanding of how connected a person is in the network. And if you can get the measure of centrality of every person in the network, you can test to see if the network is a "small world," to borrow the phrase of social psychologist Stanley Milgram. If a large proportion of the people in a network had a low centrality, that would be a pretty good indication that it was a tight knit community, or at least one where there were many highly connected persons.

There are other, perhaps more accurate, ways of measuring centrality. What if we believed--as I think is indisputably the case--that some professional connections are more important than others? Kevin Bacon has worked with his wife, Kyra Sedgwick on multiple films. It would be difficult to factor marriage into a centrality formula but working on several films together could be taken into account. Their connection should be stronger than Kevin Bacon and Gene Palma, who had an uncredited role as a street drummer in 1980's Hero at Large, where Kevin Bacon played 2nd Teenager.

Ideally, the network created represents how the lived network works in practice. John Travolta is a good example of where this representation fails. John Travolta has a Bacon Number of 2, which would indicate that he was less connected to Kevin Bacon than Gene Palma. But John Travolta has a Sedgwick Number of 1, because he and Kyra Sedgwick were in Phenomenon together. It must be easier for John Travolta than for Gene Palma to reach out to Kevin Bacon but the crude measure of centrality does not represent this reality. (Also, so I don't come off as anti-Gene Palma, here is a scene with Gene Palma from Taxi Driver, his only other credit in IMDB. His two films alone earn him a centrality rating of 3.515!)

To get a more accurate representation, the connections in the network need to be weighted. I won't get too mathy but there are various ways to weight a connection, with a lower number indicating a better connection. For example, Wikipedia says that Kevin Bacon and Kyra Sedgwick have been in four movies together. If we wanted, we could count each of those movies and give the strength of their connection 1/number of movies--so .25. As far as I know, Kyra Sedgwick has only been in one movie with John Travolta, so their connection is still 1. But now instead of the path from Kevin Bacon to John Travolta having a weight of 2, it has a weight of 1.25--the sum of the paths from John Travolta to Kyra Sedgwick to Kevin Bacon. This still isn't a great representation of the reality we know, since Gene Palma is still closer to Kevin Bacon than John Travolta. But it's possible to get a weighted path from John Travolta to Kevin Bacon in under 1 if someone has been in three movies with Kevin Bacon and John Travolta. (I'm looking at you, Harvey Keitel.) Finding the right formula for weighting is tricky but clearly using weighted numbers can give a more accurate representation of how this network works.

So now that I have given a brief introduction (albeit kind of long for a blog post) of some ideas and applications in graph theory using Oracle of Bacon. As a non-mathematician, I found mathematician/sociologist Duncan Watts's Six Degrees to be a useful and relatively non-technical starting point to developments in graph theory, especially those involving social networks. But for now this should be enough for me to post information about the network of Soviet cinema in the upcoming weeks.