February 13, 2014

No, movies haven't gotten (much) longer

data

Have movies gotten longer?

It often feels like they have, right? I mean, The Hobbit was just under three hours (and felt like just over four!) -- The Avengers was 142 minutes, and the sheer volume of people complaining about how long movies are lends some credence to the theory that yes, Hollywood has decided to drown their consumers in running time.

But I just caught The Good, The Bad, and The Ugly on Netflix 1 and here's the thing: despite being a very old movie -- and a very good movie -- it is also a very long movie. In fact, the English dub clocks in at 179 minutes -- literally one minute less than three hours. Is it possible that movies haven't gotten longer, and that our attention spans have just gotten shorter?

I decided to answer this question in the best way possible -- with large amounts of data.

Preprocessing

Getting the data was actually incredibly simple. I was so ready to write a scraper or two, but then IMDB had to go and release all of their data for public consumption. In particular, they have a file called running-times.list, which contains exactly what you might think it contains.

The format, though, leaves something to be desired. Take this snippet from near the beginning of the file:

"#OscarTheOuch" (2013)                  30
"#Yaprava" (2013)                   60  (with commercials)
"#lovemilla" (2013)                 6   (approx.)
"#waszurwahl" (2013)                    25
"$#*! My Dad Says" (2010)               USA:30
"$#*! My Dad Says" (2010) {Code Ed (#1.4)}      30
"$#*! My Dad Says" (2010) {Goodson Goes Deep (#1.12)}   30
"$#*! My Dad Says" (2010) {Make a Wish (#1.9)}      30
"$#*! My Dad Says" (2010) {Pilot (#1.1)}        30
"$#*! My Dad Says" (2010) {Wi-Fight (#1.2)}     30
"$100 Makeover" (2010)                  USA:30

A few issues jump out immediately:

  • The data dump includes television shows.
  • The data's not standardized at all: running times are prepended with countries, or appended with annotations.
  • They turned @shitmydadsays into a TV show? Seriously?

There are a few approaches we can take to clean this data, considering all we care about is release date and run time. First, lets load the thing into memory and grab the relevant lines:

all_data = open('running-times.list').readlines()[15:-2]
>>> len(all_data)
876277

Jeez, that's a lot of (mainly awful) data. Let's try and munge it a little, shall we?

First, the two columns (name + release date and running time) are separated by tabs, so lets split those up and do some basic cleanup.

for line in all_data:
    split_line = line.split('\t')
    try:

        # Get the first column.
        release_date = split_line[0]
        # First column looks like MOVIE_NAME (RELEASE_YEAR), so split at "(" and grab everything after.
        release_date = release_date.split("(")[1]
        # Years are four digits (and numerical), right?
        release_date = int(release_date[:4])

        # Get the last column if it's got what we want; otherwise grab the one before it, for cases like:
        # > Zubin and the I.P.O. (1983) (TV)            USA:59  (with commercials)
        run_time = split_line[-1] if "(" not in split_line[-1] else split_line[-2]
        # Some of the lines are in the form of COUNTRY:TIME, so split at the colon. 
        run_time = run_time.split(":")[-1]
        # Strip the newline character and convert to integer.
        run_time = int(run_time.strip())

        parsed_data.append([release_date, run_time])

    except:
        print "Ugh, edge case."

And if we were to print the first ten lines now, they look something like this:

[[2013, 7], [2013, 7], [2013, 7], [2013, 6], [2013, 6], [2013, 30], [2013, 60], [2013, 6], [2013, 25], [2010, 30]]

Excellent!

Now, we need to make sure that we aren't catching any TV shows in this list, so lets filter out anything with a run time of exactly thirty or sixty minutes:

parsed_films = filter(lambda x: x[1] % 30, parsed_data)

The Fun Stuff

Excellent! Now we're ready to do the fun stuff. Let's load this bad boy into pandas.

import pandas as pd
films = pd.DataFrame(parsed_films)
films.describe()

And the output:

                   0              1
count  682406.000000  682406.000000
mean     1993.398848      50.930266
std        22.807275      47.485743
min       100.000000       1.000000
25%      1984.000000      14.000000
50%      2003.000000      43.000000
75%      2010.000000      85.000000
max      2019.000000    3608.000000

Well, that's strange, right? Some things look normal -- the average release date is 1993 -- but the average run-time is fifty minutes? The latest movie is from five years in the future? Something's gotta be off -- we're not done cleaning things up yet.

The second answer is obvious -- IMDB's going to have some inaccurate and irrelevant data, and with around six hundred thousand items a few outliers aren't going to harm the integrity of our analysis. Still, the first question is a bit more concerning, and the answer is probably obvious for someone with experience in the field 2 -- television shows aren't actually thirty or sixty minutes long. Without commercials, they tend to be 21 and 42 respectively: since there's some variance, I'm going to be lazy and assume that anything shorter than 45 minutes is on the small screen. Pandas makes filtering easy:

filtered_films = films[(films[0] < 2015) & (films[1] < 45)]

And the new summary statistics:

                   0              1
count  318319.000000  318319.000000
mean     1988.178550      89.202935
std        23.106662      43.874200
min      1725.000000      46.000000
25%      1974.000000      70.000000
50%      1996.000000      87.000000
75%      2007.000000     100.000000
max      2014.000000    3608.000000

Oh man, that looks great, doesn't it? 3 We have a bonified collection of x and y values, and now all we need to do is group them by release date to have a genuine time-series dataframe on our hands. And, of course, that's super easy:

films_by_year = filtered_films.groupby(0).mean()

And then we just throw that into a line graph:

import vincent
line = vincent.Line(films_by_year)
line.axis_titles(x='Year', y='Run time')
line.to_json('movies.json', html_out=True, html_path='movies_template.html')

And we get our result!:

Conclusion

Average running time of 1973 movies: 87.47 minutes.

Average running time of 1993 movies 89.9 minutes.

Average running time of 2013 movies: 88.1 minutes.

So, yeah, unless you spend enough money at the cinema to be able to deduce a slight upward trend of three extra minutes per movie per year, chances are it's all in your head. There are some definite Hollywood trends -- the reglorification of the blockbuster superhero movie, the phoenix-esque death and rebirth of Matthew McConaughey 4 -- but longer run times are not one of them. It exists, but -- especially ignoring the pre-1960 trend, which I feel like doesn't apply to the majority of today's moviegoers -- is incredibly exaggerated.


  1. If you haven't seen The Good, The Bad, and The Ugly, close this tab, watch it immediately, and then reopen this tab. Trust me, it's worth it. 

  2. Or, more likely, someone with experience binging Netflix. 

  3. If you're concerned about that movie with a run-time of sixty hours, have a look for yourself. 

  4. A McConaughssance, if you will. 

You should .

Comments