No, movies haven't gotten (much) longer
Have movies gotten longer?
It often feels like they have, right? I mean, The Hobbit was just under three hours (and felt like just over four!) – The Avengers was 142 minutes, and the sheer volume of people complaining about how long movies are lends some credence to the theory that yes, Hollywood has decided to drown their consumers in running time.
But I just caught The Good, The Bad, and The Ugly on Netflix 1 and here’s the thing: despite being a very old movie – and a very good movie – it is also a very long movie. In fact, the English dub clocks in at 179 minutes – literally one minute less than three hours. Is it possible that movies haven’t gotten longer, and that our attention spans have just gotten shorter?
I decided to answer this question in the best way possible – with large amounts of data.
Preprocessing
Getting the data was actually incredibly simple. I was so ready to write a scraper or two, but then IMDB had to go and release all of their data for public consumption. In particular, they have a file called running-times.list
, which contains exactly what you might think it contains.
The format, though, leaves something to be desired. Take this snippet from near the beginning of the file:
"#OscarTheOuch" (2013) 30
"#Yaprava" (2013) 60 (with commercials)
"#lovemilla" (2013) 6 (approx.)
"#waszurwahl" (2013) 25
"$#*! My Dad Says" (2010) USA:30
"$#*! My Dad Says" (2010) {Code Ed (#1.4)} 30
"$#*! My Dad Says" (2010) {Goodson Goes Deep (#1.12)} 30
"$#*! My Dad Says" (2010) {Make a Wish (#1.9)} 30
"$#*! My Dad Says" (2010) {Pilot (#1.1)} 30
"$#*! My Dad Says" (2010) {Wi-Fight (#1.2)} 30
"$100 Makeover" (2010) USA:30
A few issues jump out immediately:
- The data dump includes television shows.
- The data’s not standardized at all: running times are prepended with countries, or appended with annotations.
- They turned @shitmydadsays into a TV show? Seriously?
There are a few approaches we can take to clean this data, considering all we care about is release date and run time. First, lets load the thing into memory and grab the relevant lines:
all_data = open('running-times.list').readlines()[15:-2]
>>> len(all_data)
876277
Jeez, that’s a lot of (mainly awful) data. Let’s try and munge it a little, shall we?
First, the two columns (name + release date and running time) are separated by tabs, so lets split those up and do some basic cleanup.
for line in all_data:
split_line = line.split('\t')
try:
# Get the first column.
release_date = split_line[0]
# First column looks like MOVIE_NAME (RELEASE_YEAR), so split at "(" and grab everything after.
release_date = release_date.split("(")[1]
# Years are four digits (and numerical), right?
release_date = int(release_date[:4])
# Get the last column if it's got what we want; otherwise grab the one before it, for cases like:
# > Zubin and the I.P.O. (1983) (TV) USA:59 (with commercials)
run_time = split_line[-1] if "(" not in split_line[-1] else split_line[-2]
# Some of the lines are in the form of COUNTRY:TIME, so split at the colon.
run_time = run_time.split(":")[-1]
# Strip the newline character and convert to integer.
run_time = int(run_time.strip())
parsed_data.append([release_date, run_time])
except:
print "Ugh, edge case."
And if we were to print the first ten lines now, they look something like this:
[[2013, 7], [2013, 7], [2013, 7], [2013, 6], [2013, 6], [2013, 30], [2013, 60], [2013, 6], [2013, 25], [2010, 30]]
Excellent!
Now, we need to make sure that we aren’t catching any TV shows in this list, so lets filter out anything with a run time of exactly thirty or sixty minutes:
parsed_films = filter(lambda x: x[1] % 30, parsed_data)
The Fun Stuff
Excellent! Now we’re ready to do the fun stuff. Let’s load this bad boy into pandas
.
import pandas as pd
films = pd.DataFrame(parsed_films)
films.describe()
And the output:
0 1
count 682406.000000 682406.000000
mean 1993.398848 50.930266
std 22.807275 47.485743
min 100.000000 1.000000
25% 1984.000000 14.000000
50% 2003.000000 43.000000
75% 2010.000000 85.000000
max 2019.000000 3608.000000
Well, that’s strange, right? Some things look normal – the average release date is 1993 – but the average run-time is fifty minutes? The latest movie is from five years in the future? Something’s gotta be off – we’re not done cleaning things up yet.
The second answer is obvious – IMDB’s going to have some inaccurate and irrelevant data, and with around six hundred thousand items a few outliers aren’t going to harm the integrity of our analysis. Still, the first question is a bit more concerning, and the answer is probably obvious for someone with experience in the field 2 – television shows aren’t actually thirty or sixty minutes long. Without commercials, they tend to be 21 and 42 respectively: since there’s some variance, I’m going to be lazy and assume that anything shorter than 45 minutes is on the small screen. Pandas makes filtering easy:
filtered_films = films[(films[0] < 2015) & (films[1] < 45)]
And the new summary statistics:
0 1
count 318319.000000 318319.000000
mean 1988.178550 89.202935
std 23.106662 43.874200
min 1725.000000 46.000000
25% 1974.000000 70.000000
50% 1996.000000 87.000000
75% 2007.000000 100.000000
max 2014.000000 3608.000000
Oh man, that looks great, doesn’t it? 3 We have a bonified collection of x and y values, and now all we need to do is group them by release date to have a genuine time-series dataframe on our hands. And, of course, that’s super easy:
films_by_year = filtered_films.groupby(0).mean()
And then we just throw that into a line graph:
import vincent
line = vincent.Line(films_by_year)
line.axis_titles(x='Year', y='Run time')
line.to_json('movies.json', html_out=True, html_path='movies_template.html')
And we get our result!:
Conclusion
Average running time of 1973 movies: 87.47 minutes.
Average running time of 1993 movies 89.9 minutes.
Average running time of 2013 movies: 88.1 minutes.
So, yeah, unless you spend enough money at the cinema to be able to deduce a slight upward trend of three extra minutes per movie per year, chances are it’s all in your head. There are some definite Hollywood trends – the reglorification of the blockbuster superhero movie, the phoenix-esque death and rebirth of Matthew McConaughey 4 – but longer run times are not one of them. It exists, but – especially ignoring the pre-1960 trend, which I feel like doesn’t apply to the majority of today’s moviegoers – is incredibly exaggerated.
- If you haven’t seen The Good, The Bad, and The Ugly, close this tab, watch it immediately, and then reopen this tab. Trust me, it’s worth it. [return]
- Or, more likely, someone with experience binging Netflix. [return]
- If you’re concerned about that movie with a run-time of sixty hours, have a look for yourself. [return]
- A McConaughssance, if you will. [return]