Yesterday, the New England Patriots posted a 43-22 victory over the Indianapolis Colts, which is the first time that final score has ever occurred in NFL history.

This got me thinking about the distribution of various final point differentials over the years – and, since Pro Football Reference has an ‘export as CSV’ option, I decided to spend an evening tinkering around with the data in pandas, matplotlib, and the newly discovered vincent.

I’ll give you the good stuff up front, then explain how I did it:

012345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273Point differential02004006008001,0001,2001,4001,6001,800Games

So how did I do it?

Parsing the data was relatively easy:

# source: http://www.pro-football-reference.com/boxscores/game_scores.cgi
SOURCE_FILE = "./nflscores.csv"

data = pd.read_csv(SOURCE_FILE, header=0)

Though I had to get rid of annoying interstitial headers that were causing pandas to interpret the columns as text:

header_rows = data.apply(lambda row : row['Rk'] == 'Rk', axis=1)
data = data[~header_rows]

data[['PtDif', 'Count']] = data[['PtDif', 'Count']].astype('int')

Since the data source included the differential as a column, it was easy to group all of the final scores. Then, we had to fill in zeroes for any possible distribution that never occurred:

score_differentials = data.groupby('PtDif').sum()['Count']

populate_histogram = lambda diff: score_differentials[diff] if diff in score_differentials else 0
histogram = [populate_histogram(i) for i in range(74)]

Lastly, we create the bar graph itself (it’s called line here because I had it originally as a line graph, because I am a dummy):

line = vincent.Bar(histogram)
line.axis_titles(x='Point differential', y='Games')
line.height = 300
line.width = 900
ax = vincent.AxisProperties(labels = vincent.PropertySet(angle=vincent.ValueRef(value=90)))
line.axes[0].properties = ax

Still, this is literally converting two-dimensional data into one-dimensional data. After puttering around for a little bit, I decided it would be interesting to convert this into a heatmap, with the x-axis representing points scored by the winning team and the y-axis representing points scored by the losers. Since vincent doesn’t seem to have heatmap functionality, I turned back to matplotlib:

data[['PtsW', 'PtsL']] = data[['PtsW', 'PtsL']].astype('int')
pivoted_data = data.pivot(index='PtsW', columns='PtsL', values='Count')
# This is devestatingly ugly code.
populate_heatmap = lambda x, y: pivoted_data[x][y] if x in pivoted_data and y in pivoted_data[x] else 0
heatmap_data = pd.DataFrame([[populate_heatmap(x, y) for y in range(73)] for x in range(73)])

plt.pcolor(heatmap_data, cmap=plt.cm.Blues, alpha=0.8, vmin=0, vmax=50)

Some points here:

  • There’s gotta be an easier way to fill in the dataframe than what I did, but I was feeling lazy: I’d imagine something using np.zeroes would do the trick.
  • If, like me, you are not readily equipped with encyclopedic knowledge of the pcolor() method which does the vast majority of the work for me: you specify the color palette via cmap, and the range of values it maps via vmin and vmax. (You’d notice that I top off at 50, which means that a final score with 52 occurrences will appear the same as one with 520.)

The result:

Some immediate observations:

  • matplotlib is incredibly ugly.
  • There are some pretty cool patterns, namely the absence of certain scores like 8, 11, and 18 – which would involve some really strange play-calling.
  • We get a rather neat triangular formation from the simple reality that losing team never scores more points than the winning team (and thus there are no points above the y = x axis).

We can try to recreate it in Google Charts to make it a little prettier:

Anyway, that’s all I have – if you found it interesting, either from a programming or a football perspective, please share it! I’ve uploaded the entire script (warts and all) to GitHub so feel free to play around with it. If you have any questions or ideas for the data, definitely let me know either via email or in the comments.

Liked this post? Follow me!
TwitterRSSEmail