Justin Duke

Better beer snobbery through data

Ever since moving to Seattle, I’ve felt that I needed to step my beer game up in a very serious way: gone are the magical, innocent days of my youth where I could feel dignified drinking a PBR and waxing poetic about the differences between Natural and Keystone Light. I’ve been very quickly introduced to the rabbit-hole of microbreweries, which has resulted in some very happy, very hoppy experiences 1 – but the dizzying array of selections in your average Seattle beer shop left me feeling as though there should be an easy way to streamline the entire process.

For fun, I decided to hop 2 over to my new favorite site, BeerAdvocate, and see if they had anything that would make my life easier.

Thankfully, they’ve got a list of the top 250 beers as rated by users, and I thought it would be interesting to tinker with the data and see if I could get any decent heuristics to make my ale-sipping lifestyle any easier.

Grabbing the data proved to be an absolute slog, due to – how do I put this kindly? – less than semantic markup in the tables themselves. Thankfully, Kimono – sort of a WYSIWYG web scraper – condensed four hours of flailing around in Python into five minutes of clicking around. One csv later, and I was off to the races!

The first thing I wanted to do was identify top-notch breweries to keep my eye out for. Using pandas, this was a cinch:

filename = 'beer.csv'
data = pandas.read_csv(filename)

breweries =  data.groupby('brewery/_text').count()['brewery/_text'].order()

The results are graphed below, with Hill Farmstead Brewery, The Bruery, and Brasserie Cantillon leading the pack:

I thought it’d also be interesting to see where all of these amazing breweries were located. Since BeerAdvocate has profile pages for each brewery, it was relatively easy to grab the data:

import panther
import us

breweries = list(set(data['brewery']))
selector = "a[href*='/place/directory/9/US']"
locations = panther.pounce(breweries, selector)
locations = map(lambda x: x[0].text if x else None, locations)
locations = filter(lambda x: x is not None, locations)
locations = Counter(locations)

all_locations = {state: 0 for state in map(str, us.states.STATES)}
all_locations.update(locations)

The winners 3, somewhat unsurprisingly, are Oregon, Colorado, Michigan, and California. Only one Washington-based brewery made it, which is something of a travesty.

As a matter of personal curiosity, I wanted to see what styles were most popular as well.

styles = data.groupby('style/_text').count()['style/_text'].order()

Unsurprisingly, stouts and IPAs lead the pack, though two styles had surprisingly high (at least for me) representation: Russian Imperial Stouts and what BeerAdvocate classifies as Lambic - Fruit, which is described as:

In the case of Fruit Lambics, whole fruits are traditionally added after spontaneous fermentation has started. Kriek (cherries), Frambroise (raspberries), Pache (peach) and Cassis (black currant) are common fruits, all producing subtle to intense fruit characters respectively. Once the fruit is added, the beer is subjected to additional maturation before bottling. Malt and hop characters are generally low to allow the fruit to consume the palate. Alcohol content tends to be low.

(Also, a quick aside about why IPAs are called IPAs – back in the day when England had colonies in India, their typical beer wouldn’t last the long ocean voyages to the Indian colonies: as a result, they added more hops to act as something of a preservative, which of course resulted in a much stronger – and tastier – beverage. 4)

Lastly, I thought it’d be interesting to look at the actual alcohol content of these prize beers. One of the biggest surprises to me – coming from the collegiate world where a Natty Ice would pack a whopping 5.9% ABV – is how strong a good beer could be (which required me to untrain myself from binging through a six-pack on Friday nights, as I would quickly realize on Saturday mornings). Turns out most great beers hover around the 7-9% mark, with some absolutely insane outliers like Sam Adams Utopias at 29% ABV –

import numpy as np
vals, keys = np.histogram(list(data['abv']), bins=range(30))

So, the obvious insight to take away from this analysis is that if you want to drink good beer, make sure it’s a 7% IPA from California. Happy drinking and let me know if you have any questions!


  1. Personal favorites so far: the Lagunitas IPA and Elysian Split Shot. But I am still a neophyte. [return]
  2. See what I did there? It’s a pun. Because beer has hops. Why aren’t you laughing? [return]
  3. This analysis is kinda fundamentally flawed because we’re looking at absolute occurences instead of relative frequencies of great breweries in a given state, but who cares? [return]
  4. I’m only like 65% sure this is accurate. The person who told me had spent the past four hours becoming well-acquainted with the subject matter, so to speak. [return]
Liked this post? You should subscribe to my newsletter and follow me on Twitter.

(I've got an RSS feed, too, if you'd prefer.)