April 10, 2014

Analyzing the price of Legos with Python

data

Legos are perfect. I liked them as a kid, I liked them as a slightly older kid (even while pretending that I was too old for such things), and I like them now (having graduated from ninja temples to recreations of Fallingwater). There's something about playing with them that brings out creativity and imagination: even though I might not tinker with them as much as I did many years ago, they evoke a very specific sort of joy that I can't resist.

Which brings me to the downside of having such a childlike habit: as an adult, you have to buy your own toys.

And jeez, these little bricks can get expensive: that Fallingwater set runs for $99.99, which is low compared to some of the "grown-up" sets dedicated to adult Lego fans like myself, who have more disposable income than self-control.

But have they always been so expensive?

To answer this question, I needed to get as much data about Legos as I could find. Thankfully, I found BrickSet, a sort of Lego database, which comes with its own handy export feature so I didn't even have to bother with scraping. (I uploaded a .csv of their entire data as a Gist to avoid hammering their servers, but I recommend checking out the site — pretty neat stuff.)

Assumptions and Notes

Worth noting a few things here:

  • Obviously the integrity of my analysis is limited by the integrity of the data. While I couldn't find a better source than BrickSet, I'm sure its still missing a few sets and details.

  • I'm filtering out any sets with less than two pieces or a price of less than two dollars. Notably, this excludes items listed as 'sets' that don't have any actual pieces.

  • I'm not taking inflation into account.

  • These graphs are drawn with the wonderful python-nvd3 library.

  • Prices are in USD.

Alright, that's all the boring stuff. Let's get started!

import pandas as pd
import numpy as np
from nvd3 import lineChart, discreteBarChart

filename = 'legos.csv'
outfile = open('legos.html', 'w')

# Main charting function.  Pretty ugly.
def print_chart(name, df, bar=False):
    width = 800
    height = 300

    xdata = list(df.keys())
    try:
        xdata = map(lambda x: float(x), xdata)
    except:
        pass

    ydata = list(df.values)

    if bar:
        chart = discreteBarChart(name=name.replace(" ", ""), height=height, width=width)
        chart.add_serie(name=name, y=ydata, x=xdata)
    else:
        chart = lineChart(name=name.replace(" ", ""), height=height, width=width, x_axis_format='')
        extra_serie = {"date_format": "%Y"}
        chart.add_serie(name=name, y=ydata, x=xdata)
    chart.buildcontent()
    outfile.write(chart.htmlcontent)

# Some boring parsing
data = pd.read_csv(filename)
data.fillna(0)
data = data[(data['Pieces'] > 1) & (data['USPrice'] > 1)]

Price over time

The first thing I wanted to check was the obvious hypothesis: your average Lego set has gotten more expensive over time.

Let's take a look:

print_chart("Price over time", data.groupby('Year')['USPrice'].mean())

At first glance, it looks like, yes, 1 price is increasing steadily over time. However, raw price per set is something of a flawed metric: after all, a $200 set with one thousand pieces is a better deal than a $10 set with twenty. What happens if we look at price per brick, instead of just price?

data['$/p'] = data['USPrice'] / data['Pieces']
print_chart("Price per brick over time", data.groupby('Year')['$/p'].mean())

Well, that's certainly a different picture. Price per brick appears to have peaked in the early nineties and slowly come back to earth since then, where it hovers around a quarter.

Price per brand

So, price hasn't particularly increased over time. Another popular hypothesis is that Lego's profits has come from the premium they charge for licensed sets: from the perennial Star Wars 2 sets to more adventurous collaborations like The Avengers and even Minecraft, its easy to imagine that the company imposes something of a 'brand tax' on these sets. Let's group the sets by theme and see for ourselves:

print_chart("Average price per brick grouped by theme", data.groupby('Theme')['$/p'].mean().order(), bar=True)

Huh -- not exactly what I was expecting. While it's definitely true that the more generic Legos are cheaper -- "Homemaker", "Freestyle", and "Classic" being amongst the cheapest themes -- The Lego Movie and Minecraft 3 are also on the cheaper end of the spectrum. Instead, it looks like the real heavy hitters are the niche sets: those for children ("Education", "Baby") and for programmers and tinkerers ("Galidor", "Mindstorms").

Should you buy in bulk?

print_chart("Price per brick grouped by pieces", data[(data['Pieces'] > 15)].groupby('Pieces')['$/p'].mean())

Short answer: yes.

Long answer: well, yeah, but after a certain point it won't make too big of a difference. Those tiny accessory and minifig-heavy sets will kill your wallet in the long run, though.

So, what's changed?

For my money, the biggest thing that's changed is my perception -- and my tastes, I suppose, as the sets that catch my eye nowadays tend to be much more expensive than the simple (and ninja-filled) sets of my childhood.

Which is relevant in of itself: Lego's push to appeal to the adults who grew up with the blocks is a relatively recent one, as they continue to segment their audience. This can be visualized (albeit roughly) by graphing standard deviation of prices over time, showing that the range of prices itself has increased:

print_chart("Variance over time", data.groupby('Year')['USPrice'].agg({'variance' : np.std})['variance'])

  1. Remember, this is without taking inflation into account. 

  2. I maintain that the dual-bladed lightsaber that I got with my Darth Maul lego is objectively the coolest thing ever. 

  3. Though in Minecraft's case, its because the thing is just a bunch of tiny 1x1 bricks. 

You should .

Comments