Linear regressions are a great tool for any level of data exploration: chances are, if you’re looking to investigate the relationship between two variables, somewhere along the line you’re going to want to conjure a regression. So how do you accomplish that in Python?

First, let’s grab some test data from John Burkardt at FSU: specifically, some toy housing data which contains – amongst other things – the area of the site in thousands of square feet and the final selling price. We’ll investigate a pretty simple hypothesis: as the area of the site increases, the selling price increases as well.

Creating the regression itself is pretty simple if you go the route of NumPy:

import pandas as pd
import numpy as np

housing_data = pd.DataFrame.from_csv('housing.csv', sep=COLUMN_SEPARATOR, header=None)

x = housing_data[AREA_INDEX]
y = housing_data[SELLING_PRICE_INDEX]
regression = np.polyfit(x, y, 1)

regression prints as [ 3.87739108 13.10587299], which is a list of the coefficients in descending power. That is, this is the same as the following equation:

y = 3.87739108 * x + 13.10587299.

So, our hypothesis turns out to be sane! For every thousand square feet in the housing site, our selling price increases by approximately $380 dollars. 1

And, since we’re feeling fancy, let’s graph this lil’ guy! I decided to go with Bokeh, which is very much still a work in progress on binding matplotlib-esque interfaces onto D3 – it doesn’t have much embedding functionality, but it’s easy to get off the ground and looks pretty dang neat.

from bokeh.plotting import *

# We need to generate actual values for the regression line.
r_x, r_y = zip(*((i, i*regression[0] + regression[1]) for i in range(15)))

line(r_x, r_y, color="red")
# Specify that the two graphs should be on the same plot.
scatter(x, y, marker="square", color="blue")


And that is fine and dandy. However, we’re left with the cold hard truth: using NumPy is lame when you can implement the algorithm yourself. 2

In fact, creating a basic one-dimensional regression takes less than a dozen lines of Python! Specifically, we’re going to be co-opting this simple linear regression equation from Wikipedia:

(This is a little different in practice than the standard linear regression equation, which involves pre-computing the mean and subtracting it from values as we multiply and sum them. The trade-off here is a slight performance increase and a slighter readability increase in exchange for theoretic precision issues due to floating point arithmetic. That being said, this is totally cool for the purposes of didactics!)

Anyhow, this equation becomes:

def basic_linear_regression(x, y):
    # Basic computations to save a little time.
    length = len(x)
    sum_x = sum(x)
    sum_y = sum(y)

    # Σx^2, and Σxy respectively.
    sum_x_squared = sum(map(lambda a: a * a, x))
    sum_of_products = sum([x[i] * y[i] for i in range(length)])

    # Magic formulae!
    a = (sum_of_products - (sum_x * sum_y) / length) / (sum_x_squared - ((sum_x ** 2) / length))
    b = (sum_y - a * sum_x) / length
    return a, b

And if we call our new method using the same arrays as before, we get:

(3.8773910821388129, 13.105872988455717)

Which is accurate within eight degrees of freedom. 3 Yay!

  1. Remember, x is in thousands of square feet and y is in hundreds of dollars. Yes, those are weird units. [return]
  2. Full disclosure: using NumPy is never lame. Installing NumPy, on the other hand… [return]
  3. And, if this was more than toy data, would likely take orders of magnitude longer – but this is optimizing for readability, not agility [return]
Liked this post? Follow me!