Linear regressions are a great tool for any level of data exploration: chances are, if you’re looking to investigate the relationship between two variables, somewhere along the line you’re going to want to conjure a regression. So how do you accomplish that in Python?

First, let’s grab some test data from John Burkardt at FSU: specifically, some toy housing data which contains – amongst other things – the area of the site in thousands of square feet and the final selling price. We’ll investigate a pretty simple hypothesis: as the area of the site increases, the selling price increases as well.

Creating the regression itself is pretty simple if you go the route of NumPy:

``````import pandas as pd
import numpy as np

COLUMN_SEPARATOR = '  '

AREA_INDEX = 4
SELLING_PRICE_INDEX = 13
x = housing_data[AREA_INDEX]
y = housing_data[SELLING_PRICE_INDEX]
regression = np.polyfit(x, y, 1)
``````

`regression` prints as `[ 3.87739108 13.10587299]`, which is a list of the coefficients in descending power. That is, this is the same as the following equation:

``````y = 3.87739108 * x + 13.10587299.
``````

So, our hypothesis turns out to be sane! For every thousand square feet in the housing site, our selling price increases by approximately \$380 dollars. 1

And, since we’re feeling fancy, let’s graph this lil’ guy! I decided to go with Bokeh, which is very much still a work in progress on binding `matplotlib`-esque interfaces onto D3 – it doesn’t have much embedding functionality, but it’s easy to get off the ground and looks pretty dang neat.

``````from bokeh.plotting import *

# We need to generate actual values for the regression line.
r_x, r_y = zip(*((i, i*regression + regression) for i in range(15)))

output_file("regression.html")
line(r_x, r_y, color="red")
# Specify that the two graphs should be on the same plot.
hold(True)
scatter(x, y, marker="square", color="blue")
show()
``````

Plots

And that is fine and dandy. However, we’re left with the cold hard truth: using NumPy is lame when you can implement the algorithm yourself. 2

In fact, creating a basic one-dimensional regression takes less than a dozen lines of Python! Specifically, we’re going to be co-opting this simple linear regression equation from Wikipedia: (This is a little different in practice than the standard linear regression equation, which involves pre-computing the mean and subtracting it from values as we multiply and sum them. The trade-off here is a slight performance increase and a slighter readability increase in exchange for theoretic precision issues due to floating point arithmetic. That being said, this is totally cool for the purposes of didactics!)

Anyhow, this equation becomes:

``````def basic_linear_regression(x, y):
# Basic computations to save a little time.
length = len(x)
sum_x = sum(x)
sum_y = sum(y)

# Σx^2, and Σxy respectively.
sum_x_squared = sum(map(lambda a: a * a, x))
sum_of_products = sum([x[i] * y[i] for i in range(length)])

# Magic formulae!
a = (sum_of_products - (sum_x * sum_y) / length) / (sum_x_squared - ((sum_x ** 2) / length))
b = (sum_y - a * sum_x) / length
return a, b
``````

And if we call our new method using the same arrays as before, we get:

``````(3.8773910821388129, 13.105872988455717)
``````

Which is accurate within eight degrees of freedom. 3 Yay!

1. Remember, x is in thousands of square feet and y is in hundreds of dollars. Yes, those are weird units. [return]
2. Full disclosure: using NumPy is never lame. Installing NumPy, on the other hand… [return]
3. And, if this was more than toy data, would likely take orders of magnitude longer – but this is optimizing for readability, not agility [return]