Tue Mar 11, 2014
Tags: python development
Linear regressions are a great tool for any level of data exploration: chances are, if you’re looking to investigate the relationship between two variables, somewhere along the line you’re going to want to conjure a regression. So how do you accomplish that in Python?
First, let’s grab some test data from John Burkardt at FSU: specifically, some toy housing data which contains – amongst other things – the area of the site in thousands of square feet and the final selling price. We’ll investigate a pretty simple hypothesis: as the area of the site increases, the selling price increases as well.
Creating the regression itself is pretty simple if you go the route of NumPy:
import pandas as pd import numpy as np COLUMN_SEPARATOR = ' ' housing_data = pd.DataFrame.from_csv('housing.csv', sep=COLUMN_SEPARATOR, header=None) AREA_INDEX = 4 SELLING_PRICE_INDEX = 13 x = housing_data[AREA_INDEX] y = housing_data[SELLING_PRICE_INDEX] regression = np.polyfit(x, y, 1)
regression prints as
[ 3.87739108 13.10587299], which is a list of the coefficients in descending power. That is, this is the same as the following equation:
y = 3.87739108 * x + 13.10587299.
So, our hypothesis turns out to be sane! For every thousand square feet in the housing site, our selling price increases by approximately $380 dollars. 1
And, since we’re feeling fancy, let’s graph this lil’ guy! I decided to go with Bokeh, which is very much still a work in progress on binding
matplotlib-esque interfaces onto D3 – it doesn’t have much embedding functionality, but it’s easy to get off the ground and looks pretty dang neat.
from bokeh.plotting import * # We need to generate actual values for the regression line. r_x, r_y = zip(*((i, i*regression + regression) for i in range(15))) output_file("regression.html") line(r_x, r_y, color="red") # Specify that the two graphs should be on the same plot. hold(True) scatter(x, y, marker="square", color="blue") show()
And that is fine and dandy. However, we’re left with the cold hard truth: using NumPy is lame when you can implement the algorithm yourself. 2
In fact, creating a basic one-dimensional regression takes less than a dozen lines of Python! Specifically, we’re going to be co-opting this simple linear regression equation from Wikipedia:
(This is a little different in practice than the standard linear regression equation, which involves pre-computing the mean and subtracting it from values as we multiply and sum them. The trade-off here is a slight performance increase and a slighter readability increase in exchange for theoretic precision issues due to floating point arithmetic. That being said, this is totally cool for the purposes of didactics!)
Anyhow, this equation becomes:
def basic_linear_regression(x, y): # Basic computations to save a little time. length = len(x) sum_x = sum(x) sum_y = sum(y) # Σx^2, and Σxy respectively. sum_x_squared = sum(map(lambda a: a * a, x)) sum_of_products = sum([x[i] * y[i] for i in range(length)]) # Magic formulae! a = (sum_of_products - (sum_x * sum_y) / length) / (sum_x_squared - ((sum_x ** 2) / length)) b = (sum_y - a * sum_x) / length return a, b
And if we call our new method using the same arrays as before, we get:
Which is accurate within eight degrees of freedom. 3 Yay!
- Remember, x is in thousands of square feet and y is in hundreds of dollars. Yes, those are weird units. [return]
- Full disclosure: using NumPy is never lame. Installing NumPy, on the other hand… [return]
- And, if this was more than toy data, would likely take orders of magnitude longer – but this is optimizing for readability, not agility [return]