### Tue Mar 11, 2014

## Basic linear regressions in Python

### Tags: python development

Linear regressions are a great tool for any level of data exploration: chances are, if you’re looking to investigate the relationship between two variables, somewhere along the line you’re going to want to conjure a regression. So how do you accomplish that in Python?

First, let’s grab some test data from John Burkardt at FSU: specifically, some toy housing data which contains – amongst other things – the area of the site in thousands of square feet and the final selling price. We’ll investigate a pretty simple hypothesis: as the area of the site increases, the selling price increases as well.

Creating the regression itself is pretty simple if you go the route of NumPy:

```
import pandas as pd
import numpy as np
COLUMN_SEPARATOR = ' '
housing_data = pd.DataFrame.from_csv('housing.csv', sep=COLUMN_SEPARATOR, header=None)
AREA_INDEX = 4
SELLING_PRICE_INDEX = 13
x = housing_data[AREA_INDEX]
y = housing_data[SELLING_PRICE_INDEX]
regression = np.polyfit(x, y, 1)
```

`regression`

prints as `[ 3.87739108 13.10587299]`

, which is a list of the coefficients in descending power. That is, this is the same as the following equation:

```
y = 3.87739108 * x + 13.10587299.
```

So, our hypothesis turns out to be sane! For every thousand square feet in the housing site, our selling price increases by approximately $380 dollars. ^{1}

And, since we’re feeling fancy, let’s graph this lil’ guy! I decided to go with Bokeh, which is very much still a work in progress on binding `matplotlib`

-esque interfaces onto D3 – it doesn’t have much embedding functionality, but it’s easy to get off the ground and looks pretty dang neat.

```
from bokeh.plotting import *
# We need to generate actual values for the regression line.
r_x, r_y = zip(*((i, i*regression[0] + regression[1]) for i in range(15)))
output_file("regression.html")
line(r_x, r_y, color="red")
# Specify that the two graphs should be on the same plot.
hold(True)
scatter(x, y, marker="square", color="blue")
show()
```

And that is fine and dandy. However, we’re left with the cold hard truth: using NumPy is lame when you can implement the algorithm yourself. ^{2}

In fact, creating a basic one-dimensional regression takes less than a dozen lines of Python! Specifically, we’re going to be co-opting this simple linear regression equation from Wikipedia:

(This is a little different in practice than the standard linear regression equation, which involves pre-computing the mean and subtracting it from values as we multiply and sum them. The trade-off here is a slight performance increase and a slighter readability increase in exchange for theoretic precision issues due to floating point arithmetic. That being said, this is totally cool for the purposes of didactics!)

Anyhow, this equation becomes:

```
def basic_linear_regression(x, y):
# Basic computations to save a little time.
length = len(x)
sum_x = sum(x)
sum_y = sum(y)
# Σx^2, and Σxy respectively.
sum_x_squared = sum(map(lambda a: a * a, x))
sum_of_products = sum([x[i] * y[i] for i in range(length)])
# Magic formulae!
a = (sum_of_products - (sum_x * sum_y) / length) / (sum_x_squared - ((sum_x ** 2) / length))
b = (sum_y - a * sum_x) / length
return a, b
```

And if we call our new method using the same arrays as before, we get:

```
(3.8773910821388129, 13.105872988455717)
```

Which is accurate within eight degrees of freedom. ^{3} Yay!

- Remember, x is in thousands of square feet and y is in hundreds of dollars. Yes, those are weird units.
^{[return]} - Full disclosure: using NumPy is never lame. Installing NumPy, on the other hand…
^{[return]} - And, if this was more than toy data, would likely take orders of magnitude longer – but this is optimizing for readability, not agility
^{[return]}