Introduction to Machine Learning: Linear Regression

Linear Regression is a technique to create a linear equation given a dataset. We use this when we expect to have a linear correlation, perhaps something like square footage of an apartment compared to rent price.

First, I’m going to show you an example of how linear regression works via sklearn and then we’ll build a project that runs linear regression on a .csv file. For this project, our example data will be square footage of apartment compared to rent price. After we go through simple linear regression, we’ll cover an example of linear regression with multiple independent variables, commonly also referred to as multiple linear regression.
Before we begin, if you do not have the numpy, pandas, sklearn, and matplotlib libraries installed, we’ll need to install them, I use pip, but you can also use conda if you are using an Anaconda Python installation.

pip install numpy pandas sklearn matplotlib

We’ll begin by importing the libraries that we need.

import numpy as np
import math
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import random

Linear Regression with a Randomized Dataset

Next, we’ll randomly generate a set of y values from x values. We’ll make the function y = 2x with a variance of +/- 0.1

arr = [1, 3, 5, 7, 9]
def _func(x):
    return 2 * x + random.uniform(-.1, .1)
x_arr = np.array(arr).reshape(-1, 1)
y_1 = [_func(a) for a in arr]
y_arr = np.array(y_1)

Notice that I turned the set of x values into a vertical array, this is important for calling sklearn’s Linear Regression function
Now let’s take a look at our x and y values

print(x_arr)
print(y_arr)

# expected output
[[1]
 [3]
 [5]
 [7]
 [9]]
[ 2.07448972  5.9836436  10.02023471 13.99454233 17.97974717]

We’re going to call LinearRegression() to fit our points to a model and then we’ll plot the generated line against our original points

model = LinearRegression().fit(x_arr, y_arr)
plt.scatter(x_arr, y_arr)
plt.plot(x_arr, model.predict(x_arr))
plt.show()
sklearn.LinearRegression Basic Example Graph
sklearn.LinearRegression Basic Example Graph

Looks like it fit really well, let’s check out our model’s coefficients and intercept to make sure. We expect an intercept close to 0 and 1 coefficient that should be near 2

print(model.coef_)
print(model.intercept_)

# expected output
[1.99107068]
0.05517809795209949

Great, we’ve verified that the linear regression returns values close to the values we expected, now let’s check the average deviation per prediction. To do this, we’ll take the Mean Squared Error (MSE), divide it by the number of entries, and then take a square root.

mse = sum((y_arr - model.predict(x_arr))**2)
avg_err = math.sqrt(mse/5)
avg_err

# expected output
0.02417347631277344

Because we’re going to be checking for mse and average error again later, I’m going to define them as functions here.

def mse(i, j):
    return sum((i - j)**2)
def avg_err(sum_err, _len):
    return math.sqrt(sum_err/_len)

We can see that our average error is also within 0.1. When we were creating our function, we added an offset of +/- 0.1 for randomization, and our average prediction error per entry being less than 0.1 verifies that our linear regression model gives a good prediction.

Linear Regression with Real Data from a CSV

Now that we’ve done a small example, let’s move on to some more applicable uses of linear regression. First we’ll read in a .csv and form our x and y arrays from that, then we’ll build and examine this new model.

arrs = np.genfromtxt("lin_reg_data.csv", delimiter=",", skip_header=1)
x = arrs[:, 0].reshape(-1, 1)
y = arrs[:, 1]
model = LinearRegression().fit(x, y)

Let’s check out the average error per entry (we expect this to be under 100), the coefficients and intercept, and see what our line looks like plotted against the regular points.

# average error
_len = y.size
_avg_err = avg_err(mse(y, model.predict(x)), _len)
print("The average price error per apartment from this model is", _avg_err)

# coefficients and intercepts
print("The linear scaling coefficient for square feet in this model is", model.coef_[0])
print("The offsetting price/intercept of this model is", model.intercept_)

# plot
plt.scatter(x, y)
plt.plot(x, model.predict(x), color="black")

# expected output
The average price error per apartment from this model is 62.58835009983644
The linear scaling coefficient for square feet in this model is 3.6575960237228298
The offsetting price/intercept of this model is 1.3200558525759334
Plot of Linear Regression
Plot of Linear Regression

Cool! The model looks pretty good AND our average price error is within 100 as expected. The rent in this city (Seattle) is very expensive at $3.66 a square foot. We see that the intercept is 1.32, and that looks about right too because we’ve taken the totally reasonable assumption on our model that apartment price varies only directly with square footage.
Now for our final walkthrough project, we’re going to try and fit a linear model to the weight of hardwood trees in tons based on their height and radius.

arrs = np.genfromtxt("trees.csv", delimiter=",", skip_header=1)
x = arrs[:, 0:2].reshape(-1, 2)
y = arrs[:, 2]
model = LinearRegression().fit(x, y)

Now let’s plot the model (in 3D!) and get the R^2 score

x1 = []
x2 = []
for entry in x:
    x1.append(entry[0])
    x2.append(entry[1])
fig = plt.figure()
ax = plt.axes(projection = '3d')
ax.scatter(x1, x2, y, alpha=0.3)
ax.plot(x1, x2, model.predict(x), color='black')
plotted 3D linear regression from sklearn
plotted 3D linear regression from sklearn

With the points plotted in blue and the plane from our linear regression plotted in black, we can see visually that our linear regression model predicts the actual values quite well, now it’s time to verify this with the .score() function from sklearn’s Linear Regression.

print("Our R squared value is", model.score(x, y))

# expected output
Our R squared value is 0.9824677872384431

Wow, that’s a really good R^2 value, the closer the R^2 value is to 1, the more accurate the linear model. Let’s also take a look at the coefficients and intercept.

print(model.coef_)
print(model.intercept_)

# expected output
[0.12629947 0.05877378]
-3.575930068798057

Conclusion

Looking at our model’s coefficients and intercept tells us that despite how well fitting our plane was at these values, this model doesn’t really make sense. The value that we’re predicting is the weight of the tree in tons, so an intercept of -3.76 doesn’t make sense, we should expect an intercept of 0. Also, examining the ‘trees.csv’ data by eye, we’ll see that the model looks like there is more and more increase in weight as the height and radius go up. Examining our plot too, we can see that the points are clearly forming some sort of curve and the reason this linear fit works well is because of the values in our data set.

This wraps up our linear regression module. In this module we started by seeing a small, 1 dimensional example of linear regression, then moved on to a larger example read in from a csv file, and finally to an example of multiple linear regression verified with a test data set. In the next module, we’ll be covering logistic regression.

I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.

Yujian Tang

2 thoughts on “Introduction to Machine Learning: Linear Regression

  1. Pingback: Python Algorithms

Leave a Reply

%d bloggers like this: