Linear Regression is a technique to create a linear equation given a dataset. We use this when we expect to have a linear correlation, perhaps something like square footage of an apartment compared to rent price.
First, I’m going to show you an example of how linear regression works via sklearn and then we’ll build a project that runs linear regression on a .csv file. For this project, our example data will be square footage of apartment compared to rent price. After we go through simple linear regression, we’ll cover an example of linear regression with multiple independent variables, commonly also referred to as multiple linear regression.
Before we begin, if you do not have the numpy, pandas, sklearn, and matplotlib libraries installed, we’ll need to install them, I use pip, but you can also use conda if you are using an Anaconda Python installation.
pip install numpy pandas sklearn matplotlib
We’ll begin by importing the libraries that we need.
import numpy as np import math from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt from mpl_toolkits import mplot3d import random
Linear Regression with a Randomized Dataset
Next, we’ll randomly generate a set of y values from x values. We’ll make the function
y = 2x with a variance of
arr = [1, 3, 5, 7, 9] def _func(x): return 2 * x + random.uniform(-.1, .1) x_arr = np.array(arr).reshape(-1, 1) y_1 = [_func(a) for a in arr] y_arr = np.array(y_1)
Notice that I turned the set of x values into a vertical array, this is important for calling sklearn’s Linear Regression function
Now let’s take a look at our x and y values
print(x_arr) print(y_arr) # expected output [    ] [ 2.07448972 5.9836436 10.02023471 13.99454233 17.97974717]
We’re going to call
LinearRegression() to fit our points to a model and then we’ll plot the generated line against our original points
model = LinearRegression().fit(x_arr, y_arr) plt.scatter(x_arr, y_arr) plt.plot(x_arr, model.predict(x_arr)) plt.show()
Looks like it fit really well, let’s check out our model’s coefficients and intercept to make sure. We expect an intercept close to 0 and 1 coefficient that should be near 2
print(model.coef_) print(model.intercept_) # expected output [1.99107068] 0.05517809795209949
Great, we’ve verified that the linear regression returns values close to the values we expected, now let’s check the average deviation per prediction. To do this, we’ll take the Mean Squared Error (MSE), divide it by the number of entries, and then take a square root.
mse = sum((y_arr - model.predict(x_arr))**2) avg_err = math.sqrt(mse/5) avg_err # expected output 0.02417347631277344
Because we’re going to be checking for mse and average error again later, I’m going to define them as functions here.
def mse(i, j): return sum((i - j)**2) def avg_err(sum_err, _len): return math.sqrt(sum_err/_len)
We can see that our average error is also within 0.1. When we were creating our function, we added an offset of +/- 0.1 for randomization, and our average prediction error per entry being less than 0.1 verifies that our linear regression model gives a good prediction.
Linear Regression with Real Data from a CSV
Now that we’ve done a small example, let’s move on to some more applicable uses of linear regression. First we’ll read in a .csv and form our x and y arrays from that, then we’ll build and examine this new model.
arrs = np.genfromtxt("lin_reg_data.csv", delimiter=",", skip_header=1) x = arrs[:, 0].reshape(-1, 1) y = arrs[:, 1] model = LinearRegression().fit(x, y)
Let’s check out the average error per entry (we expect this to be under 100), the coefficients and intercept, and see what our line looks like plotted against the regular points.
# average error _len = y.size _avg_err = avg_err(mse(y, model.predict(x)), _len) print("The average price error per apartment from this model is", _avg_err) # coefficients and intercepts print("The linear scaling coefficient for square feet in this model is", model.coef_) print("The offsetting price/intercept of this model is", model.intercept_) # plot plt.scatter(x, y) plt.plot(x, model.predict(x), color="black") # expected output The average price error per apartment from this model is 62.58835009983644 The linear scaling coefficient for square feet in this model is 3.6575960237228298 The offsetting price/intercept of this model is 1.3200558525759334
Cool! The model looks pretty good AND our average price error is within 100 as expected. The rent in this city (Seattle) is very expensive at $3.66 a square foot. We see that the intercept is 1.32, and that looks about right too because we’ve taken the totally reasonable assumption on our model that apartment price varies only directly with square footage.
Now for our final walkthrough project, we’re going to try and fit a linear model to the weight of hardwood trees in tons based on their height and radius.
arrs = np.genfromtxt("trees.csv", delimiter=",", skip_header=1) x = arrs[:, 0:2].reshape(-1, 2) y = arrs[:, 2] model = LinearRegression().fit(x, y)
Now let’s plot the model (in 3D!) and get the
x1 =  x2 =  for entry in x: x1.append(entry) x2.append(entry) fig = plt.figure() ax = plt.axes(projection = '3d') ax.scatter(x1, x2, y, alpha=0.3) ax.plot(x1, x2, model.predict(x), color='black')
With the points plotted in blue and the plane from our linear regression plotted in black, we can see visually that our linear regression model predicts the actual values quite well, now it’s time to verify this with the .score() function from sklearn’s Linear Regression.
print("Our R squared value is", model.score(x, y)) # expected output Our R squared value is 0.9824677872384431
Wow, that’s a really good
R^2 value, the closer the
R^2 value is to 1, the more accurate the linear model. Let’s also take a look at the coefficients and intercept.
print(model.coef_) print(model.intercept_) # expected output [0.12629947 0.05877378] -3.575930068798057
Looking at our model’s coefficients and intercept tells us that despite how well fitting our plane was at these values, this model doesn’t really make sense. The value that we’re predicting is the weight of the tree in tons, so an intercept of -3.76 doesn’t make sense, we should expect an intercept of 0. Also, examining the ‘trees.csv’ data by eye, we’ll see that the model looks like there is more and more increase in weight as the height and radius go up. Examining our plot too, we can see that the points are clearly forming some sort of curve and the reason this linear fit works well is because of the values in our data set.
This wraps up our linear regression module. In this module we started by seeing a small, 1 dimensional example of linear regression, then moved on to a larger example read in from a csv file, and finally to an example of multiple linear regression verified with a test data set. In the next module, we’ll be covering logistic regression.
To learn more, feel free to reach out to me @yujian_tang on Twitter, connect with me on LinkedIn, and join our Discord. Remember to follow the blog to stay updated with cool Python projects and ways to level up your Software and Python skills! If you liked this article, please Tweet it, share it on LinkedIn, or tell your friends!
I run this site to help you and others like you find cool projects and practice software skills. If this is helpful for you and you enjoy your ad free site, please help fund this site by donating below! If you can’t donate right now, please think of us next time.
Make a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
Or enter a custom amount
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.DonateDonate monthlyDonate yearly