Linear regression

Linear regression#

We have now built up all of the knowledge necessary to do some data analysis. We’re going to start with something that you’re probably very familiar with: finding a line of best fit.

In all of the following examples, we will be dealing with abstract data (i.e. \(x\) and \(y\) rather than volume and pressure or time and concentration etc), so as to focus on the methodology behind what we are doing rather than the details of the data itself. There will be frequent references to various statistical concepts, but don’t stress too much if you are unfamiliar with any of the terminology: this is not a stats course.

Let’s imagine that we’ve been given some experimental data that is evidently linear:

import matplotlib.pyplot as plt
import numpy as np

x, y = np.loadtxt('linear.csv', delimiter=',', unpack=True)

plt.plot(x, y, 'o')
plt.xlabel('x')
plt.ylabel('y')

plt.show()

../_images/32237bd466bc2bc739a5d15a73d0eb692cdc2be4bcec43ede68d3ccead7b05be.png

You can download linear.csv here.

In order to find a line of best fit, we would like to do linear regression. In short, we want to find the line:

\[y = mx + c,\]

(where \(m\) is the slope and \(c\) is the intercept), that is as close as possible to all the experimental data points. There are many ways to do linear regression in Python, but perhaps the simplest is the linregress function from the scipy library:

from scipy.stats import linregress

regression_data = linregress(x, y)

print(regression_data)

LinregressResult(slope=np.float64(3.434454545454545), intercept=np.float64(11.204090909090908), rvalue=np.float64(0.9961041753472445), pvalue=np.float64(8.369268779950266e-11), stderr=np.float64(0.10134983826003473), intercept_stderr=np.float64(0.5995937291506074))

The first thing you might notice about the code above is that the import statement looks a little different to those we have seen before. The from keyword allows to import individual functions or constants from a package rather than the whole thing. In this example, we have imported just the linregress function, which means we can call that function by typing linregress(x, y) rather than running import scipy and then having to type out scipy.stats.linregress(x, y). To further clarify this, you may recall that we have made frequent use of scipy to access important physical constants such as \(h\) and \(k_{\mathrm{B}}\) by typing scipy.constants.h or scipy.constants.k. Using the from keyword, we could instead type:

from scipy.constants import h, k

print(h, k)

6.62607015e-34 1.380649e-23

Coming back to the rest of our linear regression code, we called the linregress functions with two arguments: x and y:

regression_data = linregress(x, y)

These variables contain the experimental \(x\) and \(y\)-data as plotted previously. We stored the output of the linregress function in a variable called regression_data and then passed this to the print function:

print(regression_data)

The information returned from the linregress function contains the slope \(m\) and the intercept \(c\) of the line of best fit, alongside other useful information which we will discuss further shortly.

regression_data

LinregressResult(slope=np.float64(3.434454545454545), intercept=np.float64(11.204090909090908), rvalue=np.float64(0.9961041753472445), pvalue=np.float64(8.369268779950266e-11), stderr=np.float64(0.10134983826003473), intercept_stderr=np.float64(0.5995937291506074))

We can access the slope \(m\) and the intercept \(c\) as regression_data.slope and regression_data.intercept respectively:

print(f'm = {regression_data.slope}')
print(f'c = {regression_data.intercept}')

m = 3.434454545454545
c = 11.204090909090908

We now have all the information we need to construct and plot our line of best fit:

y_best_fit = regression_data.slope * x + regression_data.intercept # y = mx + c

plt.plot(x, y, 'o', label='Experimental data')
plt.plot(x, y_best_fit, label='Line of best fit')
plt.xlabel('x')
plt.ylabel('y')

plt.legend()
plt.show()

../_images/e17f0a4d4e9b8c857d416d312813c9eebe5daeb056e0badbe90a6d32b721d806.png

It’s worth emphasising at this stage that whilst the example is wholly abstract (\(x\) and \(y\) have no physical meaning and have no units), the same procedure could easily be applied to any linear dataset you might wish to analyse.

In addition to the slope and intercept of the line of best fit, the linregress function also gives us access to the standard error associated with both values via regression_data.stderr and regression_data.intercept_stderr:

print(f'm = {regression_data.slope} +/- {regression_data.stderr}')
print(f'c = {regression_data.intercept} +/- {regression_data.intercept_stderr}')

m = 3.434454545454545 +/- 0.10134983826003473
c = 11.204090909090908 +/- 0.5995937291506074

The linregress function uses the ordinary least-squares method to optimise \(m\) and \(c\), which assumes that the uncertainty (more specifically the variance) of all the experimental measurements are equal i.e. all the error bars are the same size. In the example we have worked through above, this assumption is valid (there are no error bars on our plot), but what if we want to account for data with some variance associated with each data point?

Accounting for uncertainties#

Let’s add some uncertainty to the experimental \(y\)-data from our previous example:

y_err = [1.55, 1.55, 1.53, 1.53, 1.33, 1.33, 1.33, 1.12, 1.02, 0.9, 0.9] # variance in each data point

plt.errorbar(x, y, y_err, fmt='o')
plt.xlabel('x')
plt.ylabel('y')

plt.show()

../_images/db15d0d58741fb5dc84c80f9720c9a9dcf5a11e418230a8bf91bc7478be0d8b9.png

As you can see above, we can plot error bars using plt.errorbar, passing the size of each error bar as a third argument (here passed as the variable y_err). The fmt keyword argument is used to prevent matplotlib from joining each point with a line (this is roughly equivalent how we can pass strings such as 'o' or 'o-' to the plt.plot function to change how the data is rendered).

To perform linear regression on this dataset that explicitly accounts for the error bars, we are going to use the curve_fit function:

from scipy.optimize import curve_fit

Given that we’re trying to fit a straight line, you might be wondering why on earth we’re using a function called curve_fit. Simply put, curve_fit is a much more robust general-use function than linregress (it is not limited to linear regression) and it has a keyword argument sigma that can be used to explicitly account for uncertainties in the experimental \(y\)-data using the weighted least-squares method.

As mentioned above, curve_fit is not limited to fitting straight lines (hence the curve in the name). As a result, we have to specify the model we would like to try to fit to our dataset. In this case, our data is linear, so the model we are going to try and fit is the equation of a straight line:

\[y = mx + c,\]

where \(m\) is the slope and \(c\) is the intercept. We can define this model in Python as a function:

def straight_line(x, m, c):
    """Calculate y = mx + c (the equation of a straight line).

    Args:
        x (np.ndarray): A numpy array containing all of the x-values.
        m (float): The slope of the line.
        c (float): The y-intercept of the line.

    Returns:
        np.ndarray: The y-values.
        
    """

    return m * x + c

We can now pass our function as an argument to curve_fit, also passing the experimental \(x\) and \(y\) data as well as the uncertainties using the sigma keyword argument:

regression_data = curve_fit(straight_line, x, y, sigma=y_err)

curve_fit returns a tuple of two elements:

regression_data

(array([ 3.45325015, 11.13753085]),
 array([[ 0.01099572, -0.06900366],
        [-0.06900366,  0.53969832]]))

The first of these is a numpy array containing the optimised parameters for our model in the order they are specified in the function definition. In this case, our straight_line function takes the arguments x, m and c:

def straight_line(x, m, c):

m comes before c, so the same order will be reflected in the optimised parameters:

regression_data[0]

array([ 3.45325015, 11.13753085])

The second element of regression_data is the covariance matrix, again in the form of a numpy array:

regression_data[1]

array([[ 0.01099572, -0.06900366],
       [-0.06900366,  0.53969832]])

Without getting into the underlying statistics, the important data here is on the diagonal of this matrix. Again in the same order as specified in the function definition, each element on the diagonal is the variance associated with that parameter.

slope, intercept = regression_data[0]
covariance_matrix = regression_data[1]
slope_variance = covariance_matrix[0,0] # First element of the diagonal, row 1 column 1
intercept_variance = covariance_matrix[1,1] # Second element of the diagonal, row 2 column 2

print(f'm = {slope} +/- {np.sqrt(slope_variance)}')
print(f'c = {intercept} +/- {np.sqrt(intercept_variance)}')

m = 3.453250150061519 +/- 0.10486045758833921
c = 11.137530849160365 +/- 0.7346416256182807

Unlike when we used the linregress function, these errors actually account for the uncertainties in the experimental data. We take the square root here to get the standard error rather than the variance associated with each parameter. We can visualise our fitted model (our line of best fit) just like we did before:

y_model = straight_line(x, slope, intercept) # y = mx + c

plt.errorbar(x, y, y_err, fmt='o', label='Experimental data')
plt.plot(x, y_model, label='Line of best fit')
plt.xlabel('x')
plt.ylabel('y')

plt.legend()
plt.show()

../_images/e4d51146fd7091f20d4c68700a6b68ae4e521fdef3678609d740ab27cf768c96.png

In this case, our final plot (and our values of \(m\) and \(c\)) are pretty similar whether we use the ordinary least-squares (linregress) or weighted least-squares (curve_fit with sigma) methods, but this is certainly not guaranteed to be the case and it is therefore very important to propagate experimental uncertanties through to our fitted model parameters when possible.

Linear regression

Contents

Linear regression#

Accounting for uncertainties#