NumPy#

In the section on lists you learned how list collection types can be used to store multiple individual objects, such as individual numerical values that together correspond to an experimental data set.

While lists are convenient for storing and working with relatively small amounts of data, they are not ideal for working with large numerical data sets. Creating large lists and iterating over them—for example, to perform some statistical analysis on our dataset—is relatively slow, and can require writing non-trivial amounts of “boilerplate” code, which can make writing code for data analysis cumbersome and prone to bugs.

For example, imagine we have recorded a set of experimental measurements and we want to compute the mean, \(\bar{x}\), and standard error, \(\delta \bar{x}\), which are given by:

\[ \bar{x} = \frac{1}{N}\sum_i x_i \]
\[ \sigma_N = \sqrt{\frac{1}{N}\sum_i\left(x_i - \bar{x}\right)^2} \]
\[ \delta \bar{x} = \frac{\sigma_{N-1}}{\sqrt{N}} \]

Using lists this might look like:

import math

data = [1.0026, 1.0019, 0.9972, 0.9986, 1.0009]

def mean(data):
    """Calcualte the mean of a dataset.
    
    Args:
        data (list(float)): The numerical data.
        
    Returns:
        float
        
    """
    return sum(data)/len(data)

def std_dev(data, ddof=0):
    """Calculate the standard deviation of a dataset.
    
    Args:
        data (list(float)): The numerical data.
        ddof (int): to adjust for bias in limited samples
                    relative to the population estimate of variance.
                    Default is 0.
                    
    Returns:
        float
        
    """
    x_mean = mean(data)
    return math.sqrt(sum([(x - x_mean)**2 for x in data])/(len(data)-ddof))

x_bar = mean(data)
delta_x_bar = std_dev(data, ddof=1) / math.sqrt(len(data))

print(f'mean = {x_bar} ± {delta_x_bar}')
mean = 1.00024 ± 0.0010171528891961034

Performing numerical analyses of this type on large datasets is a very common computational workflow. To do this efficiently using compact code we can use the NumPy package, which provides a large number of numerical computing tools that are fast and efficient.

NumPy is not part of the Python standard library, which means that it must be imported before use.

import numpy as np

We can now reimplement our analysis above using the following, much more compact, numpy code:

data = np.array([1.0026, 1.0019, 0.9972, 0.9986, 1.0009])

x_bar = data.mean()
delta_x_bar = data.std(ddof=1) / np.sqrt(len(data))

print(f'mean = {x_bar} ± {delta_x_bar}')
mean = 1.00024 ± 0.0010171528891961034

The numpy version is also significantly faster if we have large datasets to process.

data = np.random.random(10000) # generate 10,000 random numbers between 0 and 1.
%%timeit

x_bar = data.mean()
delta_x_bar = data.std(ddof=1) / np.sqrt(len(data))
35.8 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit

x_bar = mean(data)
delta_x_bar = std_dev(data, ddof=1) / math.sqrt(len(data))
3.55 ms ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

NumPy Arrays#

NumPy provides its own data type, called the numpy array. We create a NumPy array similarly to how we create a list, using the np.array() function:

data = np.array([1.0026, 1.0019, 0.9972, 0.9986, 1.0009])

This can then be indexed in the same way as a list:

data[2:5]
array([0.9972, 0.9986, 1.0009])
data[::-1]
array([1.0009, 0.9986, 0.9972, 1.0019, 1.0026])

NumPy arrays can only contain data of the same type, in contrast to lists, which can contain data of different types. Trying to create a NumPy array that contains data of different types will either cause an error, or your data will be converted to a single consistent type (which probably is not what you wanted to happen).

my_list = ['yellow', 43.0, False]
print(my_list)
['yellow', 43.0, False]
my_array = np.array(['yellow', 43.0, False])
print(my_array)
['yellow' '43.0' 'False']

NumPy arrays can be created from lists:

data = [1.0026, 1.0019, 0.9972, 0.9986, 1.0009]
data_array = np.array(data)
data_array
array([1.0026, 1.0019, 0.9972, 0.9986, 1.0009])

NumPy arrays can also have more than one dimension, which makes them useful for representing multidimensional datasets, or mathematical objects like matrices.

A = np.array([[1, 2, 0], [1, 3, 4], [-2, 0.5, 1]])
A
array([[ 1. ,  2. ,  0. ],
       [ 1. ,  3. ,  4. ],
       [-2. ,  0.5,  1. ]])

Individual elements, or slices, of a N-dimensional NumPy array can be referenced using the same syntax as for nested lists. For example, A[1] gives the second “row” of this matrix:

A[1] # second row
array([1., 3., 4.])

which we can then also slice:

A[1][2] # second row, third element
4.0

NumPy also allows a more compact N-dimensional index notation, using comma-separated values:

A[1,2] # second row, third column
4.0
A[-1,1:] # last row, everything from the second column to the end
array([0.5, 1. ])

Arithmetic with NumPy arrays#

The fact that NumPy arrays are all of a same type means that it is possible to perform arthmetic on them. For example, all of the items in an array can be multipled by a single value.

np.array([0.1, 0.2, 0.3]) * 10.0
array([1., 2., 3.])

Or two NumPy arrays can operator on each other.

np.array([5, 10, 15]) + np.array([5.0, 0.0, -5.0])
array([10., 10., 10.])

These operations on every element of a NumPy array are called vector operations.

Using NumPy arrays for value-wise operations such as those shown where are very efficient and run much faster than explicitly looping over elements in a list.

NumPy functions#

In addition to the power of the NumPy array (on which some of Python’s most impressive libraries are built), the NumPy library also enables access to a broad range of useful functions. For example, the np.log function that was introduced at the start, differs from the math.log function introduced earlier As the former can operate on NumPy array when the latter cannot.

K = np.array([1.06, 3.8, 15.0, 45.44, 150.6])

np.log(K)
array([0.05826891, 1.33500107, 2.7080502 , 3.81639277, 5.01462732])
from math import log

log(K)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[21], line 3
      1 from math import log
----> 3 log(K)

TypeError: only length-1 arrays can be converted to Python scalars

The function from the math module will result in an error.

Alongside these mathematical operations, the NumPy library also enables statistical operations on the NumPy arrays. For example, sum, mean and standard deviations are easy to find. These functions can be used using the np.function_name() syntax, or used as methods of a numpy array (using dot syntax).

mass_numbers = np.array([112, 114, 115, 116, 117, 118, 119, 120, 122, 124])

print(np.sum(mass_numbers))
print(np.mean(mass_numbers))
print(np.std(mass_numbers))
1177
117.7
3.4942810419312296
print(mass_numbers.sum())
print(mass_numbers.mean())
print(mass_numbers.std())
1177
117.7
3.4942810419312296

Exercises:#

  1. The percentage abundances of the natural isotopes of tin are:
    112: 0.0097,
    114: 0.0066,
    115: 0.0034,
    116: 0.1454,
    117: 0.0768,
    118: 0.2422,
    119: 0.0859,
    120: 0.3258,
    122: 0.0463,
    124: 0.0579
    Calculate the average (mean) mass of naturally occurring tin.

  2. Rewrite or modify your code from the Loops Exercise to calculate the distances between each pair of atoms, using numpy arrays to store the atom positions, and vector arithmetic to calculate the vectors between pairs of atoms.