NumPy#
NumPy (Numerical Python) is a fundamental package for scientific computing in Python. NumPy provides support for large, multi-dimensional arrays and matrices, along with a vast collection of high-level mathematical functions to operate on these arrays efficiently.
Using NumPy makes working with numerical data much easier and quicker than performing the same operations using only with the Python standard library (e.g., storing and manipulating data using lists).
As an example of how NumPy can make data analysis much simpler, let us consider the following example data analysis problem:
Imagine we have performed a series of experimental measurements and want to report our results as (mean ± standard error).
For data \(x\), the mean, \(\bar{x}\), and standard error, \(\delta \bar{x}\), are given by:
If we are only using the Python standard library, we might store our data as a list, and then write two functions mean()
and std_dev()
to calculate the mean and standard deviation, respectively. For example,
import math
def mean(data):
"""Calculate the mean of a dataset.
Args:
data (list(float)): The numerical data.
Returns:
float
"""
return sum(data)/len(data)
def std_dev(data, ddof=0):
"""Calculate the standard deviation of a dataset.
Args:
data (list(float)): The numerical data.
ddof (int): to adjust for bias in limited samples
relative to the population estimate of variance.
Default is 0.
Returns:
float
"""
x_mean = mean(data)
return math.sqrt(sum([(x - x_mean)**2 for x in data])/(len(data)-ddof))
data = [1.0026, 1.0019, 0.9972, 0.9986, 1.0009]
x_bar = mean(data)
delta_x_bar = std_dev(data, ddof=1) / math.sqrt(len(data))
print(f'mean = {x_bar} ± {delta_x_bar}')
mean = 1.00024 ± 0.0010171528891961034
Alternatively, the same calculation using NumPy looks like
import numpy as np
data = np.array([1.0026, 1.0019, 0.9972, 0.9986, 1.0009])
x_bar = data.mean()
delta_x_bar = data.std(ddof=1) / np.sqrt(len(data))
print(f'mean = {x_bar} ± {delta_x_bar}')
mean = 1.00024 ± 0.0010171528891961034
Now our code is much shorter, which makes it is quicker to write and easier read. We have avoided having to write our own mean()
and std_dev()
functions, which means less opportunity for bugs. And, if we are working with large datasets, the NumPy version runs significantly faster.
data = np.random.random(10000) # generate 10,000 random numbers between 0 and 1.
%%timeit
x_bar = mean(data)
delta_x_bar = std_dev(data, ddof=1) / math.sqrt(len(data))
3.77 ms ± 17.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
x_bar = data.mean()
delta_x_bar = data.std(ddof=1) / np.sqrt(len(data))
32.8 μs ± 50.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Let us return to our NumPy code:
import numpy as np
data = np.array([1.0026, 1.0019, 0.9972, 0.9986, 1.0009])
x_bar = data.mean()
delta_x_bar = data.std(ddof=1) / np.sqrt(len(data))
print(f'mean = {x_bar} ± {delta_x_bar}')
NumPy is not part of the Python standard library, which means it must be imported before use.
The import
statement in the code above is slightly different to those we have encountered before. Rather than just writing import numpy
we have import numpy as np
. The as
keyword allows us to assign numpy
and all of its functions to the variable name np
.
These two pieces of code are equivalent:
import numpy
data = numpy.array([1.0026, 1.0019, 0.9972, 0.9986, 1.0009])
print(data.mean())
1.00024
import numpy as np
data = np.array([1.0026, 1.0019, 0.9972, 0.9986, 1.0009])
print(data.mean())
1.00024
NumPy Arrays#
NumPy provides its own data type, called the NumPy array. We create a NumPy array similarly to how we create a list
, using the np.array()
function:
data = np.array([1.0026, 1.0019, 0.9972, 0.9986, 1.0009])
This can then be indexed in the same way as a list
:
data[2:5]
array([0.9972, 0.9986, 1.0009])
data[::-1]
array([1.0009, 0.9986, 0.9972, 1.0019, 1.0026])
NumPy arrays can only contain data of the same type, in contrast to lists
, which can contain data of different types.
Trying to create a NumPy array that contains data of different types will either cause an error, or your data will be converted to a single consistent type (which probably is not what you wanted to happen).
my_list = ['yellow', 43.0, False]
print(my_list)
['yellow', 43.0, False]
my_array = np.array(['yellow', 43.0, False])
print(my_array)
['yellow' '43.0' 'False']
NumPy arrays can be created from lists
:
data = [1.0026, 1.0019, 0.9972, 0.9986, 1.0009] # this is a list
data_array = np.array(data) # create a NumPy array from our list
data_array
array([1.0026, 1.0019, 0.9972, 0.9986, 1.0009])
NumPy arrays can also have more than one dimension, which makes them useful for representing multidimensional datasets, or mathematical objects like matrices.
A = np.array([[1, 2, 0], [1, 3, 4], [-2, 0.5, 1]])
A
array([[ 1. , 2. , 0. ],
[ 1. , 3. , 4. ],
[-2. , 0.5, 1. ]])
Individual elements, or slices, of a N-dimensional NumPy array can be referenced using the same syntax as for nested lists. For example, A[1]
gives the second “row” of this two-dimensional array:
A[1] # second row
array([1., 3., 4.])
which we can then also slice:
A[1][2] # second row, third element
np.float64(4.0)
NumPy also allows a more compact N-dimensional index notation, using comma-separated values:
A[1,2] # second row, third column
np.float64(4.0)
A[-1,1:] # last row, everything from the second column to the end
array([0.5, 1. ])
Vector Arithmetic#
The fact that NumPy arrays are all of a same type means that it is possible to perform arthmetic on them. For example, all of the items in an array can be multipled by a single value.
list_of_numbers = [1, 9, 17, 22, 45, 68]
array_of_numbers = np.array(list_of_numbers)
print(array_of_numbers)
[ 1 9 17 22 45 68]
If we print
a NumPy array, it looks much list a list. NumPy arrays, however, allow us to perform vector arithmetic where we perform the same operation on every element of an array, or on every pair of elements from two arrays.
array_of_numbers
array([ 1, 9, 17, 22, 45, 68])
array_of_numbers * 2
array([ 2, 18, 34, 44, 90, 136])
array_of_numbers + 10
array([11, 19, 27, 32, 55, 78])
Using the standard mathematical operators such as *
and +
allows us to operate simultaneously on all elements of the array. If we run array_of_numbers * 2
, we multiply every element of array_of_numbers by 2.
We can also add, subtract, divide and multiply arrays together. Each of these operations is carried out element-wise:
array_1 = np.array([1, 5, 7])
array_2 = np.array([3, 1, 2])
print(f'We can add arrays together: {array_1 + array_2}')
print(f'Take them away from one another: {array_1 - array_2}')
print(f'Multiply them together: {array_1 * array_2}')
print(f'Divide one by the other: {array_1 / array_2}')
We can add arrays together: [4 6 9]
Take them away from one another: [-2 4 5]
Multiply them together: [ 3 5 14]
Divide one by the other: [0.33333333 5. 3.5 ]
array_3 = array_1 + array_2
print(array_1)
print(array_2)
print('-------')
print(array_3)
[1 5 7]
[3 1 2]
-------
[4 6 9]
Here, the first element of array_3
is the sum of the first element of array_1
and the first element of array_2
. The second element of array_3
is the sum of the second element of array_1
and the second element of array_2
. And the third element of array_3
is the sum of the third element of array_1
and the third element of array_3
.
This is precisely how vector addition works, hence the term vector arithmetic.
Note how much simpler elementwise arithmetic is using NumPy arrays than with lists:
array_1 = np.array([1, 5, 7])
array_2 = np.array([3, 1, 2])
array_1 + array_2
array([4, 6, 9])
list_1 = [1, 5, 7]
list_2 = [3, 1, 2]
[a + b for a, b in zip(list_1, list_2)]
[4, 6, 9]
More numpy
Functions#
Aside from the array function, numpy also provides us with many more useful functions that are well worth being aware of.
Previously, we learnt about the math
library, which gives us access to various mathematical functions that go beyond the simple operators available as part of the Python standard library, such as +
or /
:
import math
math.log(2)
0.6931471805599453
The math
functions are designed for int
and float
objects; they do not work with data structures such as a list
:
numbers = [1, 2, 3, 4, 5]
math.log(numbers)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[28], line 3
1 numbers = [1, 2, 3, 4, 5]
----> 3 math.log(numbers)
TypeError: must be real number, not list
So, if we want to calculate the natural logarithm of every value in a list, we have to use a loop:
log_numbers = []
for number in numbers:
log_numbers.append(math.log(number))
log_numbers
or a list comprehension:
log_numbers = [math.log(n) for n in numbers]
log_numbers
Using NumPy we can calculate these logarithms as a single vector operation using the NumPy natural logarithm function, np.log()
:
numbers_array = np.array([1, 2, 3, 4, 5])
np.log(numbers_array)
To give another concrete example, let us say we want to take the exponential of a single number. We can do this with the math
library:
math.exp(5)
If we want to calculate the exponentials of multiple numbers, we have to write a loop (or a list comprehension), or we can use the NumPy version of the exp()
function:
numbers = [1, 2, 3, 4, 5]
np.exp(numbers)
You may have noticed that in this example we passed a **list** into `np.exp()`. This is completely fine. NumPy functions take **array-like** objects as argument, i.e., lists or tuples, as well as NumPy arrays. The only requirement is that a list or tuple contains data of a single type. Note that `np.exp()` returns an array, irrespective of the type of the input data.
NumPy has an enormous number of useful functions and methods. Some that are particularly helpful to know about are:
np.mean()
, np.std()
, np.sum()
#
example_array = np.array([200, 220, 198, 181, 201, 156])
mean_of_example = np.mean(example_array)
sum_of_example = np.sum(example_array)
std_dev_of_example = np.std(example_array)
print(f'The mean = {mean_of_example}')
print(f'The sum = {sum_of_example}')
print(f'The standard deviation = {std_dev_of_example}')
The mean = 192.66666666666666
The sum = 1156
The standard deviation = 19.913702708325125
These operations can also be written as methods of arrays:
Note
You can think of methods as functions that belong to specific variables. Calling the method for a particular object is equivalent to calling a function with that object as the first argument. So my_array.mean()
is analogous to np.mean(my_array)
.
Like standard functions, methods can take zero or more arguments. For example, in our earlier calculation of standard error we passed the optional argument ddof=1
to np.std()
to account for us wanting \(\sigma_{N-1}\) instead of \(\sigma_N\).
std_error = data.std(ddof=1) / np.sqrt(len(data))
Using conventional functions, we would write this as
std_error = np.std(data, ddof=1) / np.sqrt(len(data))
example_array = np.array([200, 220, 198, 181, 201, 156])
mean_of_example = example_array.mean()
sum_of_example = example_array.sum()
std_dev_of_example = example_array.std()
print(f'The mean = {mean_of_example}')
print(f'The sum = {sum_of_example}')
print(f'The standard deviation = {std_dev_of_example}')
The mean = 192.66666666666666
The sum = 1156
The standard deviation = 19.913702708325125
Exercises:#
Exercise 1#
The percentage abundances of the natural isotopes of tin are:
Mass | % Abundance |
---|---|
112 | 0.0097 |
114 | 0.0066 |
115 | 0.0034 |
116 | 0.1454 |
117 | 0.0768 |
118 | 0.2422 |
119 | 0.0859 |
120 | 0.3258 |
122 | 0.0463 |
124 | 0.0579 |
Create two numpy arrays to store the mass and abundance data. Then use vector arithmetic to calculate the average (mean) mass of naturally occurring tin.
Exercise 2#
Rewrite or modify your code from the Loops Exercise to calculate the distances between each pair of atoms, using numpy arrays to store the atom positions, and vector arithmetic to calculate the vectors between pairs of atoms.
Hints for Exercise 2#
If you are representing atom positions by vectors, \(r_i\) and \(r_j\) then the position of one atom relative to the other is given by \(r_{ij} = r_j - r_i\) (or, by \(r_{ji} = r_i - r_j\), depending on which relative position we want).
The distance between atoms \(i\) and \(j\) is then the magnitude of this vector: \(d_{ij} = |r_{ij}| = |r_{ji}|\).
We can calculate this distance using Pythagoras’s theorem:
If we were storing our atom positions in length-3 lists, we might code this as
distance = math.sqrt((r_j[0] - r_i[0])**2 + (r_j[1] - r_i[1])**2 + (r_j[2] - r_i[2])**2)
or (using a list comprehension)
distance = math.sqrt(sum([(j - i)**2 for i, j in zip(r_i, r_j)]))
(check that you understand why and how these two lines of code perform the same calculation).
But, remember that NumPy arrays allow us to perform operations on whole arrays or on pairs of arrays, without having to explicitly specify which elements we are operating on.
Exercise 3#
NumPy provides a range of useful functions for common algebraic operations, including an entire submodule for linear algebra (working with vectors and matrices).
The length of a vector can be computed directly using np.linalg.norm()
.
i.e., these two calculations are equivalent
v = np.array([3.0,4.0])
print(math.sqrt(v[0]**2 + v[1]**2))
5.0
v = np.array([3.0,4.0])
print(np.linalg.norm(v))
5.0
Rewrite your code to calculate distances between pairs of atoms using np.linalg.norm()
.