NumPy#

As of lab 4, we have covered most of the Python fundamentals. From this point onwards, we will be working through concepts that are particularly useful for data analysis.

We’re going to start by learning about a new library: numpy.

import numpy as np

temperatures = np.array([200, 220, 240, 260, 280, 300, 320, 340, 360, 380, 400])

temperatures ** 2
array([ 40000,  48400,  57600,  67600,  78400,  90000, 102400, 115600,
       129600, 144400, 160000])

You may have noticed that the import statement in the code above is slightly different to those we have encountered before. Rather than just typing import numpy, we have import numpy as np. The as keyword is used here to effectively “nickname” the numpy library, so that rather than typing numpy.name_of_function, we can type np.name_of_function.

In this example, we have passed a list of temperatures to the np.array function and then raised this new object to the power of \(2\). As you can see, the final output is just the original temperatures squared. As a reminder, this would not work with a standard list:

temperatures = [200, 220, 240, 260, 280, 300, 320, 340, 360, 380, 400]

temperatures ** 2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 3
      1 temperatures = [200, 220, 240, 260, 280, 300, 320, 340, 360, 380, 400]
----> 3 temperatures ** 2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

So therefore we can conclude that it is the np.array function that allows us to calculate the square of each temperature in such a compact way.

The np.array function, perhaps unsurpisingly, creates a numpy array:

temperatures = np.array([200, 220, 240, 260, 280, 300, 320, 340, 360, 380, 400])

type(temperatures)
numpy.ndarray

numpy arrays behave very similarly to vectors. If you are not overly familiar with vectors or linear algebra, do not panic. For the purposes of this course we won’t be worrying too much about the mathematical background describing why numpy arrays work like they do, but rather how we can use them to our advantage.

Vector arithmetic#

Okay so we have a new data structure to play with: numpy arrays. Unlike all of the other data structures we have looked at before (lists, tuples and dictionaries), numpy arrays are not built-in to Python, they are not included by default, hence we had to import them explicitly. To create a numpy array, we use the np.array function:

list_of_numbers = [1, 9, 17, 22, 45, 68]

array_of_numbers = np.array(list_of_numbers)

print(array_of_numbers)
[ 1  9 17 22 45 68]

Here we can see that, if we just print one out, a numpy array looks much like a list. Things get much more interesting when we starting doing mathmatical operations with them:

array_of_numbers
array([ 1,  9, 17, 22, 45, 68])
array_of_numbers * 2
array([  2,  18,  34,  44,  90, 136])
array_of_numbers + 10
array([11, 19, 27, 32, 55, 78])

Here we see that using the standard mathematical operators such as * and + allows us to operate simultaneously on all elements of the array. If we run array_of_numbers * 2, we multiply every element of array_of_numbers by \(2\).

array_1 = np.array([1, 5, 7])
array_2 = np.array([3, 1, 2])

print(f'We can add arrays together: {array_1 + array_2}')
print(f'Take them away from one another: {array_1 - array_2}')
print(f'Multiply them together: {array_1 * array_2}')
print(f'Divide one by the other: {array_1 / array_2}')
We can add arrays together: [4 6 9]
Take them away from one another: [-2  4  5]
Multiply them together: [ 3  5 14]
Divide one by the other: [0.33333333 5.         3.5       ]

As shown above, we can also add, subtract, divide and multiply arrays together. Each of these operations is carried out element-wise:

array_3 = array_1 + array_2

print(array_1)
print(array_2)
print(array_3)
[1 5 7]
[3 1 2]
[4 6 9]

Using the example of addition shown above, we see that the first element of array_3 is the sum of the first element of array_1 and the first element of array_2. Similarly, the second element of array_3 is the sum of the second element of array_1 and the second element of array_2 and so on and so forth:

\[\begin{split}\begin{bmatrix} a_{1} \\ a_{2} \\ \vdots \\ a_{N} \end{bmatrix} + \begin{bmatrix} b_{1} \\ b_{2} \\ \vdots \\ b_{N} \end{bmatrix} = \begin{bmatrix} a_{1} + b_{1} \\ a_{2} + b_{2} \\ \vdots \\ a_{N} + b_{N} \end{bmatrix},\end{split}\]

this is precisely how vector addition works, hence the term vector arithmetic.


A quick sidenote: numpy arrays, unlike the other data structures we have looked at, must contain homogeneous data. A list can have any data that you would like in it:

example_list = [1, 'a', True]

print(example_list)
[1, 'a', True]

But if we try to create a numpy array out of this list:

example_array = np.array(example_list)

print(example_array)
['1' 'a' 'True']

You can see that each element has been turned into a string (notice the quotation marks around '1' and 'True'). This is usually not what we want, so if you need a sequence of different data types, use a list, not a numpy array.

Slicing and dicing#

We’ve now seen how numpy arrays behave quite differently from other data structures with respect to mathematical operations, but in many other ways they are no different to lists.

We can access any element of a numpy array by indexing it just like a list:

example_array = np.array([1, 2, 3, 4, 5])
example_array[2]
3
example_array[0]
1
example_array[-1]
5

Similarly, we can slice a numpy array to retrieve only certain elements:

example_array[1:4]
array([2, 3, 4])
example_array[3:0:-1]
array([4, 3, 2])

Just like we can have nested lists, we can have multidimensional numpy arrays:

nested_list = [[1, 2, 3], 
               [4, 5, 6], 
               [7, 8, 9]]

example_array = np.array(nested_list)

print(example_array)
[[1 2 3]
 [4 5 6]
 [7 8 9]]

These can be indexed again just like regular list objects:

example_array[0]
array([1, 2, 3])
example_array[0][1]
2
example_array[1][2]
6

One difference from lists is that numpy allows us to use a more compact syntax for indexing multidimensional arrays:

example_array[2,1]
8

As opposed to:

example_array[2][1]
8

Just like one-dimensional numpy arrays behave like vectors, two-dimensional arrays behave like matrices. You can therefore think of array slicing syntax as:

name_of_array[row_index, column_index]

For example:

example_array
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
example_array[2, 0]
7

Here our row index is \(2\), therefore we are accessing the third row of example_array (remember Python starts counting from zero!). The column index is \(0\), so we are looking in the first column of example_array. Sure enough, we find that the number \(7\) is located in the first column of the third row of example_array.

If you are not particularly keen on vectors or matrices, you do not have to use this mental model all of the time. For the most part, you can think of numpy arrays as fancy lists that allow you to peform certain mathematical operations on sequences of numbers very quickly and efficiently.

More numpy functions#

Aside from the array function, numpy also provides us with many more useful functions that are well worth being aware of.

Previously, we learnt about the math library, which gives us access to various mathematical functions that go beyond the simple operators available in Python automatically such as + or /:

import math

math.log(2)
0.6931471805599453

The math functions are designed for int and float objects, they do not work with data structures such as a list:

numbers = [1, 2, 3, 4, 5]

math.log(numbers)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[30], line 3
      1 numbers = [1, 2, 3, 4, 5]
----> 3 math.log(numbers)

TypeError: must be real number, not list

So, if we want to take the natural logarithm of each value in number, we have to use a loop:

log_numbers = []
for number in numbers:
    log_numbers.append(math.log(number))

log_numbers
[0.0,
 0.6931471805599453,
 1.0986122886681098,
 1.3862943611198906,
 1.6094379124341003]

This same problem occurs for numpy arrays as well as lists:

example_array = np.array([1, 2, 3, 4, 5])

math.log(example_array)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[32], line 3
      1 example_array = np.array([1, 2, 3, 4, 5])
----> 3 math.log(example_array)

TypeError: only length-1 arrays can be converted to Python scalars

Rather than using a loop like we had to with a list, we can get around this problem in a different way:

np.log(example_array)
array([0.        , 0.69314718, 1.09861229, 1.38629436, 1.60943791])

Notice that we’re not calling math.log here, but np.log: the numpy version of the log function. Just like the element-wise operations we talked about before:

example_array + 1
array([2, 3, 4, 5, 6])

numpy also allows us to apply more complex functions such as the natural logarithm element-wise to an array.

Warning

This is a major point of confusion for a lot of Python beginners, so let’s be very clear.

If you want use the math functions, these will only work on single numbers i.e. int and float objects. To perform similar calculations on sequences of numbers, you can use the numpy versions of these functions (which have the same name) on a numpy array.

To give another concrete example, let’s say we want to take the exponential of a single number. We can do this with the math library:

math.exp(5)
148.4131591025766

If we want to do the same operation on multiple numbers, we have to use a numpy array with the numpy verison of the exp function:

numbers = [1, 2, 3, 4, 5]

np.exp(numbers)
array([  2.71828183,   7.3890561 ,  20.08553692,  54.59815003,
       148.4131591 ])

To summarise, for individual numbers use math, for multiple numbers use numpy: it is not only more convenient, but very often it is much faster too.


A few more numpy functions to be aware of include the mean, sum and std functions:

example_array = np.array([200, 220, 198, 181, 201, 156])

mean_of_example = np.mean(example_array)
sum_of_example = np.sum(example_array)
std_dev_of_example = np.std(example_array)

print(f'The mean = {mean_of_example}')
print(f'The sum = {sum_of_example}')
print(f'The standard deviation = {std_dev_of_example}')
The mean = 192.66666666666666
The sum = 1156
The standard deviation = 19.913702708325125

You might be wondering why there is a numpy version of the sum function, given that the built-in version works just fine on numpy arrays:

sum(example_array)
1156

The numpy version of the sum function calculates the sum over a given axis. In our one-dimensional example, this makes no difference, but if we have a two-dimensional array:

example_array = np.array([[1, 2, 3, 4], 
                          [5, 6, 7, 8]])

sum(example_array)
array([ 6,  8, 10, 12])

The built-in sum function is only able to add up the rows, whereas the numpy version:

option_1 = np.sum(example_array, axis=0)
option_2 = np.sum(example_array, axis=1)
option_3 = np.sum(example_array)

print(f'Sum along axis 0 = {option_1}')
print(f'Sum along axis 1 = {option_2}')
print(f'Sum along all values = {option_3}')
Sum along axis 0 = [ 6  8 10 12]
Sum along axis 1 = [10 26]
Sum along all values = 36

Gives us control over the summation via the axis keyword argument. In this course, you almost certainly won’t have to worry about the different ways in which you can sum over a numpy array, but for your information if you’re interested:

  • Axis \(0\) corresponds to summing across rows (which is what the built-in sum function does).

  • Axis \(1\) corresponds to summing across columns.

  • Providing no axis argument at all allows us to sum all values.


We were introduced to the range function back in lab 3, this allows us to generate sequences of integers:

for number in range(0, 22, 2):
    print(number)
0
2
4
6
8
10
12
14
16
18
20

Here we’ve printed all of the even numbers between \(0\) and \(20\) by passing the appropriate arguments to range (remember that the optional third argument is the step size between values).

numpy provides its own version of the range function which returns a numpy array:

np.arange(0, 22, 2)
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

The arguments passed to np.arange function exactly the same way as the built-in range function, except that rather than spitting out a sequence of int objects, we now get a numpy array.

To go along with the np.arange function, we also have the linspace function:

np.linspace(0, 10, 100)
array([ 0.        ,  0.1010101 ,  0.2020202 ,  0.3030303 ,  0.4040404 ,
        0.50505051,  0.60606061,  0.70707071,  0.80808081,  0.90909091,
        1.01010101,  1.11111111,  1.21212121,  1.31313131,  1.41414141,
        1.51515152,  1.61616162,  1.71717172,  1.81818182,  1.91919192,
        2.02020202,  2.12121212,  2.22222222,  2.32323232,  2.42424242,
        2.52525253,  2.62626263,  2.72727273,  2.82828283,  2.92929293,
        3.03030303,  3.13131313,  3.23232323,  3.33333333,  3.43434343,
        3.53535354,  3.63636364,  3.73737374,  3.83838384,  3.93939394,
        4.04040404,  4.14141414,  4.24242424,  4.34343434,  4.44444444,
        4.54545455,  4.64646465,  4.74747475,  4.84848485,  4.94949495,
        5.05050505,  5.15151515,  5.25252525,  5.35353535,  5.45454545,
        5.55555556,  5.65656566,  5.75757576,  5.85858586,  5.95959596,
        6.06060606,  6.16161616,  6.26262626,  6.36363636,  6.46464646,
        6.56565657,  6.66666667,  6.76767677,  6.86868687,  6.96969697,
        7.07070707,  7.17171717,  7.27272727,  7.37373737,  7.47474747,
        7.57575758,  7.67676768,  7.77777778,  7.87878788,  7.97979798,
        8.08080808,  8.18181818,  8.28282828,  8.38383838,  8.48484848,
        8.58585859,  8.68686869,  8.78787879,  8.88888889,  8.98989899,
        9.09090909,  9.19191919,  9.29292929,  9.39393939,  9.49494949,
        9.5959596 ,  9.6969697 ,  9.7979798 ,  9.8989899 , 10.        ])

linspace means linearly-spaced, this function returns a number of values between a given start and end point, each one equally spaced apart from each other. Here we have passed the arguments \(0\), \(10\) and \(100\):

  • The first argument \(0\) is the start point: we would like a range of values starting from \(0\) (just like the range function).

  • The second argument \(10\) is the end point, but unlike the range or arange functions, this end point is included in the output.

  • The third argument \(100\) is not the step size, but rather the total number of values we would like.

So to put it concisely, this example computes \(100\) evenly-spaced values between \(0\) and \(100\).