NumPy#
As of lab 4, we have covered most of the Python fundamentals. From this point onwards, we will be working through concepts that are particularly useful for data analysis.
We’re going to start by learning about a new library: numpy
.
import numpy as np
temperatures = np.array([200, 220, 240, 260, 280, 300, 320, 340, 360, 380, 400])
temperatures ** 2
array([ 40000, 48400, 57600, 67600, 78400, 90000, 102400, 115600,
129600, 144400, 160000])
You may have noticed that the import
statement in the code above is slightly different to those we have encountered before. Rather than just typing import numpy
, we have import numpy as np
. The as
keyword is used here to effectively “nickname” the numpy
library, so that rather than typing numpy.name_of_function
, we can type np.name_of_function
.
In this example, we have passed a list
of temperatures to the np.array
function and then raised this new object to the power of \(2\). As you can see, the final output is just the original temperatures squared. As a reminder, this would not work with a standard list
:
temperatures = [200, 220, 240, 260, 280, 300, 320, 340, 360, 380, 400]
temperatures ** 2
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[2], line 3
1 temperatures = [200, 220, 240, 260, 280, 300, 320, 340, 360, 380, 400]
----> 3 temperatures ** 2
TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'
So therefore we can conclude that it is the np.array
function that allows us to calculate the square of each temperature in such a compact way.
The np.array
function, perhaps unsurpisingly, creates a numpy
array:
temperatures = np.array([200, 220, 240, 260, 280, 300, 320, 340, 360, 380, 400])
type(temperatures)
numpy.ndarray
numpy
arrays behave very similarly to vectors. If you are not overly familiar with vectors or linear algebra, do not panic. For the purposes of this course we won’t be worrying too much about the mathematical background describing why numpy
arrays work like they do, but rather how we can use them to our advantage.
Vector arithmetic#
Okay so we have a new data structure to play with: numpy
arrays. Unlike all of the other data structures we have looked at before (lists, tuples and dictionaries), numpy
arrays are not built-in to Python, they are not included by default, hence we had to import
them explicitly. To create a numpy
array, we use the np.array
function:
list_of_numbers = [1, 9, 17, 22, 45, 68]
array_of_numbers = np.array(list_of_numbers)
print(array_of_numbers)
[ 1 9 17 22 45 68]
Here we can see that, if we just print
one out, a numpy
array looks much like a list. Things get much more interesting when we starting doing mathmatical operations with them:
array_of_numbers
array([ 1, 9, 17, 22, 45, 68])
array_of_numbers * 2
array([ 2, 18, 34, 44, 90, 136])
array_of_numbers + 10
array([11, 19, 27, 32, 55, 78])
Here we see that using the standard mathematical operators such as *
and +
allows us to operate simultaneously on all elements of the array. If we run array_of_numbers * 2
, we multiply every element of array_of_numbers
by \(2\).
array_1 = np.array([1, 5, 7])
array_2 = np.array([3, 1, 2])
print(f'We can add arrays together: {array_1 + array_2}')
print(f'Take them away from one another: {array_1 - array_2}')
print(f'Multiply them together: {array_1 * array_2}')
print(f'Divide one by the other: {array_1 / array_2}')
We can add arrays together: [4 6 9]
Take them away from one another: [-2 4 5]
Multiply them together: [ 3 5 14]
Divide one by the other: [0.33333333 5. 3.5 ]
As shown above, we can also add, subtract, divide and multiply arrays together. Each of these operations is carried out element-wise:
array_3 = array_1 + array_2
print(array_1)
print(array_2)
print(array_3)
[1 5 7]
[3 1 2]
[4 6 9]
Using the example of addition shown above, we see that the first element of array_3
is the sum of the first element of array_1
and the first element of array_2
. Similarly, the second element of array_3
is the sum of the second element of array_1
and the second element of array_2
and so on and so forth:
this is precisely how vector addition works, hence the term vector arithmetic.
A quick sidenote: numpy
arrays, unlike the other data structures we have looked at, must contain homogeneous data. A list
can have any data that you would like in it:
example_list = [1, 'a', True]
print(example_list)
[1, 'a', True]
But if we try to create a numpy
array out of this list
:
example_array = np.array(example_list)
print(example_array)
['1' 'a' 'True']
You can see that each element has been turned into a string (notice the quotation marks around '1'
and 'True'
). This is usually not what we want, so if you need a sequence of different data types, use a list
, not a numpy
array.
Slicing and dicing#
We’ve now seen how numpy
arrays behave quite differently from other data structures with respect to mathematical operations, but in many other ways they are no different to lists.
We can access any element of a numpy
array by indexing it just like a list:
example_array = np.array([1, 2, 3, 4, 5])
example_array[2]
3
example_array[0]
1
example_array[-1]
5
Similarly, we can slice a numpy
array to retrieve only certain elements:
example_array[1:4]
array([2, 3, 4])
example_array[3:0:-1]
array([4, 3, 2])
Just like we can have nested lists, we can have multidimensional numpy
arrays:
nested_list = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
example_array = np.array(nested_list)
print(example_array)
[[1 2 3]
[4 5 6]
[7 8 9]]
These can be indexed again just like regular list
objects:
example_array[0]
array([1, 2, 3])
example_array[0][1]
2
example_array[1][2]
6
One difference from lists is that numpy
allows us to use a more compact syntax for indexing multidimensional arrays:
example_array[2,1]
8
As opposed to:
example_array[2][1]
8
Just like one-dimensional numpy
arrays behave like vectors, two-dimensional arrays behave like matrices. You can therefore think of array slicing syntax as:
name_of_array[row_index, column_index]
For example:
example_array
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
example_array[2, 0]
7
Here our row index is \(2\), therefore we are accessing the third row of example_array
(remember Python starts counting from zero!). The column index is \(0\), so we are looking in the first column of example_array
. Sure enough, we find that the number \(7\) is located in the first column of the third row of example_array
.
If you are not particularly keen on vectors or matrices, you do not have to use this mental model all of the time. For the most part, you can think of numpy
arrays as fancy lists that allow you to peform certain mathematical operations on sequences of numbers very quickly and efficiently.
More numpy
functions#
Aside from the array
function, numpy
also provides us with many more useful functions that are well worth being aware of.
Previously, we learnt about the math
library, which gives us access to various mathematical functions that go beyond the simple operators available in Python automatically such as +
or /
:
import math
math.log(2)
0.6931471805599453
The math
functions are designed for int
and float
objects, they do not work with data structures such as a list
:
numbers = [1, 2, 3, 4, 5]
math.log(numbers)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[30], line 3
1 numbers = [1, 2, 3, 4, 5]
----> 3 math.log(numbers)
TypeError: must be real number, not list
So, if we want to take the natural logarithm of each value in number
, we have to use a loop:
log_numbers = []
for number in numbers:
log_numbers.append(math.log(number))
log_numbers
[0.0,
0.6931471805599453,
1.0986122886681098,
1.3862943611198906,
1.6094379124341003]
This same problem occurs for numpy
arrays as well as lists:
example_array = np.array([1, 2, 3, 4, 5])
math.log(example_array)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[32], line 3
1 example_array = np.array([1, 2, 3, 4, 5])
----> 3 math.log(example_array)
TypeError: only length-1 arrays can be converted to Python scalars
Rather than using a loop like we had to with a list, we can get around this problem in a different way:
np.log(example_array)
array([0. , 0.69314718, 1.09861229, 1.38629436, 1.60943791])
Notice that we’re not calling math.log
here, but np.log
: the numpy
version of the log function. Just like the element-wise operations we talked about before:
example_array + 1
array([2, 3, 4, 5, 6])
numpy
also allows us to apply more complex functions such as the natural logarithm element-wise to an array.
Warning
This is a major point of confusion for a lot of Python beginners, so let’s be very clear.
If you want use the math
functions, these will only work on single numbers i.e. int
and float
objects. To perform similar calculations on sequences of numbers, you can use the numpy
versions of these functions (which have the same name) on a numpy
array.
To give another concrete example, let’s say we want to take the exponential of a single number. We can do this with the math
library:
math.exp(5)
148.4131591025766
If we want to do the same operation on multiple numbers, we have to use a numpy
array with the numpy
verison of the exp
function:
numbers = [1, 2, 3, 4, 5]
np.exp(numbers)
array([ 2.71828183, 7.3890561 , 20.08553692, 54.59815003,
148.4131591 ])
To summarise, for individual numbers use math
, for multiple numbers use numpy
: it is not only more convenient, but very often it is much faster too.
A few more numpy
functions to be aware of include the mean
, sum
and std
functions:
example_array = np.array([200, 220, 198, 181, 201, 156])
mean_of_example = np.mean(example_array)
sum_of_example = np.sum(example_array)
std_dev_of_example = np.std(example_array)
print(f'The mean = {mean_of_example}')
print(f'The sum = {sum_of_example}')
print(f'The standard deviation = {std_dev_of_example}')
The mean = 192.66666666666666
The sum = 1156
The standard deviation = 19.913702708325125
You might be wondering why there is a numpy
version of the sum
function, given that the built-in version works just fine on numpy
arrays:
sum(example_array)
1156
The numpy
version of the sum
function calculates the sum over a given axis. In our one-dimensional example, this makes no difference, but if we have a two-dimensional array:
example_array = np.array([[1, 2, 3, 4],
[5, 6, 7, 8]])
sum(example_array)
array([ 6, 8, 10, 12])
The built-in sum
function is only able to add up the rows, whereas the numpy
version:
option_1 = np.sum(example_array, axis=0)
option_2 = np.sum(example_array, axis=1)
option_3 = np.sum(example_array)
print(f'Sum along axis 0 = {option_1}')
print(f'Sum along axis 1 = {option_2}')
print(f'Sum along all values = {option_3}')
Sum along axis 0 = [ 6 8 10 12]
Sum along axis 1 = [10 26]
Sum along all values = 36
Gives us control over the summation via the axis
keyword argument. In this course, you almost certainly won’t have to worry about the different ways in which you can sum over a numpy
array, but for your information if you’re interested:
Axis \(0\) corresponds to summing across rows (which is what the built-in
sum
function does).Axis \(1\) corresponds to summing across columns.
Providing no
axis
argument at all allows us to sum all values.
We were introduced to the range
function back in lab 3, this allows us to generate sequences of integers:
for number in range(0, 22, 2):
print(number)
0
2
4
6
8
10
12
14
16
18
20
Here we’ve printed all of the even numbers between \(0\) and \(20\) by passing the appropriate arguments to range
(remember that the optional third argument is the step size between values).
numpy
provides its own version of the range
function which returns a numpy
array:
np.arange(0, 22, 2)
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])
The arguments passed to np.arange
function exactly the same way as the built-in range
function, except that rather than spitting out a sequence of int
objects, we now get a numpy
array.
To go along with the np.arange
function, we also have the linspace
function:
np.linspace(0, 10, 100)
array([ 0. , 0.1010101 , 0.2020202 , 0.3030303 , 0.4040404 ,
0.50505051, 0.60606061, 0.70707071, 0.80808081, 0.90909091,
1.01010101, 1.11111111, 1.21212121, 1.31313131, 1.41414141,
1.51515152, 1.61616162, 1.71717172, 1.81818182, 1.91919192,
2.02020202, 2.12121212, 2.22222222, 2.32323232, 2.42424242,
2.52525253, 2.62626263, 2.72727273, 2.82828283, 2.92929293,
3.03030303, 3.13131313, 3.23232323, 3.33333333, 3.43434343,
3.53535354, 3.63636364, 3.73737374, 3.83838384, 3.93939394,
4.04040404, 4.14141414, 4.24242424, 4.34343434, 4.44444444,
4.54545455, 4.64646465, 4.74747475, 4.84848485, 4.94949495,
5.05050505, 5.15151515, 5.25252525, 5.35353535, 5.45454545,
5.55555556, 5.65656566, 5.75757576, 5.85858586, 5.95959596,
6.06060606, 6.16161616, 6.26262626, 6.36363636, 6.46464646,
6.56565657, 6.66666667, 6.76767677, 6.86868687, 6.96969697,
7.07070707, 7.17171717, 7.27272727, 7.37373737, 7.47474747,
7.57575758, 7.67676768, 7.77777778, 7.87878788, 7.97979798,
8.08080808, 8.18181818, 8.28282828, 8.38383838, 8.48484848,
8.58585859, 8.68686869, 8.78787879, 8.88888889, 8.98989899,
9.09090909, 9.19191919, 9.29292929, 9.39393939, 9.49494949,
9.5959596 , 9.6969697 , 9.7979798 , 9.8989899 , 10. ])
linspace
means linearly-spaced, this function returns a number of values between a given start and end point, each one equally spaced apart from each other. Here we have passed the arguments \(0\), \(10\) and \(100\):
The first argument \(0\) is the start point: we would like a range of values starting from \(0\) (just like the
range
function).The second argument \(10\) is the end point, but unlike the
range
orarange
functions, this end point is included in the output.The third argument \(100\) is not the step size, but rather the total number of values we would like.
So to put it concisely, this example computes \(100\) evenly-spaced values between \(0\) and \(100\).