File I/O#
In all of the exercises you have completed thus far, any data that you need has been provided to you directly as Python objects e.g.
concentration_data = [0.001, 0.002, 0.004, 0.006, 0.008, 0.010, 0.020]
This is not really representative of a typical data anlysis workflow. Most of the time, if you want to analyse some data, you will already have collected that data and stored it in some kind of file, perhaps a csv
for example. It seems prudent therefore to learn how we can read files into Python and write files back out of Python: file I/O (input/output).
The general case: reading in files#
The most general way to read data from a file in Python is to use the built-in open
function. Let’s look at a simple example: reading in a file that contains some simple text. We’re going to look at example.txt, which looks like this:
EXAMPLE TEXT FILE
Here is some text.
Here is some more text.
Here is even more text.
Here’s how we can read in this file in Python:
with open('example.txt', 'r') as stream:
lines = stream.readlines()
for line in lines:
print(line, end='')
EXAMPLE TEXT FILE
Here is some text.
Here is some more text.
Here is even more text.
To try this example yourself, download example.txt.
Important
When you download files from this course book, they will open up in your web browser as raw text. To actually get the file, you will have to create a blank text file on your Noteable instance (New
-> Text File
) and copy the text into there (naming the file appropriately).
Let’s take this example in sections. First up, we have the open
function:
with open('example.txt', 'r') as stream:
Here we provide two arguments: the path to the file we want to open (example.txt
) and what we would like to do with that file ('r'
for read).
As you will have already have noticed, we have also used a new keyword: with
. Here we are doing something quite similar to import
statements such as:
import numpy as np
We are effectively “nicknaming” the output of the open
function and calling it stream
instead. You could of course call it something else instead, here we use stream
as shorthand for a file stream: a stream of data read from a file.
We end the first line with a colon :
much like function definitions and loops, after which we indent all of the code which needs to access stream
.
lines = stream.readlines()
The next line actually manipulates the content of the file. The open
function returns an object which contains all of the data associated with the file, but not necessarily in human-readable or immediately useful way. By calling the readlines
method, we store a list
containing each line of the file. The remaining code simply prints these lines for our inspection:
for line in lines:
print(line, end='')
We have used the end
keyword argument here just to prevent the print
function from adding needless whitespace to the output (by default each line passed to print
will be followed by a newline).
Writing files#
Now that we’ve read in our simple text file, let’s make some modifications to it and write it back out again.
We now have the contents of the text file available to us in the form of a list
of strings:
lines
['EXAMPLE TEXT FILE\n',
'\n',
'Here is some text.\n',
'Here is some more text.\n',
'\n',
'Here is even more text.\n']
Note that each \n
is a newline character which will actually become a newline when passed to the print
function:
print('Line 1\nLine 2')
Line 1
Line 2
Let’s make some changes to lines
, starting by removing the last two:
lines = lines[:-2]
lines
['EXAMPLE TEXT FILE\n',
'\n',
'Here is some text.\n',
'Here is some more text.\n']
Now let’s add a new line:
lines.append('This new text was added in Python!')
lines
['EXAMPLE TEXT FILE\n',
'\n',
'Here is some text.\n',
'Here is some more text.\n',
'This new text was added in Python!']
And finally, to write our lines
to a new file:
with open('modified_example.txt', 'w') as stream:
for line in lines:
stream.write(line)
If you followed along with this entire example, you should see that a new file modified_example.txt
has now been created in the same directory as your Jupyter notebook - take a look.
As you can see in the code above, writing a file in Python looks much like reading a file: we use the open
function for both use cases. The difference is that here we specify that we want to write a file by passing 'w'
as the second argument, and we use the write
method rather than the readlines
method. The write
method takes a single string as an argument, this is why we have used a for
loop to write each individual line to a file. We could also have combined all of the lines into a single string beforehand and then passed this to the write
method, either way will work just fine.
Reading in scientific data with numpy
#
What we have just been through is the most general case: how to read in any file, regardless of what type of data is contained within. For our purposes, we are primarily intereseted in scientific data, numbers that have been collected during some series of experiments. There are many Python packages that can be used to read such data, here we’re going to rely on one that we’ve already been introduced to: numpy
.
Time for another example file, this time some experimental data looking at the temperature dependence of the equilibrium constant for a reaction:
# T / K | K
100 2.38e38
120 2.15e30
140 3.86e24
160 1.89e20
180 8.42e16
200 1.75e14
220 1.12e12
240 1.67e10
260 4.73e08
280 2.24e07
300 1.59e06
320 1.57e05
340 2.03e04
360 3.30e03
380 6.50e02
400 1.51e02
420 4.01e01
440 1.21e01
460 4.02e00
480 1.47e00
500 5.82e-01
Follow this link to download this file.
This data is formatted as simple text in a table of sorts, with values on each line being separated by whitespace. Simple tabular data can be read from files like this using the loadtxt
function:
import numpy as np
data = np.loadtxt('thermodynamic_data.dat')
print(data)
type(data)
[[1.00e+02 2.38e+38]
[1.20e+02 2.15e+30]
[1.40e+02 3.86e+24]
[1.60e+02 1.89e+20]
[1.80e+02 8.42e+16]
[2.00e+02 1.75e+14]
[2.20e+02 1.12e+12]
[2.40e+02 1.67e+10]
[2.60e+02 4.73e+08]
[2.80e+02 2.24e+07]
[3.00e+02 1.59e+06]
[3.20e+02 1.57e+05]
[3.40e+02 2.03e+04]
[3.60e+02 3.30e+03]
[3.80e+02 6.50e+02]
[4.00e+02 1.51e+02]
[4.20e+02 4.01e+01]
[4.40e+02 1.21e+01]
[4.60e+02 4.02e+00]
[4.80e+02 1.47e+00]
[5.00e+02 5.82e-01]]
numpy.ndarray
We end up with a numpy
array containing all of the data in the file. We can tell just from counting square brackets that this array is two-dimensional, in other words it’s an array of arrays: a matrix.
data[0]
array([1.00e+02, 2.38e+38])
As you can see from the code above, the first row of data
is [0.001, 0.005]
which is indeed the first row of data in the original file. This makes sense, but in all liklihood what we actually want is all of the temperature values in one array, and all of the equilibrium constant data in another array. We can achieve this by transposing the array:
transposed_data = data.T
print(transposed_data)
[[1.00e+02 1.20e+02 1.40e+02 1.60e+02 1.80e+02 2.00e+02 2.20e+02 2.40e+02
2.60e+02 2.80e+02 3.00e+02 3.20e+02 3.40e+02 3.60e+02 3.80e+02 4.00e+02
4.20e+02 4.40e+02 4.60e+02 4.80e+02 5.00e+02]
[2.38e+38 2.15e+30 3.86e+24 1.89e+20 8.42e+16 1.75e+14 1.12e+12 1.67e+10
4.73e+08 2.24e+07 1.59e+06 1.57e+05 2.03e+04 3.30e+03 6.50e+02 1.51e+02
4.01e+01 1.21e+01 4.02e+00 1.47e+00 5.82e-01]]
Now we have an array that contains all the same data as before, but the rows are now columns and vice versa. This allows us to very easily assign the temperature and equilibrium constant values to separate variables:
temperature, K = transposed_data
print(temperature)
print(K)
[100. 120. 140. 160. 180. 200. 220. 240. 260. 280. 300. 320. 340. 360.
380. 400. 420. 440. 460. 480. 500.]
[2.38e+38 2.15e+30 3.86e+24 1.89e+20 8.42e+16 1.75e+14 1.12e+12 1.67e+10
4.73e+08 2.24e+07 1.59e+06 1.57e+05 2.03e+04 3.30e+03 6.50e+02 1.51e+02
4.01e+01 1.21e+01 4.02e+00 1.47e+00 5.82e-01]
We could also achieve this by changing how we originally read in the file:
temperature, K = np.loadtxt('thermodynamic_data.dat', unpack=True)
print(temperature)
print(K)
[100. 120. 140. 160. 180. 200. 220. 240. 260. 280. 300. 320. 340. 360.
380. 400. 420. 440. 460. 480. 500.]
[2.38e+38 2.15e+30 3.86e+24 1.89e+20 8.42e+16 1.75e+14 1.12e+12 1.67e+10
4.73e+08 2.24e+07 1.59e+06 1.57e+05 2.03e+04 3.30e+03 6.50e+02 1.51e+02
4.01e+01 1.21e+01 4.02e+00 1.47e+00 5.82e-01]
Here we have added the unpack
keyword argument, which automatically transposes the output array for us, we are then using multiple assignment to immediately split the data into separate variables.
For one final example, let’s look at reading in a csv
file:
# T / K | K
100, 2.38e38
120, 2.15e30
140, 3.86e24
160, 1.89e20
180, 8.42e16
200, 1.75e14
220, 1.12e12
240, 1.67e10
260, 4.73e08
280, 2.24e07
300, 1.59e06
320, 1.57e05
340, 2.03e04
360, 3.30e03
380, 6.50e02
400, 1.51e02
420, 4.01e01
440, 1.21e01
460, 4.02e00
480, 1.47e00
500, 5.82e-01
Follow this link to download this file.
This example is actually the same data as the previous case, but now with commas separating the values rather than solely whitespace. This seemingly minor difference is actually quite important, as we can see if we try to read in this file in the same way as the previous example:
temperature, K = np.loadtxt('thermodynamic_data.csv', unpack=True)
print(temperature)
print(K)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[12], line 1
----> 1 temperature, K = np.loadtxt('thermodynamic_data.csv', unpack=True)
3 print(temperature)
4 print(K)
File /opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/site-packages/numpy/lib/_npyio_impl.py:1395, in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, quotechar, like)
1392 if isinstance(delimiter, bytes):
1393 delimiter = delimiter.decode('latin1')
-> 1395 arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter,
1396 converters=converters, skiplines=skiprows, usecols=usecols,
1397 unpack=unpack, ndmin=ndmin, encoding=encoding,
1398 max_rows=max_rows, quote=quotechar)
1400 return arr
File /opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/site-packages/numpy/lib/_npyio_impl.py:1046, in _read(fname, delimiter, comment, quote, imaginary_unit, usecols, skiplines, max_rows, converters, ndmin, unpack, dtype, encoding)
1043 data = _preprocess_comments(data, comments, encoding)
1045 if read_dtype_via_object_chunks is None:
-> 1046 arr = _load_from_filelike(
1047 data, delimiter=delimiter, comment=comment, quote=quote,
1048 imaginary_unit=imaginary_unit,
1049 usecols=usecols, skiplines=skiplines, max_rows=max_rows,
1050 converters=converters, dtype=dtype,
1051 encoding=encoding, filelike=filelike,
1052 byte_converters=byte_converters)
1054 else:
1055 # This branch reads the file into chunks of object arrays and then
1056 # casts them to the desired actual dtype. This ensures correct
1057 # string-length and datetime-unit discovery (like `arr.astype()`).
1058 # Due to chunking, certain error reports are less clear, currently.
1059 if filelike:
ValueError: could not convert string '100,' to float64 at row 0, column 1.
The ValueError
here gives us a good clue as to what’s going wrong: could not convert string '0.001,' to float64 at row 0, column 1.
This tells us that numpy
is including the comma with each value, which we obviously do not want. This happens because the loadtxt
function expects values to be separated by whitespace by default, not commas. We can change this with another keyword argument:
temperature, K = np.loadtxt('thermodynamic_data.csv', unpack=True, delimiter=',')
print(temperature)
print(K)
[100. 120. 140. 160. 180. 200. 220. 240. 260. 280. 300. 320. 340. 360.
380. 400. 420. 440. 460. 480. 500.]
[2.38e+38 2.15e+30 3.86e+24 1.89e+20 8.42e+16 1.75e+14 1.12e+12 1.67e+10
4.73e+08 2.24e+07 1.59e+06 1.57e+05 2.03e+04 3.30e+03 6.50e+02 1.51e+02
4.01e+01 1.21e+01 4.02e+00 1.47e+00 5.82e-01]
By setting the delimiter
to a comma ,
, the loadtxt
function is able to successfully parse the data and we end up with the same result as our previous example.