Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Files

Learning outcomes

Prerequisites

Reading files

Python has functionality to read any kind of computer file. However, making sense of a text, image, or sound file represents three very different tasks. In this lesson, we will focus on text files, since their manipulation is so common in science.

There are many different file opening methods in Python, which you may encounter especially in older programmes. Here, we will only focus on the with open() method, which has been recommended since Python 3.

Let’s try it with an example. Download this text file (molecule_names.txt) and rename it as molecule_names.txt. Then, copy the following example and place both the Python script containing the example (ending in .py) and molecule_names.txt in the same folder.

with open("molecule_names.txt") as file_in:
    lines_in_file = file_in.readlines()

print(lines_in_file)

This final line allows us to manipulate all the information from the file, however when manipulating very large data sets, we may not want to keep all of the contents of the file in Python memory. To access files only line by line, a less intuitive but more general syntax exists. Here, we iterate over the file variable using a for loop, which yields each line as an individual string:

with open("molecule_names.txt") as file_in:
    for line in file_in:
        print(line)

Try adding more molecules to molecule_names.txt and ensure that they are printed by the code above.

Solution
Experimental text files often start with information about the instrument used to record the data (so called **metadata**). In this file, those lines all start with `#`.
data_lines = []

with open("acetone.jdx") as acetone_file:
    # loop over every line
    for line in acetone_file:
        # check that the line doesn't start with #
        if line[0] != "#":
            data_lines.append(line)

print(data_lines[0])
print(data_lines[1])

Usually, it takes some effort to extract exactly the data that we want from a file. Here, we wanted to remove the metadata, so we added a condition based on the # character. This operation is prone to error, so it’s good to operate a sanity check by printing the first and final line, ensuring that we didn’t accidentally remove a line we cared about.

Solution
This task has many possible solutions. In any case, some `for` loops will be required and the numbers read in a strings should be converted to floats.
wavenumbers = []
transmittances = []
with open("acetone.jdx") as acetone_file:
    # loop over every line
    for line in acetone_file:
        # check that the line doesn't start with #
        if line[0] != "#":
            # separate the line into substrings
            split_line = line.split()
            # the first number is a wavenumber
            wavenumbers.append(float(split_line[0]))
            # average all the other numbers
            current_transmittance = 0
            for number in split_line[1:]:
                current_transmittance = current_transmittance + float(number)
            # find the average of five measurements
            current_transmittance = current_transmittance / 5
            transmittances.append(current_transmittance)

# plot
import matplotlib.pyplot as plt
plt.plot(wavenumbers, transmittances)
plt.show()

Writing files

The syntax for writing files is very similar to the one for opening them, but you should exercise extra caution:

Here is the syntax:

with open("new_molecule_names.txt","w") as file_in:
    file_in.write("paracetamol\n")
    file_in.write("butadiene")

Check that new_molecule_names.txt was indeed created and contains what you expect.

Solution
We will start identically to before.
wavenumbers = []
transmittances = []
with open("acetone.jdx") as acetone_file:
    # loop over every line
    for line in acetone_file:
        # check that the line doesn't start with #
        if line[0] != "#":
            # separate the line into substrings
            split_line = line.split()
            # the first number is a wavenumber
            wavenumbers.append(float(split_line[0]))
            # average all the other numbers
            current_transmittance = 0
            for number in split_line[1:]:
                current_transmittance = current_transmittance + float(number)
            # find the average of five measurements
            current_transmittance = current_transmittance / 5
            transmittances.append(current_transmittance)

with open("acetone_ir.csv","w") as file_in:
    # write the column headers
    file_in.write("wavenumber,transmittance\n")
    for wavenumber, transmittance in zip(wavenumbers, transmittances):
        file_in.write(f"{wavenumber},{transmittance}\n")

This solution is slightly computationally inefficient because we are looping through every line of the input file, and then looping over each result again to write out the result. A less legible but more efficient solution would be to nest both with open() blocks, so that as each line is read in the input, a new line is immediately written to the output.

Summary

References
  1. Linstrom, P. (1997). NIST Chemistry WebBook, NIST Standard Reference Database 69. National Institute of Standards.