Chapter 8. Input and Output

Chapter 8. Input and Output#

In this chapter we will cover how to read input from a user or a file and how to write output to a file.

We will also cover a topic that is rarely discussed in introductory programming books, but should be understood by every working programmer - encodings.

More on Strings#

When working with input and output, you will usually need to perform operations on strings. While we introduced the string data type in chapter 1, we haven’t actually done too much with strings so far. Let us fix that and talk about strings in more detail.

Important String Methods#

Just like lists, strings have a bunch of operators and methods that you will use quite often. We already discussed the concatenation operator + and repetition operator *. Another important operator is the index access operator, which allows us to take the character at some position of a string. This operator works the same way it works for lists - you think of a string as a list of characters in this context:

my_str = "Monty Python's Flying Circus"

my_str[4]

'y'

my_str[-2]

'u'

However there is an important difference between strings and lists - unlike lists, strings are immutable:

my_str[1] = "u"

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 my_str[1] = "u"

TypeError: 'str' object does not support item assignment

Just like with lists, you can slice strings. This is how you get the substring of a string in Python:

my_str[6:12]

'Python'

Strings also ship with a lot of methods. It’s useful to know the most important ones, since you will be using them quite regularly.

For example you can lowercase strings using the lower() method and uppercase them using the upper() method:

"Monty Python's Flying Circus".lower()

"monty python's flying circus"

"Monty Python's Flying Circus".upper()

"MONTY PYTHON'S FLYING CIRCUS"

You can split a string into “words” using the split() method:

"Monty Python's Flying Circus".split()

['Monty', "Python's", 'Flying', 'Circus']

If you want to split on custom delimiter, you can do that as well:

"Monty Python's Flying Circus".split("'")

['Monty Python', 's Flying Circus']

The opposite of the split() method is the join() method. This method is a bit confusing because it’s not called on the lists of strings it joins, but on the delimiter (and takes the strings it joins as a parameter):

" ".join(["Monty", "Python's", "Flying", "Circus"])

"Monty Python's Flying Circus"

We can also check whether a string starts with a certain prefix or ends with a certain suffix:

"Monty Python's Flying Circus".startswith("Monty Py")

True

"Monty Python's Flying Circus".endswith("us")

True

Escape Characters#

Some characters don’t denote letters, digits or special characters like ., , or ; and need special treatment. The most important example of this is the newline character.

The newline character is a non-printable character, because it’s - well - not really printed to the screen. Instead it describes an action to take (namely to go to the next line). Python represents newline characters using \n:

my_str = "This is a line.\nThis is another line."

If we print my_str, we will see the newline:

print(my_str)

This is a line.
This is another line.

However, if we get the unambiguous representation of my_str, we will see the \n representation of the newline:

my_str

'This is a line.\nThis is another line.'

This is actually more important than you might think at first. The reason is that both \n and the sequence \r\n represent a newline (which has to do with Windows quirks). Consider the following two strings:

my_str1 = "This a line.\nThis is another line."
my_str2 = "This a line.\r\nThis is another line."

These two strings look equal, if we print them:

print(my_str1)

This a line.
This is another line.

print(my_str2)

This a line.
This is another line.

However, they are two different strings, which the equality operator confirms:

my_str1 == my_str2

False

This can be extremely confusing, if you only look at the output of print. Luckily, the unambiguous representation of the strings immediately clears up the confusion:

my_str1

'This a line.\nThis is another line.'

my_str2

'This a line.\r\nThis is another line.'

Remember that if you are not inside a REPL, you will need to call the repr function manually to get the unambiguous representation of the strings.

Another important character that gets special treatment is the tab character which is represented using \t:

print("\tfirst column\tsecond column")

	first column	second column

The characters \t and \n are also called escape characters because a backslash is used for escaping characters.

Because backslashes have this special meaning inside strings, this means that if you want to display backslashes in your strings, you might need to escape them as well resulting in \\:

print("\\t means tab")

\t means tab

Escape characters also allow you to use quotes inside a string:

print(""this will not work"")

  Cell In[24], line 1
    print(""this will not work"")
            ^
SyntaxError: invalid syntax

print("\"this will work\"")

"this will work"

The `input` Function#

We can read a string from the command line using the input function:

user_value = input()

This will read the input of the user into the variable user_value.

It’s usually a good idea to supply the prompt argument to the input() function which will display a prompt telling the user what to do:

user_value = input("Supply a value:")

Working with Text Files#

Reading Text Files#

Let’s create a text file example.txt with the following content:

Roses are red.
Violets aren't blue.
It's literally in the name.
They're called violets.

We can read the file using the open function. This function returns a file object which allows access to the underlying file:

file = open("example.txt")

file

<_io.TextIOWrapper name='example.txt' mode='r' encoding='UTF-8'>

We can read the content of the file using the read method of the file object. The read method returns the entire content of the file as a regular string:

content = file.read()

print(content)

Roses are red.
Violets aren't blue.
It's literally in the name.
They're called violets.

type(content)

str

We also need to close the file to free up the resources consumed by the file object:

file.close()

Let’s check that the file is really closed by inspecting the closed attribute:

file.closed

True

Trying to call read on a closed file object will result in an error:

file.read()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[33], line 1
----> 1 file.read()

ValueError: I/O operation on closed file.

To summarize, we first need to open a file resulting in a file object. We can then perform operations on that file object (for example we can read the content of the file). Finally we need to close the file object.

The `with` Statement#

To simplify working with files, we can use the with statement which will automatically close the file:

with open("example.txt") as file:
    content = file.read()
    
    # The file is automatically closed, i.e. there is no need to call file.close()

file.closed

True

print(content)

Roses are red.
Violets aren't blue.
It's literally in the name.
They're called violets.

The `mode` Argument#

Writing to a file works similarly to reading from a file. We can write to a file by passing the "w" (“write”) mode as the second argument to open and calling the write method on the file object:

with open("somefile.txt", "w") as file:
    file.write("Some content")

This will create a file somefile.txt with the content "Some content". Note that if that file already exists, it’s content will be completely overwritten by the new content. If we want to avoid that and instead append the new content to the existing content we need to use the "a" (append) mode:

with open("somefile.txt", "a") as file:
    file.write("Some content")

If you omit the mode, the mode will be set to "r" (read) by default.

Encodings#

Now that we have the basics out of the way, we need to have a look at how the content of a file is actually stored on disk. To accomplish that we will pass yet another mode argument to open, namely rb. The rb mode means “read the file as a binary file” (r = read and b = binary):

with open("example.txt", "rb") as file:
    content = file.read()

First of all we note that the content is no longer a string, but a bytes object:

content

b"Roses are red.\nViolets aren't blue.\nIt's literally in the name.\nThey're called violets.\n"

type(content)

bytes

This object contains the actual bytes of the file. A byte is simply the smallest unit of storage on a computer and can (usually) hold values from 0 to 255.

For example we access the first byte of the file like this:

content[0]

Wait, why do we suddenly have numbers when we know that a file contains characters? The answer to this question is that the computer deceptively lies to us.

You see, computers can’t really store characters. They can only store bytes which represent numbers. This means that the file actually contains a sequence of numbers.

However computers maintain mappings from those numbers to characters, so that they can interpret those numbers as characters. The simplest such mapping is the ASCII table. Here is an excerpt from that table:

| Byte value | Character        |
| __________ | _________________|
| ...        | ...              |
| __________ | _________________|
| 80         | P                |
| __________ | _________________|
| 81         | Q                |
| __________ | _________________|
| 82         | R                |
| __________ | _________________|
| 83         | S                |
| __________ | _________________|
| 84         | T                |
| __________ | _________________|
| 85         | U                |
| __________ | _________________|
| 86         | V                |
| __________ | _________________|
| 87         | W                |
| __________ | _________________|
| 88         | X                |
| __________ | _________________|
| 89         | Y                |
| __________ | _________________|
| 90         | Z                |
| __________ | _________________|
| ...        | ...              |
| __________ | _________________|
| 97         | a                |
| __________ | _________________|
| 98         | b                |
| __________ | _________________|
| 99         | c                |
| __________ | _________________|
| 100        | d                |
| __________ | _________________|
| 101        | e                |
| __________ | _________________|
| ...        | ...              |
| __________ | _________________|
| 110        | n                |
| __________ | _________________|
| 111        | o                |
| __________ | _________________|
| 112        | p                |
| __________ | _________________|
| 113        | q                |
| __________ | _________________|
| 114        | r                |
| __________ | _________________|
| 115        | s                |
| __________ | _________________|
| ...        | ...              |
| __________ | _________________|

Now we can make some sense of the values in the content variable:

content[0]

If we look at the ASCII table, we can see that the number 82 corresponds to the character R. Therefore the first byte of the file contains the number 82 which represents the character R. The next few characters should be be o, s, e and s, i.e. the following bytes should be 111, 115, 101 and 115:

content[1]

content[2]

content[3]

content[4]

Characters like space, newline etc are also simply stored as bytes. For example, the number corresponding to the space characters is 32:

content[5]

The ASCII table worked fine for a while until programmers suddenly noticed that there are languages that are not English. This was a truly shocking discovery that fundamentally changed the way programmers thought about the world. The Unicode standard was born.

This is really oversimplified history of the Unicode standard. The reality was much more complicated.

The most important concept of the Unicode standard was the code point. A code point is a numerical value for a specific character. This is very similar to the ASCII table, except that Unicode is much, much bigger and contains such characters as:

the German umlaut ä which has the code point 228
the checkmark ✅ which has the code point 9989
the emoji 😀 which has the code point 128512

You can think of Unicode as a giant extension of the ASCII table.

In reality, Unicode is more complicated than that, but we don’t care about this right now.

The fact that Unicode is so large means we can no longer store every character in a single byte. In order to fit every Unicode character, we would need at least four bytes. However this would be extremely wasteful for e.g. english texts, since we would rarely actually need all four bytes in this case.

Therefore there are multiple encodings which govern how code points are encoded (i.e. converted) to bytes. For example an encoding can decide to represent some characters as a single byte, some characters as two bytes etc. We will not dive into the gritty details of encodings in this chapter since this is not essential to understand (at least not for know). But it is essential to realize that the same code point can be converted to a different sequence of bytes depending on the encoding.

Consider the German umlaut ä for example. The code point of ä is (always) 228, because that is the code point that was assigned to ä. However different encodings will represent this code point using different byte sequences.

For example the encoding UTF-8 (which is the most popular encoding on the internet) will represent that code point using the following sequence of bytes.

utf8_umlaut = "ä".encode("utf-8")

len(utf8_umlaut)

utf8_umlaut[0]

utf8_umlaut[1]

However, the Windows-1252 encoding (called cp1252 for short) which is commonly used on Windows systems represents the same code point completely differently:

cp1252_umlaut = "ä".encode("cp1252")

len(cp1252_umlaut)

cp1252_umlaut[0]

All of this has an extremely important practical consequence:

If you want to know what string a sequence of bytes represents, you need to know the encoding of that sequence. The sequence of bytes by itself is (generally speaking) useless without the encoding.

Consider the following sequence of bytes:

b = bytes([195, 164])

If that sequence of bytes has the encoding utf-8 it represents the German umlaut ä:

b.decode("utf-8")

'ä'

However if that sequence of bytes has the encoding cp-1252 it suddenly represents a completely different string:

b.decode("cp1252")

'Ã¤'

It should be noted that if you don’t know the encoding of a string there are certain statistical methods that can be used to guess that encoding using common patterns. In addition there are some encodings that are much more common than the rest (like UTF-8 or Windows-1252). However, such guesses are not always accurate. This means that it is a really bad idea to rely on such guesses when writing production code.

This also means that if you write a file using one encoding and then try to read it using a different encoding, you will either get scrambled content or maybe even fail to read the file completely. This is actually a fairly common occurence if a file was created on an operating system that uses one encoding by default and then read on another operating system that uses another encoding by default.

Let’s see this in action. Create a text file german.txt with the following content:

A file with umlauts: ÄÖÜäöü

The encoding of the file should be UTF-8:

with open("german.txt", "w", encoding="utf-8") as german_file:
    german_file.write("A file with umlauts: ÄÖÜäöü")

Let’s try to read the same file using a different encoding:

with open("german.txt", "r", encoding="cp1252") as german_file:
    content = german_file.read()

content

'A file with umlauts: Ã„Ã–ÃœÃ¤Ã¶Ã¼'

Uh-oh! The content of this file is completely scrambled! This is because we tried to read it in an encoding that is different from the original encoding it was written in.

Depending on the encoding, the read may even fail completely:

with open("german.txt", "r", encoding="utf-16") as german_file:
    content = german_file.read()

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[61], line 2
      1 with open("german.txt", "r", encoding="utf-16") as german_file:
----> 2     content = german_file.read()

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    319 def decode(self, input, final=False):
    320     # decode input (taking the buffer into account)
    321     data = self.buffer + input
--> 322     (result, consumed) = self._buffer_decode(data, self.errors, final)
    323     # keep undecoded input until the next call
    324     self.buffer = data[consumed:]

File /opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/encodings/utf_16.py:61, in IncrementalDecoder._buffer_decode(self, input, errors, final)
     58 def _buffer_decode(self, input, errors, final):
     59     if self.decoder is None:
     60         (output, consumed, byteorder) = \
---> 61             codecs.utf_16_ex_decode(input, errors, 0, final)
     62         if byteorder == -1:
     63             self.decoder = codecs.utf_16_le_decode

UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0xbc in position 32: truncated data

This is actually better than scrambled content because of the following general programming principle:

It’s better to crash than to proceed with invalid data.

The reason for this is simple: If you crash, then at least you know you have an error. If you proceed with invalid data, then you may never know that you have an error until something really bad happens much later.

Consider a file that contains bank transactions. If you fail to read this file, then you know that your software has an error and you may try to fix it. But if you proceed with scrambled content, you may process completely wrong transactions resulting in a lot of headaches for a lot of people (including you).