Effective Python: Strings and Unicode

What the �—±?† ?

Ever cut and paste text between applications and get unexpected results? Particularly when the source is a webpage? For example, the sentence1

◆ News: Doña Ana’s slump—lawsuit filed ☹

can paste as

  1. ◆ News: Doña Ana’s slump—lawsuit filed ☹
  2. News: Doa Anas slumplawsuit filed
  3. ◆ News: Do�a Ana’s slump—lawsuit filed ☹
  4. ◆ News: Doña Ana’s slump—lawsuit filed ☹
  5. ? News: Do�a Ana?s slump?lawsuit filed ?
  6. News: Do�a Anas slumplawsuit filed
  7. ◆ News: Doa Ana’s slump—lawsuit filed ☹
  8. ? News: Doa Ana?s slump?lawsuit filed ?

Only two of eight are rendered correctly, and row 7 is correct only in HTML since it is actually the string ◆ News: Doa Ana’s slump—lawsuit filed ☹. It’s particularly annoying for the Data Scientist that the lost data is often irrelevant: the apostrophe, dash, and emoji. Accents on characters are important and can change the meaning of a word. What is going on? In a word: Unicode.

Unicode

Unicode is a universally agreed assignment of a code point number to each printable character or glyph2. It is a vast extension of the 128-character ASCII table. A recent list of Unicode characters contains 144,697 glyphs.

A string is a sequence of glyphs. Displayed on the screen, a string corresponds to a sequence of code points, rendered by looking up the code point in a font. How does the computer store a string? To have capacity for 150,000 different glyphs requires 18 bits, since \(2^{17}=131072<150000<2^{18}=262144\). Computers prefer even numbers of bytes3, so one glyph is stored using 4 bytes in a long integer type. That gives space for \(2^{32}=4\)GB glyphs, plenty for new emojis.

Using longs is wasteful. In English, one byte—the old ASCII standard— suffices for most gpyphs. For a small document, the waste is manageable. But for data storage or transmission an unnecessary \(4\times\) increase in file size is unacceptable. As a result, encoding mechanisms are used to store strings efficiently. Encodings can be customized to a particular language, tailored to the specific characters it uses most frequently, or be general purpose. Common encodings include Latin4, for the Latin alphabet plus standard accented Latin letters, and UTF-8[5(https://en.wikipedia.org/wiki/UTF-8) According to Wikipedia, as of January 2022, UTF-8 accounts for 97.7% of all web pages and 986 of the top 1,000 highest ranked web pages.], which is general purpose but still compact. Many encodings map the lower-ASCII characters, including a-z, A-Z and 0-9, to their ASCII codes using one byte. Thereafter, they diverge.

When a text file is saved, or transmitted over the Internet, its is encoded into a series of bytes to save space. To recreate it, you must know the encoding. The top of a webpage often includes a line

<meta charset="utf-8"/>

that specifies how it is encoded.

Unicode in Python

A string in Python6 can include any Unicode character. The function ord returns a character’s code point:

s = '◆ News: Doña Ana’s slump—lawsuit filed ☹'
[ord(c) for c in s]
# '[9670, 32, 78, 101, 119, 115, 58, 32, 68, 111, 241, 97, 32,
# 65, 110, 97, 8217, 115, 32, 115, 108, 117, 109, 112, 8212,
# 108, 97, 119, 115, 117, 105, 116, 32, 102, 105, 108, 101, 100,
# 32, 9785]'

ord is universal and follows Unicode. You may know already ord('A')=65 in ASCII, which appears in the second row of output from “Ana”. 65 is octal 101, which may explain the choice.

In Excel, if you type =char(65) it returns A. The analogous Python function is chr. It is the opposite to ord, returning a glyph from a code point. Depending on the font and output device you are using, that may translate into a recognizable symbol.

''.join([chr(ord(c)) for c in s])==s
# True

To save s to a file we can try7

Path('dm.txt').write_text(s)
# UnicodeEncodeError: 'charmap' codec can't encode character '\u25c6'
# in position 0: character maps to <undefined>

Python is complaining about the first (non-ASCII) character: \u25c6 means hexadecimal 0x256c, which equals 9670 as shown by ord above. For write_text to work we must specify an encoding. The emerging default is UTF-8.

Path('dm.txt').write_text(s, encoding='utf-8')
# 40

The return value shows how many characters are written; s has 40 characters. If we want to see what Python wrote we can open the file in binary mode8 with no decoding:

Path('dm.txt').read_bytes()
# b'\xe2\x97\x86 News: Do\xc3\xb1a Ana\xe2\x80\x99s
# slump\xe2\x80\x94lawsuit filed \xe2\x98\xb9'

As expected, ASCII characters are unchanged. Non-ASCII characters have been converted into two or three bytes9. \xe2 means the hex number e2 \(=14\times 16+2=226\). The b indicates the result is an array of bytes.

We can also see the encoded string directly using s.encode('utf-8'). And starting from the encoded bytes, we can recover the string with

b = s.encode('utf-8')
s2 = b.decode('utf-8')
s == s2
# True

I created the list of eight Unicode manglings of s in the Introduction by using different encodings and decodings, and handling errors in different ways (see code in the Appendix). For example, the last row uses Latin to encode and decode. Encoding fails for s since it contains non-Latin characters. Ignoring the errors gives

b = s.encode('Latin', errors='replace')
# b'? News: Do\xf1a Ana?s slump?lawsuit filed ?'
b.decode('Latin')
# '? News: Doña Ana?s slump?lawsuit filed ?'

The ñ is a Latin character and survives the round trip. The apostrophe, diamond, dash, and emoji do not.

Ways to represent non-ASCII characters

Special characters in a UTF-8 or Latin encoded string are not human readable. There are several alternatives that can be used when readability is important. Some just show the code point more clearly, but others provide a description of the character. In HTML, it is common to see characters written like &ntilde; which (no surprise) is rendered ñ. &#nnnn; (decimal) or &#xhhhh; (hexadecimal with upper or lower case a-f) are also allowable, where the number equals the code point and it can include any number of digits or leading zeros. On-line converters give HTML codes for any glyph. Others list accented Latin characters. Or, of course, you can use Python as explained below.

The next table gives, for each glyph in s, the code point in decimal, hex, and octal (base 8), the category (Lu = Latin upper, Ll Latin lower, Pd punctuation dash, etc.), the official Unicode description of the glyph, and its representation as an HTML string and coded in hex. (The code to create the table is in the Appendix.)

Details of the glyphs in s.
glyph code hex octal category description html html2
32 0x20 0o40 Zs SPACE
: 58 0x3a 0o72 Po COLON :
A 65 0x41 0o101 Lu LATIN CAPITAL LETTER A A
D 68 0x44 0o104 Lu LATIN CAPITAL LETTER D D
N 78 0x4e 0o116 Lu LATIN CAPITAL LETTER N N
a 97 0x61 0o141 Ll LATIN SMALL LETTER A a
d 100 0x64 0o144 Ll LATIN SMALL LETTER D d
e 101 0x65 0o145 Ll LATIN SMALL LETTER E e
f 102 0x66 0o146 Ll LATIN SMALL LETTER F f
i 105 0x69 0o151 Ll LATIN SMALL LETTER I i
l 108 0x6c 0o154 Ll LATIN SMALL LETTER L l
m 109 0x6d 0o155 Ll LATIN SMALL LETTER M m
n 110 0x6e 0o156 Ll LATIN SMALL LETTER N n
o 111 0x6f 0o157 Ll LATIN SMALL LETTER O o
p 112 0x70 0o160 Ll LATIN SMALL LETTER P p
s 115 0x73 0o163 Ll LATIN SMALL LETTER S s
t 116 0x74 0o164 Ll LATIN SMALL LETTER T t
u 117 0x75 0o165 Ll LATIN SMALL LETTER U u
w 119 0x77 0o167 Ll LATIN SMALL LETTER W w
ñ 241 0xf1 0o361 Ll LATIN SMALL LETTER N WITH TILDE ntilde ñ
8212 0x2014 0o20024 Pd EM DASH mdash
8217 0x2019 0o20031 Pf RIGHT SINGLE QUOTATION MARK rsquo
9000 0x2328 0o21450 So KEYBOARD
9259 0x242b 0o22053 Cn n/a
9670 0x25c6 0o22706 So BLACK DIAMOND
9785 0x2639 0o23071 So WHITE FROWNING FACE

Data in the wild

It is a sad fact that YOU CANNOT GUESS ENCODINGS. Go ahead, Google or try it. There is no reliable way to infer an encoding if it is not known. If you don’t know the encoding, then start with UTF-8. If that fails, try Latin, if it makes linguistically. For data generated by Windows (e.g., older CSV files produced by Excel) try UTF-16 or UTF-16-LE. If the data comes from a particular language, you can look at encodings specific to it. After that, you’re likely tolerating the �—±?.

Python 3.7 includes nearly 100 different encodings10 Given an code point and a list of encodings, it is easy to check if it encodes validly:

enc_list = ['ascii', 'latin', 'iso8859_2', 'utf_32', 'utf_16_le', 'utf_8']
x = 'ñ'
fails = []
success = []
for e in enc_list:
    try:
        success.append([e, x.encode(e)])
    except UnicodeEncodeError as err:
        fails.append([e, 'fails'])
df1 = pd.DataFrame(success + fails, columns=['encoding', 'code'])

Conversely, given a code value we can see if and how it decodes. Note the different except clause.

x = b'\xf1'
fails = []
success = []
for e in enc_list:
    try:
        success.append([e, x.decode(e)])
    except UnicodeDecodeError as err:
        fails.append([e, 'fails'])
df2 = pd.DataFrame(success + fails, columns=['encoding', 'decode'])

df = pd.concat((df1.set_index('encoding'), df2.set_index('encoding')), axis=1)
Attempts to encode ñ and decode b\xf1 with different encodings. latin and iso8859 are the same, for Western European languages. iso8859_2 is tailored to Central and Eastern European languages.
encoding code decode
latin b'\xf1' ñ
utf_32 b'\xff\xfe\x00\x00\xf1\x00\x00\x00' fails
utf_16_le b'\xf1\x00' fails
utf_8 b'\xc3\xb1' fails
ascii fails fails
iso8859_2 fails ń

One final issue can occur. The code point and decoding are correct, but the glyph is not in the font you are using. That typically generates a different type of warning.

What is an encoded string?

The Python object b = 'string'.encode('utf-8') has type bytes. In many ways it is interchangeable with a string. For example, b.upper(), b.find(b'g'), b[3:] all work as expected. Byte objects can be converted into integers using int.from_bytes. You have to specify if the most significant digit comes first (byteorder='big') or last (byteorder='little'). Thus

g = b'\x01\x00\x00\x00\x00'
int.from_bytes(g, 'big'), 256**4, int.from_bytes(g, 'little')
# (4294967296, 4294967296, 1)

Appendix

Here is the code used to create the eight Unicode strings in the introduction.

import pandas as pd

def example(s, encode, decode):
    ans = []
    for eh in ('ignore', 'replace', 'xmlcharrefreplace'):
        b = s.encode(encode, errors=eh)
        for eh2 in ('ignore', 'replace'):
            ans.append([eh, eh2, b, b.decode(decode, eh2)])

    df = pd.DataFrame(ans, columns=['encode', 'decode', 'bytes', 'decoded_bytes'])
    df = df.set_index(['encode', 'decode'])
    return df

s = "◆ News: Doña Ana’s slump—lawsuit filed ☹"

eg1 = example(s, 'Latin', 'utf-8')
eg2 = example(s, 'utf-8', 'Latin')

options = set(eg1.decoded_bytes)
options = options.union(eg2.decoded_bytes)
print('* ' + '\n* '.join([s] + list(options)))

You can cut and paste this code into Jupyter Lab to run it.

Here is the code used to create the table. I added two extra characters to s, with code points 9000 and 9259. Code point 9000 is a keyboard symbol. Code point 9259 is not a code point (not all the numbers have been assigned) and it does not print in any font.

import unicodedata
import html

ans = []
s = s + chr(9259) + chr(9000)
sls = sorted(list(set(s)))
for c in sls:
    o = ord(c)
    h = html.entities.codepoint2name.get(o, '')
    cn = unicodedata.category(c)
    if cn != 'Cn':
        # not assigned
        n = unicodedata.name(c)
    else:
        n = 'n/a'
    ans.append([c, o, hex(o), oct(o), cn, n,  h, f'&#x{hex(o)[2:]};'])


df = pd.DataFrame(ans, columns=['glyph', 'code', 'hex', 'octal',
                                'category', 'description', 'html', 'html2'])
df

The function html.unescape('&#9670;') converts an HTML code back to a glpyh. And the function unicodedata.decomposition('ñ') explodes compound characters into their parts, in this case an n, and a tilde with code points 0x6E and 0x0303 respectively.


  1. Doña Ana is a county in New Mexico. It is the only county in the US whose name contains an accented character. Download county data from a website and, odds are, it’ll cause you problems.↩︎

  2. OED glyph n. mark or symbol. Can include letters, characters, emojis, mathematical symbols, dividers, and punctuation.↩︎

  3. One byte is eight bits and can store a number between 0 and \(2^8-1=255\).↩︎

  4. Latin is also called ISO-8859-1.↩︎

  5. Unicode Transformation Format, 8 bit.↩︎

  6. Since Python 3.0; Python 2 included Unicode and non-Unicode strings.↩︎

  7. See my post on Path and files.↩︎

  8. If we open in text mode we need to specify the encoding, but we will see s recreated, not the raw bytes in the file.↩︎

  9. UTF-8 can encode all valid code points using one to four bytes. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.↩︎

  10. See a StackOverflow post for a script to extract the list.↩︎

posted 2022-02-15 | tags: Effective Python, Python, strings

Share on