Effective Python: Strings and Unicode

What the �±? ?

Ever cut and paste text between applications and get unexpected results? Particularly when the source is a webpage? For example, the sentence¹

◆ News: Doña Ana’s slump—lawsuit filed ☹

can paste as

◆ News: Doña Ana’s slump—lawsuit filed ☹
News: Doa Anas slumplawsuit filed
◆ News: Do�a Ana’s slump—lawsuit filed ☹
â News: DoÃ±a Anaâs slumpâlawsuit filed â¹
? News: Do�a Ana?s slump?lawsuit filed ?
News: Do�a Anas slumplawsuit filed
◆ News: Doa Ana’s slump—lawsuit filed ☹
? News: Doa Ana?s slump?lawsuit filed ?

Only two of eight are rendered correctly, and row 7 is correct only in HTML since it is actually the string ◆ News: Doa Ana’s slump—lawsuit filed ☹. It’s particularly annoying for the Data Scientist that the lost data is often irrelevant: the apostrophe, dash, and emoji. Accents on characters are important and can change the meaning of a word. What is going on? In a word: Unicode.

Unicode

Unicode is a universally agreed assignment of a code point number to each printable character or glyph². It is a vast extension of the 128-character ASCII table. A recent list of Unicode characters contains 144,697 glyphs.

A string is a sequence of glyphs. Displayed on the screen, a string corresponds to a sequence of code points, rendered by looking up the code point in a font. How does the computer store a string? To have capacity for 150,000 different glyphs requires 18 bits, since \(2^{17}=131072<150000<2^{18}=262144\). Computers prefer even numbers of bytes³, so one glyph is stored using 4 bytes in a long integer type. That gives space for \(2^{32}=4\)GB glyphs, plenty for new emojis.

Using longs is wasteful. In English, one byte—the old ASCII standard— suffices for most gpyphs. For a small document, the waste is manageable. But for data storage or transmission an unnecessary \(4\times\) increase in file size is unacceptable. As a result, encoding mechanisms are used to store strings efficiently. Encodings can be customized to a particular language, tailored to the specific characters it uses most frequently, or be general purpose. Common encodings include Latin⁴, for the Latin alphabet plus standard accented Latin letters, and UTF-8[⁵(https://en.wikipedia.org/wiki/UTF-8) According to Wikipedia, as of January 2022, UTF-8 accounts for 97.7% of all web pages and 986 of the top 1,000 highest ranked web pages.], which is general purpose but still compact. Many encodings map the lower-ASCII characters, including a-z, A-Z and 0-9, to their ASCII codes using one byte. Thereafter, they diverge.

When a text file is saved, or transmitted over the Internet, its is encoded into a series of bytes to save space. To recreate it, you must know the encoding. The top of a webpage often includes a line

<meta charset="utf-8"/>

that specifies how it is encoded.

Unicode in Python

A string in Python⁶ can include any Unicode character. The function ord returns a character’s code point:

s = '◆ News: Doña Ana’s slump—lawsuit filed ☹'
[ord(c) for c in s]
# '[9670, 32, 78, 101, 119, 115, 58, 32, 68, 111, 241, 97, 32,
# 65, 110, 97, 8217, 115, 32, 115, 108, 117, 109, 112, 8212,
# 108, 97, 119, 115, 117, 105, 116, 32, 102, 105, 108, 101, 100,
# 32, 9785]'

ord is universal and follows Unicode. You may know already ord('A')=65 in ASCII, which appears in the second row of output from “Ana”. 65 is octal 101, which may explain the choice.

In Excel, if you type =char(65) it returns A. The analogous Python function is chr. It is the opposite to ord, returning a glyph from a code point. Depending on the font and output device you are using, that may translate into a recognizable symbol.

''.join([chr(ord(c)) for c in s])==s
# True

To save s to a file we can try⁷

Path('dm.txt').write_text(s)
# UnicodeEncodeError: 'charmap' codec can't encode character '\u25c6'
# in position 0: character maps to <undefined>

Python is complaining about the first (non-ASCII) character: \u25c6 means hexadecimal 0x256c, which equals 9670 as shown by ord above. For write_text to work we must specify an encoding. The emerging default is UTF-8.

Path('dm.txt').write_text(s, encoding='utf-8')
# 40

The return value shows how many characters are written; s has 40 characters. If we want to see what Python wrote we can open the file in binary mode⁸ with no decoding:

Path('dm.txt').read_bytes()
# b'\xe2\x97\x86 News: Do\xc3\xb1a Ana\xe2\x80\x99s
# slump\xe2\x80\x94lawsuit filed \xe2\x98\xb9'

As expected, ASCII characters are unchanged. Non-ASCII characters have been converted into two or three bytes⁹. \xe2 means the hex number e2 \(=14\times 16+2=226\). The b indicates the result is an array of bytes.

We can also see the encoded string directly using s.encode('utf-8'). And starting from the encoded bytes, we can recover the string with

b = s.encode('utf-8')
s2 = b.decode('utf-8')
s == s2
# True

I created the list of eight Unicode manglings of s in the Introduction by using different encodings and decodings, and handling errors in different ways (see code in the Appendix). For example, the last row uses Latin to encode and decode. Encoding fails for s since it contains non-Latin characters. Ignoring the errors gives

b = s.encode('Latin', errors='replace')
# b'? News: Do\xf1a Ana?s slump?lawsuit filed ?'
b.decode('Latin')
# '? News: Doña Ana?s slump?lawsuit filed ?'

The ñ is a Latin character and survives the round trip. The apostrophe, diamond, dash, and emoji do not.

Ways to represent non-ASCII characters

Special characters in a UTF-8 or Latin encoded string are not human readable. There are several alternatives that can be used when readability is important. Some just show the code point more clearly, but others provide a description of the character. In HTML, it is common to see characters written like ñ which (no surprise) is rendered ñ. &#nnnn; (decimal) or &#xhhhh; (hexadecimal with upper or lower case a-f) are also allowable, where the number equals the code point and it can include any number of digits or leading zeros. On-line converters give HTML codes for any glyph. Others list accented Latin characters. Or, of course, you can use Python as explained below.

The next table gives, for each glyph in s, the code point in decimal, hex, and octal (base 8), the category (Lu = Latin upper, Ll Latin lower, Pd punctuation dash, etc.), the official Unicode description of the glyph, and its representation as an HTML string and coded in hex. (The code to create the table is in the Appendix.)

Details of the glyphs in `s`.
glyph	code	hex	octal	category	description	html	html2
	32	0x20	0o40	Zs	SPACE
:	58	0x3a	0o72	Po	COLON		:
A	65	0x41	0o101	Lu	LATIN CAPITAL LETTER A		A
D	68	0x44	0o104	Lu	LATIN CAPITAL LETTER D		D
N	78	0x4e	0o116	Lu	LATIN CAPITAL LETTER N		N
a	97	0x61	0o141	Ll	LATIN SMALL LETTER A		a
d	100	0x64	0o144	Ll	LATIN SMALL LETTER D		d
e	101	0x65	0o145	Ll	LATIN SMALL LETTER E		e
f	102	0x66	0o146	Ll	LATIN SMALL LETTER F		f
i	105	0x69	0o151	Ll	LATIN SMALL LETTER I		i
l	108	0x6c	0o154	Ll	LATIN SMALL LETTER L		l
m	109	0x6d	0o155	Ll	LATIN SMALL LETTER M		m
n	110	0x6e	0o156	Ll	LATIN SMALL LETTER N		n
o	111	0x6f	0o157	Ll	LATIN SMALL LETTER O		o
p	112	0x70	0o160	Ll	LATIN SMALL LETTER P		p
s	115	0x73	0o163	Ll	LATIN SMALL LETTER S		s
t	116	0x74	0o164	Ll	LATIN SMALL LETTER T		t
u	117	0x75	0o165	Ll	LATIN SMALL LETTER U		u
w	119	0x77	0o167	Ll	LATIN SMALL LETTER W		w
ñ	241	0xf1	0o361	Ll	LATIN SMALL LETTER N WITH TILDE	ntilde	ñ
—	8212	0x2014	0o20024	Pd	EM DASH	mdash	—
’	8217	0x2019	0o20031	Pf	RIGHT SINGLE QUOTATION MARK	rsquo	’
⌨	9000	0x2328	0o21450	So	KEYBOARD		⌨
␫	9259	0x242b	0o22053	Cn	n/a		␫
◆	9670	0x25c6	0o22706	So	BLACK DIAMOND		◆
☹	9785	0x2639	0o23071	So	WHITE FROWNING FACE		☹

Data in the wild

It is a sad fact that YOU CANNOT GUESS ENCODINGS. Go ahead, Google or try it. There is no reliable way to infer an encoding if it is not known. If you don’t know the encoding, then start with UTF-8. If that fails, try Latin, if it makes linguistically. For data generated by Windows (e.g., older CSV files produced by Excel) try UTF-16 or UTF-16-LE. If the data comes from a particular language, you can look at encodings specific to it. After that, you’re likely tolerating the �±?.

Python 3.7 includes nearly 100 different encodings ¹⁰ Given an code point and a list of encodings, it is easy to check if it encodes validly:

enc_list = ['ascii', 'latin', 'iso8859_2', 'utf_32', 'utf_16_le', 'utf_8']
x = 'ñ'
fails = []
success = []
for e in enc_list:
    try:
        success.append([e, x.encode(e)])
    except UnicodeEncodeError as err:
        fails.append([e, 'fails'])
df1 = pd.DataFrame(success + fails, columns=['encoding', 'code'])

Conversely, given a code value we can see if and how it decodes. Note the different except clause.

x = b'\xf1'
fails = []
success = []
for e in enc_list:
    try:
        success.append([e, x.decode(e)])
    except UnicodeDecodeError as err:
        fails.append([e, 'fails'])
df2 = pd.DataFrame(success + fails, columns=['encoding', 'decode'])

df = pd.concat((df1.set_index('encoding'), df2.set_index('encoding')), axis=1)

Attempts to encode ñ and decode `b\xf1` with different encodings. `latin` and `iso8859` are the same, for Western European languages. `iso8859_2` is tailored to Central and Eastern European languages.
encoding	code	decode
latin	`b'\xf1'`	ñ
utf_32	`b'\xff\xfe\x00\x00\xf1\x00\x00\x00'`	fails
utf_16_le	`b'\xf1\x00'`	fails
utf_8	`b'\xc3\xb1'`	fails
ascii	fails	fails
iso8859_2	fails	ń

One final issue can occur. The code point and decoding are correct, but the glyph is not in the font you are using. That typically generates a different type of warning.

What is an encoded string?

The Python object b = 'string'.encode('utf-8') has type bytes. In many ways it is interchangeable with a string. For example, b.upper(), b.find(b'g'), b[3:] all work as expected. Byte objects can be converted into integers using int.from_bytes. You have to specify if the most significant digit comes first (byteorder='big') or last (byteorder='little'). Thus

g = b'\x01\x00\x00\x00\x00'
int.from_bytes(g, 'big'), 256**4, int.from_bytes(g, 'little')
# (4294967296, 4294967296, 1)

Appendix

Here is the code used to create the eight Unicode strings in the introduction.

import pandas as pd

def example(s, encode, decode):
    ans = []
    for eh in ('ignore', 'replace', 'xmlcharrefreplace'):
        b = s.encode(encode, errors=eh)
        for eh2 in ('ignore', 'replace'):
            ans.append([eh, eh2, b, b.decode(decode, eh2)])

    df = pd.DataFrame(ans, columns=['encode', 'decode', 'bytes', 'decoded_bytes'])
    df = df.set_index(['encode', 'decode'])
    return df

s = "◆ News: Doña Ana’s slump—lawsuit filed ☹"

eg1 = example(s, 'Latin', 'utf-8')
eg2 = example(s, 'utf-8', 'Latin')

options = set(eg1.decoded_bytes)
options = options.union(eg2.decoded_bytes)
print('* ' + '\n* '.join([s] + list(options)))

You can cut and paste this code into Jupyter Lab to run it.

Here is the code used to create the table. I added two extra characters to s, with code points 9000 and 9259. Code point 9000 is a keyboard symbol. Code point 9259 is not a code point (not all the numbers have been assigned) and it does not print in any font.

import unicodedata
import html

ans = []
s = s + chr(9259) + chr(9000)
sls = sorted(list(set(s)))
for c in sls:
    o = ord(c)
    h = html.entities.codepoint2name.get(o, '')
    cn = unicodedata.category(c)
    if cn != 'Cn':
        # not assigned
        n = unicodedata.name(c)
    else:
        n = 'n/a'
    ans.append([c, o, hex(o), oct(o), cn, n,  h, f'&#x{hex(o)[2:]};'])


df = pd.DataFrame(ans, columns=['glyph', 'code', 'hex', 'octal',
                                'category', 'description', 'html', 'html2'])
df

The function html.unescape('◆') converts an HTML code back to a glpyh. And the function unicodedata.decomposition('ñ') explodes compound characters into their parts, in this case an n, and a tilde with code points 0x6E and 0x0303 respectively.

Doña Ana is a county in New Mexico. It is the only county in the US whose name contains an accented character. Download county data from a website and, odds are, it’ll cause you problems.↩︎
OED glyph n. mark or symbol. Can include letters, characters, emojis, mathematical symbols, dividers, and punctuation.↩︎
One byte is eight bits and can store a number between 0 and \(2^8-1=255\).↩︎
Latin is also called ISO-8859-1.↩︎
Unicode Transformation Format, 8 bit.↩︎
Since Python 3.0; Python 2 included Unicode and non-Unicode strings.↩︎
See my post on Path and files.↩︎
If we open in text mode we need to specify the encoding, but we will see s recreated, not the raw bytes in the file.↩︎
UTF-8 can encode all valid code points using one to four bytes. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.↩︎
See a StackOverflow post for a script to extract the list.↩︎

posted 2022-02-15 | tags: Effective Python, Python, strings

Effective Python: Strings and Unicode

What the �±? ?

Unicode

Unicode in Python

Ways to represent non-ASCII characters

Data in the wild

What is an encoded string?

Appendix

Share on

related posts

Effective Python: Strings and Unicode

What the �±? ?

Unicode

Unicode in Python

Ways to represent non-ASCII characters

Data in the wild

What is an encoded string?

Appendix

Share on

related posts

What the �±? ?