Ever cut and paste text between applications and get unexpected results? Particularly when the source is a webpage? For example, the sentence1
◆ News: Doña Ana’s slump—lawsuit filed ☹
can paste as
Only two of eight are rendered correctly, and row 7 is correct only in HTML since it is actually the string ◆ News: Doa Ana’s slump—lawsuit filed ☹
. It’s particularly annoying for the Data Scientist that the lost data is often irrelevant: the apostrophe, dash, and emoji. Accents on characters are important and can change the meaning of a word. What is going on? In a word: Unicode.
Unicode is a universally agreed assignment of a code point number to each printable character or glyph2. It is a vast extension of the 128-character ASCII table. A recent list of Unicode characters contains 144,697 glyphs.
A string is a sequence of glyphs. Displayed on the screen, a string corresponds to a sequence of code points, rendered by looking up the code point in a font. How does the computer store a string? To have capacity for 150,000 different glyphs requires 18 bits, since \(2^{17}=131072<150000<2^{18}=262144\). Computers prefer even numbers of bytes3, so one glyph is stored using 4 bytes in a long
integer type. That gives space for \(2^{32}=4\)GB glyphs, plenty for new emojis.
Using long
s is wasteful. In English, one byte—the old ASCII standard— suffices for most gpyphs. For a small document, the waste is manageable. But for data storage or transmission an unnecessary \(4\times\) increase in file size is unacceptable. As a result, encoding mechanisms are used to store strings efficiently. Encodings can be customized to a particular language, tailored to the specific characters it uses most frequently, or be general purpose. Common encodings include Latin
4, for the Latin alphabet plus standard accented Latin letters, and UTF-8
[5(https://en.wikipedia.org/wiki/UTF-8) According to Wikipedia, as of January 2022, UTF-8 accounts for 97.7% of all web pages and 986 of the top 1,000 highest ranked web pages.], which is general purpose but still compact. Many encodings map the lower-ASCII characters, including a-z, A-Z and 0-9, to their ASCII codes using one byte. Thereafter, they diverge.
When a text file is saved, or transmitted over the Internet, its is encoded into a series of bytes to save space. To recreate it, you must know the encoding. The top of a webpage often includes a line
<meta charset="utf-8"/>
that specifies how it is encoded.
A string in Python6 can include any Unicode character. The function ord
returns a character’s code point:
= '◆ News: Doña Ana’s slump—lawsuit filed ☹'
s ord(c) for c in s]
[# '[9670, 32, 78, 101, 119, 115, 58, 32, 68, 111, 241, 97, 32,
# 65, 110, 97, 8217, 115, 32, 115, 108, 117, 109, 112, 8212,
# 108, 97, 119, 115, 117, 105, 116, 32, 102, 105, 108, 101, 100,
# 32, 9785]'
ord
is universal and follows Unicode. You may know already ord('A')=65
in ASCII, which appears in the second row of output from “Ana”. 65 is octal 101, which may explain the choice.
In Excel, if you type =char(65)
it returns A
. The analogous Python function is chr
. It is the opposite to ord
, returning a glyph from a code point. Depending on the font and output device you are using, that may translate into a recognizable symbol.
''.join([chr(ord(c)) for c in s])==s
# True
To save s
to a file we can try7
'dm.txt').write_text(s)
Path(# UnicodeEncodeError: 'charmap' codec can't encode character '\u25c6'
# in position 0: character maps to <undefined>
Python is complaining about the first (non-ASCII) character: \u25c6
means hexadecimal 0x256c
, which equals 9670 as shown by ord
above. For write_text
to work we must specify an encoding. The emerging default is UTF-8
.
'dm.txt').write_text(s, encoding='utf-8')
Path(# 40
The return value shows how many characters are written; s
has 40 characters. If we want to see what Python wrote we can open the file in binary mode8 with no decoding:
'dm.txt').read_bytes()
Path(# b'\xe2\x97\x86 News: Do\xc3\xb1a Ana\xe2\x80\x99s
# slump\xe2\x80\x94lawsuit filed \xe2\x98\xb9'
As expected, ASCII characters are unchanged. Non-ASCII characters have been converted into two or three bytes9. \xe2
means the hex number e2
\(=14\times 16+2=226\). The b
indicates the result is an array of bytes
.
We can also see the encoded string directly using s.encode('utf-8')
. And starting from the encoded bytes, we can recover the string with
= s.encode('utf-8')
b = b.decode('utf-8')
s2 == s2
s # True
I created the list of eight Unicode manglings of s
in the Introduction by using different encodings and decodings, and handling errors in different ways (see code in the Appendix). For example, the last row uses Latin
to encode and decode. Encoding fails for s
since it contains non-Latin characters. Ignoring the errors gives
= s.encode('Latin', errors='replace')
b # b'? News: Do\xf1a Ana?s slump?lawsuit filed ?'
'Latin')
b.decode(# '? News: Doña Ana?s slump?lawsuit filed ?'
The ñ is a Latin character and survives the round trip. The apostrophe, diamond, dash, and emoji do not.
Special characters in a UTF-8
or Latin
encoded string are not human readable. There are several alternatives that can be used when readability is important. Some just show the code point more clearly, but others provide a description of the character. In HTML, it is common to see characters written like ñ
which (no surprise) is rendered ñ. &#nnnn;
(decimal) or &#xhhhh;
(hexadecimal with upper or lower case a-f) are also allowable, where the number equals the code point and it can include any number of digits or leading zeros. On-line converters give HTML codes for any glyph. Others list accented Latin characters. Or, of course, you can use Python as explained below.
The next table gives, for each glyph in s
, the code point in decimal, hex, and octal (base 8), the category (Lu = Latin upper, Ll Latin lower, Pd punctuation dash, etc.), the official Unicode description of the glyph, and its representation as an HTML string and coded in hex. (The code to create the table is in the Appendix.)
glyph | code | hex | octal | category | description | html | html2 |
---|---|---|---|---|---|---|---|
32 | 0x20 | 0o40 | Zs | SPACE | |||
: | 58 | 0x3a | 0o72 | Po | COLON | : | |
A | 65 | 0x41 | 0o101 | Lu | LATIN CAPITAL LETTER A | A | |
D | 68 | 0x44 | 0o104 | Lu | LATIN CAPITAL LETTER D | D | |
N | 78 | 0x4e | 0o116 | Lu | LATIN CAPITAL LETTER N | N | |
a | 97 | 0x61 | 0o141 | Ll | LATIN SMALL LETTER A | a | |
d | 100 | 0x64 | 0o144 | Ll | LATIN SMALL LETTER D | d | |
e | 101 | 0x65 | 0o145 | Ll | LATIN SMALL LETTER E | e | |
f | 102 | 0x66 | 0o146 | Ll | LATIN SMALL LETTER F | f | |
i | 105 | 0x69 | 0o151 | Ll | LATIN SMALL LETTER I | i | |
l | 108 | 0x6c | 0o154 | Ll | LATIN SMALL LETTER L | l | |
m | 109 | 0x6d | 0o155 | Ll | LATIN SMALL LETTER M | m | |
n | 110 | 0x6e | 0o156 | Ll | LATIN SMALL LETTER N | n | |
o | 111 | 0x6f | 0o157 | Ll | LATIN SMALL LETTER O | o | |
p | 112 | 0x70 | 0o160 | Ll | LATIN SMALL LETTER P | p | |
s | 115 | 0x73 | 0o163 | Ll | LATIN SMALL LETTER S | s | |
t | 116 | 0x74 | 0o164 | Ll | LATIN SMALL LETTER T | t | |
u | 117 | 0x75 | 0o165 | Ll | LATIN SMALL LETTER U | u | |
w | 119 | 0x77 | 0o167 | Ll | LATIN SMALL LETTER W | w | |
ñ | 241 | 0xf1 | 0o361 | Ll | LATIN SMALL LETTER N WITH TILDE | ntilde | ñ |
— | 8212 | 0x2014 | 0o20024 | Pd | EM DASH | mdash | — |
’ | 8217 | 0x2019 | 0o20031 | Pf | RIGHT SINGLE QUOTATION MARK | rsquo | ’ |
⌨ | 9000 | 0x2328 | 0o21450 | So | KEYBOARD | ⌨ | |
| 9259 | 0x242b | 0o22053 | Cn | n/a | | |
◆ | 9670 | 0x25c6 | 0o22706 | So | BLACK DIAMOND | ◆ | |
☹ | 9785 | 0x2639 | 0o23071 | So | WHITE FROWNING FACE | ☹ |
It is a sad fact that YOU CANNOT GUESS ENCODINGS. Go ahead, Google or try it. There is no reliable way to infer an encoding if it is not known. If you don’t know the encoding, then start with UTF-8
. If that fails, try Latin
, if it makes linguistically. For data generated by Windows (e.g., older CSV files produced by Excel) try UTF-16
or UTF-16-LE
. If the data comes from a particular language, you can look at encodings specific to it. After that, you’re likely tolerating the �±?.
Python 3.7 includes nearly 100 different encodings10 Given an code point and a list of encodings, it is easy to check if it encodes validly:
= ['ascii', 'latin', 'iso8859_2', 'utf_32', 'utf_16_le', 'utf_8']
enc_list = 'ñ'
x = []
fails = []
success for e in enc_list:
try:
success.append([e, x.encode(e)])except UnicodeEncodeError as err:
'fails'])
fails.append([e, = pd.DataFrame(success + fails, columns=['encoding', 'code']) df1
Conversely, given a code value we can see if and how it decodes. Note the different except
clause.
= b'\xf1'
x = []
fails = []
success for e in enc_list:
try:
success.append([e, x.decode(e)])except UnicodeDecodeError as err:
'fails'])
fails.append([e, = pd.DataFrame(success + fails, columns=['encoding', 'decode'])
df2
= pd.concat((df1.set_index('encoding'), df2.set_index('encoding')), axis=1) df
encoding | code | decode |
---|---|---|
latin | b'\xf1' |
ñ |
utf_32 | b'\xff\xfe\x00\x00\xf1\x00\x00\x00' |
fails |
utf_16_le | b'\xf1\x00' |
fails |
utf_8 | b'\xc3\xb1' |
fails |
ascii | fails | fails |
iso8859_2 | fails | ń |
One final issue can occur. The code point and decoding are correct, but the glyph is not in the font you are using. That typically generates a different type of warning.
The Python object b = 'string'.encode('utf-8')
has type
bytes
. In many ways it is interchangeable with a string. For example, b.upper()
, b.find(b'g')
, b[3:]
all work as expected. Byte objects can be converted into integers using int.from_bytes
. You have to specify if the most significant digit comes first (byteorder='big'
) or last (byteorder='little'
). Thus
= b'\x01\x00\x00\x00\x00'
g int.from_bytes(g, 'big'), 256**4, int.from_bytes(g, 'little')
# (4294967296, 4294967296, 1)
Here is the code used to create the eight Unicode strings in the introduction.
import pandas as pd
def example(s, encode, decode):
= []
ans for eh in ('ignore', 'replace', 'xmlcharrefreplace'):
= s.encode(encode, errors=eh)
b for eh2 in ('ignore', 'replace'):
ans.append([eh, eh2, b, b.decode(decode, eh2)])
= pd.DataFrame(ans, columns=['encode', 'decode', 'bytes', 'decoded_bytes'])
df = df.set_index(['encode', 'decode'])
df return df
= "◆ News: Doña Ana’s slump—lawsuit filed ☹"
s
= example(s, 'Latin', 'utf-8')
eg1 = example(s, 'utf-8', 'Latin')
eg2
= set(eg1.decoded_bytes)
options = options.union(eg2.decoded_bytes)
options print('* ' + '\n* '.join([s] + list(options)))
You can cut and paste this code into Jupyter Lab to run it.
Here is the code used to create the table. I added two extra characters to s
, with code points 9000 and 9259. Code point 9000 is a keyboard symbol. Code point 9259 is not a code point (not all the numbers have been assigned) and it does not print in any font.
import unicodedata
import html
= []
ans = s + chr(9259) + chr(9000)
s = sorted(list(set(s)))
sls for c in sls:
= ord(c)
o = html.entities.codepoint2name.get(o, '')
h = unicodedata.category(c)
cn if cn != 'Cn':
# not assigned
= unicodedata.name(c)
n else:
= 'n/a'
n hex(o), oct(o), cn, n, h, f'&#x{hex(o)[2:]};'])
ans.append([c, o,
= pd.DataFrame(ans, columns=['glyph', 'code', 'hex', 'octal',
df 'category', 'description', 'html', 'html2'])
df
The function html.unescape('◆')
converts an HTML code back to a glpyh. And the function unicodedata.decomposition('ñ')
explodes compound characters into their parts, in this case an n, and a tilde with code points 0x6E
and 0x0303
respectively.
Doña Ana is a county in New Mexico. It is the only county in the US whose name contains an accented character. Download county data from a website and, odds are, it’ll cause you problems.↩︎
OED glyph n. mark or symbol. Can include letters, characters, emojis, mathematical symbols, dividers, and punctuation.↩︎
One byte is eight bits and can store a number between 0 and \(2^8-1=255\).↩︎
Latin
is also called ISO-8859-1
.↩︎
Unicode Transformation Format, 8 bit.↩︎
Since Python 3.0; Python 2 included Unicode and non-Unicode strings.↩︎
If we open in text mode we need to specify the encoding, but we will see s
recreated, not the raw bytes in the file.↩︎
UTF-8 can encode all valid code points using one to four bytes. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.↩︎
See a StackOverflow post for a script to extract the list.↩︎
posted 2022-02-15 | tags: Effective Python, Python, strings