What Pythonista should know about Unicode?

Szymon Pyzalski

31 January 2019, 9 min read

What's inside

Let’s start with a bit of Unicode history

The basic idea behind Unicode and UTFs

Grokking the encode and decode in Python

Working with files

Sorting Unicode data

Unicode normalization

Encoding in Python

In my experience as a Python developer, I found that understanding the difference between Unicode and UTF is essential to avoid any confusion about how Python handles Unicode data.

In this article about Python encoding, I take a closer look at Unicode itself to show you:

what it is,
where it came from,
how it’s different from UTFs,
and how to make it work in Python.

Let’s start with a bit of Unicode history

In the beginning, there was the telegraph. We invented text encoding to send text through telegraph lines. The first of such codes - the Morse Code - encoded characters into a ternary sequence of dashes, dots, and pauses. This text encoding system was meant to be used by humans who were trained to send and receive such messages. Later on, we got automated teleprinters that required a system that would be better suited for a machine rather than humans. That’s how genuinely binary systems were born. Among them was the ASCII code that became the sole standard outside of the world of mainframe computers.

ASCII is a 7-bit system designed to encompass the English language with numbers, punctuation, and several control codes. The 7-bit system was chosen as the narrowest option that could fit all the required codepoints.

But on computers with 8-bit per byte, one bit was always left unused. It could come in handy as a parity check or an option for getting some text alteration done. However, soon 8-bit systems called “code pages” appeared on the scene. They were dedicated to a specific language or a handful of languages at a time. Most of them were backward compatible with ASCII, expanding the English character set with new characters containing diacritics or adding an entirely new alphabet using the available 128 code points.

Needless to say, it was a painful time for internationalization.

The change didn't hit anglophones that much, but those who use diacritics in their languages still remember the strange characters showing on their screen when their application chose the wrong code page. These pains prompted a few smart people to create a character encoding system to end all character encodings.

Today, we know it as Unicode.

The basic idea behind Unicode and UTFs

Unicode is an ambitious project that aims to encompass every existing and historical script known to humans.

Here’s how it works:

Every character gets a numerical code assigned to it. These codes don't have any limit on the bit width. That’s why Unicode itself isn’t actually a system of binary encoding. That’s the job of the Unicode Transformation Formats (UTFs) that offer a way for transforming the Unicode data into a sequence of bytes.

The most common one is UTF-8. It’s the dominant encoding for the WWW and considered as “mandatory for all things” by WHATWG.

The format is backwards compatible with ASCII; it offers a 1 character to 1 byte correspondence for the English language. For other languages, it may require more characters. Let's compare two UTFs like that:

def utf_efficiency(txt):
   """Compare byte efficiency of UTF-8 and UTF-16 for given string"""
   utf8 = len(txt.encode('utf-8')) / len(txt)
   utf16 = len(txt.encode('utf-16')) / len(txt)
   return f'UTF-8: {utf8}, UTF-16: {utf16}'

>>> utf_efficiency("I can eat glass and it doesn't hurt me")
'UTF-8: 1.0, UTF-16: 2.0526315789473686'
>>> utf_efficiency('Mogę jeść szkło i mi nie szkodzi')

'UTF-8: 1.125, UTF-16: 2.0625'
>>> utf_efficiency("Я могу есть стекло, это мне не вредит")
'UTF-8: 1.7837837837837838, UTF-16: 2.054054054054054'
>>> utf_efficiency('შემიძლია მინა ვჭამო და არაფერი მეტკინება')

'UTF-8: 2.75, UTF-16: 2.05'

>>> utf_efficiency("私はガラスを食べられます。それは私を傷つけません。")
'UTF-8: 3.0, UTF-16: 2.08'

As you can see, UTF-8 is more efficient for Latin-based scripts and Cyrillic (it offers similar efficiency to other UTFs but is better for punctuation and spaces).

For other scripts, you may find that other UTFs store the text more efficiently. Is worth to take this into account when creating applications which use them? It’s likely that text is only a fraction of data you’ll be dealing with so the gain would be negligible.

Note that UTFs that aren’t ASCII-compatible can create some security loopholes if they allow injecting special characters that won't get properly escaped. So unless you’re tasked with building software that would store a corpus of Thai literature, you should probably stick to UTF-8.

Grokking the encode and decode in Python

Python developers need to understand the difference between Unicode and UTF. The unicode type represents Unicode data as an abstract string of codepoints. Naturally, it has an in-memory binary representation - but that should be transparent from the developer’s point of view.

By calling the encode method, we can convert this data into a binary representation of type bytes.

Here’s you can do that:

>>> 'Hello'.encode('utf-8')
b'Hello'
>>>'Cześć'.encode('utf-8')
b'Cze\xc5\x9b\xc4\x87'
>>> 'Привет'.encode('utf-8')
b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82'

You can also use an encoding that isn’t a UTF. But in that case, the conversion might fail.

Here’s an example:

&gt;&gt;&gt; 'Привет'.encode('iso-8859-2')
Traceback (most recent call last):
File "&lt;stdin&gt;", line 1, in &lt;module&gt;
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to &lt;undefined&gt;
&gt;&gt;&gt; 'Привет'.encode('koi8-r')
b'\xf0\xd2\xc9\xd7\xc5\xd4'

The decode method works in the opposite way - it’s called on bytes and returns a unicode:

>>> b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82'.decode('utf-8')
'Привет'

Working with files

It’s best to convert all the textual data to unicode as soon as possible, and then render it in a binary format as late as possible.

Python can handle file objects in a way that allows developers never to handle binary data directly. When we open a file in a textual mode (without the 'b' flag), Python automatically decodes and encodes the data for us, so we only need to work with unicode objects.

We can specify the encoding system, but it defaults to our OS locale settings (most of the modern Linux distros will use UTF-8).

Here’s an example:

with open('Конек-горбунок-win.html', encoding='WINDOWS-1251') as fin:
   with open('Конек-горбунок-utf.html', 'w') as fout:
      while True:
         data = fin.read(4096)
   if not data:
      break
fout.write(data)

This code would read the file encoded in WINDOWS-1251 encoding and write its contents to another file according to the system encoding.

The pattern of presenting you with decoded data is also present in web frameworks. For example, the Django Request object features undecoded data as body attribute.

However, it’s best not to access it directly, unless you have a really good reason to do so. Instead, read the POST or GET attributes that contain decoded Unicode data.

Sorting Unicode data

The unicode strings are sorted by their codepoints by default. That works fine in English, but if you try it with another language, you might get something else:

>>> sorted(['lis', 'łabędź', 'marabut'])
['lis', 'marabut', 'łabędź']

The sorting we got here doesn’t comply with the rules of the Polish language. To sort the data correctly, we need to use this:

>>> locale.setlocale(locale.LC_COLLATE, ('pl_PL', 'utf-8'))
'pl_PL.UTF-8'
>>> sorted(['lis', 'łabędź', 'marabut'], key=locale.strxfrm)
['lis', 'łabędź', 'marabut']

Note that there’s no universally correct method for sorting words. That's why the way we do that is locale-dependent (it’s called collation):

>>> locale.setlocale(locale.LC_COLLATE, ('pl_PL', 'utf-8'))
'pl_PL.UTF-8'
>>> sorted(['bob', 'bób', 'boc', 'bóc'], key=locale.strxfrm)
['bob', 'boc', 'bób', 'bóc']
>>> locale.setlocale(locale.LC_COLLATE, ('cz_CZ', 'utf-8'))
'cs_CZ.UTF-8'
>>> sorted(['bob', 'bób', 'boc', 'bóc'], key=locale.strxfrm)
['bob', 'bób', 'boc', 'bóc']

Unicode normalization

The two strings below should look identical on your screen. But they don't compare as identical.

>>> 'Mohu jíst sklo, neublíží mi.' == 'Mohu jíst sklo, neublíží mi.'
False

The problem here is that we can represent a combined character (a character consisting of a character + diacritics) in two different ways.

We can encode a combined character as either a single character (composed) or a sequence of characters (decomposed).

def list_chars(s):
   """List character names in string"""
   for c in s:
      print(unicodedata.name(c))

>>> list_chars(unicodedata.normalize('NFC', 'łódź'))
LATIN SMALL LETTER L WITH STROKE
LATIN SMALL LETTER O WITH ACUTE
LATIN SMALL LETTER D
LATIN SMALL LETTER Z WITH ACUTE
>>> list_chars(unicodedata.normalize('NFD', 'łódź'))
LATIN SMALL LETTER L WITH STROKE
LATIN SMALL LETTER O
COMBINING ACUTE ACCENT
LATIN SMALL LETTER D
LATIN SMALL LETTER Z
COMBINING ACUTE ACCENT

The W3C recommends NFC for all web purposes. Still, some users in some languages may enter NFD data. Note that this problem isn’t addressed in many web frameworks, so you might need to normalize your strings if you find any oddities.

If you have a good eye for detail, you might notice that with some fonts and on some systems, the NFD and NFC versions actually don't look identical. That's because when we use NFC, we’re using complete glyphs that were created by a human designer. With NFD, the job of graphically combining the character and its diacritics falls to the rendering software. That’s why the final result might be not as aesthetically pleasing as the one generated by the NFC.

Apart from the NFC and NFD normalizations, there’s another pair of algorithms that unify compatible characters. These normalization systems are lossy:

>>> s = 'ℵ?ﬀ'
>>> list_chars(s)
ALEF SYMBOL
MATHEMATICAL ITALIC SMALL ALPHA
LATIN SMALL LIGATURE FF
>>> list_chars(unicodedata.normalize('NFKD', s))
HEBREW LETTER ALEF
GREEK SMALL LETTER ALPHA
LATIN SMALL LETTER F
LATIN SMALL LETTER F

You might sometimes see an abuse of Unicode normalization to force the text to ASCII like that.

Here’s a an example of a source code that does that:

def strip_accents(s): return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

That’s just a bad idea. First – not all characters with diacritics have their canonical decomposition. Second – that solution fails to address non-latin based alphabets.

>>> strip_accents('łódź') 'łodz' >>> strip_accents('გამარჯობა') 'გამარჯობა'

But don't worry, the package Unidecode will provide you with a nice ASCII approximation of any string. That’s just a smart Python Unicode practice.

Here’s an example:

>>> from unidecode import unidecode
>>> unidecode('łódź')
'lodz'
>>> unidecode('გამარჯობა')
'gamarjoba'

Encoding in Python

Encoding issues are one of the many problems developers encounter when trying to internationalize their applications. That’s why it’s key to understand it.

It’s easy to test your code with lorem ipsum and other ASCII-compliant data. But the languages we use all over the world are way more complex than that.

Python 3 offers a wonderful, intuitive Unicode support. But only as long as you use it correctly. So always pay attention to the data you’re handling and test your applications for non-English and non-Latin inputs. That way your code will be ready to go places.

Python best practices: What every Pythonista should know about Unicode

Szymon Pyzalski

What's inside

Let’s start with a bit of Unicode history

The basic idea behind Unicode and UTFs

Grokking the encode and decode in Python

Working with files

Sorting Unicode data

Unicode normalization

Encoding in Python

Szymon Pyzalski

Backend Engineer

Recent posts

Why data engineers don’t test - according to Reddit

Modern Data Stack with Airflow and dbt - going into the cloud (part 2)

Testing in dbt - part 3

Why data engineers don’t test - according to Reddit

Modern Data Stack with Airflow and dbt - going into the cloud (part 2)

Testing in dbt - part 3

Let's talk