To sort a list in Python, we would normally use the sorted function. Let’s try to run the simplest case possible:
~$ python3
>>> sorted(["a", "A"])
<'A', 'a']
Ok, that looks fine. Uppercase letters come before lowercase ones, right?
Or do they? Let’s take a step back and check how sorting works in Linux.
You’ve probably heard about something called locale. A locale defines rules for different languages and regions and how, amongst other things, sorting must be performed. The most basic locale is called C, which operates only with ASCII charset.
Let’s try to perform the same sorting exercise on my Debian Linux machine:
~$ sort <<< $'a\nA'
a
A
That’s not the same result that the Python sorted function returned. Why is that? I didn’t change my locale. Let’s see what locale my Linux machine is currently using:
~$ echo $LC_ALL
en_US.UTF-8
Could it be that Python doesn’t use the system locale by default? Of course, I should have checked the documentation beforehand.
There is nothing mentioned about the locale in the documentation for sorted. After looking into the linked “Sorting HOW TO” tutorial in the “Odds & Ends” section, I found that for locale-aware sorting, one should use locale.strcoll() as the comparison function. Ok, let’s try that out.
After reading some more documentation, I came up with the following:
~$ python3
>>> import locale
>>> from functools import cmp_to_key
>>> sorted(['a', 'A'], key=cmp_to_key(locale.strcoll))
['A', 'a']
Hmm, but it’s still not what I had expected. Let’s check if Python is even picking up my system locale:
>>> locale.getlocale()
('en_US', 'UTF-8')
Yep, that’s the one. Python knows what my system locale is, but doesn’t apply it automatically. Let me try to “force” it and try again:
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> sorted(['a', 'A'], key=cmp_to_key(locale.strcoll))
['a', 'A']
Ok, now it works as I expected. But why?
If you know the answer, please write to me! We’re also hiring!
Author
This article was written by Mārtiņš Grunskis, a Head of Engineering for Order & Campaign Product Area.