Sorting Unicode Strings Across Languages and Writing Systems in Python

Sometimes when you put together a particular list of character strings, a particular use case, a particular audience, and default behaviors, you don't get what you need. Consider an arbitrarily-ordered list of Unicode strings:

 >>> titles = [  
... u'Alétheia - Revista de estudos sobre Antigüidade e Medievo',
... u'Archaeology Times',
... u'ákoue',
... u'Journal of Ancient Fish',
... u'Zeitschrift für Numismatik',
... u'Antípoda',
... u'Antipodes',
... u'Alecto',
... u'Ägyptische Residenzen und Tempel',
... u'Ακάμας, Όμιλος Ανάδειξης Μνημείων Σαλαμίνος, Ενημερωτικό Δελτίο',
... u'Античный мир и археология',
... u'ACME',
... u'Ávila',
... u'Άβιλα',
... u'Araştırma Sonuçları Toplantıları',
... u'Archäologische Informationen',
... u'Académie des Inscriptions et Belles-Lettres: Lettre d’information',
... u'Àvila',
... u'‘Atiqot',
... u'Aleppo'
... ]

Sort the list:

 >>> for title in sorted(titles): print title  
...
ACME
Académie des Inscriptions et Belles-Lettres: Lettre d’information
Alecto
Aleppo
Alétheia - Revista de estudos sobre Antigüidade e Medievo
Antipodes
Antípoda
Araştırma Sonuçları Toplantıları
Archaeology Times
Archäologische Informationen
Journal of Ancient Fish
Zeitschrift für Numismatik
Àvila
Ávila
Ägyptische Residenzen und Tempel
ákoue
Άβιλα
Ακάμας, Όμιλος Ανάδειξης Μνημείων Σαλαμίνος, Ενημερωτικό Δελτίο
Античный мир и археология
‘Atiqot

You and your users may not be satisfied with this result. Perhaps you'd prefer to see a list sorted across languages and scripts in a way that considers Roman characters (i.e., A-Z, a-z) as equivalent for purposes of sorting regardless of whether they bear diacritics or not (e.g., A == Á == À). Perhaps you'd like to go even further and consider characters equivalent across writing systems on the basis of a Romanization scheme (e.g., Greek α == Russian а == Latin/English a).

One way is to write a function that gives us an alternative sort key for each string; that is, a derivative string that, when sorted against other such strings, gives the desired result.

If we're comfortable with some amount of naiveté in the results, we can get this done pretty quickly for some languages and scripts by taking advantage of existing packages in the Python open-source ecosystem.

How to ignore diacritics

The venerable ASCII character encoding scheme (see also the "Basic Latin" Unicode code block) provides for only the baseline Roman characters, plus Arabic numerals, English-standard punctuation, and some ancillary things that don't concern us here.

Tomaz Solc's unidecode package (a port of Sean M. Burke's Text::Unidecode Perl module) provides a quick and easy way to:
[take] Unicode data and [try] to represent it in ASCII characters ... where the compromises taken when mapping between two character sets are chosen to be near what a human with a US keyboard would choose.
How does that work out for our example list of strings?

 >>> from unidecode import unidecode  
>>> print(titles[0])
Alétheia - Revista de estudos sobre Antigüidade e Medievo
>>> print(unidecode(titles[0]))
Aletheia - Revista de estudos sobre Antiguidade e Medievo
>>> for title in titles: print(unidecode(title))
...
Aletheia - Revista de estudos sobre Antiguidade e Medievo
Archaeology Times
akoue
Journal of Ancient Fish
Zeitschrift fur Numismatik
Antipoda
Antipodes
Alecto
Agyptische Residenzen und Tempel
Akamas, Omilos Anadeixes Mnemeion Salaminos, Enemerotiko Deltio
Antichnyi mir i arkheologiia
ACME
Avila
Abila
Arastirma Sonuclari Toplantilari
Archaologische Informationen
Academie des Inscriptions et Belles-Lettres: Lettre d'information
Avila
'Atiqot
Aleppo

You'll have noted that unidecode.unidecode() does more than just ignore diacritics. It attempts to transliterate non-Roman characters (i.e., "Romanize" them) as well. Indeed, note what the docs say:
The quality of resulting ASCII representation varies. For languages of western origin it should be between perfect and good. On the other hand transliteration (i.e., conveying, in Roman letters, the pronunciation expressed by the text in some other writing system) of languages like Chinese, Japanese or Korean is a very complex issue and this library does not even attempt to address it. It draws the line at context-free character-by-character mapping. So a good rule of thumb is that the further the script you are transliterating is from Latin alphabet, the worse the transliteration will be.
Note that this module generally produces better results than simply stripping accents from characters (which can be done in Python with built-in functions). It is based on hand-tuned character mappings that for example also contain ASCII approximations for symbols and non-Latin alphabets.

Alternative Romanization techniques

I have no doubt that there are a wide variety of good techniques and packages for Romanizing character strings available in Python. I have not done a comprehensive search for these, and would welcome relevant, collegial comments with links on this post.

I did notice Artur Barseghyan's transliterate package. It is a:
Bi-directional transliterator for Python [that] transliterates (unicode) strings
according to the rules specified in the language packs (source script <->
target script).
At the time of this writing, the package provided Romanization for strings identifiable as written in the standard scripts (as cataloged in the IANA language subtag registry) for the following languages:

 >>> from transliterate import get_available_language_codes as get_langs  
>>> get_langs()
['el', 'ka', 'hy', 'ru', 'bg', 'uk']

Side note: getting language names for IANA codes

There's open source for that too: Matthew Caruana Galizia's IANA Language Tags project, about which:
IANA's official repository is in record-jar format and is hard to parse. This project provides neatly organized JSON files representing that data.
It also provides a JavaScript API. But I don't need that for this purpose. I can just grab the JSON version of the IANA repository data from Python, with an assist from the requests package. I can then use it to make human-readable names for the languages the transliterate package supports:

 >>> import requests  
>>> r = requests.get('https://raw.githubusercontent.com/mattcg/language-subtag-registry/master/data/json/registry.json')
>>> r.status_code
200
>>> lang_registry = r.json()
>>> languages = {}
>>> for lang in lang_registry:
... if lang['Type'] == 'language':
... languages[lang['Subtag']] = lang['Description'][0]
...
>>> len(languages)
8094
>>> romanizable_languages = [languages[code] for code in get_langs()]
>>> for l in romanizable_languages: print(l)
...
Modern Greek (1453-)
Georgian
Armenian
Russian
Bulgarian
Ukrainian

Of course, there's a lot more one can do with that IANA JSON file... but let's get back to Romanization, by way of ...

Language and script detection

The transliterate package demands that you be able to identify the language (and implicitly the writing system) of the string you want to Romanize. The package provides a "very basic" language detection method to help us out:

 >>> from transliterate import detect_language  
>>> for title in titles: print(u'{0}: "{1}"'.format(detect_language(title), title))
...
ru: "Alétheia - Revista de estudos sobre Antigüidade e Medievo"
ru: "Archaeology Times"
None: "ákoue"
ru: "Journal of Ancient Fish"
ru: "Zeitschrift für Numismatik"
ru: "Antípoda"
ru: "Antipodes"
ru: "Alecto"
ru: "Ägyptische Residenzen und Tempel"
el: "Ακάμας, Όμιλος Ανάδειξης Μνημείων Σαλαμίνος, Ενημερωτικό Δελτίο"
ru: "Античный мир и археология"
ru: "ACME"
None: "Ávila"
el: "Άβιλα"
ru: "Araştırma Sonuçları Toplantıları"
ru: "Archäologische Informationen"
ru: "Académie des Inscriptions et Belles-Lettres: Lettre d’information"
None: "Àvila"
ru: "‘Atiqot"
ru: "Aleppo"

The apparent default value for any pure-ASCII string of 'ru' (Russian) problematic.

There is another package designed specifically for language detection: Marco Lui's langid package:
langid.py is a standalone Language Identification (LangID) tool.
The design principles are as follows:
  1. Fast
  2. Pre-trained over a large number of languages (currently 97)
  3. Not sensitive to domain-specific features (e.g. HTML/XML markup)
  4. Single .py file with minimal dependencies
  5. Deployable as a web service
Let's give it a whirl:

 >>> import langid  
>>> print(titles[0])
Alétheia - Revista de estudos sobre Antigüidade e Medievo
>>> langid.classify(titles[0])
('pt', 0.9997639656878511)
>>> for title in titles: print(u'{0}: "{1}"'.format(repr(langid.classify(title)), title))
...
('pt', 0.9997639656878511): "Alétheia - Revista de estudos sobre Antigüidade e Medievo"
('hu', 0.4951981167657506): "Archaeology Times"
('cs', 0.6559598835005537): "ákoue"
('en', 0.9999542989722191): "Journal of Ancient Fish"
('de', 1.0): "Zeitschrift für Numismatik"
('cs', 0.9327142178388013): "Antípoda"
('pt', 0.35121448605116784): "Antipodes"
('en', 0.16946150595865334): "Alecto"
('de', 0.9999999632942209): "Ägyptische Residenzen und Tempel"
('el', 1.0): "Ακάμας, Όμιλος Ανάδειξης Μνημείων Σαλαμίνος, Ενημερωτικό Δελτίο"
('ru', 0.9999999999665641): "Античный мир и археология"
('en', 0.16946150595865334): "ACME"
('lv', 0.3049662840719183): "Ávila"
('el', 1.0): "Άβιλα"
('tr', 0.9999345038597317): "Araştırma Sonuçları Toplantıları"
('de', 0.9999982379021314): "Archäologische Informationen"
('fr', 0.9999999999999973): "Académie des Inscriptions et Belles-Lettres: Lettre d’information"
('en', 0.16946150595865334): "Àvila"
('fr', 0.9511801373660571): "‘Atiqot"
('en', 0.31773663282480374): "Aleppo"
>>> for title in titles: print(u'{0}: "{1}"'.format([None, languages[langid.classify(title)[0]]][langid.classify(title)[1] > 0.9], title))
...
Portuguese: "Alétheia - Revista de estudos sobre Antigüidade e Medievo"
None: "Archaeology Times"
None: "ákoue"
English: "Journal of Ancient Fish"
German: "Zeitschrift für Numismatik"
Czech: "Antípoda"
None: "Antipodes"
None: "Alecto"
German: "Ägyptische Residenzen und Tempel"
Modern Greek (1453-): "Ακάμας, Όμιλος Ανάδειξης Μνημείων Σαλαμίνος, Ενημερωτικό Δελτίο"
Russian: "Античный мир и археология"
None: "ACME"
None: "Ávila"
Modern Greek (1453-): "Άβιλα"
Turkish: "Araştırma Sonuçları Toplantıları"
German: "Archäologische Informationen"
French: "Académie des Inscriptions et Belles-Lettres: Lettre d’information"
None: "Àvila"
French: "‘Atiqot"
None: "Aleppo"

That's better, especially if we pay attention to the probability measures attached to each result.

The missing pieces

So, I think we can now imagine some process that follows this outline:

  • for each string:
    • if not every character in the string is ASCII
      • try to use langid.classify() to determine language
        • if language is successfully determined:
          • if transliterate.translit() supports the language, get the transliteration
      • remove remaining non-ASCII characters by brute force
      • if result is a zero-length string, step back to the original string (what else can you do?)
    • else: just use the original string
    • strip all the punctuation
    • convert everything to lowercase
    • normalize or remove spaces (depending on how you want to deal with word breaks in sorting)

ASCII detection and stripping

It's pretty easy to strip non-ASCII characters from a Unicode string in Python:

 >>> t = u"Antípoda"  
>>> t.encode('ascii', 'ignore')
'Antpoda'

We can exploit this to do quick and dirty "all ASCII" detection:

 >>> t = u"Antípoda"  
>>> t == unicode(u.encode('ascii', 'ignore'))
False
>>> u = u'Chicken'
>>> u == unicode(u.encode('ascii', 'ignore'))
True

This approach depends upon the assumption that the strings you're starting with are really Unicode strings. If you were a huge regular expressions fan, you could use the python re module to perform a similar test, but I'm betting it would run slower.

If you weren't so confident in the consistency of your source list, you'd need to do some preprocessing. Unicode normalization might be necessary. You might have to resort to the chardet module in order to guess at encodings.

Stripping punctuation

Now here's a job for regular expressions. We can make short work of this task especially if we take advantage of the new, alternative regex package for Python, which is intended to eventually replace the current implementation.

 >>> import regex as re  
>>> rx = re.compile(ur'[\p{P}_\d]+')
>>> t = u'Ακάμας, Όμιλος Ανάδειξης Μνημείων Σαλαμίνος, Ενημερωτικό Δελτίο'
>>> print(rx.sub(u'', t))
Ακάμας Όμιλος Ανάδειξης Μνημείων Σαλαμίνος Ενημερωτικό Δελτίο

Putting it all together

Consider the following script: