Re: [Tutor] sort() method and non-ASCII

eryk sun Sat, 04 Feb 2017 23:56:22 -0800

On Sun, Feb 5, 2017 at 3:52 AM, boB Stepp <robertvst...@gmail.com> wrote:
> Does the list sort() method (and other sort methods in Python) just go
> by the hex value assigned to each symbol to determine sort order in
> whichever Unicode encoding chart is being implemented?


list.sort uses a less-than comparison. What you really want to know is
how Python compares strings. They're compared by ordinal at
corresponding indexes, i.e. ord(s1[i]) vs ord(s2[i]) for i less than
min(len(s1), len(s2)).

This gets a bit interesting when you're comparing characters that have
composed and decomposed Unicode forms, i.e. a single code vs multiple
combining codes. For example:

    >>> s1 = '\xc7'
    >>> s2 = 'C' + '\u0327'
    >>> print(s1, s2)
    Ç Ç
    >>> s2 < s1
    True

where U+0327 is a combining cedilla. As characters, s1 and s2 are the
same. However, codewise s2 is less than s1 because 0x43 ("C") is less
than 0xc7 ("Ç"). In this case you can first normalize the strings to
either composed or decomposed form [1]. For example:

    >>> strings = ['\xc7', 'C\u0327', 'D']
    >>> sorted(strings)
    ['Ç', 'D', 'Ç']

    >>> norm_nfc = functools.partial(unicodedata.normalize, 'NFC')
    >>> sorted(strings, key=norm_nfc)
    ['D', 'Ç', 'Ç']

[1]: https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] sort() method and non-ASCII

Reply via email to