How to sort unicode properly?

2019-09-25 Thread Peng Yu
Hi, It seems that "café" should be sorted before "caff" in Unicode. https://github.com/jtauber/pyuca But `sort` does not do so. $ printf '%s\n' cafe caff café | LC_ALL=UTF8 sort cafe caff café $ printf '%s\n' cafe caff café | LC_ALL=en_US.UTF-8 sort cafe caff café How to make `sort` sort acc

Re: How to sort unicode properly?

2019-09-25 Thread Eric Blake
On 9/25/19 10:20 AM, Peng Yu wrote: Hi, It seems that "café" should be sorted before "caff" in Unicode. https://github.com/jtauber/pyuca But `sort` does not do so. $ printf '%s\n' cafe caff café | LC_ALL=UTF8 sort cafe caff café $ printf '%s\n' cafe caff café | LC_ALL=en_US.UTF-8 sort cafe

Re: How to sort unicode properly?

2019-09-25 Thread Peng Yu
I want to make my `sort` to be machine-independent and always use the correct Unicode sort order. Is there a way to do so? I don't know how to check where en_US.UTF-8 comes from. Do you know how to check it? (I use Mac OS X.) On 9/25/19, Eric Blake wrote: > On 9/25/19 10:20 AM, Peng Yu wrote: >>

Re: How to sort unicode properly?

2019-09-25 Thread Eric Fischer
Unfortunately, multibyte collation is simply unimplemented in MacOS X, so there is no alternate locale definition that will fix it. As far as I can tell this is documented only in the BUGS section of `man wcscoll`: BUGS The current implementation of wcscoll() only works in single-byte LC

Re: How to sort unicode properly?

2019-09-25 Thread Eric Blake
On 9/25/19 10:56 AM, Peng Yu wrote: I want to make my `sort` to be machine-independent and always use the correct Unicode sort order. Is there a way to do so? Those two goals are somewhat at odds. The only truly portable machine-independent sorting is the one guaranteed by POSIX when you use

Re: How to sort unicode properly?

2019-09-25 Thread Peng Yu
If python can have pyuca that works across platform, why such thing can not have at C level? On Wed, Sep 25, 2019 at 12:24 PM Eric Blake wrote: > On 9/25/19 10:56 AM, Peng Yu wrote: > > I want to make my `sort` to be machine-independent and always use the > > correct Unicode sort order. Is there

Re: How to sort unicode properly?

2019-09-25 Thread Eric Blake
On 9/25/19 2:46 PM, Peng Yu wrote: If python can have pyuca that works across platform, why such thing can not have at C level? Please don't top-post on technical lists. It _can_ happen, but only if someone takes the time to contribute a patch (in this case, I already suggested that a gnulib

Re: How to sort unicode properly?

2019-09-25 Thread Lion Yang
libicu works in that way. There is ucol_strcoll. http://userguide.icu-project.org/collation/api https://github.com/unicode-org/icu But think twice if you want to add libicu as a mandatory dependency of coreutils. It does works at C level and widely used but it's also quite heavy. 2019-09-26