Re: clean up LC_COLLATE and LC_CTYPE in sort(1)

Theo Buehler Tue, 14 May 2019 10:40:04 -0700

On Tue, May 14, 2019 at 05:55:21PM +0200, Ingo Schwarze wrote:
> Hi,
> 
> after my LC_NUMERIC cleanup for sort(1) went in (thanks to tb@ for
> the review), i'd like to adress the rest of locale dependency.
> 
> Large amounts of extremely ugly code in sort(1) - many hundreds of
> lines - deal with LC_COLLATE, which we don't support now and have
> no intention to support in the future.
> 
> The code is very repetitive and currently written to handle three cases:
> 
>  1. byte_sort == true && sort_mb_cur_max == 1
> 
>     That is the only mode currently supported on OpenBSD.
>     It means everything uses the POSIX locale and ASCII.
> 
>  2. byte_sort == false && sort_mb_cur_max == 1
> 
>     That will never be supported on OpenBSD.
>     It handles 8-bit single byte character encodings which are
>     incompatible with UTF-8, for example ISO-LATIN-1.
> 
>  3. byte_sort == false && sort_mb_cur_max > 1
> 
>     Even though i doubt we will ever do it, that could theoretically
>     happen on OpenBSD in the remote future, if we ever choose to
>     implement collation support for UTF-8 locales.
> 
> Handling case 3 would be a massive undertaking - not just a matter
> of improving Unicode support, but also forcing us to maintain many
> different UTF-8 locales for many different languages, which means
> extremely messy stuff invading the C library.  During the Belgrade
> EuroBSDCon a few years ago, i talked to Baptiste Daroussin who had
> just implemented LC_COLLATE in FreeBSD libc and who was utterly
> scared by the complexity.  Knowing ourselves, we would be scared
> even more once we got there.  So it will definitely not happen
> quickly.  Then again, ruling that out for good is maybe not a
> decision to make in this particular patch.
> 
> Consequently, the byte_sort variable can be deleted immediately,
> killing case 2 for good, but i'm keeping the sort_mb_cur_max variable
> as a global constant for now, even though more than half of the
> code it controls is currently dead code.
> 
> Since none of our single-byte character and string functions are
> locale dependent, we can also zap LC_CTYPE while here.
> 
> After committing this patch, i shall re-indent bwscoll() properly
> in a separate commit, but i'm not including that in the patch sent
> out here because it would make the patch unreadable.
> 
> OK?

ok

Re: clean up LC_COLLATE and LC_CTYPE in sort(1)

Reply via email to