Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
On Tue, Sep 13, 2011 at 10:03:01PM +0200, Aurelien Jarno wrote: > On Tue, Sep 13, 2011 at 05:07:46PM +0100, Colin Watson wrote: > > On Tue, Sep 13, 2011 at 05:33:19PM +0200, Aurelien Jarno wrote: > > > Yes similar problems have already been reported. This change has been > > > done as a C locale should not have a collation order. > > > > Why not? Codepoint order collation is perfectly reasonable for a C > > locale. Lots of people use LC_COLLATE=C when all they want is for > > things like [a-z] to work reasonably. > > > > Because it is supposed to replace the C locale, so to follow POSIX > rules like the C locale. I am personally not convinced that we should go > that way, but people who have pushed for this locale (some of them > Cc:ed) have made clear in bugs #522776 and #609306 that it should handle > collation like a C locale. > > Maybe they could follow-up this mail with their arguments. OK, here goes ;-) The "C.UTF-8" locale /is/ the "C" locale, extended to support UTF-8. That is, it must support the *standard* behaviour mandated in the C, POSIX and SUS standards, or else conforming applications will break. This is the reference for the forthcoming SUSv4 locale definition: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html This standard defines in detail exactly how various aspects of the C/POSIX locale must behave. Conforming applications can expect this behaviour to be guaranteed by a conforming C library. Some aspects are strictly defined, while others offer the possibility for extension. Examples: LC_CTYPE upper Define characters to be classified as uppercase letters. In the POSIX locale, the 26 uppercase letters shall be included: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z lower Define characters to be classified as lowercase letters. In the POSIX locale, the 26 lowercase letters shall be included: a b c d e f g h i j k l m n o p q r s t u v w x y z digit Define the characters to be classified as numeric digits. In the POSIX locale, only: 0 1 2 3 4 5 6 7 8 9 shall be included. space Define characters to be classified as white-space characters. In the POSIX locale, exactly , , , , , and shall be included. cntrl Define characters to be classified as control characters. In the POSIX locale, no characters in classes alpha or print shall be included. xdigit Define the characters to be classified as hexadecimal digits. In the POSIX locale, only: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f blank Define characters to be classified as characters. In the POSIX locale, only the and shall be included. toupper Define the mapping of lowercase letters to uppercase letters. In the POSIX locale, at a minimum, the 26 lowercase characters: a b c d e f g h i j k l m n o p q r s t u v w x y z shall be mapped to the corresponding 26 uppercase characters: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z tolower Define the mapping of uppercase letters to lowercase letters. In the POSIX locale, at a minimum, the 26 uppercase characters: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z shall be mapped to the corresponding 26 lowercase characters: a b c d e f g h i j k l m n o p q r s t u v w x y z Summary: • space, cntrl, xdigit, blank are specified exactly. The C locale must only use the specified characters. It can't be extended to support other characters since it explicitly states this is not allowed. • upper, lower, toupper and tolower specify minimum requirements. It's permitted to extend these to support other characters. LC_COLLATE The standard specifies a linear incremental sort order from U+ to U+007F. That's strictly required by the standard. There's a lot of software out there which explicitly switches to the C locale (or just setlocale(LC_COLLATE, "C")) to get a locale-independent guaranteed known sort order. If this was to be changed, a lot of software would break. My take on this is that a UTF-8 C locale should extend the ordering so that it just sorts any UCS codepoint by value (i.e. U+ to U+). This extends the existing order cleanly, and I think matches expectations of what the C locale provides. Regarding handling of non-UTF-8 input, I've not tested how it's handled for regular locales. AFAICT it sorts on UCS codepoints, so it would probably have already discarded them during conversion? While in an ideal world it would be great if the "C" locale could provide the same level of UTF-8/UCS support as other "real" UTF-8 locales, the main issue is ensuring that we comply with the letter of the standards here--unlike every other locale, this one is explicitly defined to provide certain things. The other consideration is that the "C" locale is by definition a "minimal" locale that provides a bare minimum of functionality; if you want to use it to do advanced text processing, I think that's probably outside its scope. If we do want a universally available locale that does provide this level of service, then
Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
Aurelien Jarno dixit: >Because it is supposed to replace the C locale, so to follow POSIX >rules like the C locale. I am personally not convinced that we should go It’s supposed to offer a POSIX/C locale but with UTF-8 as character set instead of 7-bit US ASCII, like the “proper” POSIX/C locale, the latter even with questionable properties for octets with high-bit7 – to achieve better overall usability of UTF-8 as standard encoding, for example. bye, //mirabilos -- "Using Lynx is like wearing a really good pair of shades: cuts out the glare and harmful UV (ultra-vanity), and you feel so-o-o COOL." -- Henry Nelson, March 1999 -- To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/pine.bsm.4.64l.1109132015160.27...@herc.mirbsd.org
Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
On Tue, Sep 13, 2011 at 05:07:46PM +0100, Colin Watson wrote: > On Tue, Sep 13, 2011 at 05:33:19PM +0200, Aurelien Jarno wrote: > > Yes similar problems have already been reported. This change has been > > done as a C locale should not have a collation order. > > Why not? Codepoint order collation is perfectly reasonable for a C > locale. Lots of people use LC_COLLATE=C when all they want is for > things like [a-z] to work reasonably. > Because it is supposed to replace the C locale, so to follow POSIX rules like the C locale. I am personally not convinced that we should go that way, but people who have pushed for this locale (some of them Cc:ed) have made clear in bugs #522776 and #609306 that it should handle collation like a C locale. Maybe they could follow-up this mail with their arguments. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net -- To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110913200300.ga31...@hall.aurel32.net
Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
On Tue, Sep 13, 2011 at 05:33:19PM +0200, Aurelien Jarno wrote: > Yes similar problems have already been reported. This change has been > done as a C locale should not have a collation order. Why not? Codepoint order collation is perfectly reasonable for a C locale. Lots of people use LC_COLLATE=C when all they want is for things like [a-z] to work reasonably. -- Colin Watson [cjwat...@debian.org] -- To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110913160746.gc...@riva.dynamic.greenend.org.uk
Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
On Tue, Sep 13, 2011 at 03:53:23PM +0100, Colin Watson wrote: > On Sun, Sep 04, 2011 at 05:01:07PM +, Aurelien Jarno wrote: > > Modified: > >glibc-package/trunk/debian/changelog > >glibc-package/trunk/debian/patches/localedata/locale-C.diff > > Log: > > * debian/patches/localedata/locale-C.diff: Don't include ISO14651 > > collation rules in C.UTF-8 locale. > > I'm curious what the reason for this was. It seems to be implicated in > this apt crash in Ubuntu: > > https://bugs.launchpad.net/bugs/848907 > > (apt didn't change in the relevant time period; eglibc seems to be the > only other reasonable suspect.) > > I can reproduce the same crash in Debian unstable, with: > > sudo LC_ALL=C.UTF-8 apt-get update > > Now, Michael thinks that this is probably an apt bug too, and he's > working on fixing it; but I'm curious as to the rationale for this > change, since I don't know how many other packages might be affected by > similar problems, and what would go wrong if we backed it out? In particular, this test program fails: $ cat regcomp.c #include #include #include #include int main (int argc, char **argv) { regex_t reg; setlocale (LC_ALL, ""); if (regcomp (®, "[a-z]", 0) != 0) { fprintf (stderr, "regcomp failed!\n"); return 1; } return 0; } $ make CFLAGS='-O2 -g -Wall' regcomp cc -O2 -g -Wallregcomp.c -o regcomp $ LC_ALL=C.UTF-8 ./regcomp; echo $? regcomp failed! 1 This seems to be in conflict with the goal of having a UTF-8-capable but language-agnostic locale; and it's different from how the C.UTF-8 locale in d-i behaves. -- Colin Watson [cjwat...@debian.org] -- To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110913153917.gb...@riva.dynamic.greenend.org.uk
Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
Le 13/09/2011 16:53, Colin Watson a écrit : > On Sun, Sep 04, 2011 at 05:01:07PM +, Aurelien Jarno wrote: >> Modified: >>glibc-package/trunk/debian/changelog >>glibc-package/trunk/debian/patches/localedata/locale-C.diff >> Log: >> * debian/patches/localedata/locale-C.diff: Don't include ISO14651 >> collation rules in C.UTF-8 locale. > > I'm curious what the reason for this was. It seems to be implicated in > this apt crash in Ubuntu: > > https://bugs.launchpad.net/bugs/848907 > > (apt didn't change in the relevant time period; eglibc seems to be the > only other reasonable suspect.) > > I can reproduce the same crash in Debian unstable, with: > > sudo LC_ALL=C.UTF-8 apt-get update > > Now, Michael thinks that this is probably an apt bug too, and he's > working on fixing it; but I'm curious as to the rationale for this > change, since I don't know how many other packages might be affected by > similar problems, and what would go wrong if we backed it out? Yes similar problems have already been reported. This change has been done as a C locale should not have a collation order. Unfortunately it seems not easy to create such a locale, so the current plan is to drop the C.UTF-8 locale until a solution is found. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net -- To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4e6f77bf.8010...@aurel32.net
Re: r4943 - in glibc-package/trunk/debian: . patches/localedata
On Sun, Sep 04, 2011 at 05:01:07PM +, Aurelien Jarno wrote: > Modified: >glibc-package/trunk/debian/changelog >glibc-package/trunk/debian/patches/localedata/locale-C.diff > Log: > * debian/patches/localedata/locale-C.diff: Don't include ISO14651 > collation rules in C.UTF-8 locale. I'm curious what the reason for this was. It seems to be implicated in this apt crash in Ubuntu: https://bugs.launchpad.net/bugs/848907 (apt didn't change in the relevant time period; eglibc seems to be the only other reasonable suspect.) I can reproduce the same crash in Debian unstable, with: sudo LC_ALL=C.UTF-8 apt-get update Now, Michael thinks that this is probably an apt bug too, and he's working on fixing it; but I'm curious as to the rationale for this change, since I don't know how many other packages might be affected by similar problems, and what would go wrong if we backed it out? Thanks, -- Colin Watson [cjwat...@debian.org] -- To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110913145323.ga...@riva.dynamic.greenend.org.uk
r4943 - in glibc-package/trunk/debian: . patches/localedata
Author: aurel32 Date: 2011-09-04 17:01:06 + (Sun, 04 Sep 2011) New Revision: 4943 Modified: glibc-package/trunk/debian/changelog glibc-package/trunk/debian/patches/localedata/locale-C.diff Log: * debian/patches/localedata/locale-C.diff: Don't include ISO14651 collation rules in C.UTF-8 locale. Modified: glibc-package/trunk/debian/changelog === --- glibc-package/trunk/debian/changelog2011-09-04 16:39:12 UTC (rev 4942) +++ glibc-package/trunk/debian/changelog2011-09-04 17:01:06 UTC (rev 4943) @@ -7,6 +7,8 @@ * debian/sysdeps/sparc64.mk: re-enable multiarch similarly to what has been done on sparc. * debian/control.in/libc: remove Breaks: on perl. Closes: #640300. + * debian/patches/localedata/locale-C.diff: Don't include ISO14651 +collation rules in C.UTF-8 locale. [ Jeremie Koenig ] * New patches to improve the signal code on Hurd: Modified: glibc-package/trunk/debian/patches/localedata/locale-C.diff === --- glibc-package/trunk/debian/patches/localedata/locale-C.diff 2011-09-04 16:39:12 UTC (rev 4942) +++ glibc-package/trunk/debian/patches/localedata/locale-C.diff 2011-09-04 17:01:06 UTC (rev 4943) @@ -4,7 +4,7 @@ --- /dev/null +++ b/localedata/locales/C -@@ -0,0 +1,34 @@ +@@ -0,0 +1,35 @@ +escape_char / +comment_char % +% Locale for C locale in UTF-8 @@ -20,8 +20,8 @@ +fax"" +language "C" +territory "" -+revision "1.0" -+date "2011-02-08" ++revision "1.1" ++date "2011-09-04" +% +category "C:2011";LC_IDENTIFICATION +category "C:2011";LC_CTYPE @@ -36,6 +36,7 @@ +END LC_CTYPE + +LC_COLLATE -+% Copy the template from ISO/IEC 14651 -+copy "iso14651_t1" ++order_start forward ++UNDEFINED ++order_end +END LC_COLLATE -- To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/e1r0g4b-0002l4...@vasks.debian.org