Re: r4943 - in glibc-package/trunk/debian: . patches/localedata

2011-09-13 Thread Roger Leigh
On Tue, Sep 13, 2011 at 10:03:01PM +0200, Aurelien Jarno wrote:
> On Tue, Sep 13, 2011 at 05:07:46PM +0100, Colin Watson wrote:
> > On Tue, Sep 13, 2011 at 05:33:19PM +0200, Aurelien Jarno wrote:
> > > Yes similar problems have already been reported. This change has been
> > > done as a C locale should not have a collation order.
> > 
> > Why not?  Codepoint order collation is perfectly reasonable for a C
> > locale.  Lots of people use LC_COLLATE=C when all they want is for
> > things like [a-z] to work reasonably.
> > 
> 
> Because it is supposed to replace the C locale, so to follow POSIX
> rules like the C locale. I am personally not convinced that we should go
> that way, but people who have pushed for this locale (some of them
> Cc:ed) have made clear in bugs #522776 and #609306 that it should handle
> collation like a C locale.
> 
> Maybe they could follow-up this mail with their arguments.

OK, here goes ;-)

The "C.UTF-8" locale /is/ the "C" locale, extended to support UTF-8.
That is, it must support the *standard* behaviour mandated in the
C, POSIX and SUS standards, or else conforming applications will break.

This is the reference for the forthcoming SUSv4 locale definition:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html

This standard defines in detail exactly how various aspects of the
C/POSIX locale must behave.  Conforming applications can expect this
behaviour to be guaranteed by a conforming C library.  Some aspects
are strictly defined, while others offer the possibility for
extension.  Examples:

LC_CTYPE

upper
Define characters to be classified as uppercase letters. 
In the POSIX locale, the 26 uppercase letters shall be included:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

lower
Define characters to be classified as lowercase letters. 
In the POSIX locale, the 26 lowercase letters shall be included:
a b c d e f g h i j k l m n o p q r s t u v w x y z

digit
Define the characters to be classified as numeric digits. 
In the POSIX locale, only:
0 1 2 3 4 5 6 7 8 9 shall be included.

space
Define characters to be classified as white-space characters. 
In the POSIX locale, exactly , , ,
, , and  shall be included.

cntrl
Define characters to be classified as control characters. 
In the POSIX locale, no characters in classes alpha or print shall be
included.

xdigit
Define the characters to be classified as hexadecimal digits. 
In the POSIX locale, only:
0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f

blank
Define characters to be classified as  characters. 
In the POSIX locale, only the  and  shall be included.

toupper
Define the mapping of lowercase letters to uppercase letters. 
In the POSIX locale, at a minimum, the 26 lowercase characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
shall be mapped to the corresponding 26 uppercase characters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

tolower
Define the mapping of uppercase letters to lowercase letters. 
In the POSIX locale, at a minimum, the 26 uppercase characters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
shall be mapped to the corresponding 26 lowercase characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z

Summary:
• space, cntrl, xdigit, blank are specified exactly.  The C locale
  must only use the specified characters.  It can't be extended to
  support other characters since it explicitly states this is not
  allowed.
• upper, lower, toupper and tolower specify minimum requirements.
  It's permitted to extend these to support other characters.

LC_COLLATE
The standard specifies a linear incremental sort order from U+ to
U+007F.  That's strictly required by the standard.  There's a lot of
software out there which explicitly switches to the C locale (or just
setlocale(LC_COLLATE, "C")) to get a locale-independent guaranteed
known sort order.  If this was to be changed, a lot of software would
break.

My take on this is that a UTF-8 C locale should extend the ordering
so that it just sorts any UCS codepoint by value (i.e. U+ to
U+).  This extends the existing order cleanly, and I think
matches expectations of what the C locale provides.  Regarding
handling of non-UTF-8 input, I've not tested how it's handled for
regular locales.  AFAICT it sorts on UCS codepoints, so it would
probably have already discarded them during conversion?

While in an ideal world it would be great if the "C" locale could
provide the same level of UTF-8/UCS support as other "real" UTF-8
locales, the main issue is ensuring that we comply with the letter
of the standards here--unlike every other locale, this one is
explicitly defined to provide certain things.  The other
consideration is that the "C" locale is by definition a "minimal"
locale that provides a bare minimum of functionality; if you want to
use it to do advanced text processing, I think that's probably outside
its scope.  If we do want a universally available locale that does
provide this level of service, then

Re: r4943 - in glibc-package/trunk/debian: . patches/localedata

2011-09-13 Thread Thorsten Glaser
Aurelien Jarno dixit:

>Because it is supposed to replace the C locale, so to follow POSIX
>rules like the C locale. I am personally not convinced that we should go

It’s supposed to offer a POSIX/C locale but with UTF-8 as
character set instead of 7-bit US ASCII, like the “proper”
POSIX/C locale, the latter even with questionable properties
for octets with high-bit7 – to achieve better overall usability
of UTF-8 as standard encoding, for example.

bye,
//mirabilos
-- 
  "Using Lynx is like wearing a really good pair of shades: cuts out
   the glare and harmful UV (ultra-vanity), and you feel so-o-o COOL."
 -- Henry Nelson, March 1999


--
To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/pine.bsm.4.64l.1109132015160.27...@herc.mirbsd.org



Re: r4943 - in glibc-package/trunk/debian: . patches/localedata

2011-09-13 Thread Aurelien Jarno
On Tue, Sep 13, 2011 at 05:07:46PM +0100, Colin Watson wrote:
> On Tue, Sep 13, 2011 at 05:33:19PM +0200, Aurelien Jarno wrote:
> > Yes similar problems have already been reported. This change has been
> > done as a C locale should not have a collation order.
> 
> Why not?  Codepoint order collation is perfectly reasonable for a C
> locale.  Lots of people use LC_COLLATE=C when all they want is for
> things like [a-z] to work reasonably.
> 

Because it is supposed to replace the C locale, so to follow POSIX
rules like the C locale. I am personally not convinced that we should go
that way, but people who have pushed for this locale (some of them
Cc:ed) have made clear in bugs #522776 and #609306 that it should handle
collation like a C locale.

Maybe they could follow-up this mail with their arguments.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net


-- 
To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110913200300.ga31...@hall.aurel32.net



Re: r4943 - in glibc-package/trunk/debian: . patches/localedata

2011-09-13 Thread Colin Watson
On Tue, Sep 13, 2011 at 05:33:19PM +0200, Aurelien Jarno wrote:
> Yes similar problems have already been reported. This change has been
> done as a C locale should not have a collation order.

Why not?  Codepoint order collation is perfectly reasonable for a C
locale.  Lots of people use LC_COLLATE=C when all they want is for
things like [a-z] to work reasonably.

-- 
Colin Watson   [cjwat...@debian.org]


-- 
To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/20110913160746.gc...@riva.dynamic.greenend.org.uk



Re: r4943 - in glibc-package/trunk/debian: . patches/localedata

2011-09-13 Thread Colin Watson
On Tue, Sep 13, 2011 at 03:53:23PM +0100, Colin Watson wrote:
> On Sun, Sep 04, 2011 at 05:01:07PM +, Aurelien Jarno wrote:
> > Modified:
> >glibc-package/trunk/debian/changelog
> >glibc-package/trunk/debian/patches/localedata/locale-C.diff
> > Log:
> >   * debian/patches/localedata/locale-C.diff: Don't include ISO14651
> > collation rules in C.UTF-8 locale.
> 
> I'm curious what the reason for this was.  It seems to be implicated in
> this apt crash in Ubuntu:
> 
>   https://bugs.launchpad.net/bugs/848907
> 
> (apt didn't change in the relevant time period; eglibc seems to be the
> only other reasonable suspect.)
> 
> I can reproduce the same crash in Debian unstable, with:
> 
>   sudo LC_ALL=C.UTF-8 apt-get update
> 
> Now, Michael thinks that this is probably an apt bug too, and he's
> working on fixing it; but I'm curious as to the rationale for this
> change, since I don't know how many other packages might be affected by
> similar problems, and what would go wrong if we backed it out?

In particular, this test program fails:

  $ cat regcomp.c
  #include 
  #include 
  #include 
  #include 
  
  int main (int argc, char **argv)
  {
  regex_t reg;
  
  setlocale (LC_ALL, "");
  if (regcomp (®, "[a-z]", 0) != 0) {
  fprintf (stderr, "regcomp failed!\n");
  return 1;
  }
  return 0;
  }
  $ make CFLAGS='-O2 -g -Wall' regcomp
  cc -O2 -g -Wallregcomp.c   -o regcomp
  $ LC_ALL=C.UTF-8 ./regcomp; echo $?
  regcomp failed!
  1

This seems to be in conflict with the goal of having a UTF-8-capable but
language-agnostic locale; and it's different from how the C.UTF-8 locale
in d-i behaves.

-- 
Colin Watson   [cjwat...@debian.org]


-- 
To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/20110913153917.gb...@riva.dynamic.greenend.org.uk



Re: r4943 - in glibc-package/trunk/debian: . patches/localedata

2011-09-13 Thread Aurelien Jarno
Le 13/09/2011 16:53, Colin Watson a écrit :
> On Sun, Sep 04, 2011 at 05:01:07PM +, Aurelien Jarno wrote:
>> Modified:
>>glibc-package/trunk/debian/changelog
>>glibc-package/trunk/debian/patches/localedata/locale-C.diff
>> Log:
>>   * debian/patches/localedata/locale-C.diff: Don't include ISO14651
>> collation rules in C.UTF-8 locale.
> 
> I'm curious what the reason for this was.  It seems to be implicated in
> this apt crash in Ubuntu:
> 
>   https://bugs.launchpad.net/bugs/848907
> 
> (apt didn't change in the relevant time period; eglibc seems to be the
> only other reasonable suspect.)
> 
> I can reproduce the same crash in Debian unstable, with:
> 
>   sudo LC_ALL=C.UTF-8 apt-get update
> 
> Now, Michael thinks that this is probably an apt bug too, and he's
> working on fixing it; but I'm curious as to the rationale for this
> change, since I don't know how many other packages might be affected by
> similar problems, and what would go wrong if we backed it out?

Yes similar problems have already been reported. This change has been
done as a C locale should not have a collation order. Unfortunately it
seems not easy to create such a locale, so the current plan is to drop
the C.UTF-8 locale until a solution is found.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net


-- 
To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4e6f77bf.8010...@aurel32.net



Re: r4943 - in glibc-package/trunk/debian: . patches/localedata

2011-09-13 Thread Colin Watson
On Sun, Sep 04, 2011 at 05:01:07PM +, Aurelien Jarno wrote:
> Modified:
>glibc-package/trunk/debian/changelog
>glibc-package/trunk/debian/patches/localedata/locale-C.diff
> Log:
>   * debian/patches/localedata/locale-C.diff: Don't include ISO14651
> collation rules in C.UTF-8 locale.

I'm curious what the reason for this was.  It seems to be implicated in
this apt crash in Ubuntu:

  https://bugs.launchpad.net/bugs/848907

(apt didn't change in the relevant time period; eglibc seems to be the
only other reasonable suspect.)

I can reproduce the same crash in Debian unstable, with:

  sudo LC_ALL=C.UTF-8 apt-get update

Now, Michael thinks that this is probably an apt bug too, and he's
working on fixing it; but I'm curious as to the rationale for this
change, since I don't know how many other packages might be affected by
similar problems, and what would go wrong if we backed it out?

Thanks,

-- 
Colin Watson   [cjwat...@debian.org]


-- 
To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/20110913145323.ga...@riva.dynamic.greenend.org.uk



r4943 - in glibc-package/trunk/debian: . patches/localedata

2011-09-04 Thread Aurelien Jarno
Author: aurel32
Date: 2011-09-04 17:01:06 + (Sun, 04 Sep 2011)
New Revision: 4943

Modified:
   glibc-package/trunk/debian/changelog
   glibc-package/trunk/debian/patches/localedata/locale-C.diff
Log:
  * debian/patches/localedata/locale-C.diff: Don't include ISO14651
collation rules in C.UTF-8 locale.



Modified: glibc-package/trunk/debian/changelog
===
--- glibc-package/trunk/debian/changelog2011-09-04 16:39:12 UTC (rev 
4942)
+++ glibc-package/trunk/debian/changelog2011-09-04 17:01:06 UTC (rev 
4943)
@@ -7,6 +7,8 @@
   * debian/sysdeps/sparc64.mk: re-enable multiarch similarly to what 
 has been done on sparc.
   * debian/control.in/libc: remove Breaks: on perl.  Closes: #640300.
+  * debian/patches/localedata/locale-C.diff: Don't include ISO14651
+collation rules in C.UTF-8 locale.
 
   [ Jeremie Koenig ]
   * New patches to improve the signal code on Hurd:

Modified: glibc-package/trunk/debian/patches/localedata/locale-C.diff
===
--- glibc-package/trunk/debian/patches/localedata/locale-C.diff 2011-09-04 
16:39:12 UTC (rev 4942)
+++ glibc-package/trunk/debian/patches/localedata/locale-C.diff 2011-09-04 
17:01:06 UTC (rev 4943)
@@ -4,7 +4,7 @@
 
 --- /dev/null
 +++ b/localedata/locales/C
-@@ -0,0 +1,34 @@
+@@ -0,0 +1,35 @@
 +escape_char /
 +comment_char %
 +% Locale for C locale in UTF-8
@@ -20,8 +20,8 @@
 +fax""
 +language   "C"
 +territory  ""
-+revision   "1.0"
-+date   "2011-02-08"
++revision   "1.1"
++date   "2011-09-04"
 +%
 +category  "C:2011";LC_IDENTIFICATION
 +category  "C:2011";LC_CTYPE
@@ -36,6 +36,7 @@
 +END LC_CTYPE
 +
 +LC_COLLATE
-+% Copy the template from ISO/IEC 14651
-+copy "iso14651_t1"
++order_start forward
++UNDEFINED
++order_end
 +END LC_COLLATE


-- 
To UNSUBSCRIBE, email to debian-glibc-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/e1r0g4b-0002l4...@vasks.debian.org