Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Giacomo A. Catenazzi

Roger Leigh wrote:

On Mon, Apr 06, 2009 at 11:09:17AM -0700, Steve Langasek wrote:

On Mon, Apr 06, 2009 at 05:33:35PM +, Thorsten Glaser wrote:

If you need a specific locale (as seems from mksh, not
sure if it is a bug in that program), you need to set it.

You can only set a locale on a glibc-based system if it’s
installed beforehand, which root needs to do.

You can build-depend on the locales package and generate the locales you
want locally, using LOCPATH to reference them.  There's no need for Debian
to guarantee the presence of a particular locale ahead of time -
particularly one that isn't actually useful to end users, as C.UTF-8 would
be.


I think that it would be very useful, I'll detail why below.

The GCC toolchain has, for some time now, been using UTF-8 as the
internal representation for narrow strings (-fexec-charset).  It has
also been using UTF-8 as the default input encoding for C source code
(-finput-charset).  This means that unless you take any special
measures, your program will be outputting UTF-8 strings for all file
and terminal I/O.  Of course, this is backward compatible with ASCII,
and is also transcoded automatically when in a non-UTF-8 locale.  I've
attached a trivial example.  Just to be clear: this handling is
completely built into GCC and libc, and is completely transparent.


Hmm. Warning, you confuse some terms.
- input charset is the source charset (used to parse C code)
- exec charset is the charset of the target machine (which run the program).
- C99 must support unicode identifier (written with \u or in other
  non portable implementation defined way)
- standard libraries can use locales (but only if you initialized the locale),
  but not all the functions, not all uses.
- wide charaters are yet an other things (as you note in your example,
  the wide string is not in UTF-8, but I think UTF-32)

Same input and exec charset really means: don't translate strings
(e.g. in
   if(c = 'a') printf(bcde\n);
 'a' and bcde\n will have the same values as in the input file, else
 it will put in binary the representation of exec charset)

I expect that your program will run fine (i.e. really no changes: the
same binary output), if you use tell GCC that you use any other ASCII-7
derived 8-bit encoding (both for input and exec charset).

printf/wprintf uses locale only for numeric representation.

Usually the interpretation of bytes is done by terminal, not by compiler.



Now, this will work fine in all locales *except for C/POSIX*.
Obviously the charsets of some locales can't represent all the
characters used in this example, but the C library will actually
transcode (iconv) to the locale codeset as best it can.  Except for
C/POSIX.

Now, why is this needed?

If I write a program, I might want to use non-ASCII UTF-8 characters
in the sources.  We have been doing this for years without realising
since GCC switched to UTF-8 as the default internal encoding, but
simply for portability when using the C locale we are restricted to
using ASCII only in the sources,


Really minimal C charset is smaller than ASCII (a portable program
must not have $ and no @, plus C supports also smaller charset,
with trigraps [preprocessor] and/or new bigraphs [compiler])


and then a translation library such
as libintl/gettext to get translated strings with the extended
characters in them.  This is workable, but it imposes a big burden on
translators because I might want to use symbols and other characters
which are not part of a /language/ translation, but need adding by
each and every translator through explicit translator comments in the
sources.  This is tedious and error-prone.  If the sources were UTF-8
encoded, this would work perfectly since I could just use the
necessary UTF-8 characters directly in the source rather than abusing
the translation machinery to avoid non-ASCII codes.  A UTF-8 C locale
thus cuts out a big pile of cruft and complexity in sources which only
exists to cater for people who want to run your code in a C locale!
And the translators can completely ignore the now no longer needed
job of translating special characters as well doing as the actual
translation work, so the symbol usage is identical in all
translations, and their job is much easier.


yes, in a perfect world we need only one charset (and maybe only
one language and one locale). From all the proposals to reach this
target, unicode and UTF-8 seems the best solution.
But... for now take care about locales and don't assume UTF-8,
or you will cause trouble with a lot of non-UTF-8 users.
Converting locale (from non-UTF-8 to UTF-8) is simple for
English and few European languages, but it is a tedious work
for many user: it need a flag day, in which I should convert
all my files to UTF-8 or annotate every file with the right
encoding (most of editors and tools understands such annotations).

So for now we support UTF-8, we try to set UTF-8 default to
new users, and UTF-8 is the encoding for debian files in 

Bug#501930: Bug#501927: debian_bundle fails with empty lines containing a space

2009-04-07 Thread Stefano Zacchiroli
On Sat, Oct 11, 2008 at 07:05:31PM +0200, Stefano Zacchiroli wrote:
 Interestingly enough, the Debian policy is ambiguous about what are
 the paragraph separators in debian/control. Section 5.1 first states
 that blank lines are separators (which is usually interpreted as \n
 alone):
 
  A control file consists of one or more paragraphs of fields[1].
  The paragraphs are separated by blank lines.
 
 Then, later on, it seems to allow for other blank characters,
 mentioning spaces and tabs:
 
  Blank lines, or lines consisting only of spaces and tabs, are not
  allowed within field values or between fields - that would mean a
  new paragraph.
 
 If generic blanks (space, tabs, ...) are the intended separators I've
 no objection in fixing the bug as you propose. Cloning/reassigning
 this bug to policy, as it needs to be discussed there as well.

Heya, policy maintainers, can you please give us a bit of feedback on
your draft stance on this issue?

I _think_ I'm going to apply the proposed patch in python-debian,
i.e. allow for liberal blank lines as stanza separators, on the
basis that:

- it is my feeling than that interpretation is the one the policy
  intended (according to the second paragraph I quoted)

- apparently there are out there Packages using liberal separators

Still, I would have preferred to have at least an idea of how you plan
to clarify this, even if the clarification will come later on.

Many thanks in advance.
Cheers.

-- 
Stefano Zacchiroli -o- PhD in Computer Science \ PostDoc @ Univ. Paris 7
z...@{upsilon.cc,pps.jussieu.fr,debian.org} -- http://upsilon.cc/zack/
Dietro un grande uomo c'è ..|  .  |. Et ne m'en veux pas si je te tutoie
sempre uno zaino ...| ..: | Je dis tu à tous ceux que j'aime


signature.asc
Description: Digital signature


Re: does /var/games have to be deleted on purge? (if it's empty..)

2009-04-07 Thread Holger Levsen
Hi Russ,

On Montag, 6. April 2009, Russ Allbery wrote:
 We'd then have a similar problem with any other /var directory that holds
 files mostly created at runtime and only deleted on purge, such as
 /var/log, except that the rest are always in existence.

According to the FHS the other 4 directories in the same category 
as /var/games are /var/account, /var/crash, /var/mail and /var/yp.

 I don't see much real benefit in going out of our way to remove /var/games

less cruft on disk, better overview when doing ls /var, but yeah... that's 
not really much :)

 and it looks like it would be a bit annoying (at the least, require adding
 purge code to all games that put files in /var/games that would usually
 never be triggered).

Sounds like a job for debhelper/cdbs to me...

 My inclination would be to say that this behavior is 
 fine and perhaps we should officially bless it somewhere.

I'd appreciate that, I like to follow documented procedures when adding 
checks/ignores to piuparts and then run it on the archive ;)


regards,
Holger




signature.asc
Description: This is a digitally signed message part.


Re: does /var/games have to be deleted on purge? (if it's empty..)

2009-04-07 Thread Paul Wise
On Tue, Apr 7, 2009 at 2:33 AM, Russ Allbery r...@debian.org wrote:

 I don't see much real benefit in going out of our way to remove /var/games
 and it looks like it would be a bit annoying (at the least, require adding
 purge code to all games that put files in /var/games that would usually
 never be triggered).  My inclination would be to say that this behavior is
 fine and perhaps we should officially bless it somewhere.

A single rmdir in every game using /var/games isn't that hard,
especially since they have to remove the files from there.

-- 
bye,
pabs

http://wiki.debian.org/PaulWise


--
To UNSUBSCRIBE, email to debian-policy-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Bill Allombert
On Mon, Apr 06, 2009 at 10:56:25PM +0100, Roger Leigh wrote:
 On Mon, Apr 06, 2009 at 04:18:59PM +0200, Bill Allombert wrote:
  On Mon, Apr 06, 2009 at 02:06:55PM +0200, Thorsten Glaser wrote:
   Package: debian-policy
   Version: 3.8.1.0
   Severity: wishlist
   
   For the mksh regression tests, I need a UTF-8 locale working; most
   systems either provide “en_US.UTF-8” or “en_US.utf8” with the former
   being recommended.
  
  Hello Thorsten,
  I have some sympathy with your proposal because dgettext does not work
  in the C locale but there are too much open question.
 
 Is there any hope of fixing this?  I consider this hardcoded
 gettext behaviour in a C locale a severe misfeature, which has caused
 me (as a programmer) no end of problems.

None: I discussed extensively this issue with Bruno Haible, and while he
was sympathetic to my cause, he says there were no chance that upstream
glibc would accept such a change.

On the other hand, technically it is a one-line patch to remove that
restriction. I even considered to ship menu with a patched gettext to
avoid that issue. Fortunately, since Sarge, debian-installer set LANG in
/etc/environment so programs almost never run under C locale anymore.

Cheers,
-- 
Bill. ballo...@debian.org

Imagine a large red swirl here. 



--
To UNSUBSCRIBE, email to debian-policy-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Adeodato Simó
+ Thorsten Glaser (Tue, 07 Apr 2009 18:54:59 +):

 Except the ton which sets LC_ALL=C to get sane (parsable,
 dependable, historically compatible) output.

 These would then unset all other LC_* and LANG and LANGUAGE,
 and only set LC_CTYPE to C.UTF-8 to get old behaviour but
 with UTF-8 (and mbrtowc and iswctype and and and) available.

Isn’t setting LC_ALL=C.UTF-8 going to be about the same and less work?
I’m genuinely interested if that would behave any different to what you
said (unsetting all, setting LC_CTYPE).

Cheers,

-- 
- Are you sure we're good?
- Always.
-- Rory and Lorelai




--
To UNSUBSCRIBE, email to debian-policy-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Roger Leigh
On Tue, Apr 07, 2009 at 06:54:59PM +, Thorsten Glaser wrote:
 Bill Allombert dixit:
 
 Fortunately, since Sarge, debian-installer set LANG in
 /etc/environment so programs almost never run under C locale anymore.
 
 Except the ton which sets LC_ALL=C to get sane (parsable,
 dependable, historically compatible) output.

The gettext bug itself won't cause any change in typical behaviour
with gettext().

As an optimisation, it's OK to skip translating if running in a C
locale.  However, if we use dgettext/dcgettext etc., we are
explicitly asking for a given text domain and want translation
even in a C locale.

As Bill said, the change is trivial (I've also looked at libintl
and libc to look at fixing it).  One use case I need this for is
the generation of PPD files in gutenprint; we generate single files
containing multiple languages and so use dgettext, but this totally
breaks in the C locale due to the C locale special casing in gettext.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?   http://gutenprint.sourceforge.net/
   `-GPG Public Key: 0x25BFB848   Please GPG sign your mail.


signature.asc
Description: Digital signature


Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Thorsten Glaser
Adeodato Simó dixit:

+ Thorsten Glaser (Tue, 07 Apr 2009 18:54:59 +):

 Except the ton which sets LC_ALL=C to get sane (parsable,
 dependable, historically compatible) output.

 These would then unset all other LC_* and LANG and LANGUAGE,
 and only set LC_CTYPE to C.UTF-8 to get old behaviour but
 with UTF-8 (and mbrtowc and iswctype and and and) available.

Isn’t setting LC_ALL=C.UTF-8 going to be about the same and less work?

Indeed.

I’m genuinely interested if that would behave any different to what you
said (unsetting all, setting LC_CTYPE).

For my proposed C.UTF-8 locale it would be exactly zero, nada,
difference. (For en_US.UTF-8 it is a lot of difference, for example
sorting order.)

Unfortunately, GNU libc needs a locale to even enable UTF-8 support.

bye,
//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2


--
To UNSUBSCRIBE, email to debian-policy-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#501930: Bug#501927: debian_bundle fails with empty lines containing a space

2009-04-07 Thread Bill Allombert
On Sun, Oct 12, 2008 at 12:10:54AM +0200, Stefano Zacchiroli wrote:
 On Sat, Oct 11, 2008 at 07:26:59PM +0200, Siegfried Gevatter (RainCT) wrote:
  Further, PackageFile fails if there is more than one empty line. Eg.,
  http://revu.ubuntuwire.com/revu1-incoming/ampache-0708220100/ampache-3.3.3.5-dfsg/debian/control
  
  IMHO those cases should be handled well even if debian-policy didn't
  allow it, as I've found *hundreds* of files that can't be parsed, and
  that only on REVU... And robustness can't hurt :).
 
 Nope, I'm against such an argument.

Agreed: Robustness means rejecting malformed input with an error.

 In fact, in this specific case, I've been very surprised of not
 finding in the policy an explicit reference to RFC822 [1]. A lot of
 implementation I've seen around of Packages/Sources file do use legacy
 RFC822 libraries, having that practice written in policy would be
 helpful (note that at that point, whether spaces are accepted or not
 will depend entirely on the RFC822 standard).

What it is worth: RFC822 only allows a single empty line as delimiter between
the header and the body of a message: this allows the body to start with
an empty line.

RFC822 does not document the mbox format.

Cheers,
-- 
Bill. ballo...@debian.org

Imagine a large red swirl here. 



-- 
To UNSUBSCRIBE, email to debian-policy-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Thorsten Glaser
Bill Allombert dixit:

Fortunately, since Sarge, debian-installer set LANG in
/etc/environment so programs almost never run under C locale anymore.

Except the ton which sets LC_ALL=C to get sane (parsable,
dependable, historically compatible) output.

These would then unset all other LC_* and LANG and LANGUAGE,
and only set LC_CTYPE to C.UTF-8 to get old behaviour but
with UTF-8 (and mbrtowc and iswctype and and and) available.


For what it's worth: vorlon gave me the means to change the
mksh regression test (LOCPATH), so that this will no longer
block it on the HURD. However, I'm still in favour of a de-
fault UTF-8 locale (be it C.UTF-8 or en_US.UTF-8) installed
plus, maybe, one binary package per locale? Aurelien - if I
remember correctly - said something along these lines too.

bye,
//mirabilos
-- 
[...] if maybe ext3fs wasn't a better pick, or jfs, or maybe reiserfs, oh but
what about xfs, and if only i had waited until reiser4 was ready... in the be-
ginning, there was ffs, and in the middle, there was ffs, and at the end, there
was still ffs, and the sys admins knew it was good. :)  -- Ted Unangst über *fs


--
To UNSUBSCRIBE, email to debian-policy-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Roger Leigh
On Tue, Apr 07, 2009 at 09:24:38PM +0200, Adeodato Simó wrote:
 + Thorsten Glaser (Tue, 07 Apr 2009 18:54:59 +):
 
  Except the ton which sets LC_ALL=C to get sane (parsable,
  dependable, historically compatible) output.
 
  These would then unset all other LC_* and LANG and LANGUAGE,
  and only set LC_CTYPE to C.UTF-8 to get old behaviour but
  with UTF-8 (and mbrtowc and iswctype and and and) available.
 
 Isn’t setting LC_ALL=C.UTF-8 going to be about the same and less work?
 I’m genuinely interested if that would behave any different to what you
 said (unsetting all, setting LC_CTYPE).

% sudo localedef -c -i POSIX -f UTF-8 C.UTF-8

% LANG=C.UTF8 locale charmap
UTF-8

% LANG=C locale charmap
ANSI_X3.4-1968

This appears to work correctly at first glance.

However, I would ideally like the C/POSIX locales to be UTF-8
by default as on other systems (with a C.ASCII variant if required).


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?   http://gutenprint.sourceforge.net/
   `-GPG Public Key: 0x25BFB848   Please GPG sign your mail.


signature.asc
Description: Digital signature


Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Adeodato Simó
+ Steve Langasek (Mon, 06 Apr 2009 11:09:17 -0700):

 On Mon, Apr 06, 2009 at 05:33:35PM +, Thorsten Glaser wrote:
   If you need a specific locale (as seems from mksh, not
   sure if it is a bug in that program), you need to set it.

  You can only set a locale on a glibc-based system if it’s
  installed beforehand, which root needs to do.

 You can build-depend on the locales package and generate the locales you
 want locally, using LOCPATH to reference them.  There's no need for Debian
 to guarantee the presence of a particular locale ahead of time -

It is my impression that more packages than mksh could use an UTF-8
locale at build time (I’m afraid I don’t have pointers, but I’m sure
I’ve come across at least a couple).

Wouldn’t it be just better to change Debian’s default to make an UTF-8
locale available by default, rather than to force all those packages to
play tricks with LOCPATH?

I would go as far as suggesting that some package like libc6 itself
ships the locale, both as a way of ensuring it’ll always be there, and
of not forcing the locales package on every system (not sure if this was
part of your concerns).

Unfortunately, and from my limited knowledge and recent poking of this,
it seems the supported locales for a running system are kept in a single
file (/usr/lib/locale/locale-archive), so I’m unsure how the above could
work out, if at all.

 particularly one that isn't actually useful to end users, as C.UTF-8 would
 be.

Is that point really important? It is useful for building some packages,
plus I’m sure we have pedant enough users that would prefer C.UTF-8 over
en_US.UTF-8. :-P

Finally, this stuff that Roger proposes about making “C” be UTF-8, and
create some C.ASCII for people needing that, sounds shocking at the same
time as appealing.

Cheers,

-- 
- Are you sure we're good?
- Always.
-- Rory and Lorelai




--
To UNSUBSCRIBE, email to debian-policy-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Thorsten Glaser
Roger Leigh dixit:

However, I would ideally like the C/POSIX locales to be UTF-8
by default as on other systems (with a C.ASCII variant if required).

No, this has the potential to break, for example, tr(1).
I lived through that on MirBSD.

//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2



--
To UNSUBSCRIBE, email to debian-policy-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Thorsten Glaser
Adeodato Simó dixit:

I would go as far as suggesting that some package like libc6 itself

FWIW:

-rw-r--r-- 1 tg tg 238336 Apr  7 22:59 en_US.UTF-8/LC_CTYPE

It's not *that* much...

Finally, this stuff that Roger proposes about making “C” be UTF-8, and
create some C.ASCII for people needing that, sounds shocking at the same
time as appealing.

It won't work, because in a UTF-8 locale, for example stdio
functions must reject invalid (not valid UTF-8) input, so
it would not be 8-bit clean/transparent any more.

//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2


--
To UNSUBSCRIBE, email to debian-policy-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Roger Leigh
On Tue, Apr 07, 2009 at 09:00:50PM +, Thorsten Glaser wrote:
 Adeodato Simó dixit:
 
 I would go as far as suggesting that some package like libc6 itself
 
 FWIW:
 
 -rw-r--r-- 1 tg tg 238336 Apr  7 22:59 en_US.UTF-8/LC_CTYPE
 
 It's not *that* much...
 
 Finally, this stuff that Roger proposes about making “C” be UTF-8, and
 create some C.ASCII for people needing that, sounds shocking at the same
 time as appealing.
 
 It won't work, because in a UTF-8 locale, for example stdio
 functions must reject invalid (not valid UTF-8) input, so
 it would not be 8-bit clean/transparent any more.

I wasn't aware that this level of checking was performed, though
it does make sense.  But, does it not reject non 7-bit input in the C
locale for completeness?

Should tools doing raw I/O not be using lower level interfaces
such as fread() and fwrite() rather than the formatted print
functions which are specified to behave in a locale-dependent
manner?  This strikes me as bugs in the form of assumptions in the
code which should be fixed, rather than a fundamental problem with
the locale itself using a non-7-bit-ASCII codeset.


Thanks,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?   http://gutenprint.sourceforge.net/
   `-GPG Public Key: 0x25BFB848   Please GPG sign your mail.


signature.asc
Description: Digital signature


Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Roger Leigh
On Tue, Apr 07, 2009 at 10:36:20AM +0200, Giacomo A. Catenazzi wrote:

I can't help but feel that your reply completely missed the
purpose of what I want to do, and why.  I hope the following
response clears things up.

 Roger Leigh wrote:
 On Mon, Apr 06, 2009 at 11:09:17AM -0700, Steve Langasek wrote:
 On Mon, Apr 06, 2009 at 05:33:35PM +, Thorsten Glaser wrote:
 If you need a specific locale (as seems from mksh, not
 sure if it is a bug in that program), you need to set it.
 You can only set a locale on a glibc-based system if it’s
 installed beforehand, which root needs to do.
 You can build-depend on the locales package and generate the locales you
 want locally, using LOCPATH to reference them.  There's no need for Debian
 to guarantee the presence of a particular locale ahead of time -
 particularly one that isn't actually useful to end users, as C.UTF-8 would
 be.

 I think that it would be very useful, I'll detail why below.

 The GCC toolchain has, for some time now, been using UTF-8 as the
 internal representation for narrow strings (-fexec-charset).  It has
 also been using UTF-8 as the default input encoding for C source code
 (-finput-charset).  This means that unless you take any special
 measures, your program will be outputting UTF-8 strings for all file
 and terminal I/O.  Of course, this is backward compatible with ASCII,
 and is also transcoded automatically when in a non-UTF-8 locale.  I've
 attached a trivial example.  Just to be clear: this handling is
 completely built into GCC and libc, and is completely transparent.

 Hmm. Warning, you confuse some terms.

I'm not really sure how relevant these minor points are to the general
point that I was trying to make.

 - input charset is the source charset (used to parse C code)
 - exec charset is the charset of the target machine (which run the program).

That's pretty much what I said.

 - C99 must support unicode identifier (written with \u or in other
   non portable implementation defined way)

OK.  But that's really nothing to do with the fact that you can use
UTF-8 sources directly.  It's akin to having to support trigraphs,
but we don't use trigraphs because they are bloody annoying and nowadays
competelely unnecessary.  But mainly, it doesn't affect the exec charset
whether you use UTF-8 encoded sources or \u.

 - standard libraries can use locales (but only if you initialized the locale),
   but not all the functions, not all uses.
 - wide charaters are yet an other things (as you note in your example,
   the wide string is not in UTF-8, but I think UTF-32)

 Same input and exec charset really means: don't translate strings
 (e.g. in
if(c = 'a') printf(bcde\n);
  'a' and bcde\n will have the same values as in the input file, else
  it will put in binary the representation of exec charset)

Of course.  However, the test program I posted showed what that if the
locale has been appropriately initialised, there is an additional
translation between the exec charset and the output charset specified
by the locale (see the Latin characters correctly preserved and output
as ISO-8859-1 in an ISO-8859-1 locale).

 I expect that your program will run fine (i.e. really no changes: the
 same binary output), if you use tell GCC that you use any other ASCII-7
 derived 8-bit encoding (both for input and exec charset).

Of course.

 Usually the interpretation of bytes is done by terminal, not by compiler.

It's done at several points:
compiler: source-exec
runtime: locale-dependent exec-output (and optional use of gettext)
terminal: output-display

 Now, this will work fine in all locales *except for C/POSIX*.
 Obviously the charsets of some locales can't represent all the
 characters used in this example, but the C library will actually
 transcode (iconv) to the locale codeset as best it can.  Except for
 C/POSIX.

 Now, why is this needed?

 If I write a program, I might want to use non-ASCII UTF-8 characters
 in the sources.  We have been doing this for years without realising
 since GCC switched to UTF-8 as the default internal encoding, but
 simply for portability when using the C locale we are restricted to
 using ASCII only in the sources,

 Really minimal C charset is smaller than ASCII (a portable program
 must not have $ and no @, plus C supports also smaller charset,
 with trigraps [preprocessor] and/or new bigraphs [compiler])

I'm not sure how relevant this is.  This is specified as the minimum
requirement by the *C standard*.  But, it's the *minimum* requirement.
GCC supports full use of UTF-8 (or whatever) encoded sources, and I
want to make better use of it, while still remaining in compliance
with the standard (which it is--I've read the ISO C standard relating
to source and execution character sets, and you're allowed to do better
than 7 bit ASCII!).

 and then a translation library such
 as libintl/gettext to get translated strings with the extended
 characters in them.  This is workable, but it imposes a big 

Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Andrew McMillan
On Tue, 2009-04-07 at 22:32 +0200, Adeodato Simó wrote:
 
 It is my impression that more packages than mksh could use an UTF-8
 locale at build time (I’m afraid I don’t have pointers, but I’m sure
 I’ve come across at least a couple).
 
 Wouldn’t it be just better to change Debian’s default to make an UTF-8
 locale available by default, rather than to force all those packages to
 play tricks with LOCPATH?

I too would really like to see a UTF-8 locale available by default, and
would prefer to see this be the C.UTF-8 locale, which doesn't screw with
the collation / character type settings like any other UTF-8 locale
would.

It seems to me that the consensus here is that having a UTF-8 locale
available is a good idea and I don't hear any very strong argument
against such a change.

Consequently I think we should move on from the discussion and start
working out a patch to resolve this in policy.

Regards,
Andrew.


andrew (AT) morphoss (DOT) com+64(272)DEBIAN
   Time to be aggressive.  Go after a tattooed Virgo.






--
To UNSUBSCRIBE, email to debian-policy-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Thorsten Glaser
Roger Leigh dixit:

But, does it not reject non 7-bit input in the C
locale for completeness?

No, it doesn't - we (before my time though, I think) fought
hard for eight-bit transparence and eight-bit cleanliness.

Should tools doing raw I/O not be using lower level interfaces
such as fread() and fwrite()

These too are affected.

//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2


--
To UNSUBSCRIBE, email to debian-policy-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Roger Leigh
On Tue, Apr 07, 2009 at 10:01:16PM +, Thorsten Glaser wrote:
 Roger Leigh dixit:
 
 But, does it not reject non 7-bit input in the C
 locale for completeness?
 
 No, it doesn't - we (before my time though, I think) fought
 hard for eight-bit transparence and eight-bit cleanliness.
 
 Should tools doing raw I/O not be using lower level interfaces
 such as fread() and fwrite()
 
 These too are affected.

Are you sure?  The documentation does not suggest they are affected
by locale.  These functions are operating on binary objects, and
should not be affected by the locale.  From SUSv3:

fwrite - binary output
The fwrite() function shall write, from the array pointed to by ptr, up to
nitems elements whose size is specified by size, to the stream pointed to by
stream. For each object, size calls shall be made to the fputc() function,
taking the values (in order) from an array of unsigned char exactly overlaying
the object.

And for fputc

fputc - put a byte on a stream
The fputc() function shall write the byte specified by c (converted to an
unsigned char) to the output stream pointed to by stream, at the position
indicated by the associated file-position indicator for the stream (if
defined), and shall advance the indicator appropriately. If the file cannot
support positioning requests, or if the stream was opened with append mode, the
byte shall be appended to the output stream.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?   http://gutenprint.sourceforge.net/
   `-GPG Public Key: 0x25BFB848   Please GPG sign your mail.


signature.asc
Description: Digital signature


Re: Bug#522776: debian-policy: mandate existence of a standardised UTF-8 locale

2009-04-07 Thread Thorsten Glaser
Roger Leigh dixit:

Are you sure?

Not entirely, but I recall fgetc (or was it fgetwc?)
being affected.

//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2


--
To UNSUBSCRIBE, email to debian-policy-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org




Re: does /var/games have to be deleted on purge? (if it's empty..)

2009-04-07 Thread Holger Levsen
Hi,

On Dienstag, 7. April 2009, Paul Wise wrote:
 A single rmdir in every game using /var/games isn't that hard,
 especially since they have to remove the files from there.

I agree and plan to file RC bugs on this. 

(There have been 24781 binary packages been successfully tested in sid and 
squeeze atm, 369 have failures, of which eleven packages keep /var/games 
around, of which 4 also keep other files in /var/games/* - seven more RC bugs 
sound reasonable to me. Plus potentially a few more in packages not tested.)

Or is RC too much? Or fine now? 


regards,
Holger


signature.asc
Description: This is a digitally signed message part.