Re: LC_CTYPE=UTF-8

2020-12-12 Thread k...@keldix.com via austin-group-l at The Open Group
iso 30112 has a 10646 i18n locale which is implemented in glibc, and which 
should be used.

keld

On Sun, Dec 13, 2020 at 04:35:08AM +, Thorsten Glaser via austin-group-l at 
The Open Group wrote:
> Martijn Dekker dixit:
> 
> > Datapoint: AT ksh93 has a C.UTF-8 locale built in which it handles
> > internally. I suppose that could be considered a precedent of sorts.
> 
> Datapoint:
> 
> I proposed to add a C.UTF-8 locale to eglibc via Debian, almost a
> decade ago.
> 
> Debian???s GNU libc packages have been shipping C.UTF-8 since about
> 2013 or so, and there have been numerous problems with its precise
> implementation (as it turns out that it???s *extremely* tricky to
> implement correctly, FSVO correct).
> 
> Please contact Aurélien Jarno, who has been maintaining this, for
> any detail information on this, how this has evolved and why and
> consider to standardise this existing variant (possibly bugfixing
> it where, if at all, necessary).
> 
> bye,
> //mirabilos
> -- 
> 22:20??? The crazy that persists in his craziness becomes a master
> 22:21??? And the distance between the craziness and geniality is
> only measured by the success 18:35??? "Psychotics are consistently
> inconsistent. The essence of sanity is to be inconsistently inconsistent



Re: LC_CTYPE=UTF-8

2020-12-12 Thread Thorsten Glaser via austin-group-l at The Open Group
Martijn Dekker dixit:

> Datapoint: AT ksh93 has a C.UTF-8 locale built in which it handles
> internally. I suppose that could be considered a precedent of sorts.

Datapoint:

I proposed to add a C.UTF-8 locale to eglibc via Debian, almost a
decade ago.

Debian’s GNU libc packages have been shipping C.UTF-8 since about
2013 or so, and there have been numerous problems with its precise
implementation (as it turns out that it’s *extremely* tricky to
implement correctly, FSVO correct).

Please contact Aurélien Jarno, who has been maintaining this, for
any detail information on this, how this has evolved and why and
consider to standardise this existing variant (possibly bugfixing
it where, if at all, necessary).

bye,
//mirabilos
-- 
22:20⎜ The crazy that persists in his craziness becomes a master
22:21⎜ And the distance between the craziness and geniality is
only measured by the success 18:35⎜ "Psychotics are consistently
inconsistent. The essence of sanity is to be inconsistently inconsistent



RE: LC_CTYPE=UTF-8

2020-06-26 Thread Schwarz, Konrad
> -Original Message-
> From: Ingo Schwarze 
> Sent: Thursday, June 25, 2020 21:25
> To: Alan Coopersmith 
> Cc: Hans Åberg ; Austin Group 
> 
> Subject: Re: LC_CTYPE=UTF-8
> 
> Hi Alan,
> 
> Alan Coopersmith wrote on Thu, Jun 25, 2020 at 12:13:33PM -0700:
> > On 6/25/20 8:31 AM, Ingo Schwarze wrote:
> 
> >> Whether to standardize only C.UTF-8 or both C.UTF-8 and POSIX.UTF-8
> >> as synonyms looks a bit like asking for the best colour of a bikeshed.
> >> Given that the standard already contains the redundancy of requiring
> >> both "C" and "POSIX", maybe it is more consistent to also require
> >> both "C.UTF-8" and "POSIX.UTF-8", but i don't think that matters
> >> greatly.
> 
> > The only thought I had along those lines was that I thought the "C"
> > locale came from the C standard, and might be best left to the C
> > committee to standardize, while this group controls the "POSIX"
> > locale definition.  I suspect those following the POSIX standards
> > would end up implementing both, regardless of which specification
> > defines each.

My impression Is that the C standard shied away from all
concrete character-encoding issues, at least originally, where
alternatives such as EBCDIC were still quite relevant.
Although support for multibyte and wide characters were introduced,
this was done in a very abstract way;
I don't recall any mention of explicit encodings such as ASCII.

As such, I think it would be fine for POSIX to standardize
both POSIX.UTF-8 and C.UTF-8; I'd expect little
opposition from the C standard committee to such a move.

(Honestly, I don't know if the Microsoft Visual C library
support a C.UTF-8 locale at the moment -- I'm pretty
sure their system call level is still UTF-16).

TL;DR: for consistency, I'd prefer POSIX to define C.UTF-8
as well as POSIX.UTF-8, even without explicit blessing by
the C committee.  I don't think they reserved parts
of the locale namespace for themselves.

--
Konrad Schwarz



Re: LC_CTYPE=UTF-8

2020-06-25 Thread shwaresyst

There are plans for this, having a POSIX.UTF-8 locale as an XSI base 
requirement. There may be POSIX.UTF-E and UTF-I locales too; same features, 
simply the different charmaps. As options there may even be, albeit this is 
unlikely as no platform I'm aware of fully supports ISO-6429 now, a POSIX.ISO-7 
and POSIX.ISO-8 specification as well. Because c11 and c17 are fundamentally 
broken, with only a minimal partial fix slated for c2x, there are no viable 
plans for a C.UTF-8 or C.UTF-E proposal that I've ever seen. 

However, the way the standard is written now only the repertoire that 
transforms to a single byte encoding may be used, and is what the c2x fix 
limits itself to. This is effectively normative support only of ASCII-68, not 
ISO-646 or 10646. Expanding support to include some of the 2 byte graphic 
repertoire is already permitted by POSIX, but not required. Making allowances 
for most of the UCS2 repertoire is fairly easy, including its 3 byte UTF-8 
representations, but the text for this, and the significant changes for the 4 
byte form needed for full UCS-4 and UTF-16 support, is still to be proposed.

The point is it is still too early, in my opinion, to say what additional 
capabilities these locales will provide to applications to ease multi-lingual 
portability. Of the four choices I see the second or third as the minimum 
desireable. The industry as a whole needs to communicate how much of Unicode 
they want to be supported in Issue 8 or they will be stuck with the minimal 
represented by ASCII-68. Whatever is decided upon, bug fixes and breaking 
changes to non-portable aspects of existing implementations to be conforming to 
the final formal specification of the locale are to be expected.
On Thursday, June 25, 2020 Ingo Schwarze  wrote:
Hi Alan,

Alan Coopersmith wrote on Thu, Jun 25, 2020 at 07:59:39AM -0700:
> On 6/25/20 6:33 AM, Hans Aberg wrote:

>> Perhaps there should be a default UTF-8 locale: It seems that the
>> current construct does not apply so well to it.

> If the goal is to standardize existing behavior the standard could define
> the C.UTF-8 locale (or perhaps a POSIX.UTF-8 locale) that a number of
> systems already have, which is the standard C/POSIX locale with just the
> character set changed to UTF-8 instead.

This idea makes a lot of sense to me.

If the Austin Group decides that it wants to go into that direction,
i would make sure that both OpenBSD and the software i publish use
that name for a locale with these properties and consistently
recommend using that name.  Both already support a locale with these
properties and select it if the user asks for C.UTF-8 or POSIX.UTF-8,
but so far, they recommend that users specify en_US.UTF-8 (for
historical reasons), which is a bit unfortunate because it looks
like requesting cultural conventions for a particular country, which
is not the intention.

Whether to standardize only C.UTF-8 or both C.UTF-8 and POSIX.UTF-8
as synonyms looks a bit like asking for the best colour of a bikeshed.
Given that the standard already contains the redundancy of requiring
both "C" and "POSIX", maybe it is more consistent to also require
both "C.UTF-8" and "POSIX.UTF-8", but i don't think that matters
greatly.

Yours,
  Ingo



Re: LC_CTYPE=UTF-8

2020-06-25 Thread shwaresyst

The locale requirements specified in the C standard are what is applicable for 
implementations that limit their character encoding to the basic source and 
execution character sets. POSIX requires implementations to support, in at 
least one provided charmap, the superset of the basic sets represented by the 
portable character set. The C standard makes allowance for this, with extended 
character sets, as also being conforming so the use of "C" as synonym for 
"POSIX" is permitted. The use is required only when a platform is configured to 
operate in a POSIX conforming mode, as well, so an implementation electing to 
have separate C and POSIX definitions is plausible still. If the C standard 
ever does make a new requirement that conflicts with what is specified now for 
the POSIX locale then the likelihood is 2 separate locales will be a new 
requirement in a future Issue, no longer an election, to retain backwards 
compatibility. 
On Thursday, June 25, 2020 Martijn Dekker  wrote:
Op 25-06-20 om 21:13 schreef Alan Coopersmith:
> The only thought I had along those lines was that I thought the "C" 
> locale came from the C standard, and might be best left to the C 
> committee to standardize, while this group controls the "POSIX" 
> locale definition.

Actually, as far as POSIX is concerned, the two are synonymous.

XBD 7.2 "POSIX Locale":
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_02
| Conforming systems shall provide a POSIX locale, also known as the C
| locale. In POSIX.1 the requirements for the POSIX locale are more
| extensive than the requirements for the C locale as specified in the ISO
| C standard. However, in a conforming POSIX implementation, the POSIX
| locale and the C locale are identical.


-- 
||    modernish -- harness the shell
||    https://github.com/modernish/modernish
||
||    KornShell lives!
||    https://github.com/ksh93/ksh



Re: LC_CTYPE=UTF-8

2020-06-25 Thread Martijn Dekker

Op 25-06-20 om 21:13 schreef Alan Coopersmith:
The only thought I had along those lines was that I thought the "C" 
locale came from the C standard, and might be best left to the C 
committee to standardize, while this group controls the "POSIX" 
locale definition.


Actually, as far as POSIX is concerned, the two are synonymous.

XBD 7.2 "POSIX Locale":
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_02
| Conforming systems shall provide a POSIX locale, also known as the C
| locale. In POSIX.1 the requirements for the POSIX locale are more
| extensive than the requirements for the C locale as specified in the ISO
| C standard. However, in a conforming POSIX implementation, the POSIX
| locale and the C locale are identical.


--
||  modernish -- harness the shell
||  https://github.com/modernish/modernish
||
||  KornShell lives!
||  https://github.com/ksh93/ksh



Re: LC_CTYPE=UTF-8

2020-06-25 Thread Ingo Schwarze
Hi Alan,

Alan Coopersmith wrote on Thu, Jun 25, 2020 at 12:13:33PM -0700:
> On 6/25/20 8:31 AM, Ingo Schwarze wrote:

>> Whether to standardize only C.UTF-8 or both C.UTF-8 and POSIX.UTF-8
>> as synonyms looks a bit like asking for the best colour of a bikeshed.
>> Given that the standard already contains the redundancy of requiring
>> both "C" and "POSIX", maybe it is more consistent to also require
>> both "C.UTF-8" and "POSIX.UTF-8", but i don't think that matters
>> greatly.

> The only thought I had along those lines was that I thought the "C"
> locale came from the C standard, and might be best left to the C
> committee to standardize, while this group controls the "POSIX"
> locale definition.  I suspect those following the POSIX standards
> would end up implementing both, regardless of which specification
> defines each.

That sounds quite reasonable to me.

Yours,
  Ingo



Re: LC_CTYPE=UTF-8

2020-06-25 Thread Alan Coopersmith

On 6/25/20 8:31 AM, Ingo Schwarze wrote:

Whether to standardize only C.UTF-8 or both C.UTF-8 and POSIX.UTF-8
as synonyms looks a bit like asking for the best colour of a bikeshed.
Given that the standard already contains the redundancy of requiring
both "C" and "POSIX", maybe it is more consistent to also require
both "C.UTF-8" and "POSIX.UTF-8", but i don't think that matters
greatly.


The only thought I had along those lines was that I thought the "C"
locale came from the C standard, and might be best left to the C
committee to standardize, while this group controls the "POSIX"
locale definition.  I suspect those following the POSIX standards
would end up implementing both, regardless of which specification
defines each.

--
-Alan Coopersmith-   alan.coopersm...@oracle.com
 Oracle Solaris Engineering - https://blogs.oracle.com/alanc



Re: LC_CTYPE=UTF-8

2020-06-25 Thread Martijn Dekker

Op 25-06-20 om 16:59 schreef Alan Coopersmith:

On 6/25/20 6:33 AM, Hans Åberg wrote:
Perhaps there should be a default UTF-8 locale: It seems that the 
current construct does not apply so well to it.


If the goal is to standardize existing behavior the standard could define
the C.UTF-8 locale (or perhaps a POSIX.UTF-8 locale) that a number of
systems already have, which is the standard C/POSIX locale with just the
character set changed to UTF-8 instead.


Datapoint: AT ksh93 has a C.UTF-8 locale built in which it handles 
internally. I suppose that could be considered a precedent of sorts.


- M.

--
||  modernish -- harness the shell
||  https://github.com/modernish/modernish
||
||  KornShell lives!
||  https://github.com/ksh93/ksh



Re: LC_CTYPE=UTF-8

2020-06-25 Thread Ingo Schwarze
Hi Alan,

Alan Coopersmith wrote on Thu, Jun 25, 2020 at 07:59:39AM -0700:
> On 6/25/20 6:33 AM, Hans Aberg wrote:

>> Perhaps there should be a default UTF-8 locale: It seems that the
>> current construct does not apply so well to it.

> If the goal is to standardize existing behavior the standard could define
> the C.UTF-8 locale (or perhaps a POSIX.UTF-8 locale) that a number of
> systems already have, which is the standard C/POSIX locale with just the
> character set changed to UTF-8 instead.

This idea makes a lot of sense to me.

If the Austin Group decides that it wants to go into that direction,
i would make sure that both OpenBSD and the software i publish use
that name for a locale with these properties and consistently
recommend using that name.  Both already support a locale with these
properties and select it if the user asks for C.UTF-8 or POSIX.UTF-8,
but so far, they recommend that users specify en_US.UTF-8 (for
historical reasons), which is a bit unfortunate because it looks
like requesting cultural conventions for a particular country, which
is not the intention.

Whether to standardize only C.UTF-8 or both C.UTF-8 and POSIX.UTF-8
as synonyms looks a bit like asking for the best colour of a bikeshed.
Given that the standard already contains the redundancy of requiring
both "C" and "POSIX", maybe it is more consistent to also require
both "C.UTF-8" and "POSIX.UTF-8", but i don't think that matters
greatly.

Yours,
  Ingo



Re: LC_CTYPE=UTF-8

2020-06-25 Thread Alan Coopersmith

On 6/25/20 6:33 AM, Hans Åberg wrote:

Perhaps there should be a default UTF-8 locale: It seems that the current 
construct does not apply so well to it.


If the goal is to standardize existing behavior the standard could define
the C.UTF-8 locale (or perhaps a POSIX.UTF-8 locale) that a number of
systems already have, which is the standard C/POSIX locale with just the
character set changed to UTF-8 instead.

--
-Alan Coopersmith-   alan.coopersm...@oracle.com
 Oracle Solaris Engineering - https://blogs.oracle.com/alanc



Re: LC_CTYPE=UTF-8

2020-06-25 Thread Hans Åberg


> On 25 Jun 2020, at 15:19, Ingo Schwarze  wrote:
> 
> Hans Aberg wrote on Thu, Jun 25, 2020 at 10:15:03AM +0200:
> 
>> MacOS sets as default LC_CTYPE=UTF-8, not appearing in the 'locale
>> -a' list. Then some software interprets this as though the locale
>> is C/POSIX, disregards the UTF-8 encoding, and converts all non-ASCII
>> (high bit set) char's into octal escape sequences. What is the
>> correct interpretation here?
> 
> The correct interpretation of "LC_CTYPE=UTF-8" is whatever the
> documentation of the respective operating system says.
> All POSIX says is:
> 
>  https://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html
> 
>  The locale argument is a pointer to a character string containing
>  the required setting of category.  The contents of this string are
>  implementation-defined.
> 
> POSIX only specifies the meaning of the strings "C" and "POSIX";
> any others are implementation-defined.

This is also what I thought. As for the other your comments, rather than 
checking for a particular syntax, it seems that the particular software checks 
the 'locale -a' list, and if it is not there, applies the C/POSIX locale.

Perhaps there should be a default UTF-8 locale: It seems that the current 
construct does not apply so well to it.





Re: LC_CTYPE=UTF-8

2020-06-25 Thread Ingo Schwarze
Hi Hans,

Hans Aberg wrote on Thu, Jun 25, 2020 at 10:15:03AM +0200:

> MacOS sets as default LC_CTYPE=UTF-8, not appearing in the 'locale
> -a' list. Then some software interprets this as though the locale
> is C/POSIX, disregards the UTF-8 encoding, and converts all non-ASCII
> (high bit set) char's into octal escape sequences. What is the
> correct interpretation here?

The correct interpretation of "LC_CTYPE=UTF-8" is whatever the
documentation of the respective operating system says.
All POSIX says is:

  https://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html

  The locale argument is a pointer to a character string containing
  the required setting of category.  The contents of this string are
  implementation-defined.

POSIX only specifies the meaning of the strings "C" and "POSIX";
any others are implementation-defined.

For example, the OpenBSD manual page says:

  https://man.openbsd.org/setlocale.3

  The syntax and semantics of the locale argument are not standardized
  and vary among operating systems.  On OpenBSD, if the locale string
  ends with ".UTF-8", the UTF-8 locale is selected; otherwise, the
  "C" locale is selected, which uses the ASCII character set.  If
  the locale contains a dot but does not end with ".UTF-8", setlocale()
  fails.

Which is indeed true here:

   $ uname -a
  OpenBSD isnote.usta.de 6.7 GENERIC.MP#224 amd64
   $ LC_CTYPE=FOOBAR.UTF-8 locale charmap
  UTF-8
   $ LC_CTYPE=UTF-8 locale charmap  
  US-ASCII

To the best of my knowledge, we are POSIX-compliant in this respect.
Other systenms are of course free to make different choices.

Even though POSIX says this is implementation-defined, which implies
that operating systems are expected to document their specific rules,
some fail to do so, for example:

  https://man.bsd.lv/FreeBSD-12.0/setlocale.3
  https://man.bsd.lv/NetBSD-8.1/setlocale.3

Some do specify it.  For example, according to

  https://man.bsd.lv/Linux-5.06/setlocale.3

the string "UTF-8" would be invalid because it lacks the "language"
part which is mandatory on Linux.

For example, on a very old Linux system i have access to:

   $ uname -a
  Linux donnerwolke.asta.kit.edu 4.9.0-0.bpo.3-686 #1 SMP \
Debian 4.9.30-2+deb9u5~bpo8+1 (2017-09-28) i686 GNU/Linux
   $ LC_CTYPE=en_US.UTF-8 locale charmap
  UTF-8
   $ LC_CTYPE=UTF-8 locale charmap
  locale: Cannot set LC_CTYPE to default locale: No such file or directory
  locale: Cannot set LC_ALL to default locale: No such file or directory
  ANSI_X3.4-1968

Yours,
  Ingo