Re: LC_CTYPE=UTF-8
iso 30112 has a 10646 i18n locale which is implemented in glibc, and which should be used. keld On Sun, Dec 13, 2020 at 04:35:08AM +, Thorsten Glaser via austin-group-l at The Open Group wrote: > Martijn Dekker dixit: > > > Datapoint: AT ksh93 has a C.UTF-8 locale built in which it handles > > internally. I suppose that could be considered a precedent of sorts. > > Datapoint: > > I proposed to add a C.UTF-8 locale to eglibc via Debian, almost a > decade ago. > > Debian???s GNU libc packages have been shipping C.UTF-8 since about > 2013 or so, and there have been numerous problems with its precise > implementation (as it turns out that it???s *extremely* tricky to > implement correctly, FSVO correct). > > Please contact Aurélien Jarno, who has been maintaining this, for > any detail information on this, how this has evolved and why and > consider to standardise this existing variant (possibly bugfixing > it where, if at all, necessary). > > bye, > //mirabilos > -- > 22:20??? The crazy that persists in his craziness becomes a master > 22:21??? And the distance between the craziness and geniality is > only measured by the success 18:35??? "Psychotics are consistently > inconsistent. The essence of sanity is to be inconsistently inconsistent
Re: LC_CTYPE=UTF-8
Martijn Dekker dixit: > Datapoint: AT ksh93 has a C.UTF-8 locale built in which it handles > internally. I suppose that could be considered a precedent of sorts. Datapoint: I proposed to add a C.UTF-8 locale to eglibc via Debian, almost a decade ago. Debian’s GNU libc packages have been shipping C.UTF-8 since about 2013 or so, and there have been numerous problems with its precise implementation (as it turns out that it’s *extremely* tricky to implement correctly, FSVO correct). Please contact Aurélien Jarno, who has been maintaining this, for any detail information on this, how this has evolved and why and consider to standardise this existing variant (possibly bugfixing it where, if at all, necessary). bye, //mirabilos -- 22:20⎜ The crazy that persists in his craziness becomes a master 22:21⎜ And the distance between the craziness and geniality is only measured by the success 18:35⎜ "Psychotics are consistently inconsistent. The essence of sanity is to be inconsistently inconsistent
RE: LC_CTYPE=UTF-8
> -Original Message- > From: Ingo Schwarze > Sent: Thursday, June 25, 2020 21:25 > To: Alan Coopersmith > Cc: Hans Åberg ; Austin Group > > Subject: Re: LC_CTYPE=UTF-8 > > Hi Alan, > > Alan Coopersmith wrote on Thu, Jun 25, 2020 at 12:13:33PM -0700: > > On 6/25/20 8:31 AM, Ingo Schwarze wrote: > > >> Whether to standardize only C.UTF-8 or both C.UTF-8 and POSIX.UTF-8 > >> as synonyms looks a bit like asking for the best colour of a bikeshed. > >> Given that the standard already contains the redundancy of requiring > >> both "C" and "POSIX", maybe it is more consistent to also require > >> both "C.UTF-8" and "POSIX.UTF-8", but i don't think that matters > >> greatly. > > > The only thought I had along those lines was that I thought the "C" > > locale came from the C standard, and might be best left to the C > > committee to standardize, while this group controls the "POSIX" > > locale definition. I suspect those following the POSIX standards > > would end up implementing both, regardless of which specification > > defines each. My impression Is that the C standard shied away from all concrete character-encoding issues, at least originally, where alternatives such as EBCDIC were still quite relevant. Although support for multibyte and wide characters were introduced, this was done in a very abstract way; I don't recall any mention of explicit encodings such as ASCII. As such, I think it would be fine for POSIX to standardize both POSIX.UTF-8 and C.UTF-8; I'd expect little opposition from the C standard committee to such a move. (Honestly, I don't know if the Microsoft Visual C library support a C.UTF-8 locale at the moment -- I'm pretty sure their system call level is still UTF-16). TL;DR: for consistency, I'd prefer POSIX to define C.UTF-8 as well as POSIX.UTF-8, even without explicit blessing by the C committee. I don't think they reserved parts of the locale namespace for themselves. -- Konrad Schwarz
Re: LC_CTYPE=UTF-8
There are plans for this, having a POSIX.UTF-8 locale as an XSI base requirement. There may be POSIX.UTF-E and UTF-I locales too; same features, simply the different charmaps. As options there may even be, albeit this is unlikely as no platform I'm aware of fully supports ISO-6429 now, a POSIX.ISO-7 and POSIX.ISO-8 specification as well. Because c11 and c17 are fundamentally broken, with only a minimal partial fix slated for c2x, there are no viable plans for a C.UTF-8 or C.UTF-E proposal that I've ever seen. However, the way the standard is written now only the repertoire that transforms to a single byte encoding may be used, and is what the c2x fix limits itself to. This is effectively normative support only of ASCII-68, not ISO-646 or 10646. Expanding support to include some of the 2 byte graphic repertoire is already permitted by POSIX, but not required. Making allowances for most of the UCS2 repertoire is fairly easy, including its 3 byte UTF-8 representations, but the text for this, and the significant changes for the 4 byte form needed for full UCS-4 and UTF-16 support, is still to be proposed. The point is it is still too early, in my opinion, to say what additional capabilities these locales will provide to applications to ease multi-lingual portability. Of the four choices I see the second or third as the minimum desireable. The industry as a whole needs to communicate how much of Unicode they want to be supported in Issue 8 or they will be stuck with the minimal represented by ASCII-68. Whatever is decided upon, bug fixes and breaking changes to non-portable aspects of existing implementations to be conforming to the final formal specification of the locale are to be expected. On Thursday, June 25, 2020 Ingo Schwarze wrote: Hi Alan, Alan Coopersmith wrote on Thu, Jun 25, 2020 at 07:59:39AM -0700: > On 6/25/20 6:33 AM, Hans Aberg wrote: >> Perhaps there should be a default UTF-8 locale: It seems that the >> current construct does not apply so well to it. > If the goal is to standardize existing behavior the standard could define > the C.UTF-8 locale (or perhaps a POSIX.UTF-8 locale) that a number of > systems already have, which is the standard C/POSIX locale with just the > character set changed to UTF-8 instead. This idea makes a lot of sense to me. If the Austin Group decides that it wants to go into that direction, i would make sure that both OpenBSD and the software i publish use that name for a locale with these properties and consistently recommend using that name. Both already support a locale with these properties and select it if the user asks for C.UTF-8 or POSIX.UTF-8, but so far, they recommend that users specify en_US.UTF-8 (for historical reasons), which is a bit unfortunate because it looks like requesting cultural conventions for a particular country, which is not the intention. Whether to standardize only C.UTF-8 or both C.UTF-8 and POSIX.UTF-8 as synonyms looks a bit like asking for the best colour of a bikeshed. Given that the standard already contains the redundancy of requiring both "C" and "POSIX", maybe it is more consistent to also require both "C.UTF-8" and "POSIX.UTF-8", but i don't think that matters greatly. Yours, Ingo
Re: LC_CTYPE=UTF-8
The locale requirements specified in the C standard are what is applicable for implementations that limit their character encoding to the basic source and execution character sets. POSIX requires implementations to support, in at least one provided charmap, the superset of the basic sets represented by the portable character set. The C standard makes allowance for this, with extended character sets, as also being conforming so the use of "C" as synonym for "POSIX" is permitted. The use is required only when a platform is configured to operate in a POSIX conforming mode, as well, so an implementation electing to have separate C and POSIX definitions is plausible still. If the C standard ever does make a new requirement that conflicts with what is specified now for the POSIX locale then the likelihood is 2 separate locales will be a new requirement in a future Issue, no longer an election, to retain backwards compatibility. On Thursday, June 25, 2020 Martijn Dekker wrote: Op 25-06-20 om 21:13 schreef Alan Coopersmith: > The only thought I had along those lines was that I thought the "C" > locale came from the C standard, and might be best left to the C > committee to standardize, while this group controls the "POSIX" > locale definition. Actually, as far as POSIX is concerned, the two are synonymous. XBD 7.2 "POSIX Locale": https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_02 | Conforming systems shall provide a POSIX locale, also known as the C | locale. In POSIX.1 the requirements for the POSIX locale are more | extensive than the requirements for the C locale as specified in the ISO | C standard. However, in a conforming POSIX implementation, the POSIX | locale and the C locale are identical. -- || modernish -- harness the shell || https://github.com/modernish/modernish || || KornShell lives! || https://github.com/ksh93/ksh
Re: LC_CTYPE=UTF-8
Op 25-06-20 om 21:13 schreef Alan Coopersmith: The only thought I had along those lines was that I thought the "C" locale came from the C standard, and might be best left to the C committee to standardize, while this group controls the "POSIX" locale definition. Actually, as far as POSIX is concerned, the two are synonymous. XBD 7.2 "POSIX Locale": https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_02 | Conforming systems shall provide a POSIX locale, also known as the C | locale. In POSIX.1 the requirements for the POSIX locale are more | extensive than the requirements for the C locale as specified in the ISO | C standard. However, in a conforming POSIX implementation, the POSIX | locale and the C locale are identical. -- || modernish -- harness the shell || https://github.com/modernish/modernish || || KornShell lives! || https://github.com/ksh93/ksh
Re: LC_CTYPE=UTF-8
Hi Alan, Alan Coopersmith wrote on Thu, Jun 25, 2020 at 12:13:33PM -0700: > On 6/25/20 8:31 AM, Ingo Schwarze wrote: >> Whether to standardize only C.UTF-8 or both C.UTF-8 and POSIX.UTF-8 >> as synonyms looks a bit like asking for the best colour of a bikeshed. >> Given that the standard already contains the redundancy of requiring >> both "C" and "POSIX", maybe it is more consistent to also require >> both "C.UTF-8" and "POSIX.UTF-8", but i don't think that matters >> greatly. > The only thought I had along those lines was that I thought the "C" > locale came from the C standard, and might be best left to the C > committee to standardize, while this group controls the "POSIX" > locale definition. I suspect those following the POSIX standards > would end up implementing both, regardless of which specification > defines each. That sounds quite reasonable to me. Yours, Ingo
Re: LC_CTYPE=UTF-8
On 6/25/20 8:31 AM, Ingo Schwarze wrote: Whether to standardize only C.UTF-8 or both C.UTF-8 and POSIX.UTF-8 as synonyms looks a bit like asking for the best colour of a bikeshed. Given that the standard already contains the redundancy of requiring both "C" and "POSIX", maybe it is more consistent to also require both "C.UTF-8" and "POSIX.UTF-8", but i don't think that matters greatly. The only thought I had along those lines was that I thought the "C" locale came from the C standard, and might be best left to the C committee to standardize, while this group controls the "POSIX" locale definition. I suspect those following the POSIX standards would end up implementing both, regardless of which specification defines each. -- -Alan Coopersmith- alan.coopersm...@oracle.com Oracle Solaris Engineering - https://blogs.oracle.com/alanc
Re: LC_CTYPE=UTF-8
Op 25-06-20 om 16:59 schreef Alan Coopersmith: On 6/25/20 6:33 AM, Hans Åberg wrote: Perhaps there should be a default UTF-8 locale: It seems that the current construct does not apply so well to it. If the goal is to standardize existing behavior the standard could define the C.UTF-8 locale (or perhaps a POSIX.UTF-8 locale) that a number of systems already have, which is the standard C/POSIX locale with just the character set changed to UTF-8 instead. Datapoint: AT ksh93 has a C.UTF-8 locale built in which it handles internally. I suppose that could be considered a precedent of sorts. - M. -- || modernish -- harness the shell || https://github.com/modernish/modernish || || KornShell lives! || https://github.com/ksh93/ksh
Re: LC_CTYPE=UTF-8
Hi Alan, Alan Coopersmith wrote on Thu, Jun 25, 2020 at 07:59:39AM -0700: > On 6/25/20 6:33 AM, Hans Aberg wrote: >> Perhaps there should be a default UTF-8 locale: It seems that the >> current construct does not apply so well to it. > If the goal is to standardize existing behavior the standard could define > the C.UTF-8 locale (or perhaps a POSIX.UTF-8 locale) that a number of > systems already have, which is the standard C/POSIX locale with just the > character set changed to UTF-8 instead. This idea makes a lot of sense to me. If the Austin Group decides that it wants to go into that direction, i would make sure that both OpenBSD and the software i publish use that name for a locale with these properties and consistently recommend using that name. Both already support a locale with these properties and select it if the user asks for C.UTF-8 or POSIX.UTF-8, but so far, they recommend that users specify en_US.UTF-8 (for historical reasons), which is a bit unfortunate because it looks like requesting cultural conventions for a particular country, which is not the intention. Whether to standardize only C.UTF-8 or both C.UTF-8 and POSIX.UTF-8 as synonyms looks a bit like asking for the best colour of a bikeshed. Given that the standard already contains the redundancy of requiring both "C" and "POSIX", maybe it is more consistent to also require both "C.UTF-8" and "POSIX.UTF-8", but i don't think that matters greatly. Yours, Ingo
Re: LC_CTYPE=UTF-8
On 6/25/20 6:33 AM, Hans Åberg wrote: Perhaps there should be a default UTF-8 locale: It seems that the current construct does not apply so well to it. If the goal is to standardize existing behavior the standard could define the C.UTF-8 locale (or perhaps a POSIX.UTF-8 locale) that a number of systems already have, which is the standard C/POSIX locale with just the character set changed to UTF-8 instead. -- -Alan Coopersmith- alan.coopersm...@oracle.com Oracle Solaris Engineering - https://blogs.oracle.com/alanc
Re: LC_CTYPE=UTF-8
> On 25 Jun 2020, at 15:19, Ingo Schwarze wrote: > > Hans Aberg wrote on Thu, Jun 25, 2020 at 10:15:03AM +0200: > >> MacOS sets as default LC_CTYPE=UTF-8, not appearing in the 'locale >> -a' list. Then some software interprets this as though the locale >> is C/POSIX, disregards the UTF-8 encoding, and converts all non-ASCII >> (high bit set) char's into octal escape sequences. What is the >> correct interpretation here? > > The correct interpretation of "LC_CTYPE=UTF-8" is whatever the > documentation of the respective operating system says. > All POSIX says is: > > https://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html > > The locale argument is a pointer to a character string containing > the required setting of category. The contents of this string are > implementation-defined. > > POSIX only specifies the meaning of the strings "C" and "POSIX"; > any others are implementation-defined. This is also what I thought. As for the other your comments, rather than checking for a particular syntax, it seems that the particular software checks the 'locale -a' list, and if it is not there, applies the C/POSIX locale. Perhaps there should be a default UTF-8 locale: It seems that the current construct does not apply so well to it.
Re: LC_CTYPE=UTF-8
Hi Hans, Hans Aberg wrote on Thu, Jun 25, 2020 at 10:15:03AM +0200: > MacOS sets as default LC_CTYPE=UTF-8, not appearing in the 'locale > -a' list. Then some software interprets this as though the locale > is C/POSIX, disregards the UTF-8 encoding, and converts all non-ASCII > (high bit set) char's into octal escape sequences. What is the > correct interpretation here? The correct interpretation of "LC_CTYPE=UTF-8" is whatever the documentation of the respective operating system says. All POSIX says is: https://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html The locale argument is a pointer to a character string containing the required setting of category. The contents of this string are implementation-defined. POSIX only specifies the meaning of the strings "C" and "POSIX"; any others are implementation-defined. For example, the OpenBSD manual page says: https://man.openbsd.org/setlocale.3 The syntax and semantics of the locale argument are not standardized and vary among operating systems. On OpenBSD, if the locale string ends with ".UTF-8", the UTF-8 locale is selected; otherwise, the "C" locale is selected, which uses the ASCII character set. If the locale contains a dot but does not end with ".UTF-8", setlocale() fails. Which is indeed true here: $ uname -a OpenBSD isnote.usta.de 6.7 GENERIC.MP#224 amd64 $ LC_CTYPE=FOOBAR.UTF-8 locale charmap UTF-8 $ LC_CTYPE=UTF-8 locale charmap US-ASCII To the best of my knowledge, we are POSIX-compliant in this respect. Other systenms are of course free to make different choices. Even though POSIX says this is implementation-defined, which implies that operating systems are expected to document their specific rules, some fail to do so, for example: https://man.bsd.lv/FreeBSD-12.0/setlocale.3 https://man.bsd.lv/NetBSD-8.1/setlocale.3 Some do specify it. For example, according to https://man.bsd.lv/Linux-5.06/setlocale.3 the string "UTF-8" would be invalid because it lacks the "language" part which is mandatory on Linux. For example, on a very old Linux system i have access to: $ uname -a Linux donnerwolke.asta.kit.edu 4.9.0-0.bpo.3-686 #1 SMP \ Debian 4.9.30-2+deb9u5~bpo8+1 (2017-09-28) i686 GNU/Linux $ LC_CTYPE=en_US.UTF-8 locale charmap UTF-8 $ LC_CTYPE=UTF-8 locale charmap locale: Cannot set LC_CTYPE to default locale: No such file or directory locale: Cannot set LC_ALL to default locale: No such file or directory ANSI_X3.4-1968 Yours, Ingo