Re: UTF-8 locale & POSIX text model

2017-11-26 Thread Hans Åberg

> On 26 Nov 2017, at 13:43, k...@keldix.com wrote:
> 
> Well, the pathname processing should be a function of the filesystem. Eg if 
> you have a windows
> filesystem, or an apple filesystem mounted on a linux operating system, then 
> the file names
> of the foreign system should be interpreted as for the originating system in 
> question.
> I am not sure of the encoding of filenemes on windows and apple system, but 
> their modern
> default character encoding is utf-16.

The deprecated HFS uses UTF-16, but MacOS has LC_CTYPE=UTF-8; thus with no 
additional qualifications like in LC_CTYPE=en_US.UTF-8. It would be interesting 
to know if it is POSIX conforming, as it causes confusion with some software. 





Re: UTF-8 locale & POSIX text model

2017-11-26 Thread Hans Åberg


> On 26 Nov 2017, at 13:43, k...@keldix.com wrote:
> 
> I don't have windos nor apple systems, but they run utf-16 natively, and 
> recent
> Windows 10 system have a full linux (ubuntu) subsystem. I could also see 
> problems
> with utf-16 and posix, but at least apple should have solved that problem 
> with OS X and IOS.

APFS uses UTF-8.

https://developer.apple.com/library/content/documentation/FileManagement/Conceptual/APFS_Guide/FAQ/FAQ.html





Re: UTF-8 locale & POSIX text model

2017-11-26 Thread Stephane Chazelas
2017-11-26 14:07:50 +0100, k...@keldix.com:
[...]
> > For instance, as currently specified, POSIX says that the
> > output of the "locale" utility be suitable for reinput to the
> > shell and requiring double-quote quoting in some cases.
> > 
> > Using double-quote quoting is problematic because of the
> > backslash and backtick characters that are special inside double
> > quotes but whose encoding is found in other characters in
> > charsets like GB18030 or BIG5 (still found on some system
> > locales in many systems), causing vulnerabilities if people to
> > try to reinput the output of locale to the shell in some
> > implementations.
> 
> Linux locales avoid this by not using \ in the source code.

Not sure what you mean. Locales is a totally user space concept,
so Linux is not involved.

I don't know about the other libcs found on Linux-based systems,
but the GNU libc is one of those vulnerables. It doesn't even
try to quote characters properly in the current charset.

And anyway, it's not even possible to quote the output properly
with double quotes if that output is to be interpreted by both
multi-byte aware shells like bash/ksh93/zsh/yash and
non-multibyte aware ones like dash or mksh (that one supports
utf8 though not by default and not other multi-byte encodings).

$ LANG=zh_HK.big5hkscs luit
$ LC_NUMERIC='α' locale
LANG=zh_HK.big5hkscs
LANGUAGE=
LC_CTYPE="zh_HK.big5hkscs"
LC_NUMERIC=α\
LC_TIME=en_GB
[...]
$ printf 'α' | od -tc -tx1
000 243   \
 a3  5c
002


See how α was improperly (for bash/zsh/ksh93/yash, not for dash) quoted.

That's because α in BIG5-HKSCS is 0xa3 0x5c, 0x5c being also \

The only way to quote things is with single quotes as the byte value of single
quote, in practice (that's not required by POSIX) does not occur in the
encoding of other characters.

It's not as bad as Solaris 11 where you don't need character
encoding issues to cause vulnerabilities:

$ LANG='`uname`' locale
LANG=`uname`
[...]

> > I'd also wish the remaining cases where the parsing of code is
> > locale dependant (like the honouring of the locale's blanks for
> > token delimitation if shells, awk, bc... (which many
> > implementations thankfully don't honour), or the [:alpha:] for
> > identifiers) disappear from the spec.
> 
> Linux locales only use the invariant charset (83 characters) 
> which are found to be portable over almost all platforms, incl.
> gb18030, big5, and national ebcdics. So you can design your locales
> to not have these problems.

bash and yash honour the locale's blanks (not in multi-byte
locales with bash because of a bug) as delimiters as currently
sort of required by POSIX and ksh/zsh/bash/yash accept the
locale's [:alpha:] for identifiers (in more or less buggy
fashions).

$ yash -c $'echo\u2006test'
test
$ zsh -c 'à=1; echo $à'
1
$ ksh -c 'à=1; echo $à'
1
$ LC_ALL=en_GB.iso885915  bash -c $'\xe0=1; echo $\xe0'
1

solaris11$ LC_ALL=en_US.ISO8859-15 bash -c $'echo\xa0test'
test

If you have à=1 in your shell script, that code is interpreted
differently when invoked in a different locale (here executing a
à=1 command instead).

So I'd rather POSIX doesn't *require* à=1 be understout as an
assignment, so people not be tempted to use it.


> > Ideally, I'd like to see OS implementers remove all non-ASCII
> > compatible charsets and all multi-byte charsets other than
> > UTF-8 in available system locales. Sorry, but I don't see the
> > value in EBCDIC in a connected 21st century.
> 
> Well, banking and aviation can see the value in ebcdic.
> And Apple and Microsoft have seen the value in non utf-8 comforming platforms.

For UTF-16, you can't use the POSIX API. Most of the POSIX API
uses 0 delimited arrays or 8bit chars.

It's out of scope here.

> I agree that utf-8 is a good solution, but I acknowledge that there
> is a different world out there, and that we can easily accomodate these
> platforms with POSIX /C /C++ systems.

How do we do that? How do we pass UTF-16 strings to open() for
instance?

> > Note that I don't mean removing EBCDIC, BIG5, GB18030... from
> > the list supported by iconv for instance (we'd still note those
> > to import data from ancient systems), only from system locales,
> > that is not have all software on the system automatically
> > exposed to those.
> 
> That does not compute on platforms where the system charset
> is not compatible with utf-8. They need to have all their software available
> in the platform's encoding.
[...]

Maybe, but I doubt anybody is going to write *new* POSIX code
with the intention of it being portable to EBCDIC systems. The
remaining few EBCDIC systems are highly specialised and not
meant to interoperate generally with the rest of the open world.
I don't understand why IBM would care about POSIX compliance for
them or why POSIX would bother cattering for them.

And again, UTF-16 is not relevant to the POSIX API.

BIG5 and GB18030 are compatible with the POSIX API, they are a
superset of ASCII, but they are the 

Re: UTF-8 locale & POSIX text model

2017-11-26 Thread k...@keldix.com
On Sun, Nov 26, 2017 at 02:09:21AM +, Danny Niu wrote:
> 
> 
> On 26 Nov 2017, at 3:53 AM, k...@keldix.com wrote:
> 
> On Wed, Nov 22, 2017 at 05:43:51PM +, Stephane Chazelas wrote:
> 2017-11-22 16:27:15 +0100, Martijn Dekker:
> Op 22-11-17 om 16:02 schreef Geoff Clare:
> Danny Niu > wrote, on 22 
> Nov 2017:
> 
> Q1: What is the rationale for not making POSIX an application of ASCII?
> 
> So that systems which use other encodings (specifically EBCDIC) can
> be POSIX-conforming.  IBM z/OS is certified UNIX 95 and uses EBCDIC.
> 
> But then how should I interpret the table in 6.1 Portable Character Set,
> particularly the UCS column?
> [...]
> 
> It just says those characters are the one constituting the
> portable character set. It doesn't specify the encoding other
> than it mandates the encoding of those characters to be
> invariant in the charsets in the system's supported locales.
> 
> Well, for EBCDIC this does not hold true over different national variants.
> For example dollar is coded x5b in IBM038 and coded x67 in IBM277.
> 
> For a POSIX system to have a locale where UTF-8 is the charset,
> that means that any other locale charset would have to have the
> same encoding for those characters in the portable character
> set, which happens to be the same as ASCII. That doesn't mean
> that the C locale's charset would have to be a superset of
> ASCII, but that it would have to match ASCII on all the
> characters of the portable character set.
> 
> A EBCDIC-based system cannot have a locale where UTF-8 is the
> charset as there are many characters from the portable character
> set that don't have the same encoding in EBCDIC and UTF-8.
> 
> Depends on the specification on conformance. I think we should lift
> the constraint that all charsets in a given implementation need to have the
> same encoding of the invariant set.
> 
> I might be wrong, but I believe the "invariant" requirement is there so that
> the kernel can safely process pathnames containing  and 
> and terminal characters  and 
> while safely ignoring current locale setting of processes.

Well, the pathname processing should be a function of the filesystem. Eg if you 
have a windows
filesystem, or an apple filesystem mounted on a linux operating system, then 
the file names
of the foreign system should be interpreted as for the originating system in 
question.
I am not sure of the encoding of filenemes on windows and apple system, but 
their modern
default character encoding is utf-16.

> I do however question the nativity of EBCDIC on IBM systems.
> 
> Does z/OS support EBCDIC regular expressions, process pathname and
> kernel configuration strings in EBCDIC, and always have a compilation 
> enviremont
> where char is unsigned?
> (to hold positive character values as required in section 6.1 bullet point 5)
> 
> I've never had a licensed z/OS on hand, therefore I wouldn't know.
> 
> Then we could have conforming EBCDIC
> systems, and also implementations that can conform using UTF-16 and
> different 8-bit codesets. For instance 'A' is coded x0041 (two bytes) in 
> UTF-16
> and x41 (only one byte) in cp850, and UTF-8.
> 
> Majority of the programs specified in the standard use byte-oriented string 
> coding;
> the type of argv in main is char*[]; majority of ustar and pax files in the 
> wild
> store string in byte-oriented encoding.
> 
> UTF-16 as native coding on *nix systems I believe just causes problems.

I don't have windos nor apple systems, but they run utf-16 natively, and recent
Windows 10 system have a full linux (ubuntu) subsystem. I could also see 
problems
with utf-16 and posix, but at least apple should have solved that problem with 
OS X and IOS.

> ISO 30112 - which has enhanced POSIX locales and charmaps in a backward 
> compatible way -
> has lifted the constraint on the coding of the portable
> character set. This is also easy to code, it costs something like one extra
> indexing instruction. This is not rocket science.
> 
> ISO 30112 also has a elabirated locale covering all of UCS/Unicode,
> but in a coding independent way, so it would work both for utf-8 and utf-16.
> You know, tere are a few major systems out ther that runs utf-18, such as at
> least OS X and IOS.
> 
> The 30112 i18n locale is actually proven technology, and has been used in 
> Linux
> implementations for something like 20 years. Many hundreds of locales have 
> been built
> on it.
> 
> I get confused with this part.
> 1) I'm not entirely sure ISO TR 30112 would be within the scope of POSIX.

30112 should be within the scope of POSIX. The POSIX group and SC22 agreed
some many years ago that POSIX did not have the expertise to develop POSIX 
i18n, and let SC22
go on doing it. This eventually led to 30112.

> 2) What is with that UTF-18 thing? Byte is defined as octet in POSIX!

Sorry, it was a misspeling, I meant utf-16.

Keld



Re: UTF-8 locale & POSIX text model

2017-11-26 Thread Stephane Chazelas
2017-11-25 20:53:20 +0100, k...@keldix.com:
[...]
> > It just says those characters are the one constituting the
> > portable character set. It doesn't specify the encoding other
> > than it mandates the encoding of those characters to be
> > invariant in the charsets in the system's supported locales.
> 
> 
> Well, for EBCDIC this does not hold true over different national variants.
> For example dollar is coded x5b in IBM038 and coded x67 in IBM277.
[...]

Are you saying that there are POSIX systems out there with
system locales that have charsets with different encodings of $.

How does that work?

Does that mean that on those systems, sh/sed/awk/Makefile/bc...
scripts only work in some locales?

Like a

#! /bin/sh -
length() { echo "${#1}"; }

Where that $ is encode in one of those charsets, would be a
completely different script when invoked in a different locale?


[...]
> Depends on the specification on conformance. I think we should lift
> the constraint that all charsets in a given implementation need to have the
> same encoding of the invariant set.

I'm more on the opposite end here, trying to minimize bugs and
vulnerabilities, I would be in favour of POSIX going further.

Not only that the characters of the portable character set (the
ones used in the languages specified by POSIX) be invariant
across system locales but also that the encoding of those
characters may not be found (as a subset) in other characters
(which ATM IIRC is only required for / and newline), as that's
source of bugs and vulnerabilities.

For instance, as currently specified, POSIX says that the
output of the "locale" utility be suitable for reinput to the
shell and requiring double-quote quoting in some cases.

Using double-quote quoting is problematic because of the
backslash and backtick characters that are special inside double
quotes but whose encoding is found in other characters in
charsets like GB18030 or BIG5 (still found on some system
locales in many systems), causing vulnerabilities if people to
try to reinput the output of locale to the shell in some
implementations.

I'd also wish the remaining cases where the parsing of code is
locale dependant (like the honouring of the locale's blanks for
token delimitation if shells, awk, bc... (which many
implementations thankfully don't honour), or the [:alpha:] for
identifiers) disappear from the spec.

Ideally, I'd like to see OS implementers remove all non-ASCII
compatible charsets and all multi-byte charsets other than
UTF-8 in available system locales. Sorry, but I don't see the
value in EBCDIC in a connected 21st century.

Note that I don't mean removing EBCDIC, BIG5, GB18030... from
the list supported by iconv for instance (we'd still note those
to import data from ancient systems), only from system locales,
that is not have all software on the system automatically
exposed to those.

-- 
Stephane