I don't think I'm qualified to comment on the details but there are
certain things that I believe I can add in this thread and that is,
to be supported and portable over all supported locales and codesets,
names and identifiers must be in so-called Portable Character Set (which is
in case of many Unix/Linux systems, a (proper) subset of 7-bit ASCII) or
a subset of it.

This way, regardless of the current locale/codeset, names and identifiers
can be used to identify things.

It is possible that property values of ustring that has non-ASCII character
bytes of UTF-8 then such characters will be invalid character bytes at
non-UTF-8 locales and incomprehensible even they are "valid" character bytes
in terms of binary representation or encoding.

BTW, I just looked at the scf_types.c file, esp. valid_utf8() and
UTF8_MAX_BYTES, and found the UTF-8 binary definition used is
pre-Unicode 3.1, i.e., the current code allows "ill-formed" UTF-8 bytes
that are invalid in the latest UTF-8 definition. This is potentially
a security issue and so I filed a CR 6607481.

UTF-8 Corrigendum introduced at Unicode 3.1 and revised at Unicode 3.2
and kept in the Unicode standards since then as the UTF-8 definition has
a bit more restrictive binary representation:

        http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf

Please see the pages 103-104, especially on well-formed UTF-8 byte
sequences and the Table 3-7, from the above.

Ienup

Jordan Brown wrote at 09/20/07 17:49:
> Roland Mainz wrote:
> 
>> ... if I interpret the situation correctly you're output an UTF-8
>> encoding string, right ? If that's "true" then there is a _serious_
>> problem since such a value would be an invalid charatcer sequence for
>> non-UTF-8 multibyte encodings. You may get away in some shells like
>> ksh93, but only by "accident" because one of the implementation details
>> of ksh93 is that it treats all things as plain strings unless it needs
>> to do special handling like quotes, IFS etc. In that case the shell
>> script will break because you hit invalid charatcers... which is AFAIK
>> bad... ;-(
> 
> 
> Note that the design of UTF-8 is such that "plain ASCII" values 00-7F 
> always represent the plain ASCII characters.  Non-ASCII characters, 
> including all of the bytes of multibyte sequences, are always in the 
> range 80-FF.
> 
> That largely protects UTF-8 strings from misinterpretation by 
> applications that only understand ASCII.

Reply via email to