I don't think I'm qualified to comment on the details but there are certain things that I believe I can add in this thread and that is, to be supported and portable over all supported locales and codesets, names and identifiers must be in so-called Portable Character Set (which is in case of many Unix/Linux systems, a (proper) subset of 7-bit ASCII) or a subset of it.
This way, regardless of the current locale/codeset, names and identifiers can be used to identify things. It is possible that property values of ustring that has non-ASCII character bytes of UTF-8 then such characters will be invalid character bytes at non-UTF-8 locales and incomprehensible even they are "valid" character bytes in terms of binary representation or encoding. BTW, I just looked at the scf_types.c file, esp. valid_utf8() and UTF8_MAX_BYTES, and found the UTF-8 binary definition used is pre-Unicode 3.1, i.e., the current code allows "ill-formed" UTF-8 bytes that are invalid in the latest UTF-8 definition. This is potentially a security issue and so I filed a CR 6607481. UTF-8 Corrigendum introduced at Unicode 3.1 and revised at Unicode 3.2 and kept in the Unicode standards since then as the UTF-8 definition has a bit more restrictive binary representation: http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf Please see the pages 103-104, especially on well-formed UTF-8 byte sequences and the Table 3-7, from the above. Ienup Jordan Brown wrote at 09/20/07 17:49: > Roland Mainz wrote: > >> ... if I interpret the situation correctly you're output an UTF-8 >> encoding string, right ? If that's "true" then there is a _serious_ >> problem since such a value would be an invalid charatcer sequence for >> non-UTF-8 multibyte encodings. You may get away in some shells like >> ksh93, but only by "accident" because one of the implementation details >> of ksh93 is that it treats all things as plain strings unless it needs >> to do special handling like quotes, IFS etc. In that case the shell >> script will break because you hit invalid charatcers... which is AFAIK >> bad... ;-( > > > Note that the design of UTF-8 is such that "plain ASCII" values 00-7F > always represent the plain ASCII characters. Non-ASCII characters, > including all of the bytes of multibyte sequences, are always in the > range 80-FF. > > That largely protects UTF-8 strings from misinterpretation by > applications that only understand ASCII.