David Powell wrote: > Roland Mainz wrote: [snip] > > If that's "true" then there is a _serious_ > > problem since such a value would be an invalid charatcer sequence for > > non-UTF-8 multibyte encodings. You may get away in some shells like > > ksh93, but only by "accident" because one of the implementation details > > of ksh93 is that it treats all things as plain strings unless it needs > > to do special handling like quotes, IFS etc. In that case the shell > > script will break because you hit invalid charatcers... which is AFAIK > > bad... ;-( > > It simply means that if you assume the *parseable* output of this > command is using the caller's encoding and character set, you are > wrong.
Erm, my concern is not about "parseable" output. My point is that the output cannot be processed. If you put the into a file and then let the shell read it it may not be able to recover from this kind of error (e.g. illegal character sequence), e.g. the processing will be aborted for the whole rest of the file/stream (this is no problem for UTF-8 encoded streams (which is IMO one of the huge innovations of the UTF-8 encoding... but unfortunately other encoding schemes are not that forgiving...)). > >> If you expect svcprop to emit localized output, it isn't. > > > > No, I am not asking for "localisation", I am asking about which > > encodings the strings use, e.g. "UTF-8" vs. "Shift-Jis" (ja_JP.PCK uses > > "Shift-JIS" as encoding). > > > >> Since svcprop is primarily intended for scripting purposes, I think > >> the former argument is in line with our expectations. > > > > Erm, not really.. it seems I found something like a giant "dataloss" > > bug... ;-( > > There is no guarantee that a code point encoded in a ustring can be > unambiguously represented using the caller's encoding and/or > character set. If svcprop's parseable output was sensitive to the > caller's locale, then it would not be emitting the value in the > ustring property, it would be emitting some mapping of that value. Erm, yes and no. If no 1:1 mapping is available you could generate some kind of "escape sequence", for example something like the SGML/XML "entities", e.g. &<some-hey-val>; ... ... for example one working solution may be: 1. /usr/bin/svcprop converts all unicode strings to the current locale - which leads to a dataloss situation (which we already have anyway because no application can read the stuff generated) but avoids the problem that the application/shell will completely implode because it hits an illegal character sequence 2. /usr/bin/svcprop gets a new option to "encode" characers which cannot be represented by the current locale in some way, for example "-E hexentity" to use the SGML/XML-like entity format, "-E brackets" to use something like "\u[<some-hex-val>]" to represent such characters (bash and ksh93 use this format to represent unicode values in script code which is ASCII-only) 3. All svc*-applications which are expected to accept unicode values get the same "-E" option to convert the entity stuff back to unicode values > Always emitting the UTF-8 encoded data is the only way for svcprop to > *avoid* data loss. Erm, I think this technicially means that the current /usr/bin/svcprop is unsuited for usage within shell script or other applications if it runs in a non-("C"|"UTF-8")-locale - /usr/bin/svcprop won't losse the data - the shell or application won't be able to read the output (e.g. you have the "dataloss" problem in any case (see above for a possible solution)). > > I am reading > > > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/svc/svcprop/svcprop.c > > right now ... and the first thing I found was (line 55): > > -- snip -- > > 151 /* > > 152 * Return an allocated copy of str, with the Bourne shell's > > metacharacters > > 153 * escaped by '\'. > > 154 * > > 155 * What about unicode? > > 156 */ > > 157 static char * > > 158 quote_for_shell(const char *str) > > 159 { > > -- snip -- > > > > Erm... is it possible that the code is completely unaware about things > > like "multibyte encodings" and that the system's default locale may be > > something ja_JP.PCK, zh_CN.GB18030 or en_US.ISO8859-1 (e.g. not > > *.UTF-8-related or compatible) ? > > quote_for_shell() is called on data obtained from the repository. In > the case of ustrings, that data will be UTF-8. Non-UTF-8 multibyte > encodings simply don't enter the picture. > > Given the ASCII-transparency of UTF-8, I believe anything caught in > quote_for_shell()'s special-character dragnet will be ASCII. What aout the EUC, GB18030 or ShiftJis encodings ? ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) roland.mainz at nrubsig.org \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;)