On Wed, Sep 26, 2007 at 04:11:25AM +0200, Roland Mainz wrote: > Nicolas Williams wrote: > > Yes, but... > > > > The real answer is that everybody should have been using UTF-8 for the > > past 40 years or so. > > Groan... I said this elsewhere... each generation of software engineer > has it's own "ultimate" language encoding system... remeber when ISO2022 > was "cool" in the last decade ? Now we have Unicode which is "cool". And
No, I don't. > maybe the next decade will have it's own cool encoding system (maybe > called "interCode" or "iCode" ?). Unicode is the last one. We're building it into too many places for us to just be able to rip it out. > In any case it was IMO not a good idea to make the output of "svcprop" > unicode-specific where it may have been better to just use the standard > multibyte API and handle the possible "loss" of information differently. I think svcprop should provide BOTH kinds of interfaces: one that deals strictly in Unicode, preferably UTF-8 only, locale be damned, and one that deals in the current locale's codeset. Both should warn about possible data loss. Using the UTF-8 interface in a non-UTF-8 locale, or using the locale-aware interface in a non-UTF-8 locale and with strings for which there may be a lossy conversion. I'm assuming, BTW, that loss data through normalization is not a problem. > "svcprop" may currently have no "dataloss"[1] problem but any possible > real-world consumer will have a problem. And that's very bad... > [1]=(which isn't completely correct since Unicode is _not_ a lossless > encoding (e.g. see unicode's han unification system (which may lead to > some ambiguity if you mix some asian languages))) There's more too (think of the 'K' in NFKC and NFKD). There's ways to deal with the data loss in han unification, which, incidentally, was a temporary measure anyways. > > Seriously. We have non-UTF-8 locales. We might be able to EOF some of > > them (e.g., all the ISO-8859 locales), but not all of them, > > ... like zh_CN.GB18030 (which is _MANDATORY_ for china (assuming you > want goverment contracts)) and ja_JP.PCK (which is more or less > unavoidable for japanese installations in the next ten or twenty years) Yup. > > and we can't > > actually remove any of them any time soon. So non-UTF-8 locales are > > here for the forseeable future and we have to deal. > > Right... and IMO it may not be a good idea to hardcode every API to > unicode without adding options for alternatives (e.g. an encoding > identifer, file format version number etc.) ... as I said there may be a > new one in twenty years. The current version of Unicode isn't completely > undisputed (see "tron"&co.) ... I don't agree. The IETF and others are burning Unicode into every protocol that matters. Unicode has already become the one, primary codeset. Deal with it. And don't tell me about Klingon, please. Nico --