> > In fact, there are no characters defined in ISO 8859-8 for those code > points. If you encounter 0xBF in text that purports to be ISO 8859-8, > it is an error. > Another example showing that it would be very useful to have 128 (possibly 256) codepoints that would be reserved for such purposes.
Suppose ISO 8859-8 is ever upgraded (even if not likely, but - for the sake of argument). One might say that it would be bad to change an existing definition in the table e.g. for 0xBF from 0x2DBF to 0x20AC. But how is that worse from changing it from <undefined> to 0x20AC ? I think it is actually better, since you can never guess what will be implemented for <undefined>. "Throw and exception" is what I keep seeing in these discussions. Who will catch it? The secretary on the third floor? If mapping for undefined values would be 0xhh -> 0x2Dhh, then there would be a consistent definition of what to do if somebody wants to do something else than throw things out the window. Consequentially, there would be a better chance of being able to repair inadvertently processed data at some later time. Yes, I am talking about the roundtrip issue again. Thanks to David Hopwood's reply (see http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0362.html), I am now convinced that unpaired surrogates (UTF-8B) are not a good approach. However, the %hh mechanism has many drawbacks (like potentially 3 times longer strings after conversion). Which will lead to use of other mechanisms, and all of them will have the same multiple representation problem. My point is - not providing a suitable mechanism (or at least means for it) within Unicode will not make the multiple representation problem go away. IMHO, it would be better to accept the multiple representation problem for a fact and try to deal with it. I think that 128 codepoints is not such a high price to pay for what we could do with them. True, all the things people could do with these codepoints might bring up new issues (yes, security too), but then again, isn't it better that everybody does the dirty things in a consistent and well known manner? At least then you have a chance to have a single validator to find potentially problematic codepoints or sequences. Of course, I am not saying that the definition of UTF-8 would be changed. These reserved codepoints would only allow a UTF-8D algorithm, that would be somewhat simpler than UTF-8B and would produce valid UTF-16 data, thus significantly expanding possibilities for its usage. Lars Kristan Storage & Data Management Lab HERMES SoftLab

