Shlomi Tal <shlompi at hotmail dot com> wrote, and Markus Scherer <markus dot scherer at jtcsv dot com> responded, regarding Shlomi's experimental UTF.
Please note, before anyone gets the wrong idea, that these experimental UTFs are *not* intended as candidates to replace the official ones. As far as I am concerned, they are for fun. Not all are "jokes" in the sense of being ridiculous or absurd, however. Many are meant as intellectual exercises, to help understand the thought process behind designing a good UTF (e.g. what made UTF-8 so much more successful than UTF-1, which was superior in some ways?) For a good sense of what goes on in the mind of someone like me or Shlomi, or Markus, or Marco Cimarosti, when we invent these things, see the Jargon File entry on "hacker humor" (example 2 in particular): http://www.tuxedo.org/~esr/jargon/html/entry/hacker-humor.html Now on to the discussion. [Shlomi] > If you think 7-bit issues are totally obsolete, then sorry for > bothering... [Markus] > Personally, I think they are, but I do find encoding schemes > entertaining :-) I agree with Markus. I might have to drag my vestigial "UTF-Fieldata" concept out of the closet... [Shlomi] > UTF-7 is both stateful and fragile. Stateful it has to be, because > any attemp to encode a large charset AND maintain compatibility to > ASCII has to be stateful. However, it is also fragile in that there > is no self-sync or seek coherence (that's the advantage of UTF-8, as > we all know). [Markus] > Fragile. You assume lossy transport instead of trusting the error > correction of the lower layers. But people do continue to design file formats with CRCs and other validity checks. This was a very important feature in the days when our 300-baud modems had lousy error checking. I don't know how valuable it is today. [Markus] > ... if you also care to stay within 7 bits. Which was the original intent. [Shlomi] > Borrowing from the idea of ISO-2022-JP extended into EUC, but the > other way round, I had the following "Gedankenexperiment": > > 00..A0 stay the same > FF not used > C2..FE leadbytes (1 leadbyte) > A1..C1 trailbytes (2 trailbytes) > > allowing 61 x 33 x 33 codepoints - a little more than 65536. Where do the 3-byte sequences begin? Does the sequence C2 A1 A1 represent U+0000 or U+00A1? In the first case (like UTF-8), you have the possibility of non-shortest sequences, which you can either allow or forbid. If you allow them as alternatives to the single-byte form, any search operations that operate on undecoded data (for whatever reason) must recognize the two equivalent forms. If you forbid them (again like UTF-8), then all decoders must be vigilant about forbidding them. In the second case (like UTF-16), you have no duplicate sequences, but now you have an additive offset of 0x00A1. Some people find this annoying about UTF-16. I don't think either solution is "right" or "wrong," it's just something you have to think about. [Markus] > What about the other 1M code points? Would this encode UTF-16 code > units? In private communication, Shlomi indicated that yes, you would need to apply this algorithm to UTF-16 code units rather than Unicode scalar values. This is like UTF-7 and CESU-8. (Yuck.) [Shlomi] > And now, with an ISO-2022 sequence for state, reduce to 7-bit: > > 42..7E leadbytes (1 leadbyte) > 21..41 trailbytes (2 trailbytes) [Markus] > What about 80..9f which would collide with C0 control codes? > > What about U+00a0 which would become 20 (space) which might be > removed/replaced by emailers in ways that you would not expect for > U+00a0? Good questions. These would have to be resolved before the 7-bit variant could work. Personally I place ISO 2022 code page switching in the same "yuck" category as piggybacking an encoding scheme on top of UTF-16. > What about users' complaint of the high byte-per-code point ratio in > Unicode encodings? > > For everything but ASCII (U+0000..U+007f), UTF-7 uses 2.67 B/cp, > while this uses 3 B/cp. Another good point. But at least this is easier to encode and decode than UTF-7. [Shlomi] > Stateful, yes... fragile, no! Any relevance, or is this just an > amusing experiment to be kept among geeks privately? [Markus] > Time will tell. You could ask Doug to add it to his collection :-) Oh goody, I'm famous for something. Oh well, no such thing as bad publicity, right? -Doug Ewell Fullerton, California

