Shlomi Tal wrote: > If you think 7-bit issues are totally obsolete, then sorry for bothering...
Personally, I think they are, but I do find encoding schemes entertaining :-) > UTF-7 is both stateful and fragile. Stateful it has to be, because any Fragile. You assume lossy transport instead of trusting the error correction of the lower layers. > attemp to encode a large charset AND maintain compatibility to ASCII has > to be stateful. ... if you also care to stay within 7 bits. > However, it is also fragile in that there is no > self-sync or seek coherence (that's the advantage of UTF-8, as we all > know). > > Borrowing from the idea of ISO-2022-JP extended into EUC, but the other > way round, I had the following "Gedankenexperiment": > > 00..A0 stay the same > FF not used > C2..FE leadbytes (1 leadbyte) > A1..C1 trailbytes (2 trailbytes) > > allowing 61 x 33 x 33 codepoints - a little more than 65536. What about the other 1M code points? Would this encode UTF-16 code units? > And now, with an ISO-2022 sequence for state, reduce to 7-bit: You seem to imply to just switch between "lower bytes" (00..7f) and "upper bytes" (80..ff), which you can do with just SI/SO without the rest of the ISO 2022 apparatus. > 42..7E leadbytes (1 leadbyte) > 21..41 trailbytes (2 trailbytes) What about 80..9f which would collide with C0 control codes? What about U+00a0 which would become 20 (space) which might be removed/replaced by emailers in ways that you would not expect for U+00a0? What about users' complaint of the high byte-per-code point ratio in Unicode encodings? For everything but ASCII (U+0000..U+007f), UTF-7 uses 2.67 B/cp, while this uses 3 B/cp. > Stateful, yes... fragile, no! Any relevance, or is this just an amusing > experiment to be kept among geeks privately? Time will tell. You could ask Doug to add it to his collection :-) markus

