Re: Sync/Seek-robust UTF-7

Markus Scherer Tue, 18 Jun 2002 09:35:33 -0700

Shlomi Tal wrote:

> If you think 7-bit issues are totally obsolete, then sorry for bothering...



Personally, I think they are, but I do find encoding schemes entertaining :-)


> UTF-7 is both stateful and fragile. Stateful it has to be, because any 


Fragile. You assume lossy transport instead of trusting the error correction of the 
lower layers.

> attemp to encode a large charset AND maintain compatibility to ASCII has 
> to be stateful.


... if you also care to stay within 7 bits.

> However, it is also fragile in that there is no 
> self-sync or seek coherence (that's the advantage of UTF-8, as we all 
> know).
> 
> Borrowing from the idea of ISO-2022-JP extended into EUC, but the other 
> way round, I had the following "Gedankenexperiment":
> 
> 00..A0 stay the same
> FF not used
> C2..FE leadbytes (1 leadbyte)
> A1..C1 trailbytes (2 trailbytes)
> 
> allowing 61 x 33 x 33 codepoints - a little more than 65536.


What about the other 1M code points? Would this encode UTF-16 code units?


> And now, with an ISO-2022 sequence for state, reduce to 7-bit:


You seem to imply to just switch between "lower bytes" (00..7f) and "upper bytes" 
(80..ff), which you can do with just SI/SO without the rest of the ISO 2022 apparatus.

> 42..7E leadbytes (1 leadbyte)
> 21..41 trailbytes (2 trailbytes)


What about 80..9f which would collide with C0 control codes?

What about U+00a0 which would become 20 (space) which might be removed/replaced by 
emailers in ways that you would not expect for U+00a0?



What about users' complaint of the high byte-per-code point ratio in Unicode encodings?

For everything but ASCII (U+0000..U+007f), UTF-7 uses 2.67 B/cp, while this uses 3 
B/cp.



> Stateful, yes... fragile, no! Any relevance, or is this just an amusing 
> experiment to be kept among geeks privately?


Time will tell. You could ask Doug to add it to his collection :-)

markus

Re: Sync/Seek-robust UTF-7

Reply via email to