At 10:21 AM 6/18/02 +0000, Shlomi Tal wrote:
>Stateful, yes... fragile, no! Any relevance, or is this just an amusing experiment to 
>be kept among geeks privately?

There's a huge number of features to be traded off when making a UTF: complexity, 
encoding/decoding speed, uniqueness, statefulness, seekability (being able to start an 
arbitrary point in the stream 
and finding character boundaries either (a) right off, or (b) after 
short seeks back or forward.) The current UTFs made certain choices, 
and it's generally thought better to stick with the UTFs we have, 
for simplicity's sake, than add more that don't make radically new
choices. Given that a 7-bit UTF is not a major need, and UTF-8 is 
more often used even in UTF-7's home field of email, I don't see 
why a new UTF would be more than an amusing experiment. UTF-7 
works well enough for what it does.

That said, I've been working on my own UTF, privately dubbed 
ISO-2022-UTF. It does end up mapping 96-character planes to 
G0, but ISO-2022-JP-3 does it, and that's a MIME-legal charset.

U+0000-U+007F (ASCII)  ESC 2/8  4/2
U+0000-U+23FF          ESC 2/14 3/1
U+2400-U+47FF          ESC 2/14 3/2
U+4800-U+6BFF          ESC 2/14 3/3
U+6C00-U+8FFF          ESC 2/14 3/4
U+9000-U+B3FF          ESC 2/14 3/5
U+B400-U+D7FF          ESC 2/14 3/6
U+D800-U+FBFF          ESC 2/14 3/7
U+FC00-U+11FFF         ESC 2/14 3/8

ISO-2022-UTF starts with ASCII in G0 and normal C0 in C0. It's 
invalid to use ESC 2/14 3/1 for characters in ASCII. For characters
above 11FFF, surrogate characters are used. For characters between
10000 and 11FFF, ESC 2/14 3/8 should be used. When used as a mime
charset, it is suggested that every line end with a return to ASCII,
for compatibility with ISO-2022-JP-*. Also when used as a mime 
charset, CRLF must be used as a line ending.

I don't see a real use for it, and as is it could use some
formalization before actual use. But it seems like a workable enough
design for an ISO-2022 UTF.


Reply via email to