At 10:21 AM 6/18/02 +0000, Shlomi Tal wrote: >Stateful, yes... fragile, no! Any relevance, or is this just an amusing experiment to >be kept among geeks privately?
There's a huge number of features to be traded off when making a UTF: complexity, encoding/decoding speed, uniqueness, statefulness, seekability (being able to start an arbitrary point in the stream and finding character boundaries either (a) right off, or (b) after short seeks back or forward.) The current UTFs made certain choices, and it's generally thought better to stick with the UTFs we have, for simplicity's sake, than add more that don't make radically new choices. Given that a 7-bit UTF is not a major need, and UTF-8 is more often used even in UTF-7's home field of email, I don't see why a new UTF would be more than an amusing experiment. UTF-7 works well enough for what it does. That said, I've been working on my own UTF, privately dubbed ISO-2022-UTF. It does end up mapping 96-character planes to G0, but ISO-2022-JP-3 does it, and that's a MIME-legal charset. U+0000-U+007F (ASCII) ESC 2/8 4/2 U+0000-U+23FF ESC 2/14 3/1 U+2400-U+47FF ESC 2/14 3/2 U+4800-U+6BFF ESC 2/14 3/3 U+6C00-U+8FFF ESC 2/14 3/4 U+9000-U+B3FF ESC 2/14 3/5 U+B400-U+D7FF ESC 2/14 3/6 U+D800-U+FBFF ESC 2/14 3/7 U+FC00-U+11FFF ESC 2/14 3/8 ISO-2022-UTF starts with ASCII in G0 and normal C0 in C0. It's invalid to use ESC 2/14 3/1 for characters in ASCII. For characters above 11FFF, surrogate characters are used. For characters between 10000 and 11FFF, ESC 2/14 3/8 should be used. When used as a mime charset, it is suggested that every line end with a return to ASCII, for compatibility with ISO-2022-JP-*. Also when used as a mime charset, CRLF must be used as a line ending. I don't see a real use for it, and as is it could use some formalization before actual use. But it seems like a workable enough design for an ISO-2022 UTF.

