OK, the eXperimental Transformation Format goes thus (I didn't make it clear 
enough):

C0, G0, G1 and NBSP (0xA0) stay the same: a single byte.
All Unicode characters from U+00A1 onwards are encoded in three bytes, the 
first of which is in the range C2..FE, the other two A1..C1.

Thus U+00A1 = 0xC2 0xA1 0xA1

Advantages:

1. ASCII compatibility
2. C1 compatibility
3. Can be reduced to 7-bit SI/SO scheme with no control code overlap, thus 
being a UTF-7 without the real UTF-7's chief disadvantage of no sync.

Disadvantages:

1. No simple way of filling bits like UTF-8's 110xxxxx 10xxxxxx. I suppose 
this brings us back to UTF-1's modulo complexities...

2. 3 bytes for all Unicode characters above U+00A0.

3. UTF-16 surrogate piggybacking - 6 bytes per outside-BMP codepoint. Really 
yucky, but those characters are rare.

_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp.


Reply via email to