Markus Scherer wrote:

>+/v8 is the encoding of U+FEFF as the first code point in a text. So far, 
>so good.
>The '-' as the next byte switches UTF-7 back to direct-encoding of a subset 
>of US-ASCII.
>
>What if there is no '-' there? What if a non-ASCII code point immediately 
>follows the U+FEFF?
>In such a case, depending on the following code point, the first four bytes 
>could be
>   +/v8  or  +/v9  or  +/v+  or  +/v/
>
>The 4th byte will not be '8' if the following code point is >=U+4000.

This is more than the stateful irregularity of UTF-7; also demonstrated here 
is the violation of the Unicode principle of "one codepoint per each 
character". You could write a Unicode character U+xxxx U+yyyy as either 
+uuvww- or +uvu-+wvw- (the letters are just placeholders, I didn't intend 
any specific equation in them). Ever since I read about UTF-7, it shocked me 
how Greek "Sokrates" and "S o k r a t e s" (with spaces between each Greek 
letter in the latter) would have different encodings for the same Unicode 
characters.

It's a good thing UTF-7 is deprecated; the only reason for still mentioning 
it is that it appears as an option on mail clients.

By the way, when converting UTF-16 to UTF-7 through the Win2K/XP command 
prompt (doing "chcp 65000" and then piping the output of the UTF-16 file 
into a new file), the OS transcodes also those values which are deemed 
unsafe by MIME, such as quotation marks, excls, ampersands and so forth. 
This is in contrast to GNU recode (I have the DJGPP 32-bit DOS version from 
Simtelnet), which leaves those characters as they are.

_________________________________________________________________
MSN Photos is the easiest way to share and print your photos: 
http://photos.msn.com/support/worldwide.aspx


Reply via email to