Markus Scherer wrote: >+/v8 is the encoding of U+FEFF as the first code point in a text. So far, >so good. >The '-' as the next byte switches UTF-7 back to direct-encoding of a subset >of US-ASCII. > >What if there is no '-' there? What if a non-ASCII code point immediately >follows the U+FEFF? >In such a case, depending on the following code point, the first four bytes >could be > +/v8 or +/v9 or +/v+ or +/v/ > >The 4th byte will not be '8' if the following code point is >=U+4000.
This is more than the stateful irregularity of UTF-7; also demonstrated here is the violation of the Unicode principle of "one codepoint per each character". You could write a Unicode character U+xxxx U+yyyy as either +uuvww- or +uvu-+wvw- (the letters are just placeholders, I didn't intend any specific equation in them). Ever since I read about UTF-7, it shocked me how Greek "Sokrates" and "S o k r a t e s" (with spaces between each Greek letter in the latter) would have different encodings for the same Unicode characters. It's a good thing UTF-7 is deprecated; the only reason for still mentioning it is that it appears as an option on mail clients. By the way, when converting UTF-16 to UTF-7 through the Win2K/XP command prompt (doing "chcp 65000" and then piping the output of the UTF-16 file into a new file), the OS transcodes also those values which are deemed unsafe by MIME, such as quotation marks, excls, ampersands and so forth. This is in contrast to GNU recode (I have the DJGPP 32-bit DOS version from Simtelnet), which leaves those characters as they are. _________________________________________________________________ MSN Photos is the easiest way to share and print your photos: http://photos.msn.com/support/worldwide.aspx

