Shlomi Tal wrote:

> UTF-7, it shocked me how Greek "Sokrates" and "S o k r a t e s" (with 
> spaces between each Greek letter in the latter) would have different 
> encodings for the same Unicode characters.


That is not unusual for stateful encodings.
It's the same with BOCU-1 (not in this particular case though, but e.g. with U+00A0 
between the letters).


> It's a good thing UTF-7 is deprecated; ...


I agree. In an 8bit-clean world, it has little use.
It's interesting, though, to see that UTF-7 actually encodes some text with fewer 
bytes than UTF-8: everything U+0800..U+FFFF takes 2.67B in UTF-7 but 3B in UTF-8 :-)


> By the way, when converting UTF-16 to UTF-7 through the Win2K/XP command 
> prompt (doing "chcp 65000" and then piping the output of the UTF-16 file 
> into a new file), the OS transcodes also those values which are deemed 
> unsafe by MIME, such as quotation marks, excls, ampersands and so forth. 
> This is in contrast to GNU recode (I have the DJGPP 32-bit DOS version 
> from Simtelnet), which leaves those characters as they are.


Either is correct. The UTF-7 spec is vague on this.
ICU's UTF-7 converter has an option to use the minimum or maximum set of direct-ASCII 
characters allowed by the UTF-7 spec.

markus


Reply via email to