Shlomi Tal wrote: > UTF-7, it shocked me how Greek "Sokrates" and "S o k r a t e s" (with > spaces between each Greek letter in the latter) would have different > encodings for the same Unicode characters.
That is not unusual for stateful encodings. It's the same with BOCU-1 (not in this particular case though, but e.g. with U+00A0 between the letters). > It's a good thing UTF-7 is deprecated; ... I agree. In an 8bit-clean world, it has little use. It's interesting, though, to see that UTF-7 actually encodes some text with fewer bytes than UTF-8: everything U+0800..U+FFFF takes 2.67B in UTF-7 but 3B in UTF-8 :-) > By the way, when converting UTF-16 to UTF-7 through the Win2K/XP command > prompt (doing "chcp 65000" and then piping the output of the UTF-16 file > into a new file), the OS transcodes also those values which are deemed > unsafe by MIME, such as quotation marks, excls, ampersands and so forth. > This is in contrast to GNU recode (I have the DJGPP 32-bit DOS version > from Simtelnet), which leaves those characters as they are. Either is correct. The UTF-7 spec is vague on this. ICU's UTF-7 converter has an option to use the minimum or maximum set of direct-ASCII characters allowed by the UTF-7 spec. markus