On Thu, Feb 21, 2013 at 2:12 PM, Richard Wordingham < [email protected]> wrote:
> Microsoft chose WEOF=0xffff. I don't think it can easily be changed to > a better value until an incompatible processor architecture is used. > Changing it is likely to break existing executables and object > libraries. > If this is true, it's certainly a poor choice, and might violate the C standard. (I have not checked the actual standard for wgetc(), wint_t & WEOF.) 16-bit wchar_t doesn't exactly support 21-bit Unicode. Right -- that's why the standard library uses a separate type, wint_t, which can be wider if necessary. Nothing requires a library that processes 16-bit Unicode strings to have a 16-bit type for a single-character return value. Just like the C standard getc() returns a *negative* EOF value, in an integer type that is wider than a byte. The UTC is now applying additional pressure for the making of the > distinction between UTF-16 and UTF-16LE. The UTC is doing no such thing. Nothing has changed with regard to the UTF-16 encoding scheme and the BOM. U+FFFE has always been a code point that will never have a real character assigned to it, that's why it is *unlikely* to appear as the first character in a text file and thus useful as a "reverse BOM". However, it was never forbidden from occurring in the text. Best practice for file encodings has always been to declare the encoding. Second best for UTF-16 is to always include the BOM, even if the byte order is big-endian. And since most computers are little-endian, they need to include the BOM in UTF-16 file encodings anyway (if they use their native endianness). markus

