On Thu, 21 Feb 2013 11:52:07 -0800 Markus Scherer <[email protected]> wrote:
> On Thu, Feb 21, 2013 at 11:06 AM, Richard Wordingham < > [email protected]> wrote: > "fgetwc returns, as a > wint_t<http://msdn.microsoft.com/en-us/library/323b6b3k.aspx>, > the wide character that corresponds to the character read or returns > WEOF to indicate an error or end of file. For both functions, use > feof orferror to distinguish between an error and an end-of-file > condition." http://msdn.microsoft.com/en-us/library/c7sskzc1.aspx > In other words, the wint_t value WEOF is supposed to be out-of-range > for normal characters, and if in doubt, the API docs tell you to call > feof(). Actually, you have to call both! If both return zero, then you have U+FFFF. Just calling feof() would lead one, by UTC ruling, to misdiagnose an error. > On my Ubuntu laptop, wchar.h defines WEOF=0xffffffffu which is > thoroughly out of range for Unicode. Microsoft chose WEOF=0xffff. I don't think it can easily be changed to a better value until an incompatible processor architecture is used. Changing it is likely to break existing executables and object libraries. > The comment for *wint_t* says > /* Integral type unchanged by default argument promotions that can > hold any value corresponding to members of the extended character > set, as well as *at least one value that does not correspond to > any* > * member of the extended character set*. */ > > I don't have a Windows system handy to check for the value there. I > assume that it follows the standard: 16-bit wchar_t doesn't exactly support 21-bit Unicode. Hitherto, one could always have tried claiming that reading U+FFFF when expecting ordinary characters was tantamount to interchanging code containing it, or claimed that this part of internal usage was one of the restrictions of the system. The 'correction' destroys that defence. One can still note that U+FFFF is not an assigned character and never will be! >> U+FFFE at the start of a UTF-16 file should also cause some >> headaches! >> Doesn't Microsoft Windows still interpret this as a byte-order mark >> without asking whether there may be a byte-order mark? > In the UTF-16 *encoding scheme*, such as in an otherwise unmarked > file, the leading bytes FF FE and FE FF have special meaning. Again, > this has nothing to do with the first character in a string in code. > None of this has changed. Those believing the restrictive interpretation would not expect UTF-16LE or UTF-16BE files to start with U+FFFE, so if the first character appeared to be U+FFFE, they could get away with assuming it was actually a UTF-16 file and deducing that it was not in the default endianity assigned by the higher protocol. The UTC is now applying additional pressure for the making of the distinction between UTF-16 and UTF-16LE. To be precise, if the text of a file using the UTF-16 encoding scheme with x-endian content is to start with U+FFFE as its first character, it must start with what would be interpreted as U+FEFF U+FFFE if it were declared to be in the UTF-16xE encoding scheme. What has changed is that before such a file could be regarded as erroneous - it should not have escaped from the application that spawned it. Now the question of whether it is in the UTF-16 encoding scheme or the UTF-16xE encoding scheme needs to be resolved. Richard.

