You cannot even be "very confident" of not finding actual ill-formed UTF-16, like unpaired surrogates, in an external file, let alone noncharacters.
As for the noncharacters, take a look at the collation test files that we distribute with each version of UCA. The test data includes test strings like the following, to verify that UCA implementations do the correct thing when faced with unusual edge cases: FFFE 0021 FFFE 003F FFFE 0061 FFFE 0041 FFFE 0062 1FFFE 0021 1FFFE 003F 1FFFE 0334 ... As well as test strings starting with unpaired surrogates: D800 0021 D800 003F D800 0061 D800 0041 D800 0062 And while it is true that the *file* CollationTest_SHIFTED.txt doesn't start with either a noncharacter or an unpaired surrogate -- because all of the test data in it is represented in ASCII hex strings instead of directly in UTF-16 -- the issue in any case isn't whether a *file* starts with a noncharacter, but whether a UTF-16 *string* starts with a noncharacter. Any one of those test strings could be trivially turned into a text file by piping out that one UTF-16 string to a file. And I could then write conformant test software that would read UTF-16 string input data from that file and run it through the UCA algorithm to construct sortkeys for it. As Peter said, the main thing that prevents running into these is that it isn't very *useful* to start off files (or strings) with U+FFFE. (And, additionally, in the case of UTF-16 text data files, it would be confusing and possibly lead to misinterpretation of byte order, if you were somehow depending solely on initial BOMs -- which I wouldn't advise, anyway.) Basically, the rules of standards (e.g., you shouldn't try to publicly interchange noncharacters) are not like laws of physics. Just because the standard says you shouldn't do it doesn't mean it doesn't happen. --Ken > On Tue, 3 Jun 2014 21:28:05 +0000 > Peter Constable <peter...@microsoft.com> wrote: > > > There's never been anything preventing a file from containing and > > beginning with U+FFFE. It's just not a very useful thing to do, hence > > not very likely. > > Well, while U+FFFE was apparently prohibited from public interchange, > one could be very confident of not finding it in an external file. As > an internally generated file, it would then be much more likely to be > in the UTF-16BE or UTF-16LE encoding scheme. > > Richard. _______________________________________________ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode