Stephan Stiller wrote:

With that in mind, there is value in documenting, however briefly,
that reading FF FE 00 00 is by itself technically ambiguous.

I have seen this documented many times, though I can't say for sure that it was in official Unicode literature.

Even though you can never flat-out guarantee that a plain-text application won't use U+0000, the fact is that very few do. And UTF-32 files are almost never seen outside of laboratory environments. So you're probably safe in assuming that FF FE 00 00 is little-endian UTF-32, and any other FF FE xx xx is little-endian UTF-16, and if you want more assurance than that, apply a "halfway decent heuristic" like this:

For a file to be little-endian UTF-32, the file size must be a multiple of 4, and for each 4-byte chunk <aa bb cc dd>:

• aa bb must not be FE FF or FF FF
• cc must not be 11 through FF
• dd must be 00
• (add your own checks)

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­

Reply via email to