Re: BOM ambiguity?

Doug Ewell Sun, 15 Jul 2012 18:25:43 -0700

Stephan Stiller wrote:

With that in mind, there is value in documenting, however briefly,
that reading FF FE 00 00 is by itself technically ambiguous.

I have seen this documented many times, though I can't say for sure thatit was in official Unicode literature.

Even though you can never flat-out guarantee that a plain-textapplication won't use U+0000, the fact is that very few do. And UTF-32files are almost never seen outside of laboratory environments. Soyou're probably safe in assuming that FF FE 00 00 is little-endianUTF-32, and any other FF FE xx xx is little-endian UTF-16, and if youwant more assurance than that, apply a "halfway decent heuristic" likethis:

For a file to be little-endian UTF-32, the file size must be a multipleof 4, and for each 4-byte chunk <aa bb cc dd>:


• aa bb must not be FE FF or FF FF
• cc must not be 11 through FF
• dd must be 00
• (add your own checks)

--
Doug Ewell | Thornton, Colorado, USA

http://www.ewellic.org | @DougEwell

Re: BOM ambiguity?

Reply via email to