Steven Atreju on 28/7/'12,  0:22:
"Doug Ewell" wrote:

 |> Well, i still see a bug in the Unicode Standard here.
 |> Whereas for the multioctet UTFs there is «The BOM is not
 |> considered part of the content of the text» (Conformance, 3.10,
 |> D98, D101), i cannot find any such clarifying text for it's usage
 |> as a signature.
 |
|There really isn't as much difference between using U+FEFF "as a byte |order mark" and using it "as a signature" as this makes it seem. The |definitions you quote have to do with whether U+FEFF is treated as a |BOM/signature or as a zero-width no-break space.

I really think that a clarification in equal spirit to those of
D98 and D101 (but maybe with different content :) would be an
improvement of the Unicode Standard.

Once more i want to point out that on Unix/POSIX systems the file
content can be seen as a whole, and i hope and think that this
will not change.  This situation is completely different than on
Windows, which had textfiles with appended (separated by ^Z or so)
meta information that was invisible in normal text editors already
in the ninetees (or even earlier, but i don't know).

I.e., this is why we do have this messy text OR binary file I/O
distinction like O_BINARY (for open(2)), "b" (for fopen(3)) or
binmode (perl(1)).  Because without those a text file will see
End-Of-File at the ^Z, not at the real end of the file.  (Which
rises the immediate question why the Microsoft programmers did not
embed the meta information in this section at the end of the file.
But i don't really want to know.)
Anyway.  On Unix a UTF-8 file *will* show the BOM, because it is
file content.

I agree with Doug that there is no enormous diff between "BOM" and "encoding 
signature". In XML 1.0 the BOM is in fact described as a signature regardless of which unicode 
encoding it is used with:

http://www.w3.org/TR/xml/#charencoding

Also, whether UTF-16 is one ore two encodings is a definition question. (Microsoft at one time defined it as two encodings.) --
Leif Halvard Silli

Reply via email to