Some comments:

- If you can avoid it, don't use a BOM at the start of an UTF-8
  HTML file. It will display nicely on more browsers.

- The W3C Validator http://validator.w3.org/ accepts the BOM for
  HTML 4.01, and also XHTML. It probably should produce a warning.
  It did when I originally added code to handle it. I have requested
  that it be added again.

- Adding a BOM/ZWNBSP to the whitespace definition is a bad idea,
  because it would allow a ZWNBSP in all kinds of places where
  not seeing a space would be confusing (e.g. between attributes).
  Also, HTML 4 is only being maintained, not being developed.

- That HTML 4.0 allows ZWSP (​) as whitespace in
  http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 is for
  line breaking/rendering reasons (Thai), within element content.
  This is in conflict with the whitespace definition for syntactic
  purposes, which is formally given at
  http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html and does
  not include ZWSP (​). I have filed a request for
  clarification.

- RFC 2279 does not approve or disapprove of the BOM. Both Unicode
  and ISO 10646 allow the BOM as a signature for UTF-8. RFC 2079
  is being updated. See
  http://lists.w3.org/Archives/Public/ietf-charsets/2003JanMar/0209.html.

- For XML, a BOM at the start of UTF-8 is allowed by an erratum at
  http://www.w3.org/XML/xml-V10-2e-errata#E22. But similar to HTML,
  better to not start your XML files with a BOM, because there are
  XML parsers out there that don't like it (and this was okay at
  least until 2001-07-25).

- The BOM is both rather handy in a Windows/Notepad scenario and
  seriously disruptive in an Unix-like filter scenario (which may
  also be on Windows). I have found that Notepad doesn't need the
  BOM to detect that a file is UTF-8 if it has enough other information
  (this is on a Japanese Win2000, your milage may vary). It would be
  nice if it had a setting to not produce a BOM.

- I append a small perl program that removes an UTF-8 BOM if there
  is one. Quite handy, I use it regularly. Feel free to use and change
  on your own responsibility.
  (i.e. if starts to eat up your files, don't blame me!)

Regards,   Martin.




#!/usr/bin/perl

# program to remove a leading UTF-8 BOM from a file
# works both STDIN -> STDOUT and on the spot (with filename as argument)

if ($#ARGV > 0) {
    print STDERR "Too many arguments!\n";
    exit;
}

my @file;   # file content
my $lineno = 0;

my $filename = $ARGV[0];
if ($filename) {
    open BOMFILE, "$filename";
    while (<BOMFILE>) {
        if (!$lineno++) {
            s/^\xEF\xBB\xBF//;
        }
        push @file, $_ ;
    }
    close BOMFILE;
    open NOBOMFILE, ">$filename";
    foreach $line (@file) {
        print NOBOMFILE $line;
    }
    close NOBOMFILE;
}
else {  # STDIN -> STDOUT
    while (<>) {
        if (!$lineno++) {
            s/^\xEF\xBB\xBF//;
        }
        push @file, $_ ;
    }
    foreach $line (@file) {
        print $line;
    }
}

Reply via email to