Some comments: - If you can avoid it, don't use a BOM at the start of an UTF-8 HTML file. It will display nicely on more browsers.
- The W3C Validator http://validator.w3.org/ accepts the BOM for HTML 4.01, and also XHTML. It probably should produce a warning. It did when I originally added code to handle it. I have requested that it be added again. - Adding a BOM/ZWNBSP to the whitespace definition is a bad idea, because it would allow a ZWNBSP in all kinds of places where not seeing a space would be confusing (e.g. between attributes). Also, HTML 4 is only being maintained, not being developed. - That HTML 4.0 allows ZWSP (​) as whitespace in http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 is for line breaking/rendering reasons (Thai), within element content. This is in conflict with the whitespace definition for syntactic purposes, which is formally given at http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html and does not include ZWSP (​). I have filed a request for clarification. - RFC 2279 does not approve or disapprove of the BOM. Both Unicode and ISO 10646 allow the BOM as a signature for UTF-8. RFC 2079 is being updated. See http://lists.w3.org/Archives/Public/ietf-charsets/2003JanMar/0209.html. - For XML, a BOM at the start of UTF-8 is allowed by an erratum at http://www.w3.org/XML/xml-V10-2e-errata#E22. But similar to HTML, better to not start your XML files with a BOM, because there are XML parsers out there that don't like it (and this was okay at least until 2001-07-25). - The BOM is both rather handy in a Windows/Notepad scenario and seriously disruptive in an Unix-like filter scenario (which may also be on Windows). I have found that Notepad doesn't need the BOM to detect that a file is UTF-8 if it has enough other information (this is on a Japanese Win2000, your milage may vary). It would be nice if it had a setting to not produce a BOM. - I append a small perl program that removes an UTF-8 BOM if there is one. Quite handy, I use it regularly. Feel free to use and change on your own responsibility. (i.e. if starts to eat up your files, don't blame me!) Regards, Martin. #!/usr/bin/perl # program to remove a leading UTF-8 BOM from a file # works both STDIN -> STDOUT and on the spot (with filename as argument) if ($#ARGV > 0) { print STDERR "Too many arguments!\n"; exit; } my @file; # file content my $lineno = 0; my $filename = $ARGV[0]; if ($filename) { open BOMFILE, "$filename"; while (<BOMFILE>) { if (!$lineno++) { s/^\xEF\xBB\xBF//; } push @file, $_ ; } close BOMFILE; open NOBOMFILE, ">$filename"; foreach $line (@file) { print NOBOMFILE $line; } close NOBOMFILE; } else { # STDIN -> STDOUT while (<>) { if (!$lineno++) { s/^\xEF\xBB\xBF//; } push @file, $_ ; } foreach $line (@file) { print $line; } }

