Re: [xml] HTMLparser: UTF-8 byte order mark

Bjoern Hoehrmann Tue, 03 Jan 2006 13:12:48 -0800

* Daniel Veillard wrote:
>  Hum, I don't know how it should be processed in theory ! In XML
>the BOM is fine at the beginning of a document entity in UTF-8 or UTF-16
>but will usually mess things up in different encodings. For HTML I don't
>know what the theory suggests. For compatibility I guess the character
>should be dropped if detected.


HTML character encoding detection is a terrible mess and last time I
checked libxml2 was not a compliant implementation in that it considered
<meta> elements encoding switches and won't re-parse content preceding
the <meta> element (much unlike browsers). Browsers typically treat the
BOM here as they would do for XML documents.
-- 
Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] HTMLparser: UTF-8 byte order mark

Reply via email to