Hello,

I'm using libxm2 (2.7.6) and I've a question regarding encodings
precedences.

I have a array of bytes (UTF-8 HTML data) and I invoke
htmlCreatePushParserCtxt() with the encoding set to XML_CHAR_ENCODING_UTF8.
When I walk in the document's nodes, everything is fine unless the HTML file
was poorly generated, such as:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
...

The charset specified here is wrong as the HTML data is truly UTF-8 (I know
for sure). Nonetheless, the charset specified by the meta tag seems to take
precedence over the encoding specifed in the htmlCreatePushParserCtxt().

That is, when walking in the document's nodes using that wrong charset, it
seems that the xmlNodePtr's content isn't in UTF-8 - messing up my handler
as it expects UTF-8 data.

How can I best handle this? I could for sure strip the charset parameter of
the meta tag prior creating calling htmlCreatePushParserCtxt() but I would
rather "force" libxml to trust me and use UTF-8 on that poorly generated
content.

Thanks and best regards,
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to