Re: [xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?

Bjoern Hoehrmann Mon, 12 Jan 2009 12:27:34 -0800

* Matt Poff wrote:
>I have a migration pipeline that takes HTML files with UTF8 encoded  
>characters and pipes them through XMLlint to produce valid XHTML. This  
>is then queried by an XSLT files called by ETL scripts. However, no  
>matter what flags I use on xmllint, I cannot get it to output the  
>XHTML with the UTF-8 encoding preserved.


Do you specify the encoding when calling htmlCtxtRead* or whatever you
are using to parse the document? Generally, it would be better to check
what values are stored in memory by querying parts of the document, than
relying on the serialized result.

>If I download the file with curl the UTF-8 is preserved and visible so  
>it's specific to xmllint. I even tried downloading the HTML file,  
>running it through iconv and then XMLlint but this made no difference.

Does the HTML document start with a byte order mark? Does it include a
<meta http-equiv='Content-Type' content='text/html;charset=utf-8'> tag?
If you have libxml2 download the content, does the HTTP respone contain
a Content-Type:text/html;charset=utf-8 header?
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?

Reply via email to