Re: [xml] strange encoding behavior when parsing HTML files

Daniel Veillard Fri, 17 Apr 2009 01:54:11 -0700

On Thu, Apr 16, 2009 at 01:51:10PM -0700, Aaron Patterson wrote:
> Hi,
> 
> There seems to be strange behavior in libxml2 with regard to encoding
> when parsing an HTML file.  If an HTML file contains a meta tag
> hinting at the encoding, libxml2 will use the encoding in the meta tag
> *unless* there are strange characters before the meta tag.
> 
> If there are strange characters before the meta tag, libxml2 will
> guess the encoding and use the guessed encoding for the rest of the
> document even though the meta tag reported the correct encoding.
> What's worse is that libxml2 will report that it used the encoding
> from the meta tag when outputting the content of the document
> indicates that it did not.
> 
> Here is an example of the behavior in action:
> 
>   http://gist.github.com/96641
> 
> fail.html fails, and success.html "does the right thing".
> 
> Should I report this in bugzilla?


  Yes please. The encoding handling is a real problem in HTML
because you can get content and hence have to parse before possibly
getting the meta tag (if available !)
  That was fixed in XML by the xmlDecl and rules to parse it without
encoding informations a priori.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
[email protected]  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] strange encoding behavior when parsing HTML files

Reply via email to