Re: [xml] strange encoding behavior when parsing HTML files

Aaron Patterson Fri, 17 Apr 2009 09:39:46 -0700

On Fri, Apr 17, 2009 at 1:53 AM, Daniel Veillard <[email protected]> wrote:
> On Thu, Apr 16, 2009 at 01:51:10PM -0700, Aaron Patterson wrote:
>> Hi,
>>
>> There seems to be strange behavior in libxml2 with regard to encoding
>> when parsing an HTML file.  If an HTML file contains a meta tag
>> hinting at the encoding, libxml2 will use the encoding in the meta tag
>> *unless* there are strange characters before the meta tag.
>>
>> If there are strange characters before the meta tag, libxml2 will
>> guess the encoding and use the guessed encoding for the rest of the
>> document even though the meta tag reported the correct encoding.
>> What's worse is that libxml2 will report that it used the encoding
>> from the meta tag when outputting the content of the document
>> indicates that it did not.
>>
>> Here is an example of the behavior in action:
>>
>>   http://gist.github.com/96641
>>
>> fail.html fails, and success.html "does the right thing".
>>
>> Should I report this in bugzilla?
>
>  Yes please. The encoding handling is a real problem in HTML
> because you can get content and hence have to parse before possibly
> getting the meta tag (if available !)
>  That was fixed in XML by the xmlDecl and rules to parse it without
> encoding informations a priori.


I've reported the bug here:

  http://bugzilla.gnome.org/show_bug.cgi?id=579317

I wasn't sure how I should set the priority.  I set it to critical
because my data is incorrect and I don't have a work around besides
parsing the document myself, looking for the encoding, then passing
the encoding to libxml2.

Thanks for the help!

-- 
Aaron Patterson
http://tenderlovemaking.com/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] strange encoding behavior when parsing HTML files

Reply via email to