On Fri, Apr 17, 2009 at 1:53 AM, Daniel Veillard <[email protected]> wrote: > On Thu, Apr 16, 2009 at 01:51:10PM -0700, Aaron Patterson wrote: >> Hi, >> >> There seems to be strange behavior in libxml2 with regard to encoding >> when parsing an HTML file. If an HTML file contains a meta tag >> hinting at the encoding, libxml2 will use the encoding in the meta tag >> *unless* there are strange characters before the meta tag. >> >> If there are strange characters before the meta tag, libxml2 will >> guess the encoding and use the guessed encoding for the rest of the >> document even though the meta tag reported the correct encoding. >> What's worse is that libxml2 will report that it used the encoding >> from the meta tag when outputting the content of the document >> indicates that it did not. >> >> Here is an example of the behavior in action: >> >> http://gist.github.com/96641 >> >> fail.html fails, and success.html "does the right thing". >> >> Should I report this in bugzilla? > > Yes please. The encoding handling is a real problem in HTML > because you can get content and hence have to parse before possibly > getting the meta tag (if available !) > That was fixed in XML by the xmlDecl and rules to parse it without > encoding informations a priori.
I've reported the bug here: http://bugzilla.gnome.org/show_bug.cgi?id=579317 I wasn't sure how I should set the priority. I set it to critical because my data is incorrect and I don't have a work around besides parsing the document myself, looking for the encoding, then passing the encoding to libxml2. Thanks for the help! -- Aaron Patterson http://tenderlovemaking.com/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
