On Wed, Nov 08, 2006 at 10:03:29AM +1100, Michael Day wrote:
> Hi Daniel,
>
> > ctxt->charset is a remain from libxml1 where strings were stored
> >in the document encoding (this was a complete and total mess), now
> >they are always stored as UTF-8 so whether the value is 0 or
> >XML_CHAR_ENCODING_UTF8 this should not change anything, really.
>
> However, the charset value is used in htmlCurrentChar():
>
> if (ctxt->charset == XML_CHAR_ENCODING_UTF8) {
>
> I'm trying to parse a HTML file encoded in ISO-8859-1 using
> htmlCtxtReadFile() and I'm getting encoding errors on some of the
> characters because they are not in UTF-8. If I use htmlReadFile()
> everything works fine. If I use htmlCtxtReadFile() and comment out this
> line of htmlCtxtReset():
>
> ctxt->charset = XML_CHAR_ENCODING_UTF8;
>
> then everything works fine. However, if that line is not commented out,
> then the behaviour of htmlCtxtReadFile() is different from the behaviour
> of htmlReadFile(), and appears to be wrong. So I suggest replacing that
> line with this:
>
> ctxt->charset = 0;
>
> which will truly reset the parsing context to what it was when it was
> created and give identical behaviour to htmlReadFile() and
> htmlCtxtReadFile().
Okay, what I tough was a general rule is limited to XML parsing, we
actually do
ctxt->charset = XML_CHAR_ENCODING_8859_1
in the HTML parser when an encoding error is detected, so you're right
and the reset code need to be fixed. HTML parsing is a really scary mess :-\
So best is to change htmlCtxtReset() to do
ctxt->charset = XML_CHAR_ENCODING_NONE;
thanks for the report, I commited this change in CVS now !
Daniel
--
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard | virtualization library http://libvirt.org/
[EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml