Re: [xml] Minor bug in htmlCtxtReset

Michael Day Tue, 07 Nov 2006 15:03:36 -0800

Hi Daniel,

>   ctxt->charset is a remain from libxml1 where strings were stored
> in the document encoding (this was a complete and total mess), now
> they are always stored as UTF-8 so whether the value is 0 or 
> XML_CHAR_ENCODING_UTF8 this should not change anything, really.


However, the charset value is used in htmlCurrentChar():

     if (ctxt->charset == XML_CHAR_ENCODING_UTF8) {

I'm trying to parse a HTML file encoded in ISO-8859-1 using 
htmlCtxtReadFile() and I'm getting encoding errors on some of the 
characters because they are not in UTF-8. If I use htmlReadFile() 
everything works fine. If I use htmlCtxtReadFile() and comment out this 
line of htmlCtxtReset():

     ctxt->charset = XML_CHAR_ENCODING_UTF8;

then everything works fine. However, if that line is not commented out, 
then the behaviour of htmlCtxtReadFile() is different from the behaviour 
of htmlReadFile(), and appears to be wrong. So I suggest replacing that 
line with this:

     ctxt->charset = 0;

which will truly reset the parsing context to what it was when it was 
created and give identical behaviour to htmlReadFile() and 
htmlCtxtReadFile().

Best regards,

Michael

-- 
Print XML with Prince!
http://www.princexml.com
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Minor bug in htmlCtxtReset

Reply via email to