Re: [xml] HTML parsing with libxml2

Daniel Veillard Fri, 05 Aug 2005 06:11:11 -0700

On Fri, Aug 05, 2005 at 03:01:24PM +0200, Paweł Pałucha wrote:
> 
> >So, basically, how can I make libxml2 parse the document and ignore the 
> >character encoding (or fallback to a default encoding and continue, on 
> >error)? Or how can I make it simply ignore any unknown characters?
> >I really need to use libxml and "out-of-range" characters are messing 
> >the parsing :(


  First make sure the HTTP server is not passing an encoding which
should override the default one embedded in the file.
  Then give your own encoding string to the parser, define your own
encoding handling routines. Or debug libxml2 to find why ascii
conversion is so obtuse in the HTML parsing case, and suggest a patch.
Of course if the patch breaks the well formedness checkings at
libxml2 level it will be forgotten.

> libxml is an XML parser, do not require it to parse IE-ready html code ;-)

  Wromg it's about the HTML parser in libxml2.

> You can always clean the document on your own before passing it to 
> libxml2. Or you can use libtidy or similar tool to clean your code.

  ironically if you look at the document it had been tidied, or 
it is supposed to, even though it's not XML there are non closed tags.

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] HTML parsing with libxml2

Reply via email to