Actually, libxml2 includes a "non-verifying parser" for HTML 4.0 called
HTMLparser.  (See http://xmlsoft.org/html/libxml-HTMLparser.html.)  Its API
is similar to libxml2's XML parser interface, but it is most definitely NOT
an XML parser.

That said, if you can specify that your documents must be XHTML, you'll be
able to use an XML parser, which will be a much more robust and durable
solution.  You'll be able to validate the documents, and any XML processor
(whether based on libxml2, Xerces, or some other parser) will be able to
understand them.  This brings flexibility in the parser interface, too: you
can choose DOM, SAX, or other APIs (such as libxml2's).

HTML parsers often have one feature that no conformant XML parser will have:
some mechanism for attempting to fix up broken documents.  Personally, I
think it's a mistake to attempt such repairs, because the code has to guess
at the original intent.  If possible, it's much better to get the document
fixed before processing it.  If, however, you need to be able to process any
HTML found on the Web today, such a feature is probably necessary.

> -----Original Message-----
> From: Jeroen N. Witmond [mailto:[EMAIL PROTECTED] 
> Sent: Friday, December 12, 2003 5:55 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Can I use libxml2 to parse HTML?
> 
> 
> Hi,
> 
> First of all, this is the mailing list for Xerces-C++, not 
> for libxml2.
> More importantly, you cannot use any XML parser for HTML, as 
> HTML does not
> conform to the XML rules (see http://www.w3.org/TR/REC-xml ). 
> You CAN use
> an XML parser for documents written in XHTML (see
> http://www.w3.org/MarkUp/ ).
> 
> Regards,
> 
> Jeroen.
> 
> > Hi,
> >
> > I downloaded, installed and compiled the libxml2
> > today. I wonder if I can use it to parse and
> > HTML-file.
> >
> > Regards
> >
> > Wei Chen

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to