Actually, libxml2 includes a "non-verifying parser" for HTML 4.0 called HTMLparser. (See http://xmlsoft.org/html/libxml-HTMLparser.html.) Its API is similar to libxml2's XML parser interface, but it is most definitely NOT an XML parser.
That said, if you can specify that your documents must be XHTML, you'll be able to use an XML parser, which will be a much more robust and durable solution. You'll be able to validate the documents, and any XML processor (whether based on libxml2, Xerces, or some other parser) will be able to understand them. This brings flexibility in the parser interface, too: you can choose DOM, SAX, or other APIs (such as libxml2's). HTML parsers often have one feature that no conformant XML parser will have: some mechanism for attempting to fix up broken documents. Personally, I think it's a mistake to attempt such repairs, because the code has to guess at the original intent. If possible, it's much better to get the document fixed before processing it. If, however, you need to be able to process any HTML found on the Web today, such a feature is probably necessary. > -----Original Message----- > From: Jeroen N. Witmond [mailto:[EMAIL PROTECTED] > Sent: Friday, December 12, 2003 5:55 AM > To: [EMAIL PROTECTED] > Subject: Re: Can I use libxml2 to parse HTML? > > > Hi, > > First of all, this is the mailing list for Xerces-C++, not > for libxml2. > More importantly, you cannot use any XML parser for HTML, as > HTML does not > conform to the XML rules (see http://www.w3.org/TR/REC-xml ). > You CAN use > an XML parser for documents written in XHTML (see > http://www.w3.org/MarkUp/ ). > > Regards, > > Jeroen. > > > Hi, > > > > I downloaded, installed and compiled the libxml2 > > today. I wonder if I can use it to parse and > > HTML-file. > > > > Regards > > > > Wei Chen --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]