Hi, Does anyone know if either Jtidy or NekoHTML creates a HTML DOM (defined in the package org.apache.html.dom) from a HTML document? In other words, for each <input> tag, it creates an object HTMLInputElementImpl, for <image> tag, it creates an object HTMLImageElementImpl, and so on.
Thanks for any feedback. Sam -----Original Message----- From: Andy Clark [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 16, 2002 1:11 AM To: [EMAIL PROTECTED] Subject: Re: validating html "David M. Hirst" wrote: > I'm using the parser to parse html as well as xml documents. When > I read in an html file, the parser is generating the following error on Please realize that most HTML documents are *not* well-formed XML documents and therefore cannot be parsed by any conformant XML parser. As long as the HTML documents in question are also well-formed XML documents (e.g. XHTML documents), then you can follow the suggestions given by Eric and Benson. However, if you really need to parse HTML documents, then you need another solution. Two options I can recommend are the following: JTidy[1] and NekoHTML[2]. JTidy is excellent at fixing up HTML documents but must read the entire document into memory and can only handle a restricted set of character encodings. NekoHTML does less but is written directly to the Xerces Native Interface (XNI) so it integrates well with Xerces2, can operate in a streaming fashion, and handle more character encodings. [1] http://sourceforge.net/projects/jtidy [2] http://www.apache.org/~andyc/ -- Andy Clark * [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
