Hi, 

Does anyone know if either Jtidy or NekoHTML creates a HTML DOM (defined in
the package org.apache.html.dom) from a HTML document? In other words, for
each <input> tag, it creates an object HTMLInputElementImpl, for <image>
tag, it creates an object HTMLImageElementImpl, and so on.

Thanks for any feedback.
Sam

-----Original Message-----
From: Andy Clark [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 16, 2002 1:11 AM
To: [EMAIL PROTECTED]
Subject: Re: validating html


"David M. Hirst" wrote:
>         I'm using the parser to parse html as well as xml documents. When
> I read in an html file, the parser is generating the following error on

Please realize that most HTML documents are *not* well-formed
XML documents and therefore cannot be parsed by any conformant
XML parser. As long as the HTML documents in question are also
well-formed XML documents (e.g. XHTML documents), then you can
follow the suggestions given by Eric and Benson.

However, if you really need to parse HTML documents, then you
need another solution. Two options I can recommend are the
following: JTidy[1] and NekoHTML[2]. 

JTidy is excellent at fixing up HTML documents but must read the 
entire document into memory and can only handle a restricted set 
of character encodings. 

NekoHTML does less but is written directly to the Xerces Native 
Interface (XNI) so it integrates well with Xerces2, can operate 
in a streaming fashion, and handle more character encodings.

[1] http://sourceforge.net/projects/jtidy
[2] http://www.apache.org/~andyc/

-- 
Andy Clark * [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to