My impression was that JTidy had to make a complete pass over the document in 
order
to tidy it.  This would preclude using it for a SAX (stream-based) parser.

court

Wong Kok Wai wrote:

> Hi, Assaf,
>
> I believe you've suggested using Tidy/JTidy to preprocess the HTML before
> parsing. Possible to integrate JTidy or applies Tidy's rules in the HTML 
> parser?
>
> Assaf Arkin wrote:
>
> >
> > If a tag is not closed but it's parent is closed, the tag will be
> > forcefully closed and an error issued (but will not stop the parser). If
> > the tag is optional closing (like P), no error will be issued. If the
> > tag is explicitly closed (e.g. LI closes another LI, /UL and /OL close
> > any open LI) it will be properly dealt with.
> >
> > HTML and BODY tags are always created whether they exist or not in the
> > file.
> >
> > This is all taken care of and is most of what the HTML parser is
> > supposed to do, as opposed to an XML parser which demands well formed
> > documents.
> >
> > As for overlapping <b>, <i> and <form> (tricky), I use the DOM
> > normalization and not any specific approach taken by any one parser.
> > It's a bit easier for a parser to work with <b>/<i> since it need not
> > create a DOM but just fontify text sections.
> >
> > arkin
> >

Reply via email to