My impression was that JTidy had to make a complete pass over the document in order to tidy it. This would preclude using it for a SAX (stream-based) parser.
court Wong Kok Wai wrote: > Hi, Assaf, > > I believe you've suggested using Tidy/JTidy to preprocess the HTML before > parsing. Possible to integrate JTidy or applies Tidy's rules in the HTML > parser? > > Assaf Arkin wrote: > > > > > If a tag is not closed but it's parent is closed, the tag will be > > forcefully closed and an error issued (but will not stop the parser). If > > the tag is optional closing (like P), no error will be issued. If the > > tag is explicitly closed (e.g. LI closes another LI, /UL and /OL close > > any open LI) it will be properly dealt with. > > > > HTML and BODY tags are always created whether they exist or not in the > > file. > > > > This is all taken care of and is most of what the HTML parser is > > supposed to do, as opposed to an XML parser which demands well formed > > documents. > > > > As for overlapping <b>, <i> and <form> (tricky), I use the DOM > > normalization and not any specific approach taken by any one parser. > > It's a bit easier for a parser to work with <b>/<i> since it need not > > create a DOM but just fontify text sections. > > > > arkin > >
