Hi, Assaf, I believe you've suggested using Tidy/JTidy to preprocess the HTML before parsing. Possible to integrate JTidy or applies Tidy's rules in the HTML parser?
Assaf Arkin wrote: > > If a tag is not closed but it's parent is closed, the tag will be > forcefully closed and an error issued (but will not stop the parser). If > the tag is optional closing (like P), no error will be issued. If the > tag is explicitly closed (e.g. LI closes another LI, /UL and /OL close > any open LI) it will be properly dealt with. > > HTML and BODY tags are always created whether they exist or not in the > file. > > This is all taken care of and is most of what the HTML parser is > supposed to do, as opposed to an XML parser which demands well formed > documents. > > As for overlapping <b>, <i> and <form> (tricky), I use the DOM > normalization and not any specific approach taken by any one parser. > It's a bit easier for a parser to work with <b>/<i> since it need not > create a DOM but just fontify text sections. > > arkin >
