Both Tidy and OpenXML can correctly parse the same set of HTML documents, but Tidy can sort out stuff like <b>/<i> which in some cases parsing through OpenXML and reproducing as HTML would not give the same bold/italic effect (sometimes using a different browser will also generate that).
Tidy is able to do this sort of correction by processing the HTML DOM, while OpenXML creates the DOM as it parses for SAX compatibility. Most of the times you won't notice a difference, but if you do and you want to correct such a document, I do recommend HTML -> Tidy -> OpenXML -> HTML. arkin Wong Kok Wai wrote: > > Hi, Assaf, > > I believe you've suggested using Tidy/JTidy to preprocess the HTML before > parsing. Possible to integrate JTidy or applies Tidy's rules in the HTML > parser? > > Assaf Arkin wrote: > > > > > If a tag is not closed but it's parent is closed, the tag will be > > forcefully closed and an error issued (but will not stop the parser). If > > the tag is optional closing (like P), no error will be issued. If the > > tag is explicitly closed (e.g. LI closes another LI, /UL and /OL close > > any open LI) it will be properly dealt with. > > > > HTML and BODY tags are always created whether they exist or not in the > > file. > > > > This is all taken care of and is most of what the HTML parser is > > supposed to do, as opposed to an XML parser which demands well formed > > documents. > > > > As for overlapping <b>, <i> and <form> (tricky), I use the DOM > > normalization and not any specific approach taken by any one parser. > > It's a bit easier for a parser to work with <b>/<i> since it need not > > create a DOM but just fontify text sections. > > > > arkin > >
