That's the main difference between OpenXML and Tidy, OpenXML can generate SAX events as it parses with little memory overhead. Most of the time they produce the same DOM.
arkin Court Demas wrote: > > My impression was that JTidy had to make a complete pass over the document in > order > to tidy it. This would preclude using it for a SAX (stream-based) parser. > > court > > Wong Kok Wai wrote: > > > Hi, Assaf, > > > > I believe you've suggested using Tidy/JTidy to preprocess the HTML before > > parsing. Possible to integrate JTidy or applies Tidy's rules in the HTML > > parser? > > > > Assaf Arkin wrote: > > > > > > > > If a tag is not closed but it's parent is closed, the tag will be > > > forcefully closed and an error issued (but will not stop the parser). If > > > the tag is optional closing (like P), no error will be issued. If the > > > tag is explicitly closed (e.g. LI closes another LI, /UL and /OL close > > > any open LI) it will be properly dealt with. > > > > > > HTML and BODY tags are always created whether they exist or not in the > > > file. > > > > > > This is all taken care of and is most of what the HTML parser is > > > supposed to do, as opposed to an XML parser which demands well formed > > > documents. > > > > > > As for overlapping <b>, <i> and <form> (tricky), I use the DOM > > > normalization and not any specific approach taken by any one parser. > > > It's a bit easier for a parser to work with <b>/<i> since it need not > > > create a DOM but just fontify text sections. > > > > > > arkin > > >
