If the document is well formed, both Tidy, OpenXML and browsers can create a valid document.
If the document has strange overlapping in decorators (b, i, tt, etc), OpenXML forces a well-formed structue on it, Tidy tries to figure out the author's intentions. Some browsers behave like Tidy, others behave like OpenXML. But unlike Tidy, a browser does not need to correct the DOM, since a decoration applies to how fonts are rendered it can produce the right rendering while producing an OpenXML-equivalent DOM in memory. Actual milage might vary depending on the browser in use :-) arkin Mike Pogue wrote: > > I suspect that fixing the HTML should be done however a *browser* would > do it (there are many million of those in use!). > > In particular, IE5 exposes its DOM, so it should be possible to run > large amounts of HTML through the browser, and through the HTML parser, > and then compare them. In cases where the input may be ambiguous, I > think browsers would be a good "reference" implementation for this > purpose... > > Mike > > Tom Palmer wrote: > > > > > My impression was that JTidy had to make a complete pass over the document > > > in order to tidy it. This would preclude using it for a SAX > > (stream-based) parser. > > > > > If so, too bad. (Of course, this wouldn't _preclude_ it, just make it > > extremely > > inefficient.) > > > > I think it may get more complicated than Assaf listed in his algorithms, but > > I still think a knowledge of the stack of what tags are currently open is > > sufficient to fix the HTML. > > > > - Tom Palmer
