If the document is well formed, both Tidy, OpenXML and browsers can
create a valid document.

If the document has strange overlapping in decorators (b, i, tt, etc),
OpenXML forces a well-formed structue on it, Tidy tries to figure out
the author's intentions. Some browsers behave like Tidy, others behave
like OpenXML.

But unlike Tidy, a browser does not need to correct the DOM, since a
decoration applies to how fonts are rendered it can produce the right
rendering while producing an OpenXML-equivalent DOM in memory.

Actual milage might vary depending on the browser in use :-)

arkin


Mike Pogue wrote:
> 
> I suspect that fixing the HTML should be done however a *browser* would
> do it (there are many million of those in use!).
> 
> In particular, IE5 exposes its DOM, so it should be possible to run
> large amounts of HTML through the browser, and through the HTML parser,
> and then compare them.  In cases where the input may be ambiguous, I
> think browsers would be a good "reference" implementation for this
> purpose...
> 
> Mike
> 
> Tom Palmer wrote:
> >
> > > My impression was that JTidy had to make a complete pass over the document
> > > in order to tidy it.  This would preclude using it for a SAX
> > (stream-based) parser.
> > >
> > If so, too bad.  (Of course, this wouldn't _preclude_ it, just make it
> > extremely
> > inefficient.)
> >
> > I think it may get more complicated than Assaf listed in his algorithms, but
> > I still think a knowledge of the stack of what tags are currently open is
> > sufficient to fix the HTML.
> >
> > - Tom Palmer

Reply via email to