On Tue, 24 Nov 2009 21:14:54 +0100, John Cowan <co...@ccil.org> wrote:

Ian Hickson scripsit:

TagSoup is could be made more compatible with existing deployed content,
then. It might be compatible enough for most purposes already, but there
are pages on the Web that depend on the <head> element being always
present. Also, the <ul> element should certainly not be implied.

TagSoup is not intended for deployment in browsers.  Rather, it generates
SAX events based on HTML input, permitting fairly arbitrary HTML to
be processed using XML tools such as XSLT.  It guarantees, therefore,
that the output is well-formed XML (except for encoding issues) rather
than that it conforms to any specific schema.  If you don't like what
TagSoup outputs, you can always transform the output further until the
result is more like what you expect.

In particular, there are absolutely no guarantees that CSS paths or
JavaScript DOM references that work on the HTML will continue to work
on the XML; they probably won't.

In principle it would be possible to use an implementation of the HTML5
algorithm to construct a DOM and then use a simple DOM walker to read
out SAX events, but this would be much more heavyweight in time and
space than TagSoup is, so I imagine it will continue to be used.

The Validator.nu HTML parser can be run in SAX streaming mode which doesn't construct a DOM in between.

Because of things like attributes on stray <html> tags affecting attributes on the root element, a streaming parser sometimes either has to abort, emit non-SAX events or violate HTML5.


Also, on another note, TagSoup is not compliant with HTML4 if it doesn't
output a HEAD element without an explicit <HEAD> tag, since <HEAD> is an
optional tag in HTML4. :-)

True; see above.



--
Simon Pieters
Opera Software

Reply via email to