On Apr 16, 2008, at 12:58, Paul Libbrecht wrote:
In fact, the reason why the proportion of Web pages that get parsed
as XML is negligible is that the XML approach totally failed to
plug into the existing text/html network effects[...]
My hypothesis here is that this problem is mostly a parsing problem
and not a model problem. HTML5 mixes the two.
For backwards compatibility in scripted browser environments, the HTML
DOM can't behave exactly like the XHTML5 DOM. For non-scripted non-
browser environments, using an XML data model (XML DOM, XOM, JDOM,
dom4j, SAX, ElementTree, lxml, etc., etc.) works fine.
There are tools that convert quite a lot of text/html pages (whose
compliance is user-defined to be "it works in my browser") to an XML
stream today NeckoHTML is one of them. The goal would be to
formalize this parsing, and just this parsing.
Like NekoHTML and TagSoup, the Validator.nu HTML parser turns text/
html input into Java XML models. The difference is that the
Validator.nu HTML parser implements the HTML5 algorithm instead of
something the authors of NekoHTML and TagSoup figured out on their
own. So if you are asking for a NekoHTML-like product for HTML5, it
already exists and supports three popular Java XML APIs (SAX, DOM and
XOM). Not XNI, though, at the moment. (It doesn't support the recent
MathML addition, *yet*, though.)
http://about.validator.nu/htmlparser/
Currently HTML5 defines at the same time parsing and the model and
this is what can cause us to expect that XML is getting weaker. I
believe that the whole model-definition work of XML is rich, has
many libraries, has empowered a lot of great developments and it
is a bad idea to drop it instead of enriching it.
The dominant design of non-browser HTML5 parsing libraries is
exposing the document tree using an XML parser API. The non-browser
HTML5 libraries, therefore, plug into the network of XML libraries.
For example, Validator.nu's internals operate on SAX events that
look like SAX events for an XHTML5 document. This allows
Validator.nu to use libraries written for XML, such as oNVDL and
Saxon.
So, except for needing yet another XHTML version to accomodate all
wishes, I think it would be much saner that browsers'
implementations and related specifications rely on an XML-based
model of HTML (as the DOM is) instead of a coupled parsing-and-
modelling specification which has different interpretations at
different places.
HTML5 already specifies parsing in terms of DOM output. However, when
the DOM is in the HTML mode, it has to be slightly different.
--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/