On Jun 17, 2009, at 14:47, Jonathan Rees wrote:
I don't see how your answer or the linked documents bear on my
question, so let me amplify.
Anne's answer seems entirely relevant to me.
The ideal situation: you can take any HTML5 document, convert it to
some XML-based language designed for the purpose (not necessarily
XHTML), convert it back, and get a semantically equivalent HTML5
document.
The only HTML5 to XML conversion we have defined is conversion to
XHTML5, which is not a 100% reversible conversion for some edge cases.
The edge cases are all arbitrary restrictions that XML places on what
characters may appear where. For example, the conversion of an HTML5
document that has a form feed somewhere in element content is lossy,
because XML doesn't allow form feed in element content. Likewise, the
conversion is lossy when the source document has local names that are
not NCNames. Also, the conversion is lossy for documents that have
Unicode non-characters (e.g. U+FFFF) in element content.
However, the for *conforming* HTML5 documents, the only lossiness is
form feed and the loss of semantically void talisman attributes
(attributes in no namespace that have "xml:lang" or "xmlns" as the
local name). Note that "xml:lang" in no namespace means nothing in
text/html and conformance requires it to be accompanied with "lang" in
no namespace with the same value and that does carry meaning. To the
extent the semantics of a form feed in text/html are the same as the
semantics of a space and the semantics of non-characters are the same
as the semantics of of U+FFFD, for conforming documents, semantics are
round-tripped.
So I think it's fair to say that for conforming HTML5 documents, HTML5-
>XHTML5->HTML5 round trips semantics. (Note, however, that the
conversion from XHTML5 to HTML5 is lossless if the XHTML5 document was
a result of an HTML5 to XHTML5 conversion but it isn't lossless for
arbitrary XHTML5 documents.)
The problem I'm worried about is the lack of interoperability between
HTML5 and XML processors. (It has nothing to do with browsers.) Other
specs such as OWL 2 and XQuery have addressed this problem by
providing XML syntax as an alternative. But this only achieves the
intended effect if semantics-preserving round trips work.
The Validator.nu HTML Parser works as a drop-in replacement for an XML
parser in apps that have been programmed to consume XHTML using the
DOM, SAX or XOM APIs. That is, the Validator.nu HTML Parser appears to
the application as if it were an XML parser parsing XHTML5.
For comparison, 'tidy' provides conversion from HTML4 to XHTML (I
think), and the resulting XHTML is in a subset (I think) of HTML4, so
the round trip property holds.
The Validator.nu HTML Parser comes with a sample application called
HTML2XML. When the input is a conforming HTML5 document, the output is
the semantically equivalent XHTML5 document. HTML2XML doesn't repair
non-conforming documents.
You can obtain the Java version from http://about.validator.nu/htmlparser/
Sam Ruby is working on a version that doesn't require the JVM
invocation overhead
http://intertwingly.net/blog/2009/06/15/Invoking-HtmlParser-from-C
If your pipeline is in Java, you don't need HTML2XML but you should
just use the Validator.nu HTML Parser directly, which optimizes away
the steps of serializing as XML and reparsing it.
I assume this approach doesn't work for
HTML5, which is why I do not necessarily have XHTML in mind as the
representation.
In my opinion, it would be bad if XHTML5 weren't the XML
representation for HTML5 you can use in this case.
Our draft Design Principles contain the DOM Consistency design
principle that is intended to keep the design of HTML5 such that
XHTML5 is that representation. ("DOM" is rather browser-oriented. It
helps to read it as "Infoset Consistency".)
http://www.w3.org/TR/html-design-principles/#dom-consistency
--
Henri Sivonen
[email protected]
http://hsivonen.iki.fi/