Re: question about XML and HTML5

Henri Sivonen Thu, 18 Jun 2009 00:47:44 -0700

On Jun 17, 2009, at 14:47, Jonathan Rees wrote:

I don't see how your answer or the linked documents bear on my
question, so let me amplify.


Anne's answer seems entirely relevant to me.

The ideal situation:  you can take any HTML5 document, convert it to
some XML-based language designed for the purpose (not necessarily
XHTML), convert it back, and get a semantically equivalent HTML5
document.

The only HTML5 to XML conversion we have defined is conversion toXHTML5, which is not a 100% reversible conversion for some edge cases.

The edge cases are all arbitrary restrictions that XML places on whatcharacters may appear where. For example, the conversion of an HTML5document that has a form feed somewhere in element content is lossy,because XML doesn't allow form feed in element content. Likewise, theconversion is lossy when the source document has local names that arenot NCNames. Also, the conversion is lossy for documents that haveUnicode non-characters (e.g. U+FFFF) in element content.

However, the for *conforming* HTML5 documents, the only lossiness isform feed and the loss of semantically void talisman attributes(attributes in no namespace that have "xml:lang" or "xmlns" as thelocal name). Note that "xml:lang" in no namespace means nothing intext/html and conformance requires it to be accompanied with "lang" inno namespace with the same value and that does carry meaning. To theextent the semantics of a form feed in text/html are the same as thesemantics of a space and the semantics of non-characters are the sameas the semantics of of U+FFFD, for conforming documents, semantics areround-tripped.

So I think it's fair to say that for conforming HTML5 documents, HTML5->XHTML5->HTML5 round trips semantics. (Note, however, that theconversion from XHTML5 to HTML5 is lossless if the XHTML5 document wasa result of an HTML5 to XHTML5 conversion but it isn't lossless forarbitrary XHTML5 documents.)

The problem I'm worried about is the lack of interoperability between
HTML5 and XML processors. (It has nothing to do with browsers.) Other
specs such as OWL 2 and XQuery have addressed this problem by
providing XML syntax as an alternative. But this only achieves the
intended effect if semantics-preserving round trips work.

The Validator.nu HTML Parser works as a drop-in replacement for an XMLparser in apps that have been programmed to consume XHTML using theDOM, SAX or XOM APIs. That is, the Validator.nu HTML Parser appears tothe application as if it were an XML parser parsing XHTML5.

For comparison, 'tidy' provides conversion from HTML4 to XHTML (I
think), and the resulting XHTML is in a subset (I think) of HTML4, so
the round trip property holds.

The Validator.nu HTML Parser comes with a sample application calledHTML2XML. When the input is a conforming HTML5 document, the output isthe semantically equivalent XHTML5 document. HTML2XML doesn't repairnon-conforming documents.


You can obtain the Java version from http://about.validator.nu/htmlparser/

Sam Ruby is working on a version that doesn't require the JVMinvocation overhead

http://intertwingly.net/blog/2009/06/15/Invoking-HtmlParser-from-C

If your pipeline is in Java, you don't need HTML2XML but you shouldjust use the Validator.nu HTML Parser directly, which optimizes awaythe steps of serializing as XML and reparsing it.

I assume this approach doesn't work for
HTML5, which is why I do not necessarily have XHTML in mind as the
representation.

In my opinion, it would be bad if XHTML5 weren't the XMLrepresentation for HTML5 you can use in this case.

Our draft Design Principles contain the DOM Consistency designprinciple that is intended to keep the design of HTML5 such thatXHTML5 is that representation. ("DOM" is rather browser-oriented. Ithelps to read it as "Infoset Consistency".)

http://www.w3.org/TR/html-design-principles/#dom-consistency

--
Henri Sivonen
[email protected]
http://hsivonen.iki.fi/

Re: question about XML and HTML5

Reply via email to