Sam Ruby wrote:
James Graham wrote:
As I understand it, the full chain of events should look like this:

 [Internal data model in server]
                |
                |
       HTML 5 Serializer
                |
                |
            {Network}
                |
                |
          HTML 5 Parser
                |
                |
 [Whatever client tools you like]

This only works if the internal-data-model to HTML5 conversion is lossless.

The potentially-lossy-conversion argument is rather pointless when you consider that reserialising XHTML as HTML has, for all practical purposes, is almost exactly the same or better serving XHTML as text/html.

The main difference is that instead of the conversion to HTML5 happening on the server side, as in that diagram, the browser receives XHTML which it then attempts to treat as HTML anyway. What practical difference is there? The following example illustrates this.

Say the following was your XHTML document. I'm only including the doctype because it's necessary for the example, not because it's useful to have in XHTML at all.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="en">
<head>
  <title>Example</title>
</head>
<body>
  <p>This document cannot be converted losslessy because:
    <ul>
      <li>A paragraph cannot contain a ul in HTML</li>
    </ul>
    and they will become siblings instead.</p>
</body>
</html>

There are 3 scenarios. In scenario 1, it's sent unchanged as XML. In scenario 2, the XHTML is serialised to HTML on the server side. In scenario 3, it's sent unchanged as text/html.


*Scenario 1: XHTML as XML*
When parsed by the browser using an XML parser, it produces the following DOM:
(whitespace nodes omitted and all elements are in the XHTML namespace)

* #DOCTYPE html
* html
    - ("http://www.w3.org/2000/xmlns/";, "xmlns")
    - ("http://www.w3.org/XML/1998/namespace";, "xml:lang")
  * head
    * title
      * #text: Example
  * body
    * p
      * #text: This document cannot be converted losslessy because:
      * ul
        * li
          * #text: A paragraph cannot contain a ul in HTML
      * #text: and they will become siblings instead.


*Scenario 2: Reserialising as HTML*

* Because a <p> cannot contain a <ul>, the document gets converted into the following:

<!DOCTYPE html>
<html lang="en">
<head>
  <title>Example</title>
</head>
<body>
  <p>This document cannot be converted losslessy because:
    </p><ul>
      <li>A paragraph cannot contain a ul in HTML</li>
    </ul><p>
    and they will become siblings instead.</p>
</body>
</html>

In this simple example, there were 4 changes:
* Removal of xmlns
* Changed xml:lang to lang
* The <p> element had to end immediately before the <ul>
* Created a new paragraph after the UL for the remaining sentence.

When parsed, the browser will produce a DOM that looks like this:

* #DOCTYPE html
* html
    - ("", "lang")
  * head
    * title
      * #text: Example
  * body
    * p
      * #text: This document cannot be converted losslessy because:
    * ul
      * li
        * #text: A paragraph cannot contain a ul in HTML
    * p
      * #text: and they will become siblings instead.


*Scenario 3: XHTML as text/html*

This relies on browser error recovery. The document is sent unchanged and produces the following DOM:

* #DOCTYPE html
* html
    - ("", "xmlns")
    - ("", "xml:lang")
  * head
    * title
      * #text: Example
  * body
    * p
      * #text: This document cannot be converted losslessy because:
    * ul
      * li
        * #text: A paragraph cannot contain a ul in HTML
    * #text: and they will become siblings instead.

In this final case, the DOM is similar to scenario 2; except for the following:

* The "xmlns" and "xml:lang" attributes in the null namespace.
* The lack of the "lang" attribute in the null namespace.
* The final text node has become child of body, instead of a p element.

You've ended up with a lossy conversion of your XHTML in both text/html cases. In fact, it's marginally better when you perform the reserialisation yourself because you get to make smarter decisions.

The point is that complaining about the inability to perform lossless conversion in some cases is not really practically relevant for anyone who's willing to serve their XHTML documents as text/html anyway – the end result is practically same, if not better, when you reserialise it yourself.

This issue has been around for years, ever since XHTML 1.0 began wreaking havoc on the world, yet it doesn't seem to have particularly bothered anyone trying to use it, or even promoting it.

You just need to realise that, if you wish to have your documents reserialised as HTML or even wrongly serve XHTML as text/html, you need to take care to avoid features which will result in a lossy conversion, or put up with the minor discrepancies.

--
Lachlan Hunt
http://lachy.id.au/

Reply via email to