Re: [whatwg] several messages about XML syntax and HTML5

Lachlan Hunt Mon, 04 Dec 2006 15:26:23 -0800

Sam Ruby wrote:

James Graham wrote:

As I understand it, the full chain of events should look like this:


 [Internal data model in server]
                |
                |
       HTML 5 Serializer
                |
                |
            {Network}
                |
                |
          HTML 5 Parser
                |
                |
 [Whatever client tools you like]

This only works if the internal-data-model to HTML5 conversion islossless.

The potentially-lossy-conversion argument is rather pointless when youconsider that reserialising XHTML as HTML has, for all practicalpurposes, is almost exactly the same or better serving XHTML as text/html.

The main difference is that instead of the conversion to HTML5 happeningon the server side, as in that diagram, the browser receives XHTML whichit then attempts to treat as HTML anyway. What practical difference isthere? The following example illustrates this.

Say the following was your XHTML document. I'm only including thedoctype because it's necessary for the example, not because it's usefulto have in XHTML at all.


<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="en">
<head>
  <title>Example</title>
</head>
<body>
  <p>This document cannot be converted losslessy because:
    <ul>
      <li>A paragraph cannot contain a ul in HTML</li>
    </ul>
    and they will become siblings instead.</p>
</body>
</html>

There are 3 scenarios. In scenario 1, it's sent unchanged as XML. Inscenario 2, the XHTML is serialised to HTML on the server side. Inscenario 3, it's sent unchanged as text/html.



*Scenario 1: XHTML as XML*

When parsed by the browser using an XML parser, it produces thefollowing DOM:

(whitespace nodes omitted and all elements are in the XHTML namespace)

* #DOCTYPE html
* html
    - ("http://www.w3.org/2000/xmlns/";, "xmlns")
    - ("http://www.w3.org/XML/1998/namespace";, "xml:lang")
  * head
    * title
      * #text: Example
  * body
    * p
      * #text: This document cannot be converted losslessy because:
      * ul
        * li
          * #text: A paragraph cannot contain a ul in HTML
      * #text: and they will become siblings instead.


*Scenario 2: Reserialising as HTML*

* Because a <p> cannot contain a <ul>, the document gets converted intothe following:


<!DOCTYPE html>
<html lang="en">
<head>
  <title>Example</title>
</head>
<body>
  <p>This document cannot be converted losslessy because:
    </p><ul>
      <li>A paragraph cannot contain a ul in HTML</li>
    </ul><p>
    and they will become siblings instead.</p>
</body>
</html>

In this simple example, there were 4 changes:
* Removal of xmlns
* Changed xml:lang to lang
* The <p> element had to end immediately before the <ul>
* Created a new paragraph after the UL for the remaining sentence.

When parsed, the browser will produce a DOM that looks like this:

* #DOCTYPE html
* html
    - ("", "lang")
  * head
    * title
      * #text: Example
  * body
    * p
      * #text: This document cannot be converted losslessy because:
    * ul
      * li
        * #text: A paragraph cannot contain a ul in HTML
    * p
      * #text: and they will become siblings instead.


*Scenario 3: XHTML as text/html*

This relies on browser error recovery. The document is sent unchangedand produces the following DOM:


* #DOCTYPE html
* html
    - ("", "xmlns")
    - ("", "xml:lang")
  * head
    * title
      * #text: Example
  * body
    * p
      * #text: This document cannot be converted losslessy because:
    * ul
      * li
        * #text: A paragraph cannot contain a ul in HTML
    * #text: and they will become siblings instead.

In this final case, the DOM is similar to scenario 2; except for thefollowing:


* The "xmlns" and "xml:lang" attributes in the null namespace.
* The lack of the "lang" attribute in the null namespace.
* The final text node has become child of body, instead of a p element.

You've ended up with a lossy conversion of your XHTML in both text/htmlcases. In fact, it's marginally better when you perform thereserialisation yourself because you get to make smarter decisions.

The point is that complaining about the inability to perform losslessconversion in some cases is not really practically relevant for anyonewho's willing to serve their XHTML documents as text/html anyway – theend result is practically same, if not better, when you reserialise ityourself.

This issue has been around for years, ever since XHTML 1.0 beganwreaking havoc on the world, yet it doesn't seem to have particularlybothered anyone trying to use it, or even promoting it.

You just need to realise that, if you wish to have your documentsreserialised as HTML or even wrongly serve XHTML as text/html, you needto take care to avoid features which will result in a lossy conversion,or put up with the minor discrepancies.


--
Lachlan Hunt
http://lachy.id.au/

Re: [whatwg] several messages about XML syntax and HTML5

Reply via email to