On 13/08/15 15:43, MZMcBride wrote:
> Or could we replace Tidy with nothing? Relying on the principle of
> "garbage in, garbage out" seems reasonable in some ways. And modern
> browsers are fairly adept at handling moderately bad HTML.

The HTML 5 spec makes a distinction between valid, balanced HTML and
error recovery algorithms. Browsers are basically the only clients
able to handle moderately bad HTML, and as I've previously said in
discussions of HTML 5 output, I don't think it is acceptable to screw
over all non-browser clients by sending output that relies on obscure
details of the HTML 5 spec. I think XHTML or something close to it is
an appropriate machine-readable output format.

Have you looked at my survey on the bug? Compliant HTML 5 parsers are
10-30k source lines and are in pretty short supply.

Wikitext is not meant to be easily machine-readable, it is meant to be
easily human-writable. Unbalanced tags in HTML are errors, but in
wikitext they are allowed. This is a design choice. Most humans don't
really care about the spec, they just want the machine to figure out
what they meant.

And, as several others have noted, you can't just disable Tidy, since
the effects of unclosed tags are not confined to the content area, and
there is a large amount of existing content that depends on it. I have
seen the effects of Tidy being accidentally disabled on the English
Wikipedia, it is not pleasant.

Am I correct in saying that MZMcBride is the only person in this
thread in favour of the idea of getting rid of HTML cleanup?


By the way, you can see my work in progress on an HTML reserializer
web service in the mediawiki/services/html5depurate project on Gerrit:

<https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/services/html5depurate+branch:master,n,z>

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to