[Wikitext-l] Stealing error recovery from HTML5 parsers

Gabriel Wicke Fri, 11 Nov 2011 15:58:19 -0800

Good evening,

this week I looked at different ways of cajoling overlapping, improperly
nested or otherwise horrible but real-life wiki content into the WikiDom
structure for consumption by the visual editor currently in development.
So far, MediaWiki delegates the sanitization of those horrors to html
tidy, which employs (mostly) good heuristics to make sense of its input.


The [HTML5] spec finally standardized parsing and error recovery for
HTML, which seems to overlap widely with what we need for the new parser
(how far?). Open-source reference implementations of the parser spec are
available in Java [VNU] that compiles to C++ and Javascript
(http://livedom.validator.nu/) through GWT, and PHP and Python ports at
[HLib]. Modern browsers have similar implementations built in.

The reference parsers all use a relatively simple tokenizer in
combination with a mostly switch-based parser / tree builder that
constructs a cleaned-up DOM from the token stream. Tags are balanced and
matched using a random-access stack, with a separate list of open
formatting elements (very similar to the annotations in WikiDom). For
each parsing context and token combination an error recovery strategy
can be directly specified in a switch case.

The strength of this strategy is clearly the ease of implementing  error
recovery. The big disadvantage is the absence of a nicely declarative
grammar, except perhaps a shallow one for the tokenizer. (Is there
actually an example of a parser with serious HTML-like error recovery
and an elegant grammar?)

In our specific visual editor application, performing a full error
recovery / clean-up while constructing the WikiDom is at odds with the
desire to round-trip wiki source. Performing full sanitation only in the
HTML serializer while doing none in the Wikitext serializer seems to be
a better fit. The WikiDom design with its support for overlapping
annotations allows the omission of most early sanitation for inline
elements. Block-level constructs however still need to be fully parsed
so that implicit scopes of inline elements can be determined (e.g.,
limiting the range of annotations to table cells) and a DOM tree can be
built. This tree then allows the visual editor to present some sensible,
editable outline of the document.

A possible implementation could use a simplified version of the current
PEG parser mostly as a combined Wiki and HTML tokenizer, that feeds a
token stream to a parser / tree builder modeled on the HTML5 parsers.
Separating the sanitation of inline and block-level elements to minimize
early sanitation seems to be quite doable.

What do you think about this general direction of building on HTML
parsers? Where should a wiki parser differ in its error recovery
strategy? How important is having a full grammar?

Gabriel

[HTML5] Parsing spec: http://dev.w3.org/html5/spec/Overview.html#parsing
[VNU]   Ref impl. (Java, C++, JS): http://about.validator.nu/htmlparser/
        Live JS parser demo: http://livedom.validator.nu/
[HLib]  PHP and Python parsers: http://code.google.com/p/html5lib/


_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

[Wikitext-l] Stealing error recovery from HTML5 parsers

Reply via email to