Re: [Wikitext-l] Stealing error recovery from HTML5 parsers

Olivier Beaton Sat, 12 Nov 2011 20:13:08 -0800

It seems to me like a rough grammar and an extensive test suite to
verify the correctness of any parser is a much bigger win. Especially
story based tests you end up with something that helps you write a
parser and validate it at the same time.


It can also be used to validate our own parser.

On Sat, Nov 12, 2011 at 1:58 AM, Neil Kandalgaonkar <[email protected]> wrote:
> +1 on doing HTML and Wikitext in the same parser, only because I've
> found that it is necessary, in my limited experience doing it in JS.
>
> I'm not knowledgeable enough about the HTML5 error recovery spec to
> comment. I don't know of any other models for "recovery" in parsers out
> there, other than our own. I don't know how you would find out if the
> HTML5 way is appropriate for us other than trying it. Since it seems to
> point the way towards a more understandable means of normalizing
> wikitext, I would vote for it, but I'm voting from a position of
> relative ignorance.
>
>
> Should we have a formal grammar? Let's be pragmatic -- a formal grammar
> is a means to a couple of ends as far as I see it.
>
>
> 1 - to easily have equivalent parsers in PHP and JS, and to allow the
> community to help develop it in an interactive way a la ParserPlayground.
>
> This is not an either-or thing. If the parser is MOSTLY formal, that's
> good enough. But we should still be shooting for like 97% of the cases
> to be handled by the grammar.
>
>
> 2 - to give others a way to parse wikitext better.
>
> This may not be necessary. If our parser can produce a nice abstract
> syntax tree at some point, the API can just emit some other regular
> format for people to use, perhaps XML or JSON based. Wikidom is more
> optimized for the editor, but it's probably also good for this purpose.
>
> Then *maybe* one day we can transition to this more regular format, but
> that's a decision we'll probably face in 2013, if ever.
>
>
>
>
> On 11/11/11 3:57 PM, Gabriel Wicke wrote:
>> Good evening,
>>
>> this week I looked at different ways of cajoling overlapping, improperly
>> nested or otherwise horrible but real-life wiki content into the WikiDom
>> structure for consumption by the visual editor currently in development.
>> So far, MediaWiki delegates the sanitization of those horrors to html
>> tidy, which employs (mostly) good heuristics to make sense of its input.
>>
>> The [HTML5] spec finally standardized parsing and error recovery for
>> HTML, which seems to overlap widely with what we need for the new parser
>> (how far?). Open-source reference implementations of the parser spec are
>> available in Java [VNU] that compiles to C++ and Javascript
>> (http://livedom.validator.nu/) through GWT, and PHP and Python ports at
>> [HLib]. Modern browsers have similar implementations built in.
>>
>> The reference parsers all use a relatively simple tokenizer in
>> combination with a mostly switch-based parser / tree builder that
>> constructs a cleaned-up DOM from the token stream. Tags are balanced and
>> matched using a random-access stack, with a separate list of open
>> formatting elements (very similar to the annotations in WikiDom). For
>> each parsing context and token combination an error recovery strategy
>> can be directly specified in a switch case.
>>
>> The strength of this strategy is clearly the ease of implementing  error
>> recovery. The big disadvantage is the absence of a nicely declarative
>> grammar, except perhaps a shallow one for the tokenizer. (Is there
>> actually an example of a parser with serious HTML-like error recovery
>> and an elegant grammar?)
>>
>> In our specific visual editor application, performing a full error
>> recovery / clean-up while constructing the WikiDom is at odds with the
>> desire to round-trip wiki source. Performing full sanitation only in the
>> HTML serializer while doing none in the Wikitext serializer seems to be
>> a better fit. The WikiDom design with its support for overlapping
>> annotations allows the omission of most early sanitation for inline
>> elements. Block-level constructs however still need to be fully parsed
>> so that implicit scopes of inline elements can be determined (e.g.,
>> limiting the range of annotations to table cells) and a DOM tree can be
>> built. This tree then allows the visual editor to present some sensible,
>> editable outline of the document.
>>
>> A possible implementation could use a simplified version of the current
>> PEG parser mostly as a combined Wiki and HTML tokenizer, that feeds a
>> token stream to a parser / tree builder modeled on the HTML5 parsers.
>> Separating the sanitation of inline and block-level elements to minimize
>> early sanitation seems to be quite doable.
>>
>> What do you think about this general direction of building on HTML
>> parsers? Where should a wiki parser differ in its error recovery
>> strategy? How important is having a full grammar?
>>
>> Gabriel
>>
>> [HTML5] Parsing spec: http://dev.w3.org/html5/spec/Overview.html#parsing
>> [VNU]   Ref impl. (Java, C++, JS): http://about.validator.nu/htmlparser/
>>          Live JS parser demo: http://livedom.validator.nu/
>> [HLib]  PHP and Python parsers: http://code.google.com/p/html5lib/
>>
>>
>> _______________________________________________
>> Wikitext-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikitext-l
>
> --
> Neil Kandalgaonkar ( ) <[email protected]>
>
> _______________________________________________
> Wikitext-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitext-l
>

_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Re: [Wikitext-l] Stealing error recovery from HTML5 parsers

Reply via email to