Re: [Wikitext-l] Stealing error recovery from HTML5 parsers

Paul Charlton Sat, 12 Nov 2011 21:18:59 -0800

An xsd + formal token list (for text elements)+ formal grammar seems most 
maintainable, explainable, and durable.  Maybe put pages with exceptions on a 
flag list for human queue/review and repair.   The fun part is the parsing 
during active edits.
-Paul



On Nov 12, 2011, at 20:13, "Olivier Beaton" <[email protected]> wrote:

> It seems to me like a rough grammar and an extensive test suite to
> verify the correctness of any parser is a much bigger win. Especially
> story based tests you end up with something that helps you write a
> parser and validate it at the same time.
> 
> It can also be used to validate our own parser.
> 
> On Sat, Nov 12, 2011 at 1:58 AM, Neil Kandalgaonkar <[email protected]> 
> wrote:
>> +1 on doing HTML and Wikitext in the same parser, only because I've
>> found that it is necessary, in my limited experience doing it in JS.
>> 
>> I'm not knowledgeable enough about the HTML5 error recovery spec to
>> comment. I don't know of any other models for "recovery" in parsers out
>> there, other than our own. I don't know how you would find out if the
>> HTML5 way is appropriate for us other than trying it. Since it seems to
>> point the way towards a more understandable means of normalizing
>> wikitext, I would vote for it, but I'm voting from a position of
>> relative ignorance.
>> 
>> 
>> Should we have a formal grammar? Let's be pragmatic -- a formal grammar
>> is a means to a couple of ends as far as I see it.
>> 
>> 
>> 1 - to easily have equivalent parsers in PHP and JS, and to allow the
>> community to help develop it in an interactive way a la ParserPlayground.
>> 
>> This is not an either-or thing. If the parser is MOSTLY formal, that's
>> good enough. But we should still be shooting for like 97% of the cases
>> to be handled by the grammar.
>> 
>> 
>> 2 - to give others a way to parse wikitext better.
>> 
>> This may not be necessary. If our parser can produce a nice abstract
>> syntax tree at some point, the API can just emit some other regular
>> format for people to use, perhaps XML or JSON based. Wikidom is more
>> optimized for the editor, but it's probably also good for this purpose.
>> 
>> Then *maybe* one day we can transition to this more regular format, but
>> that's a decision we'll probably face in 2013, if ever.
>> 
>> 
>> 
>> 
>> On 11/11/11 3:57 PM, Gabriel Wicke wrote:
>>> Good evening,
>>> 
>>> this week I looked at different ways of cajoling overlapping, improperly
>>> nested or otherwise horrible but real-life wiki content into the WikiDom
>>> structure for consumption by the visual editor currently in development.
>>> So far, MediaWiki delegates the sanitization of those horrors to html
>>> tidy, which employs (mostly) good heuristics to make sense of its input.
>>> 
>>> The [HTML5] spec finally standardized parsing and error recovery for
>>> HTML, which seems to overlap widely with what we need for the new parser
>>> (how far?). Open-source reference implementations of the parser spec are
>>> available in Java [VNU] that compiles to C++ and Javascript
>>> (http://livedom.validator.nu/) through GWT, and PHP and Python ports at
>>> [HLib]. Modern browsers have similar implementations built in.
>>> 
>>> The reference parsers all use a relatively simple tokenizer in
>>> combination with a mostly switch-based parser / tree builder that
>>> constructs a cleaned-up DOM from the token stream. Tags are balanced and
>>> matched using a random-access stack, with a separate list of open
>>> formatting elements (very similar to the annotations in WikiDom). For
>>> each parsing context and token combination an error recovery strategy
>>> can be directly specified in a switch case.
>>> 
>>> The strength of this strategy is clearly the ease of implementing  error
>>> recovery. The big disadvantage is the absence of a nicely declarative
>>> grammar, except perhaps a shallow one for the tokenizer. (Is there
>>> actually an example of a parser with serious HTML-like error recovery
>>> and an elegant grammar?)
>>> 
>>> In our specific visual editor application, performing a full error
>>> recovery / clean-up while constructing the WikiDom is at odds with the
>>> desire to round-trip wiki source. Performing full sanitation only in the
>>> HTML serializer while doing none in the Wikitext serializer seems to be
>>> a better fit. The WikiDom design with its support for overlapping
>>> annotations allows the omission of most early sanitation for inline
>>> elements. Block-level constructs however still need to be fully parsed
>>> so that implicit scopes of inline elements can be determined (e.g.,
>>> limiting the range of annotations to table cells) and a DOM tree can be
>>> built. This tree then allows the visual editor to present some sensible,
>>> editable outline of the document.
>>> 
>>> A possible implementation could use a simplified version of the current
>>> PEG parser mostly as a combined Wiki and HTML tokenizer, that feeds a
>>> token stream to a parser / tree builder modeled on the HTML5 parsers.
>>> Separating the sanitation of inline and block-level elements to minimize
>>> early sanitation seems to be quite doable.
>>> 
>>> What do you think about this general direction of building on HTML
>>> parsers? Where should a wiki parser differ in its error recovery
>>> strategy? How important is having a full grammar?
>>> 
>>> Gabriel
>>> 
>>> [HTML5] Parsing spec: http://dev.w3.org/html5/spec/Overview.html#parsing
>>> [VNU]   Ref impl. (Java, C++, JS): http://about.validator.nu/htmlparser/
>>>          Live JS parser demo: http://livedom.validator.nu/
>>> [HLib]  PHP and Python parsers: http://code.google.com/p/html5lib/
>>> 
>>> 
>>> _______________________________________________
>>> Wikitext-l mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/wikitext-l
>> 
>> --
>> Neil Kandalgaonkar ( ) <[email protected]>
>> 
>> _______________________________________________
>> Wikitext-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikitext-l
>> 
> 
> _______________________________________________
> Wikitext-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitext-l

_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Re: [Wikitext-l] Stealing error recovery from HTML5 parsers

Reply via email to