It seems to me like a rough grammar and an extensive test suite to verify the correctness of any parser is a much bigger win. Especially story based tests you end up with something that helps you write a parser and validate it at the same time.
It can also be used to validate our own parser. On Sat, Nov 12, 2011 at 1:58 AM, Neil Kandalgaonkar <[email protected]> wrote: > +1 on doing HTML and Wikitext in the same parser, only because I've > found that it is necessary, in my limited experience doing it in JS. > > I'm not knowledgeable enough about the HTML5 error recovery spec to > comment. I don't know of any other models for "recovery" in parsers out > there, other than our own. I don't know how you would find out if the > HTML5 way is appropriate for us other than trying it. Since it seems to > point the way towards a more understandable means of normalizing > wikitext, I would vote for it, but I'm voting from a position of > relative ignorance. > > > Should we have a formal grammar? Let's be pragmatic -- a formal grammar > is a means to a couple of ends as far as I see it. > > > 1 - to easily have equivalent parsers in PHP and JS, and to allow the > community to help develop it in an interactive way a la ParserPlayground. > > This is not an either-or thing. If the parser is MOSTLY formal, that's > good enough. But we should still be shooting for like 97% of the cases > to be handled by the grammar. > > > 2 - to give others a way to parse wikitext better. > > This may not be necessary. If our parser can produce a nice abstract > syntax tree at some point, the API can just emit some other regular > format for people to use, perhaps XML or JSON based. Wikidom is more > optimized for the editor, but it's probably also good for this purpose. > > Then *maybe* one day we can transition to this more regular format, but > that's a decision we'll probably face in 2013, if ever. > > > > > On 11/11/11 3:57 PM, Gabriel Wicke wrote: >> Good evening, >> >> this week I looked at different ways of cajoling overlapping, improperly >> nested or otherwise horrible but real-life wiki content into the WikiDom >> structure for consumption by the visual editor currently in development. >> So far, MediaWiki delegates the sanitization of those horrors to html >> tidy, which employs (mostly) good heuristics to make sense of its input. >> >> The [HTML5] spec finally standardized parsing and error recovery for >> HTML, which seems to overlap widely with what we need for the new parser >> (how far?). Open-source reference implementations of the parser spec are >> available in Java [VNU] that compiles to C++ and Javascript >> (http://livedom.validator.nu/) through GWT, and PHP and Python ports at >> [HLib]. Modern browsers have similar implementations built in. >> >> The reference parsers all use a relatively simple tokenizer in >> combination with a mostly switch-based parser / tree builder that >> constructs a cleaned-up DOM from the token stream. Tags are balanced and >> matched using a random-access stack, with a separate list of open >> formatting elements (very similar to the annotations in WikiDom). For >> each parsing context and token combination an error recovery strategy >> can be directly specified in a switch case. >> >> The strength of this strategy is clearly the ease of implementing error >> recovery. The big disadvantage is the absence of a nicely declarative >> grammar, except perhaps a shallow one for the tokenizer. (Is there >> actually an example of a parser with serious HTML-like error recovery >> and an elegant grammar?) >> >> In our specific visual editor application, performing a full error >> recovery / clean-up while constructing the WikiDom is at odds with the >> desire to round-trip wiki source. Performing full sanitation only in the >> HTML serializer while doing none in the Wikitext serializer seems to be >> a better fit. The WikiDom design with its support for overlapping >> annotations allows the omission of most early sanitation for inline >> elements. Block-level constructs however still need to be fully parsed >> so that implicit scopes of inline elements can be determined (e.g., >> limiting the range of annotations to table cells) and a DOM tree can be >> built. This tree then allows the visual editor to present some sensible, >> editable outline of the document. >> >> A possible implementation could use a simplified version of the current >> PEG parser mostly as a combined Wiki and HTML tokenizer, that feeds a >> token stream to a parser / tree builder modeled on the HTML5 parsers. >> Separating the sanitation of inline and block-level elements to minimize >> early sanitation seems to be quite doable. >> >> What do you think about this general direction of building on HTML >> parsers? Where should a wiki parser differ in its error recovery >> strategy? How important is having a full grammar? >> >> Gabriel >> >> [HTML5] Parsing spec: http://dev.w3.org/html5/spec/Overview.html#parsing >> [VNU] Ref impl. (Java, C++, JS): http://about.validator.nu/htmlparser/ >> Live JS parser demo: http://livedom.validator.nu/ >> [HLib] PHP and Python parsers: http://code.google.com/p/html5lib/ >> >> >> _______________________________________________ >> Wikitext-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikitext-l > > -- > Neil Kandalgaonkar ( ) <[email protected]> > > _______________________________________________ > Wikitext-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitext-l > _______________________________________________ Wikitext-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitext-l
