An xsd + formal token list (for text elements)+ formal grammar seems most maintainable, explainable, and durable. Maybe put pages with exceptions on a flag list for human queue/review and repair. The fun part is the parsing during active edits. -Paul
On Nov 12, 2011, at 20:13, "Olivier Beaton" <[email protected]> wrote: > It seems to me like a rough grammar and an extensive test suite to > verify the correctness of any parser is a much bigger win. Especially > story based tests you end up with something that helps you write a > parser and validate it at the same time. > > It can also be used to validate our own parser. > > On Sat, Nov 12, 2011 at 1:58 AM, Neil Kandalgaonkar <[email protected]> > wrote: >> +1 on doing HTML and Wikitext in the same parser, only because I've >> found that it is necessary, in my limited experience doing it in JS. >> >> I'm not knowledgeable enough about the HTML5 error recovery spec to >> comment. I don't know of any other models for "recovery" in parsers out >> there, other than our own. I don't know how you would find out if the >> HTML5 way is appropriate for us other than trying it. Since it seems to >> point the way towards a more understandable means of normalizing >> wikitext, I would vote for it, but I'm voting from a position of >> relative ignorance. >> >> >> Should we have a formal grammar? Let's be pragmatic -- a formal grammar >> is a means to a couple of ends as far as I see it. >> >> >> 1 - to easily have equivalent parsers in PHP and JS, and to allow the >> community to help develop it in an interactive way a la ParserPlayground. >> >> This is not an either-or thing. If the parser is MOSTLY formal, that's >> good enough. But we should still be shooting for like 97% of the cases >> to be handled by the grammar. >> >> >> 2 - to give others a way to parse wikitext better. >> >> This may not be necessary. If our parser can produce a nice abstract >> syntax tree at some point, the API can just emit some other regular >> format for people to use, perhaps XML or JSON based. Wikidom is more >> optimized for the editor, but it's probably also good for this purpose. >> >> Then *maybe* one day we can transition to this more regular format, but >> that's a decision we'll probably face in 2013, if ever. >> >> >> >> >> On 11/11/11 3:57 PM, Gabriel Wicke wrote: >>> Good evening, >>> >>> this week I looked at different ways of cajoling overlapping, improperly >>> nested or otherwise horrible but real-life wiki content into the WikiDom >>> structure for consumption by the visual editor currently in development. >>> So far, MediaWiki delegates the sanitization of those horrors to html >>> tidy, which employs (mostly) good heuristics to make sense of its input. >>> >>> The [HTML5] spec finally standardized parsing and error recovery for >>> HTML, which seems to overlap widely with what we need for the new parser >>> (how far?). Open-source reference implementations of the parser spec are >>> available in Java [VNU] that compiles to C++ and Javascript >>> (http://livedom.validator.nu/) through GWT, and PHP and Python ports at >>> [HLib]. Modern browsers have similar implementations built in. >>> >>> The reference parsers all use a relatively simple tokenizer in >>> combination with a mostly switch-based parser / tree builder that >>> constructs a cleaned-up DOM from the token stream. Tags are balanced and >>> matched using a random-access stack, with a separate list of open >>> formatting elements (very similar to the annotations in WikiDom). For >>> each parsing context and token combination an error recovery strategy >>> can be directly specified in a switch case. >>> >>> The strength of this strategy is clearly the ease of implementing error >>> recovery. The big disadvantage is the absence of a nicely declarative >>> grammar, except perhaps a shallow one for the tokenizer. (Is there >>> actually an example of a parser with serious HTML-like error recovery >>> and an elegant grammar?) >>> >>> In our specific visual editor application, performing a full error >>> recovery / clean-up while constructing the WikiDom is at odds with the >>> desire to round-trip wiki source. Performing full sanitation only in the >>> HTML serializer while doing none in the Wikitext serializer seems to be >>> a better fit. The WikiDom design with its support for overlapping >>> annotations allows the omission of most early sanitation for inline >>> elements. Block-level constructs however still need to be fully parsed >>> so that implicit scopes of inline elements can be determined (e.g., >>> limiting the range of annotations to table cells) and a DOM tree can be >>> built. This tree then allows the visual editor to present some sensible, >>> editable outline of the document. >>> >>> A possible implementation could use a simplified version of the current >>> PEG parser mostly as a combined Wiki and HTML tokenizer, that feeds a >>> token stream to a parser / tree builder modeled on the HTML5 parsers. >>> Separating the sanitation of inline and block-level elements to minimize >>> early sanitation seems to be quite doable. >>> >>> What do you think about this general direction of building on HTML >>> parsers? Where should a wiki parser differ in its error recovery >>> strategy? How important is having a full grammar? >>> >>> Gabriel >>> >>> [HTML5] Parsing spec: http://dev.w3.org/html5/spec/Overview.html#parsing >>> [VNU] Ref impl. (Java, C++, JS): http://about.validator.nu/htmlparser/ >>> Live JS parser demo: http://livedom.validator.nu/ >>> [HLib] PHP and Python parsers: http://code.google.com/p/html5lib/ >>> >>> >>> _______________________________________________ >>> Wikitext-l mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/wikitext-l >> >> -- >> Neil Kandalgaonkar ( ) <[email protected]> >> >> _______________________________________________ >> Wikitext-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikitext-l >> > > _______________________________________________ > Wikitext-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitext-l _______________________________________________ Wikitext-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitext-l
