On 12/13/2012 06:43 AM, Marco Fleckinger wrote: > Implementing this is not very easy, but developers can may use some of > the old ideas. Parsing the other way around has to be realized really > from the scratch but is easier because everything is in a tree. not in a > single text-string. > > Neither de- nor searalizing includes any surface, testing could be done > automatically really easy comparing the results of conventional and the > new parsing. The result of the serialization can be compared with the > original markup.
Hi Marco, we (the Parsoid team) have been doing many of the things you describe in the last year: * We wrote a new bidirectional parser / serializer - see http://www.mediawiki.org/wiki/Parsoid. This includes a grammar-based tokenizer, async/parallel token stream transformations and HTML5 DOM building. * We developed a HTML5 / RDFa document model spec at http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec. * Our parserTests runner tests wt2html (wikitext to html), wt2wt, html2html and html2wt modes with the same wikitext / HTML pairs as used in the PHP parser tests. We have roughly doubled the number of such pairs in the process. * Automated and distributed round-trip tests are currently run over a random selection of 100k English Wikipedia pages: http://parsoid.wmflabs.org:8001/. This test infrastructure can easily be pointed at a different set of pages or another wiki. Parsoid is by no means complete, but we are very happy with how far we already got since last October. Cheers, Gabriel -- Gabriel Wicke Senior Software Engineer Wikimedia Foundation _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
