On 12/13/2012 06:43 AM, Marco Fleckinger wrote:
> Implementing this is not very easy, but developers can may use some of
> the old ideas. Parsing the other way around has to be realized really
> from the scratch but is easier because everything is in a tree. not in a
> single text-string.
> 
> Neither de- nor searalizing includes any surface, testing could be done
> automatically really easy comparing the results of conventional and the
> new parsing. The result of the serialization can be compared with the
> original markup.

Hi Marco,

we (the Parsoid team) have been doing many of the things you describe in
the last year:

* We wrote a new bidirectional parser / serializer - see
http://www.mediawiki.org/wiki/Parsoid. This includes a grammar-based
tokenizer, async/parallel token stream transformations and HTML5 DOM
building.

* We developed a HTML5 / RDFa document model spec at
http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec.

* Our parserTests runner tests wt2html (wikitext to html), wt2wt,
html2html and html2wt modes with the same wikitext / HTML pairs as used
in the PHP parser tests. We have roughly doubled the number of such
pairs in the process.

* Automated and distributed round-trip tests are currently run over a
random selection of 100k English Wikipedia pages:
http://parsoid.wmflabs.org:8001/. This test infrastructure can easily be
pointed at a different set of pages or another wiki.

Parsoid is by no means complete, but we are very happy with how far we
already got since last October.

Cheers,

Gabriel

-- 
Gabriel Wicke
Senior Software Engineer
Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to