Is it possible use part of the Parsoid code to do this?

- Trevor

On Tuesday, August 11, 2015, Tim Starling <tstarl...@wikimedia.org> wrote:

> I'm elevating this task of mine to RFC status:
>
> https://phabricator.wikimedia.org/T89331
>
> Running the output of the MediaWiki parser through HTML Tidy always
> seemed like a nasty hack. The effects on wikitext syntax are arbitrary
> and change from version to version. When we upgrade our Linux
> distribution, we sometimes see changes in the HTML generated by given
> wikitext, which is not ideal.
>
> Parsoid took a different approach. After token-level transformations,
> tokens are fed into the HTML 5 parse algorithm, a complex but
> well-specified algorithm which generates a DOM tree from quirky input
> text.
>
> http://www.w3.org/TR/html5/syntax.html
>
> We can get nearly the same effect in MediaWiki by replacing the Tidy
> transformation stage with an HTML 5 parse followed by serialization of
> the DOM back to HTML. This would stabilize wikitext syntax and resolve
> several important syntax differences compared to Parsoid.
>
> However:
>
> * I have not been able to find any PHP implementation of this
> algorithm. Masterminds and Ressio do not even attempt it. Electrolinux
> attempts it but does not implement the error recovery parts that are
> of interest to us.
> * Writing our own would be difficult.
> * Even if we did write it, it would probably be too slow.
>
> So the question is: what language should we use? Since this is the
> standard programmer troll question, please bring popcorn.
>
> The best implementation of this algorithm is in Java: the validator.nu
> parser is maintained by Mozilla, and has source translation to C++,
> which is used by Mozilla and could potentially be used for an HHVM
> extension.
>
> There is also a Rust port (also written by Mozilla), and notable
> implementations in JavaScript and Python.
>
> For WMF, a Java service would be quite easily done, and I have
> prototyped it already. An HHVM extension might also be possible. A
> non-service fallback for small installations might be Node.js or a
> compiled binary from Rust or C++.
>
> -- Tim Starling
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org <javascript:;>
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to