On Tue, Aug 11, 2015 at 5:16 PM, Trevor Parscal <[email protected]> wrote:
> Is it possible use part of the Parsoid code to do this? > It is possible to do this in Parsoid (or any node service) with this line: var sanerHTML = domino.createDocument(input).outerHTML; However, performance is about 2x worse than current tidy (116ms vs. 238ms for Obama), and about 4x slower than the fastest option in our tests. The task has a lot more benchmarks of various options. Gabriel > > - Trevor > > On Tuesday, August 11, 2015, Tim Starling <[email protected]> wrote: > > > I'm elevating this task of mine to RFC status: > > > > https://phabricator.wikimedia.org/T89331 > > > > Running the output of the MediaWiki parser through HTML Tidy always > > seemed like a nasty hack. The effects on wikitext syntax are arbitrary > > and change from version to version. When we upgrade our Linux > > distribution, we sometimes see changes in the HTML generated by given > > wikitext, which is not ideal. > > > > Parsoid took a different approach. After token-level transformations, > > tokens are fed into the HTML 5 parse algorithm, a complex but > > well-specified algorithm which generates a DOM tree from quirky input > > text. > > > > http://www.w3.org/TR/html5/syntax.html > > > > We can get nearly the same effect in MediaWiki by replacing the Tidy > > transformation stage with an HTML 5 parse followed by serialization of > > the DOM back to HTML. This would stabilize wikitext syntax and resolve > > several important syntax differences compared to Parsoid. > > > > However: > > > > * I have not been able to find any PHP implementation of this > > algorithm. Masterminds and Ressio do not even attempt it. Electrolinux > > attempts it but does not implement the error recovery parts that are > > of interest to us. > > * Writing our own would be difficult. > > * Even if we did write it, it would probably be too slow. > > > > So the question is: what language should we use? Since this is the > > standard programmer troll question, please bring popcorn. > > > > The best implementation of this algorithm is in Java: the validator.nu > > parser is maintained by Mozilla, and has source translation to C++, > > which is used by Mozilla and could potentially be used for an HHVM > > extension. > > > > There is also a Rust port (also written by Mozilla), and notable > > implementations in JavaScript and Python. > > > > For WMF, a Java service would be quite easily done, and I have > > prototyped it already. An HHVM extension might also be possible. A > > non-service fallback for small installations might be Node.js or a > > compiled binary from Rust or C++. > > > > -- Tim Starling > > > > > > _______________________________________________ > > Wikitech-l mailing list > > [email protected] <javascript:;> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > -- Gabriel Wicke Principal Engineer, Wikimedia Foundation _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
