Language choice. Tidy is written in C. Note that I included shelling out to Node.js as an option in my original post. It's not really part of Parsoid, it's a JavaScript library that Parsoid uses. We would use the same JavaScript library with a few lines of wrapper code.
-- Tim Starling On 12/08/15 10:24, Trevor Parscal wrote: > Interesting. What is the cause of the slower speed? > > - Trevor > > On Tuesday, August 11, 2015, Gabriel Wicke <[email protected]> wrote: > >> On Tue, Aug 11, 2015 at 5:16 PM, Trevor Parscal <[email protected] >> <javascript:;>> >> wrote: >> >>> Is it possible use part of the Parsoid code to do this? >>> >> >> It is possible to do this in Parsoid (or any node service) with this line: >> >> var sanerHTML = domino.createDocument(input).outerHTML; >> >> However, performance is about 2x worse than current tidy (116ms vs. 238ms >> for Obama), and about 4x slower than the fastest option in our tests. The >> task has a lot more benchmarks of various options. >> >> Gabriel >> >> >> >> >> >>> >>> - Trevor >>> >>> On Tuesday, August 11, 2015, Tim Starling <[email protected] >> <javascript:;>> wrote: >>> >>>> I'm elevating this task of mine to RFC status: >>>> >>>> https://phabricator.wikimedia.org/T89331 >>>> >>>> Running the output of the MediaWiki parser through HTML Tidy always >>>> seemed like a nasty hack. The effects on wikitext syntax are arbitrary >>>> and change from version to version. When we upgrade our Linux >>>> distribution, we sometimes see changes in the HTML generated by given >>>> wikitext, which is not ideal. >>>> >>>> Parsoid took a different approach. After token-level transformations, >>>> tokens are fed into the HTML 5 parse algorithm, a complex but >>>> well-specified algorithm which generates a DOM tree from quirky input >>>> text. >>>> >>>> http://www.w3.org/TR/html5/syntax.html >>>> >>>> We can get nearly the same effect in MediaWiki by replacing the Tidy >>>> transformation stage with an HTML 5 parse followed by serialization of >>>> the DOM back to HTML. This would stabilize wikitext syntax and resolve >>>> several important syntax differences compared to Parsoid. >>>> >>>> However: >>>> >>>> * I have not been able to find any PHP implementation of this >>>> algorithm. Masterminds and Ressio do not even attempt it. Electrolinux >>>> attempts it but does not implement the error recovery parts that are >>>> of interest to us. >>>> * Writing our own would be difficult. >>>> * Even if we did write it, it would probably be too slow. >>>> >>>> So the question is: what language should we use? Since this is the >>>> standard programmer troll question, please bring popcorn. >>>> >>>> The best implementation of this algorithm is in Java: the validator.nu >>>> parser is maintained by Mozilla, and has source translation to C++, >>>> which is used by Mozilla and could potentially be used for an HHVM >>>> extension. >>>> >>>> There is also a Rust port (also written by Mozilla), and notable >>>> implementations in JavaScript and Python. >>>> >>>> For WMF, a Java service would be quite easily done, and I have >>>> prototyped it already. An HHVM extension might also be possible. A >>>> non-service fallback for small installations might be Node.js or a >>>> compiled binary from Rust or C++. >>>> >>>> -- Tim Starling >>>> >>>> >>>> _______________________________________________ >>>> Wikitech-l mailing list >>>> [email protected] <javascript:;> <javascript:;> >>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l >>> _______________________________________________ >>> Wikitech-l mailing list >>> [email protected] <javascript:;> >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l >>> >> >> >> >> -- >> Gabriel Wicke >> Principal Engineer, Wikimedia Foundation >> _______________________________________________ >> Wikitech-l mailing list >> [email protected] <javascript:;> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
