Language choice. Tidy is written in C. Note that I included shelling
out to Node.js as an option in my original post. It's not really part
of Parsoid, it's a JavaScript library that Parsoid uses. We would use
the same JavaScript library with a few lines of wrapper code.

-- Tim Starling

On 12/08/15 10:24, Trevor Parscal wrote:
> Interesting. What is the cause of the slower speed?
> 
> - Trevor
> 
> On Tuesday, August 11, 2015, Gabriel Wicke <[email protected]> wrote:
> 
>> On Tue, Aug 11, 2015 at 5:16 PM, Trevor Parscal <[email protected]
>> <javascript:;>>
>> wrote:
>>
>>> Is it possible use part of the Parsoid code to do this?
>>>
>>
>> It is possible to do this in Parsoid (or any node service) with this line:
>>
>>  var sanerHTML = domino.createDocument(input).outerHTML;
>>
>> However, performance is about 2x worse than current tidy (116ms vs. 238ms
>> for Obama), and about 4x slower than the fastest option in our tests. The
>> task has a lot more benchmarks of various options.
>>
>> Gabriel
>>
>>
>>
>>
>>
>>>
>>> - Trevor
>>>
>>> On Tuesday, August 11, 2015, Tim Starling <[email protected]
>> <javascript:;>> wrote:
>>>
>>>> I'm elevating this task of mine to RFC status:
>>>>
>>>> https://phabricator.wikimedia.org/T89331
>>>>
>>>> Running the output of the MediaWiki parser through HTML Tidy always
>>>> seemed like a nasty hack. The effects on wikitext syntax are arbitrary
>>>> and change from version to version. When we upgrade our Linux
>>>> distribution, we sometimes see changes in the HTML generated by given
>>>> wikitext, which is not ideal.
>>>>
>>>> Parsoid took a different approach. After token-level transformations,
>>>> tokens are fed into the HTML 5 parse algorithm, a complex but
>>>> well-specified algorithm which generates a DOM tree from quirky input
>>>> text.
>>>>
>>>> http://www.w3.org/TR/html5/syntax.html
>>>>
>>>> We can get nearly the same effect in MediaWiki by replacing the Tidy
>>>> transformation stage with an HTML 5 parse followed by serialization of
>>>> the DOM back to HTML. This would stabilize wikitext syntax and resolve
>>>> several important syntax differences compared to Parsoid.
>>>>
>>>> However:
>>>>
>>>> * I have not been able to find any PHP implementation of this
>>>> algorithm. Masterminds and Ressio do not even attempt it. Electrolinux
>>>> attempts it but does not implement the error recovery parts that are
>>>> of interest to us.
>>>> * Writing our own would be difficult.
>>>> * Even if we did write it, it would probably be too slow.
>>>>
>>>> So the question is: what language should we use? Since this is the
>>>> standard programmer troll question, please bring popcorn.
>>>>
>>>> The best implementation of this algorithm is in Java: the validator.nu
>>>> parser is maintained by Mozilla, and has source translation to C++,
>>>> which is used by Mozilla and could potentially be used for an HHVM
>>>> extension.
>>>>
>>>> There is also a Rust port (also written by Mozilla), and notable
>>>> implementations in JavaScript and Python.
>>>>
>>>> For WMF, a Java service would be quite easily done, and I have
>>>> prototyped it already. An HHVM extension might also be possible. A
>>>> non-service fallback for small installations might be Node.js or a
>>>> compiled binary from Rust or C++.
>>>>
>>>> -- Tim Starling
>>>>
>>>>
>>>> _______________________________________________
>>>> Wikitech-l mailing list
>>>> [email protected] <javascript:;> <javascript:;>
>>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>> _______________________________________________
>>> Wikitech-l mailing list
>>> [email protected] <javascript:;>
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>>
>>
>>
>> --
>> Gabriel Wicke
>> Principal Engineer, Wikimedia Foundation
>> _______________________________________________
>> Wikitech-l mailing list
>> [email protected] <javascript:;>
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> 



_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to