Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

Trevor Parscal Tue, 11 Aug 2015 17:26:07 -0700

Interesting. What is the cause of the slower speed?

- Trevor


On Tuesday, August 11, 2015, Gabriel Wicke <[email protected]> wrote:

> On Tue, Aug 11, 2015 at 5:16 PM, Trevor Parscal <[email protected]
> <javascript:;>>
> wrote:
>
> > Is it possible use part of the Parsoid code to do this?
> >
>
> It is possible to do this in Parsoid (or any node service) with this line:
>
>  var sanerHTML = domino.createDocument(input).outerHTML;
>
> However, performance is about 2x worse than current tidy (116ms vs. 238ms
> for Obama), and about 4x slower than the fastest option in our tests. The
> task has a lot more benchmarks of various options.
>
> Gabriel
>
>
>
>
>
> >
> > - Trevor
> >
> > On Tuesday, August 11, 2015, Tim Starling <[email protected]
> <javascript:;>> wrote:
> >
> > > I'm elevating this task of mine to RFC status:
> > >
> > > https://phabricator.wikimedia.org/T89331
> > >
> > > Running the output of the MediaWiki parser through HTML Tidy always
> > > seemed like a nasty hack. The effects on wikitext syntax are arbitrary
> > > and change from version to version. When we upgrade our Linux
> > > distribution, we sometimes see changes in the HTML generated by given
> > > wikitext, which is not ideal.
> > >
> > > Parsoid took a different approach. After token-level transformations,
> > > tokens are fed into the HTML 5 parse algorithm, a complex but
> > > well-specified algorithm which generates a DOM tree from quirky input
> > > text.
> > >
> > > http://www.w3.org/TR/html5/syntax.html
> > >
> > > We can get nearly the same effect in MediaWiki by replacing the Tidy
> > > transformation stage with an HTML 5 parse followed by serialization of
> > > the DOM back to HTML. This would stabilize wikitext syntax and resolve
> > > several important syntax differences compared to Parsoid.
> > >
> > > However:
> > >
> > > * I have not been able to find any PHP implementation of this
> > > algorithm. Masterminds and Ressio do not even attempt it. Electrolinux
> > > attempts it but does not implement the error recovery parts that are
> > > of interest to us.
> > > * Writing our own would be difficult.
> > > * Even if we did write it, it would probably be too slow.
> > >
> > > So the question is: what language should we use? Since this is the
> > > standard programmer troll question, please bring popcorn.
> > >
> > > The best implementation of this algorithm is in Java: the validator.nu
> > > parser is maintained by Mozilla, and has source translation to C++,
> > > which is used by Mozilla and could potentially be used for an HHVM
> > > extension.
> > >
> > > There is also a Rust port (also written by Mozilla), and notable
> > > implementations in JavaScript and Python.
> > >
> > > For WMF, a Java service would be quite easily done, and I have
> > > prototyped it already. An HHVM extension might also be possible. A
> > > non-service fallback for small installations might be Node.js or a
> > > compiled binary from Rust or C++.
> > >
> > > -- Tim Starling
> > >
> > >
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [email protected] <javascript:;> <javascript:;>
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > _______________________________________________
> > Wikitech-l mailing list
> > [email protected] <javascript:;>
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
>
> --
> Gabriel Wicke
> Principal Engineer, Wikimedia Foundation
> _______________________________________________
> Wikitech-l mailing list
> [email protected] <javascript:;>
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Replace Tidy with HTML 5 parse/reserialize

Reply via email to