Hi,

I've been working recently on the HTML parsing and rewriting features in
Shindig. On aspect of this work was to investigate the performance of the
Caja DOM parser and evaluate others. I evaluated the Neko HTML parser (
http://nekohtml.sourceforge.net/) which is used in many common OS tools and
seemed to have decent performance (
http://www.portletbridge.org/saxbenchmark/results.html). It generally gives
significantly better performance than the Caja DOM parser for equivalent
content and seems to do a good job of maintaining doucment structure and
parsing oddly-formed HTML.

I expanded on johnh's earlier benchmarks results to get comparison times
between Caja and Neko, results below are from parsing an Amazon.com home
page of ~22k. Test accounts for the usual JIT warmup and compilation phase.

Caja Parse------------------------
Parsing [749 ms total: 24.966666666666665ms/run]

Neko Parse------------------------
Parsing [275 ms total: 9.166666666666666ms/run]


The Neko parser actually generates an org.w3c.dom.Document which I need to
wrap to to map to the org.apache.shindig.parse.ParsedHtmlNode. So I added
support to GadgetHtmlParser to also produce a Document object. Here are the
benchmark results for that parse including implemeneting a converted from
Caja DOM to w3c DOM.

Caja Parse------------------------
Parsing W3C DOM [292 ms total: 9.733333333333333ms/run]

Neko Parse------------------------
Parsing W3C DOM [82 ms total: 2.7333333333333334ms/run]

Some things worth noting. Converting Caja DOM to w3c DOM is low overhead but
the other way around is not (though this may just be poor coding on my
part).

There is really no functional advantage to having the ParsedHtmlNode
abstraction over DOM if we can use w3c DOM more cheaply or with minimal
overhead in the case of Caja so I propose eliminating these interfaces from
the implementation and altering the rewriter pipeline to consume w3c DOM.

Overall I think the performance of the Neko parser speaks for itself and I
believe its the one we should be using in Shindig by default.

-Louis

Reply via email to