On Fri, Oct 17, 2008 at 4:23 PM, Louis Ryan <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I've been working recently on the HTML parsing and rewriting features in
> Shindig. On aspect of this work was to investigate the performance of the
> Caja DOM parser and evaluate others. I evaluated the Neko HTML parser (
> http://nekohtml.sourceforge.net/) which is used in many common OS tools
> and
> seemed to have decent performance (
> http://www.portletbridge.org/saxbenchmark/results.html). It generally
> gives
> significantly better performance than the Caja DOM parser for equivalent
> content and seems to do a good job of maintaining doucment structure and
> parsing oddly-formed HTML.
>
> I expanded on johnh's earlier benchmarks results to get comparison times
> between Caja and Neko, results below are from parsing an Amazon.com home
> page of ~22k. Test accounts for the usual JIT warmup and compilation phase.
>
> Caja Parse------------------------
> Parsing [749 ms total: 24.966666666666665ms/run]
>
> Neko Parse------------------------
> Parsing [275 ms total: 9.166666666666666ms/run]
>
>
> The Neko parser actually generates an org.w3c.dom.Document which I need to
> wrap to to map to the org.apache.shindig.parse.ParsedHtmlNode. So I added
> support to GadgetHtmlParser to also produce a Document object. Here are the
> benchmark results for that parse including implemeneting a converted from
> Caja DOM to w3c DOM.
>
> Caja Parse------------------------
> Parsing W3C DOM [292 ms total: 9.733333333333333ms/run]
>
> Neko Parse------------------------
> Parsing W3C DOM [82 ms total: 2.7333333333333334ms/run]
>
> Some things worth noting. Converting Caja DOM to w3c DOM is low overhead
> but
> the other way around is not (though this may just be poor coding on my
> part).
>
> There is really no functional advantage to having the ParsedHtmlNode
> abstraction over DOM if we can use w3c DOM more cheaply or with minimal
> overhead in the case of Caja so I propose eliminating these interfaces from
> the implementation and altering the rewriter pipeline to consume w3c DOM.
>
> Overall I think the performance of the Neko parser speaks for itself and I
> believe its the one we should be using in Shindig by default.


+1 to both of your conclusions. For my part, I'd be happy to see w3c DOM
replace Gadget/ParsedHtmlNode, and Neko replace CajaHtmlParser.

--John


>
>
> -Louis
>

Reply via email to