On Fri, Oct 17, 2008 at 4:23 PM, Louis Ryan <[EMAIL PROTECTED]> wrote:
> Hi, > > I've been working recently on the HTML parsing and rewriting features in > Shindig. On aspect of this work was to investigate the performance of the > Caja DOM parser and evaluate others. I evaluated the Neko HTML parser ( > http://nekohtml.sourceforge.net/) which is used in many common OS tools > and > seemed to have decent performance ( > http://www.portletbridge.org/saxbenchmark/results.html). It generally > gives > significantly better performance than the Caja DOM parser for equivalent > content and seems to do a good job of maintaining doucment structure and > parsing oddly-formed HTML. > > I expanded on johnh's earlier benchmarks results to get comparison times > between Caja and Neko, results below are from parsing an Amazon.com home > page of ~22k. Test accounts for the usual JIT warmup and compilation phase. > > Caja Parse------------------------ > Parsing [749 ms total: 24.966666666666665ms/run] > > Neko Parse------------------------ > Parsing [275 ms total: 9.166666666666666ms/run] > > > The Neko parser actually generates an org.w3c.dom.Document which I need to > wrap to to map to the org.apache.shindig.parse.ParsedHtmlNode. So I added > support to GadgetHtmlParser to also produce a Document object. Here are the > benchmark results for that parse including implemeneting a converted from > Caja DOM to w3c DOM. > > Caja Parse------------------------ > Parsing W3C DOM [292 ms total: 9.733333333333333ms/run] > > Neko Parse------------------------ > Parsing W3C DOM [82 ms total: 2.7333333333333334ms/run] > > Some things worth noting. Converting Caja DOM to w3c DOM is low overhead > but > the other way around is not (though this may just be poor coding on my > part). > > There is really no functional advantage to having the ParsedHtmlNode > abstraction over DOM if we can use w3c DOM more cheaply or with minimal > overhead in the case of Caja so I propose eliminating these interfaces from > the implementation and altering the rewriter pipeline to consume w3c DOM. > > Overall I think the performance of the Neko parser speaks for itself and I > believe its the one we should be using in Shindig by default. +1 to both of your conclusions. For my part, I'd be happy to see w3c DOM replace Gadget/ParsedHtmlNode, and Neko replace CajaHtmlParser. --John > > > -Louis >

