Hi, I've been working recently on the HTML parsing and rewriting features in Shindig. On aspect of this work was to investigate the performance of the Caja DOM parser and evaluate others. I evaluated the Neko HTML parser ( http://nekohtml.sourceforge.net/) which is used in many common OS tools and seemed to have decent performance ( http://www.portletbridge.org/saxbenchmark/results.html). It generally gives significantly better performance than the Caja DOM parser for equivalent content and seems to do a good job of maintaining doucment structure and parsing oddly-formed HTML.
I expanded on johnh's earlier benchmarks results to get comparison times between Caja and Neko, results below are from parsing an Amazon.com home page of ~22k. Test accounts for the usual JIT warmup and compilation phase. Caja Parse------------------------ Parsing [749 ms total: 24.966666666666665ms/run] Neko Parse------------------------ Parsing [275 ms total: 9.166666666666666ms/run] The Neko parser actually generates an org.w3c.dom.Document which I need to wrap to to map to the org.apache.shindig.parse.ParsedHtmlNode. So I added support to GadgetHtmlParser to also produce a Document object. Here are the benchmark results for that parse including implemeneting a converted from Caja DOM to w3c DOM. Caja Parse------------------------ Parsing W3C DOM [292 ms total: 9.733333333333333ms/run] Neko Parse------------------------ Parsing W3C DOM [82 ms total: 2.7333333333333334ms/run] Some things worth noting. Converting Caja DOM to w3c DOM is low overhead but the other way around is not (though this may just be poor coding on my part). There is really no functional advantage to having the ParsedHtmlNode abstraction over DOM if we can use w3c DOM more cheaply or with minimal overhead in the case of Caja so I propose eliminating these interfaces from the implementation and altering the rewriter pipeline to consume w3c DOM. Overall I think the performance of the Neko parser speaks for itself and I believe its the one we should be using in Shindig by default. -Louis

