On Thu, Oct 2, 2008 at 5:46 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:
> All, > We've had a number of discussions on this list regarding our ability to get > rid of rewritten-content caching altogether. The primary cost savings > associated with doing so, by percentage, comes from avoiding the re-parsing > of gadget contents in order to apply rewriter passes on them (which > themselves are typically very cheap, in the sub-1ms range for reasonable > large input). The primary cost is for parsing content that isn't cacheable to begin with because it changes every request (proxied gadget renders, makeRequest, etc.) Until we can get a very fast parser, we can't actually do the more complex optimizations that a parse tree facilitates, so we're stuck with string-based manipulations anyway. The real thing we should be investigating is why it takes 25ms to use the parser on buddypoke when it only takes 10ms to cajole it. > > With this in mind, I've written and submitted r701267, which provides > custom > serialization and deserialization routines for parsed content, along with a > helper base class for any GadgetHtmlParser choosing to support caching. > > In coming to this solution, I implemented three mechanisms: Java > serialization, overridden Java serialization routines > (writeObject/readObject), and finally a simplified, ad hoc byte-packed > routine. Standard and overridden Java serialization results were virtually > identical. > > I ran each serialization/deserialization routine across a variety of gadget > contents. In sum: > * Custom serialization measured 10-30% more efficient in space. Space > savings largely came from lack of Java class information and other > metadata, > so are more pronounced for highly structured content. > * Custom serialization measured 30-40% faster than Java's, and > deserialization was 40-50% faster. > > As one example, I took the BuddyPoke gadget's canvas view contents and ran > them through these routines, as well as through CajaHtmlParser. Results: > * CajaHtmlParser average parse time = 25ms. > * Java serialization average = 2.25ms; deserialization = 3.35ms; size = > 35kB. > * Custom serialization average = 1.25ms; deserialization = 2.3ms; size = > 30kB. > > So I removed the Java serialization impl and stuck with custom. This has > the > corollary minor benefit that different tools can easily write and read the > same format - consider a cache warmer job for instance. > > Given these results, combined with fast, relatively cheap caching by things > like memcache, I'm encouraged that we're getting close to where we can > remove rewritten content caching altogether. Per several previous comments, > many rewriting passes simply can't be cached anyway. The remainder are > extremely cheap given a low-cost parse tree. > > The biggest risk with caching content in this way is the universe of > possible input. Now seems like the time we should reduce that, by finally > going ahead with our long-proposed plan to allow hangman variable > substitution only in String contexts (HTML attributes, cdata, and text > nodes). Assuming we reach agreement on this, we can hook up parsed content > caching and implement all existing rewriting operations in terms of a parse > tree with relatively low cost. > > In the meantime, I still plan to enable this for CajaHtmlParser, since the > parse tree is only used in opt-in fashion today by "new" gadgets that don't > use __UP substitution in structural elements. I'm also inclined to get rid > of rewritten content caching, since it's largely useless today. I'd be > interested to hear others' opinions on this. > > --John >

