All, We've had a number of discussions on this list regarding our ability to get rid of rewritten-content caching altogether. The primary cost savings associated with doing so, by percentage, comes from avoiding the re-parsing of gadget contents in order to apply rewriter passes on them (which themselves are typically very cheap, in the sub-1ms range for reasonable large input).
With this in mind, I've written and submitted r701267, which provides custom serialization and deserialization routines for parsed content, along with a helper base class for any GadgetHtmlParser choosing to support caching. In coming to this solution, I implemented three mechanisms: Java serialization, overridden Java serialization routines (writeObject/readObject), and finally a simplified, ad hoc byte-packed routine. Standard and overridden Java serialization results were virtually identical. I ran each serialization/deserialization routine across a variety of gadget contents. In sum: * Custom serialization measured 10-30% more efficient in space. Space savings largely came from lack of Java class information and other metadata, so are more pronounced for highly structured content. * Custom serialization measured 30-40% faster than Java's, and deserialization was 40-50% faster. As one example, I took the BuddyPoke gadget's canvas view contents and ran them through these routines, as well as through CajaHtmlParser. Results: * CajaHtmlParser average parse time = 25ms. * Java serialization average = 2.25ms; deserialization = 3.35ms; size = 35kB. * Custom serialization average = 1.25ms; deserialization = 2.3ms; size = 30kB. So I removed the Java serialization impl and stuck with custom. This has the corollary minor benefit that different tools can easily write and read the same format - consider a cache warmer job for instance. Given these results, combined with fast, relatively cheap caching by things like memcache, I'm encouraged that we're getting close to where we can remove rewritten content caching altogether. Per several previous comments, many rewriting passes simply can't be cached anyway. The remainder are extremely cheap given a low-cost parse tree. The biggest risk with caching content in this way is the universe of possible input. Now seems like the time we should reduce that, by finally going ahead with our long-proposed plan to allow hangman variable substitution only in String contexts (HTML attributes, cdata, and text nodes). Assuming we reach agreement on this, we can hook up parsed content caching and implement all existing rewriting operations in terms of a parse tree with relatively low cost. In the meantime, I still plan to enable this for CajaHtmlParser, since the parse tree is only used in opt-in fashion today by "new" gadgets that don't use __UP substitution in structural elements. I'm also inclined to get rid of rewritten content caching, since it's largely useless today. I'd be interested to hear others' opinions on this. --John

