Serializing parsed content and caching GadgetHtmlParsers

John Hjelmstad Thu, 02 Oct 2008 17:49:03 -0700

All,
We've had a number of discussions on this list regarding our ability to get
rid of rewritten-content caching altogether. The primary cost savings
associated with doing so, by percentage, comes from avoiding the re-parsing
of gadget contents in order to apply rewriter passes on them (which
themselves are typically very cheap, in the sub-1ms range for reasonable
large input).


With this in mind, I've written and submitted r701267, which provides custom
serialization and deserialization routines for parsed content, along with a
helper base class for any GadgetHtmlParser choosing to support caching.

In coming to this solution, I implemented three mechanisms: Java
serialization, overridden Java serialization routines
(writeObject/readObject), and finally a simplified, ad hoc byte-packed
routine. Standard and overridden Java serialization results were virtually
identical.

I ran each serialization/deserialization routine across a variety of gadget
contents. In sum:
* Custom serialization measured 10-30% more efficient in space. Space
savings largely came from lack of Java class information and other metadata,
so are more pronounced for highly structured content.
* Custom serialization measured 30-40% faster than Java's, and
deserialization was 40-50% faster.

As one example, I took the BuddyPoke gadget's canvas view contents and ran
them through these routines, as well as through CajaHtmlParser. Results:
* CajaHtmlParser average parse time = 25ms.
* Java serialization average = 2.25ms; deserialization = 3.35ms; size =
35kB.
* Custom serialization average = 1.25ms; deserialization = 2.3ms; size =
30kB.

So I removed the Java serialization impl and stuck with custom. This has the
corollary minor benefit that different tools can easily write and read the
same format - consider a cache warmer job for instance.

Given these results, combined with fast, relatively cheap caching by things
like memcache, I'm encouraged that we're getting close to where we can
remove rewritten content caching altogether. Per several previous comments,
many rewriting passes simply can't be cached anyway. The remainder are
extremely cheap given a low-cost parse tree.

The biggest risk with caching content in this way is the universe of
possible input. Now seems like the time we should reduce that, by finally
going ahead with our long-proposed plan to allow hangman variable
substitution only in String contexts (HTML attributes, cdata, and text
nodes). Assuming we reach agreement on this, we can hook up parsed content
caching and implement all existing rewriting operations in terms of a parse
tree with relatively low cost.

In the meantime, I still plan to enable this for CajaHtmlParser, since the
parse tree is only used in opt-in fashion today by "new" gadgets that don't
use __UP substitution in structural elements. I'm also inclined to get rid
of rewritten content caching, since it's largely useless today. I'd be
interested to hear others' opinions on this.

--John

Serializing parsed content and caching GadgetHtmlParsers

Reply via email to