Re: Serializing parsed content and caching GadgetHtmlParsers

Kevin Brown Thu, 02 Oct 2008 18:04:22 -0700

On Thu, Oct 2, 2008 at 5:46 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:


> All,
> We've had a number of discussions on this list regarding our ability to get
> rid of rewritten-content caching altogether. The primary cost savings
> associated with doing so, by percentage, comes from avoiding the re-parsing
> of gadget contents in order to apply rewriter passes on them (which
> themselves are typically very cheap, in the sub-1ms range for reasonable
> large input).


The primary cost is for parsing content that isn't cacheable to begin with
because it changes every request (proxied gadget renders, makeRequest, etc.)

Until we can get a very fast parser, we can't actually do the more complex
optimizations that a parse tree facilitates, so we're stuck with
string-based manipulations anyway.

The real thing we should be investigating is why it takes 25ms to use the
parser on buddypoke when it only takes 10ms to cajole it.


>
> With this in mind, I've written and submitted r701267, which provides
> custom
> serialization and deserialization routines for parsed content, along with a
> helper base class for any GadgetHtmlParser choosing to support caching.
>
> In coming to this solution, I implemented three mechanisms: Java
> serialization, overridden Java serialization routines
> (writeObject/readObject), and finally a simplified, ad hoc byte-packed
> routine. Standard and overridden Java serialization results were virtually
> identical.
>
> I ran each serialization/deserialization routine across a variety of gadget
> contents. In sum:
> * Custom serialization measured 10-30% more efficient in space. Space
> savings largely came from lack of Java class information and other
> metadata,
> so are more pronounced for highly structured content.
> * Custom serialization measured 30-40% faster than Java's, and
> deserialization was 40-50% faster.
>
> As one example, I took the BuddyPoke gadget's canvas view contents and ran
> them through these routines, as well as through CajaHtmlParser. Results:
> * CajaHtmlParser average parse time = 25ms.
> * Java serialization average = 2.25ms; deserialization = 3.35ms; size =
> 35kB.
> * Custom serialization average = 1.25ms; deserialization = 2.3ms; size =
> 30kB.
>
> So I removed the Java serialization impl and stuck with custom. This has
> the
> corollary minor benefit that different tools can easily write and read the
> same format - consider a cache warmer job for instance.
>
> Given these results, combined with fast, relatively cheap caching by things
> like memcache, I'm encouraged that we're getting close to where we can
> remove rewritten content caching altogether. Per several previous comments,
> many rewriting passes simply can't be cached anyway. The remainder are
> extremely cheap given a low-cost parse tree.
>
> The biggest risk with caching content in this way is the universe of
> possible input. Now seems like the time we should reduce that, by finally
> going ahead with our long-proposed plan to allow hangman variable
> substitution only in String contexts (HTML attributes, cdata, and text
> nodes). Assuming we reach agreement on this, we can hook up parsed content
> caching and implement all existing rewriting operations in terms of a parse
> tree with relatively low cost.
>
> In the meantime, I still plan to enable this for CajaHtmlParser, since the
> parse tree is only used in opt-in fashion today by "new" gadgets that don't
> use __UP substitution in structural elements. I'm also inclined to get rid
> of rewritten content caching, since it's largely useless today. I'd be
> interested to hear others' opinions on this.
>
> --John
>

Re: Serializing parsed content and caching GadgetHtmlParsers

Reply via email to