Re: Serializing parsed content and caching GadgetHtmlParsers

John Hjelmstad Fri, 03 Oct 2008 09:48:06 -0700

My interest here is for gadget rendering in its various forms, not for
makeRequest (which hopefully will stop being used for gadget rendering
purposes as proxied continues deployment). Proxied could derail all this,
given it's likely to generate new contents for each user, and thus yield
terrible cache hit rates. Sigh, a possible waste of time here.


So let's ask the Caja folks. Do any of you have time to help figure out why
cajoling is faster than just using Caja's DomParser to yield a parse tree?

--John

On Thu, Oct 2, 2008 at 6:03 PM, Kevin Brown <[EMAIL PROTECTED]> wrote:

> On Thu, Oct 2, 2008 at 5:46 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:
>
> > All,
> > We've had a number of discussions on this list regarding our ability to
> get
> > rid of rewritten-content caching altogether. The primary cost savings
> > associated with doing so, by percentage, comes from avoiding the
> re-parsing
> > of gadget contents in order to apply rewriter passes on them (which
> > themselves are typically very cheap, in the sub-1ms range for reasonable
> > large input).
>
>
> The primary cost is for parsing content that isn't cacheable to begin with
> because it changes every request (proxied gadget renders, makeRequest,
> etc.)
>
> Until we can get a very fast parser, we can't actually do the more complex
> optimizations that a parse tree facilitates, so we're stuck with
> string-based manipulations anyway.
>
> The real thing we should be investigating is why it takes 25ms to use the
> parser on buddypoke when it only takes 10ms to cajole it.
>
>
> >
> > With this in mind, I've written and submitted r701267, which provides
> > custom
> > serialization and deserialization routines for parsed content, along with
> a
> > helper base class for any GadgetHtmlParser choosing to support caching.
> >
> > In coming to this solution, I implemented three mechanisms: Java
> > serialization, overridden Java serialization routines
> > (writeObject/readObject), and finally a simplified, ad hoc byte-packed
> > routine. Standard and overridden Java serialization results were
> virtually
> > identical.
> >
> > I ran each serialization/deserialization routine across a variety of
> gadget
> > contents. In sum:
> > * Custom serialization measured 10-30% more efficient in space. Space
> > savings largely came from lack of Java class information and other
> > metadata,
> > so are more pronounced for highly structured content.
> > * Custom serialization measured 30-40% faster than Java's, and
> > deserialization was 40-50% faster.
> >
> > As one example, I took the BuddyPoke gadget's canvas view contents and
> ran
> > them through these routines, as well as through CajaHtmlParser. Results:
> > * CajaHtmlParser average parse time = 25ms.
> > * Java serialization average = 2.25ms; deserialization = 3.35ms; size =
> > 35kB.
> > * Custom serialization average = 1.25ms; deserialization = 2.3ms; size =
> > 30kB.
> >
> > So I removed the Java serialization impl and stuck with custom. This has
> > the
> > corollary minor benefit that different tools can easily write and read
> the
> > same format - consider a cache warmer job for instance.
> >
> > Given these results, combined with fast, relatively cheap caching by
> things
> > like memcache, I'm encouraged that we're getting close to where we can
> > remove rewritten content caching altogether. Per several previous
> comments,
> > many rewriting passes simply can't be cached anyway. The remainder are
> > extremely cheap given a low-cost parse tree.
> >
> > The biggest risk with caching content in this way is the universe of
> > possible input. Now seems like the time we should reduce that, by finally
> > going ahead with our long-proposed plan to allow hangman variable
> > substitution only in String contexts (HTML attributes, cdata, and text
> > nodes). Assuming we reach agreement on this, we can hook up parsed
> content
> > caching and implement all existing rewriting operations in terms of a
> parse
> > tree with relatively low cost.
> >
> > In the meantime, I still plan to enable this for CajaHtmlParser, since
> the
> > parse tree is only used in opt-in fashion today by "new" gadgets that
> don't
> > use __UP substitution in structural elements. I'm also inclined to get
> rid
> > of rewritten content caching, since it's largely useless today. I'd be
> > interested to hear others' opinions on this.
> >
> > --John
> >
>

Re: Serializing parsed content and caching GadgetHtmlParsers

Reply via email to