On Wed, Sep 24, 2008 at 10:35 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:

> I'm confused. You currently execute, in the new rendering pass,
> RenderingContentRewriter directly.


No, we don't. RenderingContentRewriter is not currently used by anything
other than tests. It won't be bound until GadgetServer & GadgetRenderingTask
get replaced by Renderer (sometime in the next few days).


> That's fine, but the implication of this discussion is that we want it -
> and related functionality (UP substitution,
> message bundles, preloads, JS features) - to run as part of a single
> compound rewriting (could be renamed "processing") pass. These things sound
> pretty "core" to me, as they're right in the spec. It seems an unnecessary
> burden to require that anyone constructing a server need to inject all
> those. But that's just a convenience, and not my main concern here.


Lots of classes fall into the same category (HttpFetcher, HttpCache,
GadgetSpecFactory...). It doesn't matter, because most people will probably
never write their own rewriters anyway, and will just use the default
bindings. Those people who do write rewriters aren't going to forget to
include an important rewriter.


> >
> > these things are cacheable, others are not. You can't possibly flip
> between
> > the two, so you always have to do cacheable things first. Even if you
> > really
> > have a compelling case for TTLs on cacheable things, you still have to
> > group
> > all the cacheable things together unless you're going to create separate
> > cache entries for each stage, which is horribly wasteful.
>
> Yes, it's wasteful, which is why you'd batch as many cacheable things
> together as possible. If for some reason we found that there's a
> significant
> interleaving of cacheable and uncacheable rewriter passes, then our
> conversation is moot: in that case it's not worth caching rewriter passes
> at
> all. But it still may be a good idea to cache parse trees, an orthogonal
> discussion to this.


Yes, caching parse trees instead of full parse might be a useful
optimization, but interleaving is still impossible, and probably not even
worth worrying about. As long as the order of cacheable vs. uncacheable
operations is documented, anyone writing rewriters will be sure to write
them with that in mind. We don't have to address every potential edge case.


> >
> >
> > iv is a generalization of iii, and it's unnecessary for the same reason.
>
> I'm not convinced iii/iv are overly complicated. I could achieve them with
> very little code and not a whole lot of complexity.


The *current* code is already overly complicated for what it's doing. This
is shaping up to be a repeat of GadgetFeature. There's no current need for
this extra level of complexity, so why make code that's going to be hard to
maintain and understand even harder? What real value does grouping arbitrary
rewriters together offer, other than making Guice bindings much more
complicated (you'll now have to inject rewriters into rewriter groups
instead of just injecting rewriters)? The end result is the same, because
you can't cache the results of each group individually.


> If default/common configuration is in a single Guice module that people can
> use directly, that's great. But as presently written, a single global
> List<ContentRewriter> is injected into the Registry, which means that in
> order to inject any custom rewriter, you need to inject all rewriters.
> Again, this is important only if we run what I'm calling "core" rewriter
> passes through a single unified Registry rewriting pass.


Again, this doesn't matter. If someone writes a custom binding and forgets
to bind an important rewriter, they'll see the problem immediately. This is
*NOT* a problem unique to rewriters. It also impacts both auth handlers and
social data handlers. It hasn't been a problem there, and it won't be a
problem here because far fewer people will ever write custom rewriters. Most
people will have to write custom social data handlers. Many people will
write custom auth handlers.


> > Option 1 may very well be viable, but nobody can know until there are
> real
> > benchmarks. Parsing isn't a free operation, but it's not the most
> expensive
> > thing either (it pales in comparison to, say, RSA signed fetch). Option 2
> > (in any variant) is workable, but all variants require huge compromises
> in
> > what we can do to the document. If it isn't cacheable, and parsing is
> > expensive, you're going to wind up with ugly string manipulation like you
> > see in RenderingContentRewriter right now.
> >
> > Once again, I think our best bet is to stop speculating and do some real
> > benchmarks, figuring out what the latency and CPU consumption is when
> using
> > the parser. There are many, many projects that parse HTML documents in
> real
> > time -- there's little reason why we shouldn't be able to.
>
> Agreed that we should be as data-driven as possible. My numbers from before
> give a good sense for parsing cost IMO. Granted, CPU utilization wasn't
> included in them. The conclusion I drew is that Caja's parser is O(C + n^(x
> > 1)) in cost. 1K of content = 1.5ms. 10K = 6.5ms. 50K = 121ms. 100K =
> 464ms. So that covers at least a good chunk of latency.


>
> In any case, these numbers will ultimately answer only one question: is it
> worth doing content caching in the first place, or not? If not, we're all
> wasting our time :) I'm assuming the answer is possibly yes (as has been
> asserted on this list a few times, and implied by the very existence of
> this
> conversation), and want to build just enough infrastructure to accommodate
> the classes of need I've identified.
>
> In sum:
> 1. I'm willing to build 2.ii or 2.iv right now. I prefer 2.iv since it
> involves no change to the CR interface.


Why does it matter if it requires a change to this interface? Nobody is
using this code, and they're not going to until this thing has stabilized.

And it doesn't change CR, but it changes everything else involved in the
pipeline.


> 2. I also wonder if we should just add a caching layer to the HTML parser.
> Key = MD5 of String. Value = tree. Only time will tell us which caching
> layer(s) we really need. IMO it's easier to add the capability of this
> cache
> than to spend the time hypothesizing about future rewriters - the
> performance of "core" ones in particular - and myriad execution
> environments
> (different caches, JVMs, CPU, etc.) figuring out whether we need to enable
> it.


That doesn't address the more fundamental problem of needing to have a parse
tree in scope for other operations. It is definitely not just the
optimization pass that needs to have a parse tree.


>
>
> --John
>

Reply via email to