2 ii sounds good to me, though I still think just going for option 1 is even
simpler.

On Thu, Sep 25, 2008 at 1:17 AM, John Hjelmstad <[EMAIL PROTECTED]> wrote:

> Putting pedantry aside, it sounds like 2.ii is palatable to both of us.
> Agreed? If so, I'm happy to implement it. Parse tree optimizations can
> happen later - where by later I mean "as soon as immediately, but
> orthogonal
> to this discussion" :)
> John
>
> On Wed, Sep 24, 2008 at 4:08 PM, Kevin Brown <[EMAIL PROTECTED]> wrote:
>
> > On Wed, Sep 24, 2008 at 10:35 PM, John Hjelmstad <[EMAIL PROTECTED]>
> wrote:
> >
> > > I'm confused. You currently execute, in the new rendering pass,
> > > RenderingContentRewriter directly.
> >
> >
> > No, we don't. RenderingContentRewriter is not currently used by anything
> > other than tests. It won't be bound until GadgetServer &
> > GadgetRenderingTask
> > get replaced by Renderer (sometime in the next few days).
> >
> >
> > > That's fine, but the implication of this discussion is that we want it
> -
> > > and related functionality (UP substitution,
> > > message bundles, preloads, JS features) - to run as part of a single
> > > compound rewriting (could be renamed "processing") pass. These things
> > sound
> > > pretty "core" to me, as they're right in the spec. It seems an
> > unnecessary
> > > burden to require that anyone constructing a server need to inject all
> > > those. But that's just a convenience, and not my main concern here.
> >
> >
> > Lots of classes fall into the same category (HttpFetcher, HttpCache,
> > GadgetSpecFactory...). It doesn't matter, because most people will
> probably
> > never write their own rewriters anyway, and will just use the default
> > bindings. Those people who do write rewriters aren't going to forget to
> > include an important rewriter.
> >
> >
> > > >
> > > > these things are cacheable, others are not. You can't possibly flip
> > > between
> > > > the two, so you always have to do cacheable things first. Even if you
> > > > really
> > > > have a compelling case for TTLs on cacheable things, you still have
> to
> > > > group
> > > > all the cacheable things together unless you're going to create
> > separate
> > > > cache entries for each stage, which is horribly wasteful.
> > >
> > > Yes, it's wasteful, which is why you'd batch as many cacheable things
> > > together as possible. If for some reason we found that there's a
> > > significant
> > > interleaving of cacheable and uncacheable rewriter passes, then our
> > > conversation is moot: in that case it's not worth caching rewriter
> passes
> > > at
> > > all. But it still may be a good idea to cache parse trees, an
> orthogonal
> > > discussion to this.
> >
> >
> > Yes, caching parse trees instead of full parse might be a useful
> > optimization, but interleaving is still impossible, and probably not even
> > worth worrying about. As long as the order of cacheable vs. uncacheable
> > operations is documented, anyone writing rewriters will be sure to write
> > them with that in mind. We don't have to address every potential edge
> case.
> >
> >
> > > >
> > > >
> > > > iv is a generalization of iii, and it's unnecessary for the same
> > reason.
> > >
> > > I'm not convinced iii/iv are overly complicated. I could achieve them
> > with
> > > very little code and not a whole lot of complexity.
> >
> >
> > The *current* code is already overly complicated for what it's doing.
> This
> > is shaping up to be a repeat of GadgetFeature. There's no current need
> for
> > this extra level of complexity, so why make code that's going to be hard
> to
> > maintain and understand even harder? What real value does grouping
> > arbitrary
> > rewriters together offer, other than making Guice bindings much more
> > complicated (you'll now have to inject rewriters into rewriter groups
> > instead of just injecting rewriters)? The end result is the same, because
> > you can't cache the results of each group individually.
> >
> >
> > > If default/common configuration is in a single Guice module that people
> > can
> > > use directly, that's great. But as presently written, a single global
> > > List<ContentRewriter> is injected into the Registry, which means that
> in
> > > order to inject any custom rewriter, you need to inject all rewriters.
> > > Again, this is important only if we run what I'm calling "core"
> rewriter
> > > passes through a single unified Registry rewriting pass.
> >
> >
> > Again, this doesn't matter. If someone writes a custom binding and
> forgets
> > to bind an important rewriter, they'll see the problem immediately. This
> is
> > *NOT* a problem unique to rewriters. It also impacts both auth handlers
> and
> > social data handlers. It hasn't been a problem there, and it won't be a
> > problem here because far fewer people will ever write custom rewriters.
> > Most
> > people will have to write custom social data handlers. Many people will
> > write custom auth handlers.
> >
> >
> > > > Option 1 may very well be viable, but nobody can know until there are
> > > real
> > > > benchmarks. Parsing isn't a free operation, but it's not the most
> > > expensive
> > > > thing either (it pales in comparison to, say, RSA signed fetch).
> Option
> > 2
> > > > (in any variant) is workable, but all variants require huge
> compromises
> > > in
> > > > what we can do to the document. If it isn't cacheable, and parsing is
> > > > expensive, you're going to wind up with ugly string manipulation like
> > you
> > > > see in RenderingContentRewriter right now.
> > > >
> > > > Once again, I think our best bet is to stop speculating and do some
> > real
> > > > benchmarks, figuring out what the latency and CPU consumption is when
> > > using
> > > > the parser. There are many, many projects that parse HTML documents
> in
> > > real
> > > > time -- there's little reason why we shouldn't be able to.
> > >
> > > Agreed that we should be as data-driven as possible. My numbers from
> > before
> > > give a good sense for parsing cost IMO. Granted, CPU utilization wasn't
> > > included in them. The conclusion I drew is that Caja's parser is O(C +
> > n^(x
> > > > 1)) in cost. 1K of content = 1.5ms. 10K = 6.5ms. 50K = 121ms. 100K =
> > > 464ms. So that covers at least a good chunk of latency.
> >
> >
> > >
> > > In any case, these numbers will ultimately answer only one question: is
> > it
> > > worth doing content caching in the first place, or not? If not, we're
> all
> > > wasting our time :) I'm assuming the answer is possibly yes (as has
> been
> > > asserted on this list a few times, and implied by the very existence of
> > > this
> > > conversation), and want to build just enough infrastructure to
> > accommodate
> > > the classes of need I've identified.
> > >
> > > In sum:
> > > 1. I'm willing to build 2.ii or 2.iv right now. I prefer 2.iv since it
> > > involves no change to the CR interface.
> >
> >
> > Why does it matter if it requires a change to this interface? Nobody is
> > using this code, and they're not going to until this thing has
> stabilized.
> >
> > And it doesn't change CR, but it changes everything else involved in the
> > pipeline.
> >
> >
> > > 2. I also wonder if we should just add a caching layer to the HTML
> > parser.
> > > Key = MD5 of String. Value = tree. Only time will tell us which caching
> > > layer(s) we really need. IMO it's easier to add the capability of this
> > > cache
> > > than to spend the time hypothesizing about future rewriters - the
> > > performance of "core" ones in particular - and myriad execution
> > > environments
> > > (different caches, JVMs, CPU, etc.) figuring out whether we need to
> > enable
> > > it.
> >
> >
> > That doesn't address the more fundamental problem of needing to have a
> > parse
> > tree in scope for other operations. It is definitely not just the
> > optimization pass that needs to have a parse tree.
> >
> >
> > >
> > >
> > > --John
> > >
> >
>

Reply via email to