Putting pedantry aside, it sounds like 2.ii is palatable to both of us.
Agreed? If so, I'm happy to implement it. Parse tree optimizations can
happen later - where by later I mean "as soon as immediately, but orthogonal
to this discussion" :)
John

On Wed, Sep 24, 2008 at 4:08 PM, Kevin Brown <[EMAIL PROTECTED]> wrote:

> On Wed, Sep 24, 2008 at 10:35 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:
>
> > I'm confused. You currently execute, in the new rendering pass,
> > RenderingContentRewriter directly.
>
>
> No, we don't. RenderingContentRewriter is not currently used by anything
> other than tests. It won't be bound until GadgetServer &
> GadgetRenderingTask
> get replaced by Renderer (sometime in the next few days).
>
>
> > That's fine, but the implication of this discussion is that we want it -
> > and related functionality (UP substitution,
> > message bundles, preloads, JS features) - to run as part of a single
> > compound rewriting (could be renamed "processing") pass. These things
> sound
> > pretty "core" to me, as they're right in the spec. It seems an
> unnecessary
> > burden to require that anyone constructing a server need to inject all
> > those. But that's just a convenience, and not my main concern here.
>
>
> Lots of classes fall into the same category (HttpFetcher, HttpCache,
> GadgetSpecFactory...). It doesn't matter, because most people will probably
> never write their own rewriters anyway, and will just use the default
> bindings. Those people who do write rewriters aren't going to forget to
> include an important rewriter.
>
>
> > >
> > > these things are cacheable, others are not. You can't possibly flip
> > between
> > > the two, so you always have to do cacheable things first. Even if you
> > > really
> > > have a compelling case for TTLs on cacheable things, you still have to
> > > group
> > > all the cacheable things together unless you're going to create
> separate
> > > cache entries for each stage, which is horribly wasteful.
> >
> > Yes, it's wasteful, which is why you'd batch as many cacheable things
> > together as possible. If for some reason we found that there's a
> > significant
> > interleaving of cacheable and uncacheable rewriter passes, then our
> > conversation is moot: in that case it's not worth caching rewriter passes
> > at
> > all. But it still may be a good idea to cache parse trees, an orthogonal
> > discussion to this.
>
>
> Yes, caching parse trees instead of full parse might be a useful
> optimization, but interleaving is still impossible, and probably not even
> worth worrying about. As long as the order of cacheable vs. uncacheable
> operations is documented, anyone writing rewriters will be sure to write
> them with that in mind. We don't have to address every potential edge case.
>
>
> > >
> > >
> > > iv is a generalization of iii, and it's unnecessary for the same
> reason.
> >
> > I'm not convinced iii/iv are overly complicated. I could achieve them
> with
> > very little code and not a whole lot of complexity.
>
>
> The *current* code is already overly complicated for what it's doing. This
> is shaping up to be a repeat of GadgetFeature. There's no current need for
> this extra level of complexity, so why make code that's going to be hard to
> maintain and understand even harder? What real value does grouping
> arbitrary
> rewriters together offer, other than making Guice bindings much more
> complicated (you'll now have to inject rewriters into rewriter groups
> instead of just injecting rewriters)? The end result is the same, because
> you can't cache the results of each group individually.
>
>
> > If default/common configuration is in a single Guice module that people
> can
> > use directly, that's great. But as presently written, a single global
> > List<ContentRewriter> is injected into the Registry, which means that in
> > order to inject any custom rewriter, you need to inject all rewriters.
> > Again, this is important only if we run what I'm calling "core" rewriter
> > passes through a single unified Registry rewriting pass.
>
>
> Again, this doesn't matter. If someone writes a custom binding and forgets
> to bind an important rewriter, they'll see the problem immediately. This is
> *NOT* a problem unique to rewriters. It also impacts both auth handlers and
> social data handlers. It hasn't been a problem there, and it won't be a
> problem here because far fewer people will ever write custom rewriters.
> Most
> people will have to write custom social data handlers. Many people will
> write custom auth handlers.
>
>
> > > Option 1 may very well be viable, but nobody can know until there are
> > real
> > > benchmarks. Parsing isn't a free operation, but it's not the most
> > expensive
> > > thing either (it pales in comparison to, say, RSA signed fetch). Option
> 2
> > > (in any variant) is workable, but all variants require huge compromises
> > in
> > > what we can do to the document. If it isn't cacheable, and parsing is
> > > expensive, you're going to wind up with ugly string manipulation like
> you
> > > see in RenderingContentRewriter right now.
> > >
> > > Once again, I think our best bet is to stop speculating and do some
> real
> > > benchmarks, figuring out what the latency and CPU consumption is when
> > using
> > > the parser. There are many, many projects that parse HTML documents in
> > real
> > > time -- there's little reason why we shouldn't be able to.
> >
> > Agreed that we should be as data-driven as possible. My numbers from
> before
> > give a good sense for parsing cost IMO. Granted, CPU utilization wasn't
> > included in them. The conclusion I drew is that Caja's parser is O(C +
> n^(x
> > > 1)) in cost. 1K of content = 1.5ms. 10K = 6.5ms. 50K = 121ms. 100K =
> > 464ms. So that covers at least a good chunk of latency.
>
>
> >
> > In any case, these numbers will ultimately answer only one question: is
> it
> > worth doing content caching in the first place, or not? If not, we're all
> > wasting our time :) I'm assuming the answer is possibly yes (as has been
> > asserted on this list a few times, and implied by the very existence of
> > this
> > conversation), and want to build just enough infrastructure to
> accommodate
> > the classes of need I've identified.
> >
> > In sum:
> > 1. I'm willing to build 2.ii or 2.iv right now. I prefer 2.iv since it
> > involves no change to the CR interface.
>
>
> Why does it matter if it requires a change to this interface? Nobody is
> using this code, and they're not going to until this thing has stabilized.
>
> And it doesn't change CR, but it changes everything else involved in the
> pipeline.
>
>
> > 2. I also wonder if we should just add a caching layer to the HTML
> parser.
> > Key = MD5 of String. Value = tree. Only time will tell us which caching
> > layer(s) we really need. IMO it's easier to add the capability of this
> > cache
> > than to spend the time hypothesizing about future rewriters - the
> > performance of "core" ones in particular - and myriad execution
> > environments
> > (different caches, JVMs, CPU, etc.) figuring out whether we need to
> enable
> > it.
>
>
> That doesn't address the more fundamental problem of needing to have a
> parse
> tree in scope for other operations. It is definitely not just the
> optimization pass that needs to have a parse tree.
>
>
> >
> >
> > --John
> >
>

Reply via email to