On Thu, Sep 25, 2008 at 9:18 AM, Louis Ryan <[EMAIL PROTECTED]> wrote:

> I think its worth pointing out some observations about the code as it
> stands
> today that will hopefully make our decisions easier
>
> 1. There is essentially no difference between the ContentRewriter interface
> and the ContentRewriterRegistry interface. Im fairly certain we can
> eliminate ContentRewriterRegistry


There is one substantial difference, which is that the registry is
responsible for parsing the document before handing it to the rewriters.
This means you have a single dependency (on the rewriter) instead of
dependencies on the parser and every rewriter. It doesn't do much, but it's
important. The alternative would be for all 3 places that need to do
rewriting (makeRequest, open proxy / concat, and rendering) both parse the
source document and iterate over all rewriters.


>
> 2. A ContentRewriter implementation can wrap many subsidiary
> ContentRewriters
>
> It seems like composition allows folks to implement any bundling, grouping
> and caching the their heart desires and as long as we don't make
> composition
> impossible we'll be just fine. A composition of rewriters that have
> specialized caching requirements could be handled by declaring their
> requirements as interface extensions (isCacheable/getTTL) to
> ContentRewriter
> but that is probably overkill. Its perfectly fine for there to be private
> contracts between a cache and a group of specialized rewriter as they will
> in all likelihood be tightly coupled anyway. There is almost certainly
> value
> in providing some straightforward concrete implementations of composite
> rewriters that group usefully related individual rewriters together.
> Implementations can still mix, match and substitute as they see fit but a
> working reference is provided.


Yes, that's my thought exactly. If a ContentRewriter is composed of multiple
related things (for example, the way that RenderingContentRewriter is), that
rewriter can simply invoke the appropriate sub rendering processes
separately.


>
>
> Based on what Kevins been doing Im assuming he's already following this
> model and that RenderingRewriter will compose MsgSubstituteRewriter,
> JsFeatureRewriter etc at some point.


You mean like the TODO says? :)


>
> As for providing a caching implementation I agree we can defer adding it to
> the default pipeline in Shindig until theres a proven need. Its probably
> fine to keep the TTLCache in a patch on JIRA for experimentation.
>
>
> On Wed, Sep 24, 2008 at 4:57 PM, Kevin Brown <[EMAIL PROTECTED]> wrote:
>
> > 2 ii sounds good to me, though I still think just going for option 1 is
> > even
> > simpler.
> >
> > On Thu, Sep 25, 2008 at 1:17 AM, John Hjelmstad <[EMAIL PROTECTED]>
> wrote:
> >
> > > Putting pedantry aside, it sounds like 2.ii is palatable to both of us.
> > > Agreed? If so, I'm happy to implement it. Parse tree optimizations can
> > > happen later - where by later I mean "as soon as immediately, but
> > > orthogonal
> > > to this discussion" :)
> > > John
> > >
> > > On Wed, Sep 24, 2008 at 4:08 PM, Kevin Brown <[EMAIL PROTECTED]> wrote:
> > >
> > > > On Wed, Sep 24, 2008 at 10:35 PM, John Hjelmstad <[EMAIL PROTECTED]>
> > > wrote:
> > > >
> > > > > I'm confused. You currently execute, in the new rendering pass,
> > > > > RenderingContentRewriter directly.
> > > >
> > > >
> > > > No, we don't. RenderingContentRewriter is not currently used by
> > anything
> > > > other than tests. It won't be bound until GadgetServer &
> > > > GadgetRenderingTask
> > > > get replaced by Renderer (sometime in the next few days).
> > > >
> > > >
> > > > > That's fine, but the implication of this discussion is that we want
> > it
> > > -
> > > > > and related functionality (UP substitution,
> > > > > message bundles, preloads, JS features) - to run as part of a
> single
> > > > > compound rewriting (could be renamed "processing") pass. These
> things
> > > > sound
> > > > > pretty "core" to me, as they're right in the spec. It seems an
> > > > unnecessary
> > > > > burden to require that anyone constructing a server need to inject
> > all
> > > > > those. But that's just a convenience, and not my main concern here.
> > > >
> > > >
> > > > Lots of classes fall into the same category (HttpFetcher, HttpCache,
> > > > GadgetSpecFactory...). It doesn't matter, because most people will
> > > probably
> > > > never write their own rewriters anyway, and will just use the default
> > > > bindings. Those people who do write rewriters aren't going to forget
> to
> > > > include an important rewriter.
> > > >
> > > >
> > > > > >
> > > > > > these things are cacheable, others are not. You can't possibly
> flip
> > > > > between
> > > > > > the two, so you always have to do cacheable things first. Even if
> > you
> > > > > > really
> > > > > > have a compelling case for TTLs on cacheable things, you still
> have
> > > to
> > > > > > group
> > > > > > all the cacheable things together unless you're going to create
> > > > separate
> > > > > > cache entries for each stage, which is horribly wasteful.
> > > > >
> > > > > Yes, it's wasteful, which is why you'd batch as many cacheable
> things
> > > > > together as possible. If for some reason we found that there's a
> > > > > significant
> > > > > interleaving of cacheable and uncacheable rewriter passes, then our
> > > > > conversation is moot: in that case it's not worth caching rewriter
> > > passes
> > > > > at
> > > > > all. But it still may be a good idea to cache parse trees, an
> > > orthogonal
> > > > > discussion to this.
> > > >
> > > >
> > > > Yes, caching parse trees instead of full parse might be a useful
> > > > optimization, but interleaving is still impossible, and probably not
> > even
> > > > worth worrying about. As long as the order of cacheable vs.
> uncacheable
> > > > operations is documented, anyone writing rewriters will be sure to
> > write
> > > > them with that in mind. We don't have to address every potential edge
> > > case.
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > iv is a generalization of iii, and it's unnecessary for the same
> > > > reason.
> > > > >
> > > > > I'm not convinced iii/iv are overly complicated. I could achieve
> them
> > > > with
> > > > > very little code and not a whole lot of complexity.
> > > >
> > > >
> > > > The *current* code is already overly complicated for what it's doing.
> > > This
> > > > is shaping up to be a repeat of GadgetFeature. There's no current
> need
> > > for
> > > > this extra level of complexity, so why make code that's going to be
> > hard
> > > to
> > > > maintain and understand even harder? What real value does grouping
> > > > arbitrary
> > > > rewriters together offer, other than making Guice bindings much more
> > > > complicated (you'll now have to inject rewriters into rewriter groups
> > > > instead of just injecting rewriters)? The end result is the same,
> > because
> > > > you can't cache the results of each group individually.
> > > >
> > > >
> > > > > If default/common configuration is in a single Guice module that
> > people
> > > > can
> > > > > use directly, that's great. But as presently written, a single
> global
> > > > > List<ContentRewriter> is injected into the Registry, which means
> that
> > > in
> > > > > order to inject any custom rewriter, you need to inject all
> > rewriters.
> > > > > Again, this is important only if we run what I'm calling "core"
> > > rewriter
> > > > > passes through a single unified Registry rewriting pass.
> > > >
> > > >
> > > > Again, this doesn't matter. If someone writes a custom binding and
> > > forgets
> > > > to bind an important rewriter, they'll see the problem immediately.
> > This
> > > is
> > > > *NOT* a problem unique to rewriters. It also impacts both auth
> handlers
> > > and
> > > > social data handlers. It hasn't been a problem there, and it won't be
> a
> > > > problem here because far fewer people will ever write custom
> rewriters.
> > > > Most
> > > > people will have to write custom social data handlers. Many people
> will
> > > > write custom auth handlers.
> > > >
> > > >
> > > > > > Option 1 may very well be viable, but nobody can know until there
> > are
> > > > > real
> > > > > > benchmarks. Parsing isn't a free operation, but it's not the most
> > > > > expensive
> > > > > > thing either (it pales in comparison to, say, RSA signed fetch).
> > > Option
> > > > 2
> > > > > > (in any variant) is workable, but all variants require huge
> > > compromises
> > > > > in
> > > > > > what we can do to the document. If it isn't cacheable, and
> parsing
> > is
> > > > > > expensive, you're going to wind up with ugly string manipulation
> > like
> > > > you
> > > > > > see in RenderingContentRewriter right now.
> > > > > >
> > > > > > Once again, I think our best bet is to stop speculating and do
> some
> > > > real
> > > > > > benchmarks, figuring out what the latency and CPU consumption is
> > when
> > > > > using
> > > > > > the parser. There are many, many projects that parse HTML
> documents
> > > in
> > > > > real
> > > > > > time -- there's little reason why we shouldn't be able to.
> > > > >
> > > > > Agreed that we should be as data-driven as possible. My numbers
> from
> > > > before
> > > > > give a good sense for parsing cost IMO. Granted, CPU utilization
> > wasn't
> > > > > included in them. The conclusion I drew is that Caja's parser is
> O(C
> > +
> > > > n^(x
> > > > > > 1)) in cost. 1K of content = 1.5ms. 10K = 6.5ms. 50K = 121ms.
> 100K
> > =
> > > > > 464ms. So that covers at least a good chunk of latency.
> > > >
> > > >
> > > > >
> > > > > In any case, these numbers will ultimately answer only one
> question:
> > is
> > > > it
> > > > > worth doing content caching in the first place, or not? If not,
> we're
> > > all
> > > > > wasting our time :) I'm assuming the answer is possibly yes (as has
> > > been
> > > > > asserted on this list a few times, and implied by the very
> existence
> > of
> > > > > this
> > > > > conversation), and want to build just enough infrastructure to
> > > > accommodate
> > > > > the classes of need I've identified.
> > > > >
> > > > > In sum:
> > > > > 1. I'm willing to build 2.ii or 2.iv right now. I prefer 2.iv since
> > it
> > > > > involves no change to the CR interface.
> > > >
> > > >
> > > > Why does it matter if it requires a change to this interface? Nobody
> is
> > > > using this code, and they're not going to until this thing has
> > > stabilized.
> > > >
> > > > And it doesn't change CR, but it changes everything else involved in
> > the
> > > > pipeline.
> > > >
> > > >
> > > > > 2. I also wonder if we should just add a caching layer to the HTML
> > > > parser.
> > > > > Key = MD5 of String. Value = tree. Only time will tell us which
> > caching
> > > > > layer(s) we really need. IMO it's easier to add the capability of
> > this
> > > > > cache
> > > > > than to spend the time hypothesizing about future rewriters - the
> > > > > performance of "core" ones in particular - and myriad execution
> > > > > environments
> > > > > (different caches, JVMs, CPU, etc.) figuring out whether we need to
> > > > enable
> > > > > it.
> > > >
> > > >
> > > > That doesn't address the more fundamental problem of needing to have
> a
> > > > parse
> > > > tree in scope for other operations. It is definitely not just the
> > > > optimization pass that needs to have a parse tree.
> > > >
> > > >
> > > > >
> > > > >
> > > > > --John
> > > > >
> > > >
> > >
> >
>

Reply via email to