On Thu, Sep 25, 2008 at 9:18 AM, Louis Ryan <[EMAIL PROTECTED]> wrote:
> I think its worth pointing out some observations about the code as it > stands > today that will hopefully make our decisions easier > > 1. There is essentially no difference between the ContentRewriter interface > and the ContentRewriterRegistry interface. Im fairly certain we can > eliminate ContentRewriterRegistry There is one substantial difference, which is that the registry is responsible for parsing the document before handing it to the rewriters. This means you have a single dependency (on the rewriter) instead of dependencies on the parser and every rewriter. It doesn't do much, but it's important. The alternative would be for all 3 places that need to do rewriting (makeRequest, open proxy / concat, and rendering) both parse the source document and iterate over all rewriters. > > 2. A ContentRewriter implementation can wrap many subsidiary > ContentRewriters > > It seems like composition allows folks to implement any bundling, grouping > and caching the their heart desires and as long as we don't make > composition > impossible we'll be just fine. A composition of rewriters that have > specialized caching requirements could be handled by declaring their > requirements as interface extensions (isCacheable/getTTL) to > ContentRewriter > but that is probably overkill. Its perfectly fine for there to be private > contracts between a cache and a group of specialized rewriter as they will > in all likelihood be tightly coupled anyway. There is almost certainly > value > in providing some straightforward concrete implementations of composite > rewriters that group usefully related individual rewriters together. > Implementations can still mix, match and substitute as they see fit but a > working reference is provided. Yes, that's my thought exactly. If a ContentRewriter is composed of multiple related things (for example, the way that RenderingContentRewriter is), that rewriter can simply invoke the appropriate sub rendering processes separately. > > > Based on what Kevins been doing Im assuming he's already following this > model and that RenderingRewriter will compose MsgSubstituteRewriter, > JsFeatureRewriter etc at some point. You mean like the TODO says? :) > > As for providing a caching implementation I agree we can defer adding it to > the default pipeline in Shindig until theres a proven need. Its probably > fine to keep the TTLCache in a patch on JIRA for experimentation. > > > On Wed, Sep 24, 2008 at 4:57 PM, Kevin Brown <[EMAIL PROTECTED]> wrote: > > > 2 ii sounds good to me, though I still think just going for option 1 is > > even > > simpler. > > > > On Thu, Sep 25, 2008 at 1:17 AM, John Hjelmstad <[EMAIL PROTECTED]> > wrote: > > > > > Putting pedantry aside, it sounds like 2.ii is palatable to both of us. > > > Agreed? If so, I'm happy to implement it. Parse tree optimizations can > > > happen later - where by later I mean "as soon as immediately, but > > > orthogonal > > > to this discussion" :) > > > John > > > > > > On Wed, Sep 24, 2008 at 4:08 PM, Kevin Brown <[EMAIL PROTECTED]> wrote: > > > > > > > On Wed, Sep 24, 2008 at 10:35 PM, John Hjelmstad <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > I'm confused. You currently execute, in the new rendering pass, > > > > > RenderingContentRewriter directly. > > > > > > > > > > > > No, we don't. RenderingContentRewriter is not currently used by > > anything > > > > other than tests. It won't be bound until GadgetServer & > > > > GadgetRenderingTask > > > > get replaced by Renderer (sometime in the next few days). > > > > > > > > > > > > > That's fine, but the implication of this discussion is that we want > > it > > > - > > > > > and related functionality (UP substitution, > > > > > message bundles, preloads, JS features) - to run as part of a > single > > > > > compound rewriting (could be renamed "processing") pass. These > things > > > > sound > > > > > pretty "core" to me, as they're right in the spec. It seems an > > > > unnecessary > > > > > burden to require that anyone constructing a server need to inject > > all > > > > > those. But that's just a convenience, and not my main concern here. > > > > > > > > > > > > Lots of classes fall into the same category (HttpFetcher, HttpCache, > > > > GadgetSpecFactory...). It doesn't matter, because most people will > > > probably > > > > never write their own rewriters anyway, and will just use the default > > > > bindings. Those people who do write rewriters aren't going to forget > to > > > > include an important rewriter. > > > > > > > > > > > > > > > > > > > > these things are cacheable, others are not. You can't possibly > flip > > > > > between > > > > > > the two, so you always have to do cacheable things first. Even if > > you > > > > > > really > > > > > > have a compelling case for TTLs on cacheable things, you still > have > > > to > > > > > > group > > > > > > all the cacheable things together unless you're going to create > > > > separate > > > > > > cache entries for each stage, which is horribly wasteful. > > > > > > > > > > Yes, it's wasteful, which is why you'd batch as many cacheable > things > > > > > together as possible. If for some reason we found that there's a > > > > > significant > > > > > interleaving of cacheable and uncacheable rewriter passes, then our > > > > > conversation is moot: in that case it's not worth caching rewriter > > > passes > > > > > at > > > > > all. But it still may be a good idea to cache parse trees, an > > > orthogonal > > > > > discussion to this. > > > > > > > > > > > > Yes, caching parse trees instead of full parse might be a useful > > > > optimization, but interleaving is still impossible, and probably not > > even > > > > worth worrying about. As long as the order of cacheable vs. > uncacheable > > > > operations is documented, anyone writing rewriters will be sure to > > write > > > > them with that in mind. We don't have to address every potential edge > > > case. > > > > > > > > > > > > > > > > > > > > > > > > > > iv is a generalization of iii, and it's unnecessary for the same > > > > reason. > > > > > > > > > > I'm not convinced iii/iv are overly complicated. I could achieve > them > > > > with > > > > > very little code and not a whole lot of complexity. > > > > > > > > > > > > The *current* code is already overly complicated for what it's doing. > > > This > > > > is shaping up to be a repeat of GadgetFeature. There's no current > need > > > for > > > > this extra level of complexity, so why make code that's going to be > > hard > > > to > > > > maintain and understand even harder? What real value does grouping > > > > arbitrary > > > > rewriters together offer, other than making Guice bindings much more > > > > complicated (you'll now have to inject rewriters into rewriter groups > > > > instead of just injecting rewriters)? The end result is the same, > > because > > > > you can't cache the results of each group individually. > > > > > > > > > > > > > If default/common configuration is in a single Guice module that > > people > > > > can > > > > > use directly, that's great. But as presently written, a single > global > > > > > List<ContentRewriter> is injected into the Registry, which means > that > > > in > > > > > order to inject any custom rewriter, you need to inject all > > rewriters. > > > > > Again, this is important only if we run what I'm calling "core" > > > rewriter > > > > > passes through a single unified Registry rewriting pass. > > > > > > > > > > > > Again, this doesn't matter. If someone writes a custom binding and > > > forgets > > > > to bind an important rewriter, they'll see the problem immediately. > > This > > > is > > > > *NOT* a problem unique to rewriters. It also impacts both auth > handlers > > > and > > > > social data handlers. It hasn't been a problem there, and it won't be > a > > > > problem here because far fewer people will ever write custom > rewriters. > > > > Most > > > > people will have to write custom social data handlers. Many people > will > > > > write custom auth handlers. > > > > > > > > > > > > > > Option 1 may very well be viable, but nobody can know until there > > are > > > > > real > > > > > > benchmarks. Parsing isn't a free operation, but it's not the most > > > > > expensive > > > > > > thing either (it pales in comparison to, say, RSA signed fetch). > > > Option > > > > 2 > > > > > > (in any variant) is workable, but all variants require huge > > > compromises > > > > > in > > > > > > what we can do to the document. If it isn't cacheable, and > parsing > > is > > > > > > expensive, you're going to wind up with ugly string manipulation > > like > > > > you > > > > > > see in RenderingContentRewriter right now. > > > > > > > > > > > > Once again, I think our best bet is to stop speculating and do > some > > > > real > > > > > > benchmarks, figuring out what the latency and CPU consumption is > > when > > > > > using > > > > > > the parser. There are many, many projects that parse HTML > documents > > > in > > > > > real > > > > > > time -- there's little reason why we shouldn't be able to. > > > > > > > > > > Agreed that we should be as data-driven as possible. My numbers > from > > > > before > > > > > give a good sense for parsing cost IMO. Granted, CPU utilization > > wasn't > > > > > included in them. The conclusion I drew is that Caja's parser is > O(C > > + > > > > n^(x > > > > > > 1)) in cost. 1K of content = 1.5ms. 10K = 6.5ms. 50K = 121ms. > 100K > > = > > > > > 464ms. So that covers at least a good chunk of latency. > > > > > > > > > > > > > > > > > > In any case, these numbers will ultimately answer only one > question: > > is > > > > it > > > > > worth doing content caching in the first place, or not? If not, > we're > > > all > > > > > wasting our time :) I'm assuming the answer is possibly yes (as has > > > been > > > > > asserted on this list a few times, and implied by the very > existence > > of > > > > > this > > > > > conversation), and want to build just enough infrastructure to > > > > accommodate > > > > > the classes of need I've identified. > > > > > > > > > > In sum: > > > > > 1. I'm willing to build 2.ii or 2.iv right now. I prefer 2.iv since > > it > > > > > involves no change to the CR interface. > > > > > > > > > > > > Why does it matter if it requires a change to this interface? Nobody > is > > > > using this code, and they're not going to until this thing has > > > stabilized. > > > > > > > > And it doesn't change CR, but it changes everything else involved in > > the > > > > pipeline. > > > > > > > > > > > > > 2. I also wonder if we should just add a caching layer to the HTML > > > > parser. > > > > > Key = MD5 of String. Value = tree. Only time will tell us which > > caching > > > > > layer(s) we really need. IMO it's easier to add the capability of > > this > > > > > cache > > > > > than to spend the time hypothesizing about future rewriters - the > > > > > performance of "core" ones in particular - and myriad execution > > > > > environments > > > > > (different caches, JVMs, CPU, etc.) figuring out whether we need to > > > > enable > > > > > it. > > > > > > > > > > > > That doesn't address the more fundamental problem of needing to have > a > > > > parse > > > > tree in scope for other operations. It is definitely not just the > > > > optimization pass that needs to have a parse tree. > > > > > > > > > > > > > > > > > > > > > > > --John > > > > > > > > > > > > > > >