2 ii sounds good to me, though I still think just going for option 1 is even simpler.
On Thu, Sep 25, 2008 at 1:17 AM, John Hjelmstad <[EMAIL PROTECTED]> wrote: > Putting pedantry aside, it sounds like 2.ii is palatable to both of us. > Agreed? If so, I'm happy to implement it. Parse tree optimizations can > happen later - where by later I mean "as soon as immediately, but > orthogonal > to this discussion" :) > John > > On Wed, Sep 24, 2008 at 4:08 PM, Kevin Brown <[EMAIL PROTECTED]> wrote: > > > On Wed, Sep 24, 2008 at 10:35 PM, John Hjelmstad <[EMAIL PROTECTED]> > wrote: > > > > > I'm confused. You currently execute, in the new rendering pass, > > > RenderingContentRewriter directly. > > > > > > No, we don't. RenderingContentRewriter is not currently used by anything > > other than tests. It won't be bound until GadgetServer & > > GadgetRenderingTask > > get replaced by Renderer (sometime in the next few days). > > > > > > > That's fine, but the implication of this discussion is that we want it > - > > > and related functionality (UP substitution, > > > message bundles, preloads, JS features) - to run as part of a single > > > compound rewriting (could be renamed "processing") pass. These things > > sound > > > pretty "core" to me, as they're right in the spec. It seems an > > unnecessary > > > burden to require that anyone constructing a server need to inject all > > > those. But that's just a convenience, and not my main concern here. > > > > > > Lots of classes fall into the same category (HttpFetcher, HttpCache, > > GadgetSpecFactory...). It doesn't matter, because most people will > probably > > never write their own rewriters anyway, and will just use the default > > bindings. Those people who do write rewriters aren't going to forget to > > include an important rewriter. > > > > > > > > > > > > these things are cacheable, others are not. You can't possibly flip > > > between > > > > the two, so you always have to do cacheable things first. Even if you > > > > really > > > > have a compelling case for TTLs on cacheable things, you still have > to > > > > group > > > > all the cacheable things together unless you're going to create > > separate > > > > cache entries for each stage, which is horribly wasteful. > > > > > > Yes, it's wasteful, which is why you'd batch as many cacheable things > > > together as possible. If for some reason we found that there's a > > > significant > > > interleaving of cacheable and uncacheable rewriter passes, then our > > > conversation is moot: in that case it's not worth caching rewriter > passes > > > at > > > all. But it still may be a good idea to cache parse trees, an > orthogonal > > > discussion to this. > > > > > > Yes, caching parse trees instead of full parse might be a useful > > optimization, but interleaving is still impossible, and probably not even > > worth worrying about. As long as the order of cacheable vs. uncacheable > > operations is documented, anyone writing rewriters will be sure to write > > them with that in mind. We don't have to address every potential edge > case. > > > > > > > > > > > > > > > > iv is a generalization of iii, and it's unnecessary for the same > > reason. > > > > > > I'm not convinced iii/iv are overly complicated. I could achieve them > > with > > > very little code and not a whole lot of complexity. > > > > > > The *current* code is already overly complicated for what it's doing. > This > > is shaping up to be a repeat of GadgetFeature. There's no current need > for > > this extra level of complexity, so why make code that's going to be hard > to > > maintain and understand even harder? What real value does grouping > > arbitrary > > rewriters together offer, other than making Guice bindings much more > > complicated (you'll now have to inject rewriters into rewriter groups > > instead of just injecting rewriters)? The end result is the same, because > > you can't cache the results of each group individually. > > > > > > > If default/common configuration is in a single Guice module that people > > can > > > use directly, that's great. But as presently written, a single global > > > List<ContentRewriter> is injected into the Registry, which means that > in > > > order to inject any custom rewriter, you need to inject all rewriters. > > > Again, this is important only if we run what I'm calling "core" > rewriter > > > passes through a single unified Registry rewriting pass. > > > > > > Again, this doesn't matter. If someone writes a custom binding and > forgets > > to bind an important rewriter, they'll see the problem immediately. This > is > > *NOT* a problem unique to rewriters. It also impacts both auth handlers > and > > social data handlers. It hasn't been a problem there, and it won't be a > > problem here because far fewer people will ever write custom rewriters. > > Most > > people will have to write custom social data handlers. Many people will > > write custom auth handlers. > > > > > > > > Option 1 may very well be viable, but nobody can know until there are > > > real > > > > benchmarks. Parsing isn't a free operation, but it's not the most > > > expensive > > > > thing either (it pales in comparison to, say, RSA signed fetch). > Option > > 2 > > > > (in any variant) is workable, but all variants require huge > compromises > > > in > > > > what we can do to the document. If it isn't cacheable, and parsing is > > > > expensive, you're going to wind up with ugly string manipulation like > > you > > > > see in RenderingContentRewriter right now. > > > > > > > > Once again, I think our best bet is to stop speculating and do some > > real > > > > benchmarks, figuring out what the latency and CPU consumption is when > > > using > > > > the parser. There are many, many projects that parse HTML documents > in > > > real > > > > time -- there's little reason why we shouldn't be able to. > > > > > > Agreed that we should be as data-driven as possible. My numbers from > > before > > > give a good sense for parsing cost IMO. Granted, CPU utilization wasn't > > > included in them. The conclusion I drew is that Caja's parser is O(C + > > n^(x > > > > 1)) in cost. 1K of content = 1.5ms. 10K = 6.5ms. 50K = 121ms. 100K = > > > 464ms. So that covers at least a good chunk of latency. > > > > > > > > > > In any case, these numbers will ultimately answer only one question: is > > it > > > worth doing content caching in the first place, or not? If not, we're > all > > > wasting our time :) I'm assuming the answer is possibly yes (as has > been > > > asserted on this list a few times, and implied by the very existence of > > > this > > > conversation), and want to build just enough infrastructure to > > accommodate > > > the classes of need I've identified. > > > > > > In sum: > > > 1. I'm willing to build 2.ii or 2.iv right now. I prefer 2.iv since it > > > involves no change to the CR interface. > > > > > > Why does it matter if it requires a change to this interface? Nobody is > > using this code, and they're not going to until this thing has > stabilized. > > > > And it doesn't change CR, but it changes everything else involved in the > > pipeline. > > > > > > > 2. I also wonder if we should just add a caching layer to the HTML > > parser. > > > Key = MD5 of String. Value = tree. Only time will tell us which caching > > > layer(s) we really need. IMO it's easier to add the capability of this > > > cache > > > than to spend the time hypothesizing about future rewriters - the > > > performance of "core" ones in particular - and myriad execution > > > environments > > > (different caches, JVMs, CPU, etc.) figuring out whether we need to > > enable > > > it. > > > > > > That doesn't address the more fundamental problem of needing to have a > > parse > > tree in scope for other operations. It is definitely not just the > > optimization pass that needs to have a parse tree. > > > > > > > > > > > > > --John > > > > > >