Putting pedantry aside, it sounds like 2.ii is palatable to both of us. Agreed? If so, I'm happy to implement it. Parse tree optimizations can happen later - where by later I mean "as soon as immediately, but orthogonal to this discussion" :) John
On Wed, Sep 24, 2008 at 4:08 PM, Kevin Brown <[EMAIL PROTECTED]> wrote: > On Wed, Sep 24, 2008 at 10:35 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote: > > > I'm confused. You currently execute, in the new rendering pass, > > RenderingContentRewriter directly. > > > No, we don't. RenderingContentRewriter is not currently used by anything > other than tests. It won't be bound until GadgetServer & > GadgetRenderingTask > get replaced by Renderer (sometime in the next few days). > > > > That's fine, but the implication of this discussion is that we want it - > > and related functionality (UP substitution, > > message bundles, preloads, JS features) - to run as part of a single > > compound rewriting (could be renamed "processing") pass. These things > sound > > pretty "core" to me, as they're right in the spec. It seems an > unnecessary > > burden to require that anyone constructing a server need to inject all > > those. But that's just a convenience, and not my main concern here. > > > Lots of classes fall into the same category (HttpFetcher, HttpCache, > GadgetSpecFactory...). It doesn't matter, because most people will probably > never write their own rewriters anyway, and will just use the default > bindings. Those people who do write rewriters aren't going to forget to > include an important rewriter. > > > > > > > > these things are cacheable, others are not. You can't possibly flip > > between > > > the two, so you always have to do cacheable things first. Even if you > > > really > > > have a compelling case for TTLs on cacheable things, you still have to > > > group > > > all the cacheable things together unless you're going to create > separate > > > cache entries for each stage, which is horribly wasteful. > > > > Yes, it's wasteful, which is why you'd batch as many cacheable things > > together as possible. If for some reason we found that there's a > > significant > > interleaving of cacheable and uncacheable rewriter passes, then our > > conversation is moot: in that case it's not worth caching rewriter passes > > at > > all. But it still may be a good idea to cache parse trees, an orthogonal > > discussion to this. > > > Yes, caching parse trees instead of full parse might be a useful > optimization, but interleaving is still impossible, and probably not even > worth worrying about. As long as the order of cacheable vs. uncacheable > operations is documented, anyone writing rewriters will be sure to write > them with that in mind. We don't have to address every potential edge case. > > > > > > > > > > > iv is a generalization of iii, and it's unnecessary for the same > reason. > > > > I'm not convinced iii/iv are overly complicated. I could achieve them > with > > very little code and not a whole lot of complexity. > > > The *current* code is already overly complicated for what it's doing. This > is shaping up to be a repeat of GadgetFeature. There's no current need for > this extra level of complexity, so why make code that's going to be hard to > maintain and understand even harder? What real value does grouping > arbitrary > rewriters together offer, other than making Guice bindings much more > complicated (you'll now have to inject rewriters into rewriter groups > instead of just injecting rewriters)? The end result is the same, because > you can't cache the results of each group individually. > > > > If default/common configuration is in a single Guice module that people > can > > use directly, that's great. But as presently written, a single global > > List<ContentRewriter> is injected into the Registry, which means that in > > order to inject any custom rewriter, you need to inject all rewriters. > > Again, this is important only if we run what I'm calling "core" rewriter > > passes through a single unified Registry rewriting pass. > > > Again, this doesn't matter. If someone writes a custom binding and forgets > to bind an important rewriter, they'll see the problem immediately. This is > *NOT* a problem unique to rewriters. It also impacts both auth handlers and > social data handlers. It hasn't been a problem there, and it won't be a > problem here because far fewer people will ever write custom rewriters. > Most > people will have to write custom social data handlers. Many people will > write custom auth handlers. > > > > > Option 1 may very well be viable, but nobody can know until there are > > real > > > benchmarks. Parsing isn't a free operation, but it's not the most > > expensive > > > thing either (it pales in comparison to, say, RSA signed fetch). Option > 2 > > > (in any variant) is workable, but all variants require huge compromises > > in > > > what we can do to the document. If it isn't cacheable, and parsing is > > > expensive, you're going to wind up with ugly string manipulation like > you > > > see in RenderingContentRewriter right now. > > > > > > Once again, I think our best bet is to stop speculating and do some > real > > > benchmarks, figuring out what the latency and CPU consumption is when > > using > > > the parser. There are many, many projects that parse HTML documents in > > real > > > time -- there's little reason why we shouldn't be able to. > > > > Agreed that we should be as data-driven as possible. My numbers from > before > > give a good sense for parsing cost IMO. Granted, CPU utilization wasn't > > included in them. The conclusion I drew is that Caja's parser is O(C + > n^(x > > > 1)) in cost. 1K of content = 1.5ms. 10K = 6.5ms. 50K = 121ms. 100K = > > 464ms. So that covers at least a good chunk of latency. > > > > > > In any case, these numbers will ultimately answer only one question: is > it > > worth doing content caching in the first place, or not? If not, we're all > > wasting our time :) I'm assuming the answer is possibly yes (as has been > > asserted on this list a few times, and implied by the very existence of > > this > > conversation), and want to build just enough infrastructure to > accommodate > > the classes of need I've identified. > > > > In sum: > > 1. I'm willing to build 2.ii or 2.iv right now. I prefer 2.iv since it > > involves no change to the CR interface. > > > Why does it matter if it requires a change to this interface? Nobody is > using this code, and they're not going to until this thing has stabilized. > > And it doesn't change CR, but it changes everything else involved in the > pipeline. > > > > 2. I also wonder if we should just add a caching layer to the HTML > parser. > > Key = MD5 of String. Value = tree. Only time will tell us which caching > > layer(s) we really need. IMO it's easier to add the capability of this > > cache > > than to spend the time hypothesizing about future rewriters - the > > performance of "core" ones in particular - and myriad execution > > environments > > (different caches, JVMs, CPU, etc.) figuring out whether we need to > enable > > it. > > > That doesn't address the more fundamental problem of needing to have a > parse > tree in scope for other operations. It is definitely not just the > optimization pass that needs to have a parse tree. > > > > > > > > --John > > >