On Fri, Sep 5, 2008 at 1:45 AM, Kevin Brown <[EMAIL PROTECTED]> wrote:
> On Thu, Sep 4, 2008 at 11:53 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote: > > > Hey Louis: > > > > I appreciate the vote of confidence in the parse tree approach :) ...and > > agree that the numbers on rewriting are pretty encouraging. That said, a > > few > > more points on which I'd like to get your thoughts: > > > > 1. The case for String based substitution applies to any circumstance in > > which a given expansion spans the equivalent of multiple nodes. Because > > GadgetHtmlNode treats text-type nodes as literal text, the expansion of a > > message whose value is, for instance, "<b>foo!</b>" will in fact be > > rendered > > "<b>foo!<>". That's not necessarily a bad thing, but it and > > cases like it are almost certainly non-backward-compatible with _some_ > use > > cases out there. > > > So is whitespace removal. It's not enough to warrant becoming software > contortionists. > > > > 2. We've discussed using this new content rewriting infrastructure as a > way > > of implementing UP, MSG, and other Substitutions, and potentially other > > basic content manipulations. Past functional regressions (however > > dismissable, per #1), satisfying these sorts of requests will be made > > inherently more expensive, and by a considerable margin, due to parsing > > cost. Further, they will fail when the parser fails. The alternative is > to > > keep these routines as-is. That's reasonable, but in general there > remains > > a > > set of requests for which parse tree manipulation simply isn't necessary, > > which would be made much more expensive if a parse tree is the only way > to > > manipulate content through a rewriter. > > > You keep coming back to 'expensive'. Parsing should not be expensive. If it's too expensive to do in real time for all requests, it's too expensive > period and it needs to be revisited. Once this code stabilizes, I can > already see many places that can have performance improvements, and I'm > sure > there are many more if we start sending patches to caja. > > While parsing is certainly not free, it can be provably demonstrated that > it's quite possible to write an efficient html parser for these purposes. > There are hundreds of thousands of websites that are using far less > efficient html parsers to do everything from input validation to template > processing without issue. I don't dispute this. My entire point is that it's inherently more expensive than manipulating a String directly, and notably more expensive given today's parser (more comments about this below). I'd like to understand what our position is regarding CPU usage in rewriting. Commenting to SHINDIG-500, you said the following when I proposed temporarily removing caching of rewritten content - in that case, only for the minority of gadgets that opted into rewriting in the first place: "No, it isn't acceptable.. As Louis and I have posted on other bugs, rewriting (and associated sub-processes) is one of the most expensive operations done by Shindig today (after RSA signing and content encoding detection). We need to optimize this in the long run anyway, but we can't afford to rewrite every render in the short term either". A reasonable chunk of the work I've done on this rewriting stuff is designed to maintain prior performance characteristics (before we clean up how that's implemented too). The only breaking change I've made is to modify ContentRewriter's Gadget-rewriting interface, which we all agree needs to change anyway. > > > Parsing is absolutely required to handle opensocial templates server side. > > What might be expensive here is analysis, such as content reordering, but > those steps can be skipped if they're deemed too costly. > > > > 3. By my count, the number of lines of code associated with maintaining > > both > > String and Parse Tree representations of content is 62, in > > MutableHtmlContent.java. > > > Lines of code is a bogus metric, and you know it. There's far more to > maintaining the more complex logic than a few lines of code. > > The biggest cost is not even in our own code, it's in keeping a clean > interface for others to integrate against. We've been horrible API > designers > thus far (constantly breaking interfaces, changing data structures, and > generally creating source incompatibility), and that needs to stop -- we're > releasing soon. The most consistently proven way to write quality APIs with > as little backwards compatibility issues as possible is to write clean, > simple interfaces that are as small as possible. It's not a complete metric, no doubt, which is what I acknowledged. I also agree that writing a quality API that we actually do set in relative stone is of paramount importance. The fact that we're breaking the ContentRewriter interface right now is a given, however, so the question becomes what to change it to and how to get there. Today the candidates are: ContentRewriter.rewrite(HttpRequest context, [MutableContent|ParseTree] content) ContentRewriter.rewrite([Gadget gadget | GadgetContext context, [MutableContent|ParseTree] content)); We have two interface decisions: 1. Use MutableContent or ParseTree. You and Louis strongly prefer ParseTree. I prefer MutableContent at present. In my view, to users of the Rewriter interface will almost always simply take content.getParseTree() and call it a day. Others will take content.getContent() and be finished: not especially difficult from a usage perspective. Meantime, calling code would simply construct a MutableContent, pass it into the rewriters, and getContent() afterward to render. 2. Pass in Gadget, or GadgetContext plus some rewritable content. Once JsLibraries are made into standard ParseTree nodes, or we agree to stick with the separate notion of JsLibrary forever (forever = as long as we don't change the rewriter interface), I'm happy to go the latter route. I could use your help with either of these. Again, I too want to solidify these interfaces as soon as possible. If we had a wicked-fast parser that can handle the vast majority of gadget input we have today (a requirement to make Substitutions and JsLibraries Rewriters from an implementation perspective), we could use that. But we don't, yet, and we need to use this code in the reasonably near term. > > > > > True, there's some complexity inherent in two > > representations being made available, and the corresponding burden to > > ensure > > that all String-based rewriters are bundled together and all Parse-based > > ones are as well. Just about every rewriter will use only one > > representation, though, so the latter burden is the major issue. Given > our > > dogged insistence (rightly so) on performance/latency reduction, isn't > that > > a reasonable price to pay? > > > No. Again, you're assuming that parsing automatically and eternally means > "expensive". Just looking at cajita should tell you that it's not an issue. > We create the right architecture, and then we worry about optimization if > we > need to. Not so. I'm acknowledging that "relatively expensive" is the case today (and comparatively expensive to String manipulation by definition, the weaker point once parsing is "fast enough"), and that it will take nontrivial amounts of work to get to the point where that's not the case any longer. > > > I suppose it's ironic that I'm championing the String representation given > > the main thrust of my changes is to introduce a parse tree. If you both - > > and anyone else in the community with an opinion! - still feel that parse > > tree only is the way to go, I can accede... though to set expectations, > > it'll take me a little while to get to making those mods. In the > meantime, > > I'd like to add *some* ability to retrieve <intermediary/mutable content > > representation, currently MutableHtmlContent, later maybe just parse > tree> > > from an HttpResponse in such a way that it can (through collaboration > with > > HttpResponseBuilder) be shared between HttpResponse instances to avoid > > re-parsing in a chain of Rewriters. Seem reasonable? Like Kevin said, > we're > > dealing with code, not stone here, and we can fiddle with the relevant > APIs > > until the balance feels right. > > > No, just fix the interface. Why do you want to go through these unnecessary > intermediate steps? Fixing the interface and tweaking the existing > implementations is infinitely easier than mucking around "temporarily" with > HttpResponse. If you think it's too much work or don't have the time to do > this, just say so and one of us will do it -- I've been held up on > completing proxied content because of this and I'd rather not wait any > more. > The current code base is not in a stable, usable state to integrate against > and it needs to be moved back to that state rapidly. Looking at HttpResponse a little more, it's painfully clear that it shouldn't be mucked with even temporarily. Passing in *some* content param is the solution. Re: being held up and amount of work. Can you elaborate how you're held up by this? What do you plan on implementing in terms of ContentRewriters that I couldn't, for instance, change while making interface modifications described above? I'm always happy to receive help. The major piece is to resolve the interface. If you can A) own the task of optimizing Caja's parser so it's suitable for application on every request for every piece of gadget content we have, or B) agree we won't implement JsLibraries and Substitutions as Rewriters, then we're done. We go parse tree only. > > > There's no reason why a chain of rewriters should have to re-parse a piece > of content, and there's no reason to change HttpResponse. Pass the parse > tree as an explicit parameter and move on. Even if it did require > re-parsing > right now, it wouldn't matter because there's exactly 1 rewriter to worry > about at the moment. Not so. The "1 rewriter" is actually 3 rewriters bundled behind the old hack-y interface we all agree needs to change. See my latest commit splitting ParseTreeDefaultContentRewriter into 3 separate rewriters that employ the new-style API for gadget rewriting.
