On Fri, Sep 5, 2008 at 12:30 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:
> I completely agree that this is a great place to go for the parse tree. But > 2 concerns: > 1. It'll take a while to get there. The CSS piece is actually somewhat > easy.. I've already written a basic CssNode along with a tested CssParser > interface and implementation... but it might prove incomplete at this point > (I just have Rules composed of multiple Selectors and Declarations -- no > comments, structured Declaration values, etc yet) The JS piece will be > significantly more difficult. Maybe implementing one or both to our full > satisfaction is a requirement to move on from all this rewriter work; I was > hoping we could remain optimistic > ...yeesh, somehow that sentence got mangled in a copy-and-paste job. I had said I was optimistic we could just use a String in cases where we don't yet have a parse tree for whatever content type is applicable. > > 2. A somewhat more esoteric, but potentially valid point for the future: > what about non-browser-displayable content types? In theory rewriters could > operate on, say, images for sanitization or optimization purposes.. XML > (yes, HTML-like, but not exactly HTML?) for transformations.. JSON for same, > or security rewrites or something.. > > Comment #2 gets to the heart of my previous comment about hedging bets, > since I still don't find the String+Tree wrapper all that confusing from an > API point of view. On the other hand, ensuring that we jam everything into > ParseTree feels like it might limit us at some point, which would lead to > another API change. > > --John > > > On Fri, Sep 5, 2008 at 12:16 PM, Kevin Brown <[EMAIL PROTECTED]> wrote: > >> On Fri, Sep 5, 2008 at 11:02 AM, John Hjelmstad <[EMAIL PROTECTED]> wrote: >> >> > Addendum argument for using something like MutableContent rather than >> > ParseTree always: as you said, not all content is HTML (agreed, >> > MutableHtmlContent is not the right name)... >> > We seem to agree that HttpResponse rewriting via an API that looks >> > schematically like so: >> > >> > ContentRewriter.rewrite(HttpRequest req, HttpResponse orig, >> > SomethingEditable editable) >> > >> > SomethingEditable needs to be passed in, as opposed to the return value >> of >> > the method being rewritten content, in order to ensure that we don't >> > re-parse in the HTML case. >> > >> > The concrete use case for MutableContent rather than ParseTree is >> > mimeType="text/css". We could parse this into some canonical parse tree >> > (top-level node wrapping text or a future CssTree), but requiring that >> we >> > do >> > so seems a little odd to me. Other mime types provide different >> examples. >> >> >> I'd prefer that the parse tree be unified and that the rewriter can >> programmatically determine if the thing it's rewriting is CSS or HTML (or >> javascript). I don't care what the name is, but "html" anything is >> probably >> inappropriate. >> >> This is of paramount importance because you may be rewriting external CSS >> (common for rewriting of HttpResponse objects) or you may be rewriting >> inline CSS (common for rewriting of Gadget objects). >> >> A tree might look like this: >> >> Tree >> |__HtmlNode[head] >> | |__HtmlNode[style] >> | | |__CssNode[...] >> |__HtmlNode[body] >> |__HtmlNode[div] >> |__... >> >> Or it might just look like this: >> >> Tree >> |__CssNode[...] >> |__CssNode[...] >> >> The rewriter can trivially handle both cases (or choose not to handle one >> or >> the other, if that's the nature of the rewriter). >> >> >> > >> > >> > Thoughts? >> > >> > --John >> > >> > On Fri, Sep 5, 2008 at 10:18 AM, John Hjelmstad <[EMAIL PROTECTED]> >> wrote: >> > >> > > On Fri, Sep 5, 2008 at 1:45 AM, Kevin Brown <[EMAIL PROTECTED]> wrote: >> > > >> > >> On Thu, Sep 4, 2008 at 11:53 PM, John Hjelmstad <[EMAIL PROTECTED]> >> > wrote: >> > >> >> > >> > Hey Louis: >> > >> > >> > >> > I appreciate the vote of confidence in the parse tree approach :) >> > ...and >> > >> > agree that the numbers on rewriting are pretty encouraging. That >> said, >> > a >> > >> > few >> > >> > more points on which I'd like to get your thoughts: >> > >> > >> > >> > 1. The case for String based substitution applies to any >> circumstance >> > in >> > >> > which a given expansion spans the equivalent of multiple nodes. >> > Because >> > >> > GadgetHtmlNode treats text-type nodes as literal text, the >> expansion >> > of >> > >> a >> > >> > message whose value is, for instance, "<b>foo!</b>" will in fact be >> > >> > rendered >> > >> > "<b>foo!<>". That's not necessarily a bad thing, but it >> > and >> > >> > cases like it are almost certainly non-backward-compatible with >> _some_ >> > >> use >> > >> > cases out there. >> > >> >> > >> >> > >> So is whitespace removal. It's not enough to warrant becoming >> software >> > >> contortionists. >> > >> >> > >> >> > >> > 2. We've discussed using this new content rewriting infrastructure >> as >> > a >> > >> way >> > >> > of implementing UP, MSG, and other Substitutions, and potentially >> > other >> > >> > basic content manipulations. Past functional regressions (however >> > >> > dismissable, per #1), satisfying these sorts of requests will be >> made >> > >> > inherently more expensive, and by a considerable margin, due to >> > parsing >> > >> > cost. Further, they will fail when the parser fails. The >> alternative >> > is >> > >> to >> > >> > keep these routines as-is. That's reasonable, but in general there >> > >> remains >> > >> > a >> > >> > set of requests for which parse tree manipulation simply isn't >> > >> necessary, >> > >> > which would be made much more expensive if a parse tree is the only >> > way >> > >> to >> > >> > manipulate content through a rewriter. >> > >> >> > >> >> > >> You keep coming back to 'expensive'. Parsing should not be expensive. >> If >> > > >> > > it's too expensive to do in real time for all requests, it's too >> > expensive >> > >> period and it needs to be revisited. Once this code stabilizes, I can >> > >> already see many places that can have performance improvements, and >> I'm >> > >> sure >> > >> there are many more if we start sending patches to caja. >> > >> >> > >> While parsing is certainly not free, it can be provably demonstrated >> > that >> > >> it's quite possible to write an efficient html parser for these >> > purposes. >> > >> There are hundreds of thousands of websites that are using far less >> > >> efficient html parsers to do everything from input validation to >> > template >> > >> processing without issue. >> > > >> > > >> > > I don't dispute this. My entire point is that it's inherently more >> > > expensive than manipulating a String directly, and notably more >> expensive >> > > given today's parser (more comments about this below). >> > > >> > > I'd like to understand what our position is regarding CPU usage in >> > > rewriting. Commenting to SHINDIG-500, you said the following when I >> > proposed >> > > temporarily removing caching of rewritten content - in that case, only >> > for >> > > the minority of gadgets that opted into rewriting in the first place: >> > "No, >> > > it isn't acceptable.. As Louis and I have posted on other bugs, >> rewriting >> > > (and associated sub-processes) is one of the most expensive operations >> > done >> > > by Shindig today (after RSA signing and content encoding detection). >> We >> > need >> > > to optimize this in the long run anyway, but we can't afford to >> rewrite >> > > every render in the short term either". >> > > >> > > A reasonable chunk of the work I've done on this rewriting stuff is >> > > designed to maintain prior performance characteristics (before we >> clean >> > up >> > > how that's implemented too). The only breaking change I've made is to >> > modify >> > > ContentRewriter's Gadget-rewriting interface, which we all agree needs >> to >> > > change anyway. >> > > >> > > >> > >> >> > >> >> > >> Parsing is absolutely required to handle opensocial templates server >> > side. >> > >> >> > >> What might be expensive here is analysis, such as content reordering, >> > but >> > >> those steps can be skipped if they're deemed too costly. >> > >> >> > >> >> > >> > 3. By my count, the number of lines of code associated with >> > maintaining >> > >> > both >> > >> > String and Parse Tree representations of content is 62, in >> > >> > MutableHtmlContent.java. >> > >> >> > >> >> > >> Lines of code is a bogus metric, and you know it. There's far more to >> > >> maintaining the more complex logic than a few lines of code. >> > > >> > > >> > >> >> > >> The biggest cost is not even in our own code, it's in keeping a clean >> > >> interface for others to integrate against. We've been horrible API >> > >> designers >> > >> thus far (constantly breaking interfaces, changing data structures, >> and >> > >> generally creating source incompatibility), and that needs to stop -- >> > >> we're >> > >> releasing soon. The most consistently proven way to write quality >> APIs >> > >> with >> > >> as little backwards compatibility issues as possible is to write >> clean, >> > >> simple interfaces that are as small as possible. >> > > >> > > >> > > It's not a complete metric, no doubt, which is what I acknowledged. I >> > also >> > > agree that writing a quality API that we actually do set in relative >> > stone >> > > is of paramount importance. The fact that we're breaking the >> > ContentRewriter >> > > interface right now is a given, however, so the question becomes what >> to >> > > change it to and how to get there. >> > > >> > > Today the candidates are: >> > > ContentRewriter.rewrite(HttpRequest context, >> [MutableContent|ParseTree] >> > > content) >> > > ContentRewriter.rewrite([Gadget gadget | GadgetContext context, >> > > [MutableContent|ParseTree] content)); >> > > >> > > We have two interface decisions: >> > > 1. Use MutableContent or ParseTree. You and Louis strongly prefer >> > > ParseTree. I prefer MutableContent at present. In my view, to users of >> > the >> > > Rewriter interface will almost always simply take >> content.getParseTree() >> > and >> > > call it a day. Others will take content.getContent() and be finished: >> not >> > > especially difficult from a usage perspective. Meantime, calling code >> > would >> > > simply construct a MutableContent, pass it into the rewriters, and >> > > getContent() afterward to render. >> > > >> > > 2. Pass in Gadget, or GadgetContext plus some rewritable content. Once >> > > JsLibraries are made into standard ParseTree nodes, or we agree to >> stick >> > > with the separate notion of JsLibrary forever (forever = as long as we >> > don't >> > > change the rewriter interface), I'm happy to go the latter route. I >> could >> > > use your help with either of these. >> > > >> > > Again, I too want to solidify these interfaces as soon as possible. If >> we >> > > had a wicked-fast parser that can handle the vast majority of gadget >> > input >> > > we have today (a requirement to make Substitutions and JsLibraries >> > Rewriters >> > > from an implementation perspective), we could use that. But we don't, >> > yet, >> > > and we need to use this code in the reasonably near term. >> > > >> > > >> > >> >> > >> >> > >> >> > >> > True, there's some complexity inherent in two >> > >> > representations being made available, and the corresponding burden >> to >> > >> > ensure >> > >> > that all String-based rewriters are bundled together and all >> > Parse-based >> > >> > ones are as well. Just about every rewriter will use only one >> > >> > representation, though, so the latter burden is the major issue. >> Given >> > >> our >> > >> > dogged insistence (rightly so) on performance/latency reduction, >> isn't >> > >> that >> > >> > a reasonable price to pay? >> > >> >> > >> >> > >> No. Again, you're assuming that parsing automatically and eternally >> > means >> > >> "expensive". Just looking at cajita should tell you that it's not an >> > >> issue. >> > >> We create the right architecture, and then we worry about >> optimization >> > if >> > >> we >> > >> need to. >> > > >> > > >> > > Not so. I'm acknowledging that "relatively expensive" is the case >> today >> > > (and comparatively expensive to String manipulation by definition, the >> > > weaker point once parsing is "fast enough"), and that it will take >> > > nontrivial amounts of work to get to the point where that's not the >> case >> > any >> > > longer. >> > > >> > > >> > >> >> > >> >> > >> I suppose it's ironic that I'm championing the String representation >> > given >> > >> > the main thrust of my changes is to introduce a parse tree. If you >> > both >> > >> - >> > >> > and anyone else in the community with an opinion! - still feel that >> > >> parse >> > >> > tree only is the way to go, I can accede... though to set >> > expectations, >> > >> > it'll take me a little while to get to making those mods. In the >> > >> meantime, >> > >> > I'd like to add *some* ability to retrieve <intermediary/mutable >> > content >> > >> > representation, currently MutableHtmlContent, later maybe just >> parse >> > >> tree> >> > >> > from an HttpResponse in such a way that it can (through >> collaboration >> > >> with >> > >> > HttpResponseBuilder) be shared between HttpResponse instances to >> avoid >> > >> > re-parsing in a chain of Rewriters. Seem reasonable? Like Kevin >> said, >> > >> we're >> > >> > dealing with code, not stone here, and we can fiddle with the >> relevant >> > >> APIs >> > >> > until the balance feels right. >> > >> >> > >> >> > >> No, just fix the interface. Why do you want to go through these >> > >> unnecessary >> > >> intermediate steps? Fixing the interface and tweaking the existing >> > >> implementations is infinitely easier than mucking around >> "temporarily" >> > >> with >> > >> HttpResponse. If you think it's too much work or don't have the time >> to >> > do >> > >> this, just say so and one of us will do it -- I've been held up on >> > >> completing proxied content because of this and I'd rather not wait >> any >> > >> more. >> > >> The current code base is not in a stable, usable state to integrate >> > >> against >> > >> and it needs to be moved back to that state rapidly. >> > > >> > > >> > > Looking at HttpResponse a little more, it's painfully clear that it >> > > shouldn't be mucked with even temporarily. Passing in *some* content >> > param >> > > is the solution. >> > > >> > > Re: being held up and amount of work. Can you elaborate how you're >> held >> > up >> > > by this? What do you plan on implementing in terms of ContentRewriters >> > that >> > > I couldn't, for instance, change while making interface modifications >> > > described above? >> > > >> > > I'm always happy to receive help. The major piece is to resolve the >> > > interface. If you can A) own the task of optimizing Caja's parser so >> it's >> > > suitable for application on every request for every piece of gadget >> > content >> > > we have, or B) agree we won't implement JsLibraries and Substitutions >> as >> > > Rewriters, then we're done. We go parse tree only. >> > > >> > > >> > >> >> > >> >> > >> There's no reason why a chain of rewriters should have to re-parse a >> > piece >> > >> of content, and there's no reason to change HttpResponse. Pass the >> parse >> > >> tree as an explicit parameter and move on. Even if it did require >> > >> re-parsing >> > >> right now, it wouldn't matter because there's exactly 1 rewriter to >> > worry >> > >> about at the moment. >> > > >> > > >> > > Not so. The "1 rewriter" is actually 3 rewriters bundled behind the >> old >> > > hack-y interface we all agree needs to change. See my latest commit >> > > splitting ParseTreeDefaultContentRewriter into 3 separate rewriters >> that >> > > employ the new-style API for gadget rewriting. >> > > >> > > >> > >> > >
