Re: [jira] Commented: (SHINDIG-571) Create shared intermediary representation of mutable HTML content for Gadget and HttpResponse to use.

Kevin Brown Fri, 05 Sep 2008 12:17:01 -0700

On Fri, Sep 5, 2008 at 11:02 AM, John Hjelmstad <[EMAIL PROTECTED]> wrote:


> Addendum argument for using something like MutableContent rather than
> ParseTree always: as you said, not all content is HTML (agreed,
> MutableHtmlContent is not the right name)...
> We seem to agree that HttpResponse rewriting via an API that looks
> schematically like so:
>
> ContentRewriter.rewrite(HttpRequest req, HttpResponse orig,
> SomethingEditable editable)
>
> SomethingEditable needs to be passed in, as opposed to the return value of
> the method being rewritten content, in order to ensure that we don't
> re-parse in the HTML case.
>
> The concrete use case for MutableContent rather than ParseTree is
> mimeType="text/css". We could parse this into some canonical parse tree
> (top-level node wrapping text or a future CssTree), but requiring that we
> do
> so seems a little odd to me. Other mime types provide different examples.


I'd prefer that the parse tree be unified and that the rewriter can
programmatically determine if the thing it's rewriting is CSS or HTML (or
javascript). I don't care what the name is, but "html" anything is probably
inappropriate.

This is of paramount importance because you may be rewriting external CSS
(common for rewriting of HttpResponse objects) or you may be rewriting
inline CSS (common for rewriting of Gadget objects).

A tree might look like this:

Tree
|__HtmlNode[head]
|  |__HtmlNode[style]
|  |   |__CssNode[...]
|__HtmlNode[body]
    |__HtmlNode[div]
    |__...

Or it might just look like this:

Tree
|__CssNode[...]
|__CssNode[...]

The rewriter can trivially handle both cases (or choose not to handle one or
the other, if that's the nature of the rewriter).


>
>
> Thoughts?
>
> --John
>
> On Fri, Sep 5, 2008 at 10:18 AM, John Hjelmstad <[EMAIL PROTECTED]> wrote:
>
> > On Fri, Sep 5, 2008 at 1:45 AM, Kevin Brown <[EMAIL PROTECTED]> wrote:
> >
> >> On Thu, Sep 4, 2008 at 11:53 PM, John Hjelmstad <[EMAIL PROTECTED]>
> wrote:
> >>
> >> > Hey Louis:
> >> >
> >> > I appreciate the vote of confidence in the parse tree approach :)
> ...and
> >> > agree that the numbers on rewriting are pretty encouraging. That said,
> a
> >> > few
> >> > more points on which I'd like to get your thoughts:
> >> >
> >> > 1. The case for String based substitution applies to any circumstance
> in
> >> > which a given expansion spans the equivalent of multiple nodes.
> Because
> >> > GadgetHtmlNode treats text-type nodes as literal text, the expansion
> of
> >> a
> >> > message whose value is, for instance, "<b>foo!</b>" will in fact be
> >> > rendered
> >> > "&lt;b&gt;foo!&lt;&gt;". That's not necessarily a bad thing, but it
> and
> >> > cases like it are almost certainly non-backward-compatible with _some_
> >> use
> >> > cases out there.
> >>
> >>
> >> So is whitespace removal. It's not enough to warrant becoming software
> >> contortionists.
> >>
> >>
> >> > 2. We've discussed using this new content rewriting infrastructure as
> a
> >> way
> >> > of implementing UP, MSG, and other Substitutions, and potentially
> other
> >> > basic content manipulations. Past functional regressions (however
> >> > dismissable, per #1), satisfying these sorts of requests will be made
> >> > inherently more expensive, and by a considerable margin, due to
> parsing
> >> > cost. Further, they will fail when the parser fails. The alternative
> is
> >> to
> >> > keep these routines as-is. That's reasonable, but in general there
> >> remains
> >> > a
> >> > set of requests for which parse tree manipulation simply isn't
> >> necessary,
> >> > which would be made much more expensive if a parse tree is the only
> way
> >> to
> >> > manipulate content through a rewriter.
> >>
> >>
> >> You keep coming back to 'expensive'. Parsing should not be expensive. If
> >
> > it's too expensive to do in real time for all requests, it's too
> expensive
> >> period and it needs to be revisited. Once this code stabilizes, I can
> >> already see many places that can have performance improvements, and I'm
> >> sure
> >> there are many more if we start sending patches to caja.
> >>
> >> While parsing is certainly not free, it can be provably demonstrated
> that
> >> it's quite possible to write an efficient html parser for these
> purposes.
> >> There are hundreds of thousands of websites that are using far less
> >> efficient html parsers to do everything from input validation to
> template
> >> processing without issue.
> >
> >
> > I don't dispute this. My entire point is that it's inherently more
> > expensive than manipulating a String directly, and notably more expensive
> > given today's parser (more comments about this below).
> >
> > I'd like to understand what our position is regarding CPU usage in
> > rewriting. Commenting to SHINDIG-500, you said the following when I
> proposed
> > temporarily removing caching of rewritten content - in that case, only
> for
> > the minority of gadgets that opted into rewriting in the first place:
> "No,
> > it isn't acceptable.. As Louis and I have posted on other bugs, rewriting
> > (and associated sub-processes) is one of the most expensive operations
> done
> > by Shindig today (after RSA signing and content encoding detection). We
> need
> > to optimize this in the long run anyway, but we can't afford to rewrite
> > every render in the short term either".
> >
> > A reasonable chunk of the work I've done on this rewriting stuff is
> > designed to maintain prior performance characteristics (before we clean
> up
> > how that's implemented too). The only breaking change I've made is to
> modify
> > ContentRewriter's Gadget-rewriting interface, which we all agree needs to
> > change anyway.
> >
> >
> >>
> >>
> >> Parsing is absolutely required to handle opensocial templates server
> side.
> >>
> >> What might be expensive here is analysis, such as content reordering,
> but
> >> those steps can be skipped if they're deemed too costly.
> >>
> >>
> >> > 3. By my count, the number of lines of code associated with
> maintaining
> >> > both
> >> > String and Parse Tree representations of content is 62, in
> >> > MutableHtmlContent.java.
> >>
> >>
> >> Lines of code is a bogus metric, and you know it. There's far more to
> >> maintaining the more complex logic than a few lines of code.
> >
> >
> >>
> >> The biggest cost is not even in our own code, it's in keeping a clean
> >> interface for others to integrate against. We've been horrible API
> >> designers
> >> thus far (constantly breaking interfaces, changing data structures, and
> >> generally creating source incompatibility), and that needs to stop --
> >> we're
> >> releasing soon. The most consistently proven way to write quality APIs
> >> with
> >> as little backwards compatibility issues as possible is to write clean,
> >> simple interfaces that are as small as possible.
> >
> >
> > It's not a complete metric, no doubt, which is what I acknowledged. I
> also
> > agree that writing a quality API that we actually do set in relative
> stone
> > is of paramount importance. The fact that we're breaking the
> ContentRewriter
> > interface right now is a given, however, so the question becomes what to
> > change it to and how to get there.
> >
> > Today the candidates are:
> > ContentRewriter.rewrite(HttpRequest context, [MutableContent|ParseTree]
> > content)
> > ContentRewriter.rewrite([Gadget gadget | GadgetContext context,
> > [MutableContent|ParseTree] content));
> >
> > We have two interface decisions:
> > 1. Use MutableContent or ParseTree. You and Louis strongly prefer
> > ParseTree. I prefer MutableContent at present. In my view, to users of
> the
> > Rewriter interface will almost always simply take content.getParseTree()
> and
> > call it a day. Others will take content.getContent() and be finished: not
> > especially difficult from a usage perspective. Meantime, calling code
> would
> > simply construct a MutableContent, pass it into the rewriters, and
> > getContent() afterward to render.
> >
> > 2. Pass in Gadget, or GadgetContext plus some rewritable content. Once
> > JsLibraries are made into standard ParseTree nodes, or we agree to stick
> > with the separate notion of JsLibrary forever (forever = as long as we
> don't
> > change the rewriter interface), I'm happy to go the latter route. I could
> > use your help with either of these.
> >
> > Again, I too want to solidify these interfaces as soon as possible. If we
> > had a wicked-fast parser that can handle the vast majority of gadget
> input
> > we have today (a requirement to make Substitutions and JsLibraries
> Rewriters
> > from an implementation perspective), we could use that. But we don't,
> yet,
> > and we need to use this code in the reasonably near term.
> >
> >
> >>
> >>
> >>
> >> > True, there's some complexity inherent in two
> >> > representations being made available, and the corresponding burden to
> >> > ensure
> >> > that all String-based rewriters are bundled together and all
> Parse-based
> >> > ones are as well. Just about every rewriter will use only one
> >> > representation, though, so the latter burden is the major issue. Given
> >> our
> >> > dogged insistence (rightly so) on performance/latency reduction, isn't
> >> that
> >> > a reasonable price to pay?
> >>
> >>
> >> No. Again, you're assuming that parsing automatically and eternally
> means
> >> "expensive". Just looking at cajita should tell you that it's not an
> >> issue.
> >> We create the right architecture, and then we worry about optimization
> if
> >> we
> >> need to.
> >
> >
> > Not so. I'm acknowledging that "relatively expensive" is the case today
> > (and comparatively expensive to String manipulation by definition, the
> > weaker point once parsing is "fast enough"), and that it will take
> > nontrivial amounts of work to get to the point where that's not the case
> any
> > longer.
> >
> >
> >>
> >>
> >> I suppose it's ironic that I'm championing the String representation
> given
> >> > the main thrust of my changes is to introduce a parse tree. If you
> both
> >> -
> >> > and anyone else in the community with an opinion! - still feel that
> >> parse
> >> > tree only is the way to go, I can accede... though to set
> expectations,
> >> > it'll take me a little while to get to making those mods. In the
> >> meantime,
> >> > I'd like to add *some* ability to retrieve <intermediary/mutable
> content
> >> > representation, currently MutableHtmlContent, later maybe just parse
> >> tree>
> >> > from an HttpResponse in such a way that it can (through collaboration
> >> with
> >> > HttpResponseBuilder) be shared between HttpResponse instances to avoid
> >> > re-parsing in a chain of Rewriters. Seem reasonable? Like Kevin said,
> >> we're
> >> > dealing with code, not stone here, and we can fiddle with the relevant
> >> APIs
> >> > until the balance feels right.
> >>
> >>
> >> No, just fix the interface. Why do you want to go through these
> >> unnecessary
> >> intermediate steps? Fixing the interface and tweaking the existing
> >> implementations is infinitely easier than mucking around "temporarily"
> >> with
> >> HttpResponse. If you think it's too much work or don't have the time to
> do
> >> this, just say so and one of us will do it -- I've been held up on
> >> completing proxied content because of this and I'd rather not wait any
> >> more.
> >> The current code base is not in a stable, usable state to integrate
> >> against
> >> and it needs to be moved back to that state rapidly.
> >
> >
> > Looking at HttpResponse a little more, it's painfully clear that it
> > shouldn't be mucked with even temporarily. Passing in *some* content
> param
> > is the solution.
> >
> > Re: being held up and amount of work. Can you elaborate how you're held
> up
> > by this? What do you plan on implementing in terms of ContentRewriters
> that
> > I couldn't, for instance, change while making interface modifications
> > described above?
> >
> > I'm always happy to receive help. The major piece is to resolve the
> > interface. If you can A) own the task of optimizing Caja's parser so it's
> > suitable for application on every request for every piece of gadget
> content
> > we have, or B) agree we won't implement JsLibraries and Substitutions as
> > Rewriters, then we're done. We go parse tree only.
> >
> >
> >>
> >>
> >> There's no reason why a chain of rewriters should have to re-parse a
> piece
> >> of content, and there's no reason to change HttpResponse. Pass the parse
> >> tree as an explicit parameter and move on. Even if it did require
> >> re-parsing
> >> right now, it wouldn't matter because there's exactly 1 rewriter to
> worry
> >> about at the moment.
> >
> >
> > Not so. The "1 rewriter" is actually 3 rewriters bundled behind the old
> > hack-y interface we all agree needs to change. See my latest commit
> > splitting ParseTreeDefaultContentRewriter into 3 separate rewriters that
> > employ the new-style API for gadget rewriting.
> >
> >
>

Re: [jira] Commented: (SHINDIG-571) Create shared intermediary representation of mutable HTML content for Gadget and HttpResponse to use.

Reply via email to