Re: [jira] Commented: (SHINDIG-571) Create shared intermediary representation of mutable HTML content for Gadget and HttpResponse to use.

Kevin Brown Fri, 05 Sep 2008 13:39:15 -0700

On Fri, Sep 5, 2008 at 12:30 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:


> I completely agree that this is a great place to go for the parse tree. But
> 2 concerns:
> 1. It'll take a while to get there. The CSS piece is actually somewhat
> easy.. I've already written a basic CssNode along with a tested CssParser
> interface and implementation... but it might prove incomplete at this point
> (I just have Rules composed of multiple Selectors and Declarations -- no
> comments, structured Declaration values, etc yet) The JS piece will be
> significantly more difficult. Maybe implementing one or both to our full
> satisfaction is a requirement to move on from all this rewriter work; I was
> hoping we could remain optimistic


I don't think this is an issue either, since the only substantial JS
optimization we can reasonably do now is comment and whitespace removal. If
the js elements are just text nodes right now, we'll just rewrite that
string using a regex or some trivial tokenization routine.


>
> 2. A somewhat more esoteric, but potentially valid point for the future:
> what about non-browser-displayable content types? In theory rewriters could
> operate on, say, images for sanitization or optimization purposes.. XML
> (yes, HTML-like, but not exactly HTML?) for transformations.. JSON for
> same,
> or security rewrites or something..


That's not going to happen any time in the near future, and when it does
we'll have more than enough room to release a new version of Shindig. We've
wasted way too much time speculating about potential future issues.


>
> Comment #2 gets to the heart of my previous comment about hedging bets,
> since I still don't find the String+Tree wrapper all that confusing from an
> API point of view. On the other hand, ensuring that we jam everything into
> ParseTree feels like it might limit us at some point, which would lead to
> another API change.


 The API has to remain stable across major version boundaries. If Shindig
2.0 (based on OpenSocial 0.9, for example) needs to substantially change
some interfaces, that's OK, especially if we're providing an
AbstractRewriter class that they extend from instead of directly
implementing the Rewriter interface. This is a very common practice and it
gives us plenty of freedom to evolve the API without breaking compatibility.


>
> --John
>
> On Fri, Sep 5, 2008 at 12:16 PM, Kevin Brown <[EMAIL PROTECTED]> wrote:
>
> > On Fri, Sep 5, 2008 at 11:02 AM, John Hjelmstad <[EMAIL PROTECTED]>
> wrote:
> >
> > > Addendum argument for using something like MutableContent rather than
> > > ParseTree always: as you said, not all content is HTML (agreed,
> > > MutableHtmlContent is not the right name)...
> > > We seem to agree that HttpResponse rewriting via an API that looks
> > > schematically like so:
> > >
> > > ContentRewriter.rewrite(HttpRequest req, HttpResponse orig,
> > > SomethingEditable editable)
> > >
> > > SomethingEditable needs to be passed in, as opposed to the return value
> > of
> > > the method being rewritten content, in order to ensure that we don't
> > > re-parse in the HTML case.
> > >
> > > The concrete use case for MutableContent rather than ParseTree is
> > > mimeType="text/css". We could parse this into some canonical parse tree
> > > (top-level node wrapping text or a future CssTree), but requiring that
> we
> > > do
> > > so seems a little odd to me. Other mime types provide different
> examples.
> >
> >
> > I'd prefer that the parse tree be unified and that the rewriter can
> > programmatically determine if the thing it's rewriting is CSS or HTML (or
> > javascript). I don't care what the name is, but "html" anything is
> probably
> > inappropriate.
> >
> > This is of paramount importance because you may be rewriting external CSS
> > (common for rewriting of HttpResponse objects) or you may be rewriting
> > inline CSS (common for rewriting of Gadget objects).
> >
> > A tree might look like this:
> >
> > Tree
> > |__HtmlNode[head]
> > |  |__HtmlNode[style]
> > |  |   |__CssNode[...]
> > |__HtmlNode[body]
> >    |__HtmlNode[div]
> >    |__...
> >
> > Or it might just look like this:
> >
> > Tree
> > |__CssNode[...]
> > |__CssNode[...]
> >
> > The rewriter can trivially handle both cases (or choose not to handle one
> > or
> > the other, if that's the nature of the rewriter).
> >
> >
> > >
> > >
> > > Thoughts?
> > >
> > > --John
> > >
> > > On Fri, Sep 5, 2008 at 10:18 AM, John Hjelmstad <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > On Fri, Sep 5, 2008 at 1:45 AM, Kevin Brown <[EMAIL PROTECTED]> wrote:
> > > >
> > > >> On Thu, Sep 4, 2008 at 11:53 PM, John Hjelmstad <[EMAIL PROTECTED]>
> > > wrote:
> > > >>
> > > >> > Hey Louis:
> > > >> >
> > > >> > I appreciate the vote of confidence in the parse tree approach :)
> > > ...and
> > > >> > agree that the numbers on rewriting are pretty encouraging. That
> > said,
> > > a
> > > >> > few
> > > >> > more points on which I'd like to get your thoughts:
> > > >> >
> > > >> > 1. The case for String based substitution applies to any
> > circumstance
> > > in
> > > >> > which a given expansion spans the equivalent of multiple nodes.
> > > Because
> > > >> > GadgetHtmlNode treats text-type nodes as literal text, the
> expansion
> > > of
> > > >> a
> > > >> > message whose value is, for instance, "<b>foo!</b>" will in fact
> be
> > > >> > rendered
> > > >> > "&lt;b&gt;foo!&lt;&gt;". That's not necessarily a bad thing, but
> it
> > > and
> > > >> > cases like it are almost certainly non-backward-compatible with
> > _some_
> > > >> use
> > > >> > cases out there.
> > > >>
> > > >>
> > > >> So is whitespace removal. It's not enough to warrant becoming
> software
> > > >> contortionists.
> > > >>
> > > >>
> > > >> > 2. We've discussed using this new content rewriting infrastructure
> > as
> > > a
> > > >> way
> > > >> > of implementing UP, MSG, and other Substitutions, and potentially
> > > other
> > > >> > basic content manipulations. Past functional regressions (however
> > > >> > dismissable, per #1), satisfying these sorts of requests will be
> > made
> > > >> > inherently more expensive, and by a considerable margin, due to
> > > parsing
> > > >> > cost. Further, they will fail when the parser fails. The
> alternative
> > > is
> > > >> to
> > > >> > keep these routines as-is. That's reasonable, but in general there
> > > >> remains
> > > >> > a
> > > >> > set of requests for which parse tree manipulation simply isn't
> > > >> necessary,
> > > >> > which would be made much more expensive if a parse tree is the
> only
> > > way
> > > >> to
> > > >> > manipulate content through a rewriter.
> > > >>
> > > >>
> > > >> You keep coming back to 'expensive'. Parsing should not be
> expensive.
> > If
> > > >
> > > > it's too expensive to do in real time for all requests, it's too
> > > expensive
> > > >> period and it needs to be revisited. Once this code stabilizes, I
> can
> > > >> already see many places that can have performance improvements, and
> > I'm
> > > >> sure
> > > >> there are many more if we start sending patches to caja.
> > > >>
> > > >> While parsing is certainly not free, it can be provably demonstrated
> > > that
> > > >> it's quite possible to write an efficient html parser for these
> > > purposes.
> > > >> There are hundreds of thousands of websites that are using far less
> > > >> efficient html parsers to do everything from input validation to
> > > template
> > > >> processing without issue.
> > > >
> > > >
> > > > I don't dispute this. My entire point is that it's inherently more
> > > > expensive than manipulating a String directly, and notably more
> > expensive
> > > > given today's parser (more comments about this below).
> > > >
> > > > I'd like to understand what our position is regarding CPU usage in
> > > > rewriting. Commenting to SHINDIG-500, you said the following when I
> > > proposed
> > > > temporarily removing caching of rewritten content - in that case,
> only
> > > for
> > > > the minority of gadgets that opted into rewriting in the first place:
> > > "No,
> > > > it isn't acceptable.. As Louis and I have posted on other bugs,
> > rewriting
> > > > (and associated sub-processes) is one of the most expensive
> operations
> > > done
> > > > by Shindig today (after RSA signing and content encoding detection).
> We
> > > need
> > > > to optimize this in the long run anyway, but we can't afford to
> rewrite
> > > > every render in the short term either".
> > > >
> > > > A reasonable chunk of the work I've done on this rewriting stuff is
> > > > designed to maintain prior performance characteristics (before we
> clean
> > > up
> > > > how that's implemented too). The only breaking change I've made is to
> > > modify
> > > > ContentRewriter's Gadget-rewriting interface, which we all agree
> needs
> > to
> > > > change anyway.
> > > >
> > > >
> > > >>
> > > >>
> > > >> Parsing is absolutely required to handle opensocial templates server
> > > side.
> > > >>
> > > >> What might be expensive here is analysis, such as content
> reordering,
> > > but
> > > >> those steps can be skipped if they're deemed too costly.
> > > >>
> > > >>
> > > >> > 3. By my count, the number of lines of code associated with
> > > maintaining
> > > >> > both
> > > >> > String and Parse Tree representations of content is 62, in
> > > >> > MutableHtmlContent.java.
> > > >>
> > > >>
> > > >> Lines of code is a bogus metric, and you know it. There's far more
> to
> > > >> maintaining the more complex logic than a few lines of code.
> > > >
> > > >
> > > >>
> > > >> The biggest cost is not even in our own code, it's in keeping a
> clean
> > > >> interface for others to integrate against. We've been horrible API
> > > >> designers
> > > >> thus far (constantly breaking interfaces, changing data structures,
> > and
> > > >> generally creating source incompatibility), and that needs to stop
> --
> > > >> we're
> > > >> releasing soon. The most consistently proven way to write quality
> APIs
> > > >> with
> > > >> as little backwards compatibility issues as possible is to write
> > clean,
> > > >> simple interfaces that are as small as possible.
> > > >
> > > >
> > > > It's not a complete metric, no doubt, which is what I acknowledged. I
> > > also
> > > > agree that writing a quality API that we actually do set in relative
> > > stone
> > > > is of paramount importance. The fact that we're breaking the
> > > ContentRewriter
> > > > interface right now is a given, however, so the question becomes what
> > to
> > > > change it to and how to get there.
> > > >
> > > > Today the candidates are:
> > > > ContentRewriter.rewrite(HttpRequest context,
> [MutableContent|ParseTree]
> > > > content)
> > > > ContentRewriter.rewrite([Gadget gadget | GadgetContext context,
> > > > [MutableContent|ParseTree] content));
> > > >
> > > > We have two interface decisions:
> > > > 1. Use MutableContent or ParseTree. You and Louis strongly prefer
> > > > ParseTree. I prefer MutableContent at present. In my view, to users
> of
> > > the
> > > > Rewriter interface will almost always simply take
> > content.getParseTree()
> > > and
> > > > call it a day. Others will take content.getContent() and be finished:
> > not
> > > > especially difficult from a usage perspective. Meantime, calling code
> > > would
> > > > simply construct a MutableContent, pass it into the rewriters, and
> > > > getContent() afterward to render.
> > > >
> > > > 2. Pass in Gadget, or GadgetContext plus some rewritable content.
> Once
> > > > JsLibraries are made into standard ParseTree nodes, or we agree to
> > stick
> > > > with the separate notion of JsLibrary forever (forever = as long as
> we
> > > don't
> > > > change the rewriter interface), I'm happy to go the latter route. I
> > could
> > > > use your help with either of these.
> > > >
> > > > Again, I too want to solidify these interfaces as soon as possible.
> If
> > we
> > > > had a wicked-fast parser that can handle the vast majority of gadget
> > > input
> > > > we have today (a requirement to make Substitutions and JsLibraries
> > > Rewriters
> > > > from an implementation perspective), we could use that. But we don't,
> > > yet,
> > > > and we need to use this code in the reasonably near term.
> > > >
> > > >
> > > >>
> > > >>
> > > >>
> > > >> > True, there's some complexity inherent in two
> > > >> > representations being made available, and the corresponding burden
> > to
> > > >> > ensure
> > > >> > that all String-based rewriters are bundled together and all
> > > Parse-based
> > > >> > ones are as well. Just about every rewriter will use only one
> > > >> > representation, though, so the latter burden is the major issue.
> > Given
> > > >> our
> > > >> > dogged insistence (rightly so) on performance/latency reduction,
> > isn't
> > > >> that
> > > >> > a reasonable price to pay?
> > > >>
> > > >>
> > > >> No. Again, you're assuming that parsing automatically and eternally
> > > means
> > > >> "expensive". Just looking at cajita should tell you that it's not an
> > > >> issue.
> > > >> We create the right architecture, and then we worry about
> optimization
> > > if
> > > >> we
> > > >> need to.
> > > >
> > > >
> > > > Not so. I'm acknowledging that "relatively expensive" is the case
> today
> > > > (and comparatively expensive to String manipulation by definition,
> the
> > > > weaker point once parsing is "fast enough"), and that it will take
> > > > nontrivial amounts of work to get to the point where that's not the
> > case
> > > any
> > > > longer.
> > > >
> > > >
> > > >>
> > > >>
> > > >> I suppose it's ironic that I'm championing the String representation
> > > given
> > > >> > the main thrust of my changes is to introduce a parse tree. If you
> > > both
> > > >> -
> > > >> > and anyone else in the community with an opinion! - still feel
> that
> > > >> parse
> > > >> > tree only is the way to go, I can accede... though to set
> > > expectations,
> > > >> > it'll take me a little while to get to making those mods. In the
> > > >> meantime,
> > > >> > I'd like to add *some* ability to retrieve <intermediary/mutable
> > > content
> > > >> > representation, currently MutableHtmlContent, later maybe just
> parse
> > > >> tree>
> > > >> > from an HttpResponse in such a way that it can (through
> > collaboration
> > > >> with
> > > >> > HttpResponseBuilder) be shared between HttpResponse instances to
> > avoid
> > > >> > re-parsing in a chain of Rewriters. Seem reasonable? Like Kevin
> > said,
> > > >> we're
> > > >> > dealing with code, not stone here, and we can fiddle with the
> > relevant
> > > >> APIs
> > > >> > until the balance feels right.
> > > >>
> > > >>
> > > >> No, just fix the interface. Why do you want to go through these
> > > >> unnecessary
> > > >> intermediate steps? Fixing the interface and tweaking the existing
> > > >> implementations is infinitely easier than mucking around
> "temporarily"
> > > >> with
> > > >> HttpResponse. If you think it's too much work or don't have the time
> > to
> > > do
> > > >> this, just say so and one of us will do it -- I've been held up on
> > > >> completing proxied content because of this and I'd rather not wait
> any
> > > >> more.
> > > >> The current code base is not in a stable, usable state to integrate
> > > >> against
> > > >> and it needs to be moved back to that state rapidly.
> > > >
> > > >
> > > > Looking at HttpResponse a little more, it's painfully clear that it
> > > > shouldn't be mucked with even temporarily. Passing in *some* content
> > > param
> > > > is the solution.
> > > >
> > > > Re: being held up and amount of work. Can you elaborate how you're
> held
> > > up
> > > > by this? What do you plan on implementing in terms of
> ContentRewriters
> > > that
> > > > I couldn't, for instance, change while making interface modifications
> > > > described above?
> > > >
> > > > I'm always happy to receive help. The major piece is to resolve the
> > > > interface. If you can A) own the task of optimizing Caja's parser so
> > it's
> > > > suitable for application on every request for every piece of gadget
> > > content
> > > > we have, or B) agree we won't implement JsLibraries and Substitutions
> > as
> > > > Rewriters, then we're done. We go parse tree only.
> > > >
> > > >
> > > >>
> > > >>
> > > >> There's no reason why a chain of rewriters should have to re-parse a
> > > piece
> > > >> of content, and there's no reason to change HttpResponse. Pass the
> > parse
> > > >> tree as an explicit parameter and move on. Even if it did require
> > > >> re-parsing
> > > >> right now, it wouldn't matter because there's exactly 1 rewriter to
> > > worry
> > > >> about at the moment.
> > > >
> > > >
> > > > Not so. The "1 rewriter" is actually 3 rewriters bundled behind the
> old
> > > > hack-y interface we all agree needs to change. See my latest commit
> > > > splitting ParseTreeDefaultContentRewriter into 3 separate rewriters
> > that
> > > > employ the new-style API for gadget rewriting.
> > > >
> > > >
> > >
> >
>

Re: [jira] Commented: (SHINDIG-571) Create shared intermediary representation of mutable HTML content for Gadget and HttpResponse to use.

Reply via email to