Re: [jira] Commented: (SHINDIG-571) Create shared intermediary representation of mutable HTML content for Gadget and HttpResponse to use.

John Hjelmstad Fri, 05 Sep 2008 12:36:50 -0700

On Fri, Sep 5, 2008 at 12:30 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:


> I completely agree that this is a great place to go for the parse tree. But
> 2 concerns:
> 1. It'll take a while to get there. The CSS piece is actually somewhat
> easy.. I've already written a basic CssNode along with a tested CssParser
> interface and implementation... but it might prove incomplete at this point
> (I just have Rules composed of multiple Selectors and Declarations -- no
> comments, structured Declaration values, etc yet) The JS piece will be
> significantly more difficult. Maybe implementing one or both to our full
> satisfaction is a requirement to move on from all this rewriter work; I was
> hoping we could remain optimistic
>

...yeesh, somehow that sentence got mangled in a copy-and-paste job. I had
said I was optimistic we could just use a String in cases where we don't yet
have a parse tree for whatever content type is applicable.


>
> 2. A somewhat more esoteric, but potentially valid point for the future:
> what about non-browser-displayable content types? In theory rewriters could
> operate on, say, images for sanitization or optimization purposes.. XML
> (yes, HTML-like, but not exactly HTML?) for transformations.. JSON for same,
> or security rewrites or something..
>
> Comment #2 gets to the heart of my previous comment about hedging bets,
> since I still don't find the String+Tree wrapper all that confusing from an
> API point of view. On the other hand, ensuring that we jam everything into
> ParseTree feels like it might limit us at some point, which would lead to
> another API change.
>
> --John
>
>
> On Fri, Sep 5, 2008 at 12:16 PM, Kevin Brown <[EMAIL PROTECTED]> wrote:
>
>> On Fri, Sep 5, 2008 at 11:02 AM, John Hjelmstad <[EMAIL PROTECTED]> wrote:
>>
>> > Addendum argument for using something like MutableContent rather than
>> > ParseTree always: as you said, not all content is HTML (agreed,
>> > MutableHtmlContent is not the right name)...
>> > We seem to agree that HttpResponse rewriting via an API that looks
>> > schematically like so:
>> >
>> > ContentRewriter.rewrite(HttpRequest req, HttpResponse orig,
>> > SomethingEditable editable)
>> >
>> > SomethingEditable needs to be passed in, as opposed to the return value
>> of
>> > the method being rewritten content, in order to ensure that we don't
>> > re-parse in the HTML case.
>> >
>> > The concrete use case for MutableContent rather than ParseTree is
>> > mimeType="text/css". We could parse this into some canonical parse tree
>> > (top-level node wrapping text or a future CssTree), but requiring that
>> we
>> > do
>> > so seems a little odd to me. Other mime types provide different
>> examples.
>>
>>
>> I'd prefer that the parse tree be unified and that the rewriter can
>> programmatically determine if the thing it's rewriting is CSS or HTML (or
>> javascript). I don't care what the name is, but "html" anything is
>> probably
>> inappropriate.
>>
>> This is of paramount importance because you may be rewriting external CSS
>> (common for rewriting of HttpResponse objects) or you may be rewriting
>> inline CSS (common for rewriting of Gadget objects).
>>
>> A tree might look like this:
>>
>> Tree
>> |__HtmlNode[head]
>> |  |__HtmlNode[style]
>> |  |   |__CssNode[...]
>> |__HtmlNode[body]
>>    |__HtmlNode[div]
>>    |__...
>>
>> Or it might just look like this:
>>
>> Tree
>> |__CssNode[...]
>> |__CssNode[...]
>>
>> The rewriter can trivially handle both cases (or choose not to handle one
>> or
>> the other, if that's the nature of the rewriter).
>>
>>
>> >
>> >
>> > Thoughts?
>> >
>> > --John
>> >
>> > On Fri, Sep 5, 2008 at 10:18 AM, John Hjelmstad <[EMAIL PROTECTED]>
>> wrote:
>> >
>> > > On Fri, Sep 5, 2008 at 1:45 AM, Kevin Brown <[EMAIL PROTECTED]> wrote:
>> > >
>> > >> On Thu, Sep 4, 2008 at 11:53 PM, John Hjelmstad <[EMAIL PROTECTED]>
>> > wrote:
>> > >>
>> > >> > Hey Louis:
>> > >> >
>> > >> > I appreciate the vote of confidence in the parse tree approach :)
>> > ...and
>> > >> > agree that the numbers on rewriting are pretty encouraging. That
>> said,
>> > a
>> > >> > few
>> > >> > more points on which I'd like to get your thoughts:
>> > >> >
>> > >> > 1. The case for String based substitution applies to any
>> circumstance
>> > in
>> > >> > which a given expansion spans the equivalent of multiple nodes.
>> > Because
>> > >> > GadgetHtmlNode treats text-type nodes as literal text, the
>> expansion
>> > of
>> > >> a
>> > >> > message whose value is, for instance, "<b>foo!</b>" will in fact be
>> > >> > rendered
>> > >> > "&lt;b&gt;foo!&lt;&gt;". That's not necessarily a bad thing, but it
>> > and
>> > >> > cases like it are almost certainly non-backward-compatible with
>> _some_
>> > >> use
>> > >> > cases out there.
>> > >>
>> > >>
>> > >> So is whitespace removal. It's not enough to warrant becoming
>> software
>> > >> contortionists.
>> > >>
>> > >>
>> > >> > 2. We've discussed using this new content rewriting infrastructure
>> as
>> > a
>> > >> way
>> > >> > of implementing UP, MSG, and other Substitutions, and potentially
>> > other
>> > >> > basic content manipulations. Past functional regressions (however
>> > >> > dismissable, per #1), satisfying these sorts of requests will be
>> made
>> > >> > inherently more expensive, and by a considerable margin, due to
>> > parsing
>> > >> > cost. Further, they will fail when the parser fails. The
>> alternative
>> > is
>> > >> to
>> > >> > keep these routines as-is. That's reasonable, but in general there
>> > >> remains
>> > >> > a
>> > >> > set of requests for which parse tree manipulation simply isn't
>> > >> necessary,
>> > >> > which would be made much more expensive if a parse tree is the only
>> > way
>> > >> to
>> > >> > manipulate content through a rewriter.
>> > >>
>> > >>
>> > >> You keep coming back to 'expensive'. Parsing should not be expensive.
>> If
>> > >
>> > > it's too expensive to do in real time for all requests, it's too
>> > expensive
>> > >> period and it needs to be revisited. Once this code stabilizes, I can
>> > >> already see many places that can have performance improvements, and
>> I'm
>> > >> sure
>> > >> there are many more if we start sending patches to caja.
>> > >>
>> > >> While parsing is certainly not free, it can be provably demonstrated
>> > that
>> > >> it's quite possible to write an efficient html parser for these
>> > purposes.
>> > >> There are hundreds of thousands of websites that are using far less
>> > >> efficient html parsers to do everything from input validation to
>> > template
>> > >> processing without issue.
>> > >
>> > >
>> > > I don't dispute this. My entire point is that it's inherently more
>> > > expensive than manipulating a String directly, and notably more
>> expensive
>> > > given today's parser (more comments about this below).
>> > >
>> > > I'd like to understand what our position is regarding CPU usage in
>> > > rewriting. Commenting to SHINDIG-500, you said the following when I
>> > proposed
>> > > temporarily removing caching of rewritten content - in that case, only
>> > for
>> > > the minority of gadgets that opted into rewriting in the first place:
>> > "No,
>> > > it isn't acceptable.. As Louis and I have posted on other bugs,
>> rewriting
>> > > (and associated sub-processes) is one of the most expensive operations
>> > done
>> > > by Shindig today (after RSA signing and content encoding detection).
>> We
>> > need
>> > > to optimize this in the long run anyway, but we can't afford to
>> rewrite
>> > > every render in the short term either".
>> > >
>> > > A reasonable chunk of the work I've done on this rewriting stuff is
>> > > designed to maintain prior performance characteristics (before we
>> clean
>> > up
>> > > how that's implemented too). The only breaking change I've made is to
>> > modify
>> > > ContentRewriter's Gadget-rewriting interface, which we all agree needs
>> to
>> > > change anyway.
>> > >
>> > >
>> > >>
>> > >>
>> > >> Parsing is absolutely required to handle opensocial templates server
>> > side.
>> > >>
>> > >> What might be expensive here is analysis, such as content reordering,
>> > but
>> > >> those steps can be skipped if they're deemed too costly.
>> > >>
>> > >>
>> > >> > 3. By my count, the number of lines of code associated with
>> > maintaining
>> > >> > both
>> > >> > String and Parse Tree representations of content is 62, in
>> > >> > MutableHtmlContent.java.
>> > >>
>> > >>
>> > >> Lines of code is a bogus metric, and you know it. There's far more to
>> > >> maintaining the more complex logic than a few lines of code.
>> > >
>> > >
>> > >>
>> > >> The biggest cost is not even in our own code, it's in keeping a clean
>> > >> interface for others to integrate against. We've been horrible API
>> > >> designers
>> > >> thus far (constantly breaking interfaces, changing data structures,
>> and
>> > >> generally creating source incompatibility), and that needs to stop --
>> > >> we're
>> > >> releasing soon. The most consistently proven way to write quality
>> APIs
>> > >> with
>> > >> as little backwards compatibility issues as possible is to write
>> clean,
>> > >> simple interfaces that are as small as possible.
>> > >
>> > >
>> > > It's not a complete metric, no doubt, which is what I acknowledged. I
>> > also
>> > > agree that writing a quality API that we actually do set in relative
>> > stone
>> > > is of paramount importance. The fact that we're breaking the
>> > ContentRewriter
>> > > interface right now is a given, however, so the question becomes what
>> to
>> > > change it to and how to get there.
>> > >
>> > > Today the candidates are:
>> > > ContentRewriter.rewrite(HttpRequest context,
>> [MutableContent|ParseTree]
>> > > content)
>> > > ContentRewriter.rewrite([Gadget gadget | GadgetContext context,
>> > > [MutableContent|ParseTree] content));
>> > >
>> > > We have two interface decisions:
>> > > 1. Use MutableContent or ParseTree. You and Louis strongly prefer
>> > > ParseTree. I prefer MutableContent at present. In my view, to users of
>> > the
>> > > Rewriter interface will almost always simply take
>> content.getParseTree()
>> > and
>> > > call it a day. Others will take content.getContent() and be finished:
>> not
>> > > especially difficult from a usage perspective. Meantime, calling code
>> > would
>> > > simply construct a MutableContent, pass it into the rewriters, and
>> > > getContent() afterward to render.
>> > >
>> > > 2. Pass in Gadget, or GadgetContext plus some rewritable content. Once
>> > > JsLibraries are made into standard ParseTree nodes, or we agree to
>> stick
>> > > with the separate notion of JsLibrary forever (forever = as long as we
>> > don't
>> > > change the rewriter interface), I'm happy to go the latter route. I
>> could
>> > > use your help with either of these.
>> > >
>> > > Again, I too want to solidify these interfaces as soon as possible. If
>> we
>> > > had a wicked-fast parser that can handle the vast majority of gadget
>> > input
>> > > we have today (a requirement to make Substitutions and JsLibraries
>> > Rewriters
>> > > from an implementation perspective), we could use that. But we don't,
>> > yet,
>> > > and we need to use this code in the reasonably near term.
>> > >
>> > >
>> > >>
>> > >>
>> > >>
>> > >> > True, there's some complexity inherent in two
>> > >> > representations being made available, and the corresponding burden
>> to
>> > >> > ensure
>> > >> > that all String-based rewriters are bundled together and all
>> > Parse-based
>> > >> > ones are as well. Just about every rewriter will use only one
>> > >> > representation, though, so the latter burden is the major issue.
>> Given
>> > >> our
>> > >> > dogged insistence (rightly so) on performance/latency reduction,
>> isn't
>> > >> that
>> > >> > a reasonable price to pay?
>> > >>
>> > >>
>> > >> No. Again, you're assuming that parsing automatically and eternally
>> > means
>> > >> "expensive". Just looking at cajita should tell you that it's not an
>> > >> issue.
>> > >> We create the right architecture, and then we worry about
>> optimization
>> > if
>> > >> we
>> > >> need to.
>> > >
>> > >
>> > > Not so. I'm acknowledging that "relatively expensive" is the case
>> today
>> > > (and comparatively expensive to String manipulation by definition, the
>> > > weaker point once parsing is "fast enough"), and that it will take
>> > > nontrivial amounts of work to get to the point where that's not the
>> case
>> > any
>> > > longer.
>> > >
>> > >
>> > >>
>> > >>
>> > >> I suppose it's ironic that I'm championing the String representation
>> > given
>> > >> > the main thrust of my changes is to introduce a parse tree. If you
>> > both
>> > >> -
>> > >> > and anyone else in the community with an opinion! - still feel that
>> > >> parse
>> > >> > tree only is the way to go, I can accede... though to set
>> > expectations,
>> > >> > it'll take me a little while to get to making those mods. In the
>> > >> meantime,
>> > >> > I'd like to add *some* ability to retrieve <intermediary/mutable
>> > content
>> > >> > representation, currently MutableHtmlContent, later maybe just
>> parse
>> > >> tree>
>> > >> > from an HttpResponse in such a way that it can (through
>> collaboration
>> > >> with
>> > >> > HttpResponseBuilder) be shared between HttpResponse instances to
>> avoid
>> > >> > re-parsing in a chain of Rewriters. Seem reasonable? Like Kevin
>> said,
>> > >> we're
>> > >> > dealing with code, not stone here, and we can fiddle with the
>> relevant
>> > >> APIs
>> > >> > until the balance feels right.
>> > >>
>> > >>
>> > >> No, just fix the interface. Why do you want to go through these
>> > >> unnecessary
>> > >> intermediate steps? Fixing the interface and tweaking the existing
>> > >> implementations is infinitely easier than mucking around
>> "temporarily"
>> > >> with
>> > >> HttpResponse. If you think it's too much work or don't have the time
>> to
>> > do
>> > >> this, just say so and one of us will do it -- I've been held up on
>> > >> completing proxied content because of this and I'd rather not wait
>> any
>> > >> more.
>> > >> The current code base is not in a stable, usable state to integrate
>> > >> against
>> > >> and it needs to be moved back to that state rapidly.
>> > >
>> > >
>> > > Looking at HttpResponse a little more, it's painfully clear that it
>> > > shouldn't be mucked with even temporarily. Passing in *some* content
>> > param
>> > > is the solution.
>> > >
>> > > Re: being held up and amount of work. Can you elaborate how you're
>> held
>> > up
>> > > by this? What do you plan on implementing in terms of ContentRewriters
>> > that
>> > > I couldn't, for instance, change while making interface modifications
>> > > described above?
>> > >
>> > > I'm always happy to receive help. The major piece is to resolve the
>> > > interface. If you can A) own the task of optimizing Caja's parser so
>> it's
>> > > suitable for application on every request for every piece of gadget
>> > content
>> > > we have, or B) agree we won't implement JsLibraries and Substitutions
>> as
>> > > Rewriters, then we're done. We go parse tree only.
>> > >
>> > >
>> > >>
>> > >>
>> > >> There's no reason why a chain of rewriters should have to re-parse a
>> > piece
>> > >> of content, and there's no reason to change HttpResponse. Pass the
>> parse
>> > >> tree as an explicit parameter and move on. Even if it did require
>> > >> re-parsing
>> > >> right now, it wouldn't matter because there's exactly 1 rewriter to
>> > worry
>> > >> about at the moment.
>> > >
>> > >
>> > > Not so. The "1 rewriter" is actually 3 rewriters bundled behind the
>> old
>> > > hack-y interface we all agree needs to change. See my latest commit
>> > > splitting ParseTreeDefaultContentRewriter into 3 separate rewriters
>> that
>> > > employ the new-style API for gadget rewriting.
>> > >
>> > >
>> >
>>
>
>

Re: [jira] Commented: (SHINDIG-571) Create shared intermediary representation of mutable HTML content for Gadget and HttpResponse to use.

Reply via email to