Re: [jira] Commented: (SHINDIG-571) Create shared intermediary representation of mutable HTML content for Gadget and HttpResponse to use.

Kevin Brown Fri, 05 Sep 2008 01:45:52 -0700

On Thu, Sep 4, 2008 at 11:53 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:


> Hey Louis:
>
> I appreciate the vote of confidence in the parse tree approach :) ...and
> agree that the numbers on rewriting are pretty encouraging. That said, a
> few
> more points on which I'd like to get your thoughts:
>
> 1. The case for String based substitution applies to any circumstance in
> which a given expansion spans the equivalent of multiple nodes. Because
> GadgetHtmlNode treats text-type nodes as literal text, the expansion of a
> message whose value is, for instance, "<b>foo!</b>" will in fact be
> rendered
> "&lt;b&gt;foo!&lt;&gt;". That's not necessarily a bad thing, but it and
> cases like it are almost certainly non-backward-compatible with _some_ use
> cases out there.


So is whitespace removal. It's not enough to warrant becoming software
contortionists.


> 2. We've discussed using this new content rewriting infrastructure as a way
> of implementing UP, MSG, and other Substitutions, and potentially other
> basic content manipulations. Past functional regressions (however
> dismissable, per #1), satisfying these sorts of requests will be made
> inherently more expensive, and by a considerable margin, due to parsing
> cost. Further, they will fail when the parser fails. The alternative is to
> keep these routines as-is. That's reasonable, but in general there remains
> a
> set of requests for which parse tree manipulation simply isn't necessary,
> which would be made much more expensive if a parse tree is the only way to
> manipulate content through a rewriter.


You keep coming back to 'expensive'. Parsing should not be expensive. If
it's too expensive to do in real time for all requests, it's too expensive
period and it needs to be revisited. Once this code stabilizes, I can
already see many places that can have performance improvements, and I'm sure
there are many more if we start sending patches to caja.

While parsing is certainly not free, it can be provably demonstrated that
it's quite possible to write an efficient html parser for these purposes.
There are hundreds of thousands of websites that are using far less
efficient html parsers to do everything from input validation to template
processing without issue.

Parsing is absolutely required to handle opensocial templates server side.

What might be expensive here is analysis, such as content reordering, but
those steps can be skipped if they're deemed too costly.


> 3. By my count, the number of lines of code associated with maintaining
> both
> String and Parse Tree representations of content is 62, in
> MutableHtmlContent.java.


Lines of code is a bogus metric, and you know it. There's far more to
maintaining the more complex logic than a few lines of code.

The biggest cost is not even in our own code, it's in keeping a clean
interface for others to integrate against. We've been horrible API designers
thus far (constantly breaking interfaces, changing data structures, and
generally creating source incompatibility), and that needs to stop -- we're
releasing soon. The most consistently proven way to write quality APIs with
as little backwards compatibility issues as possible is to write clean,
simple interfaces that are as small as possible.


> True, there's some complexity inherent in two
> representations being made available, and the corresponding burden to
> ensure
> that all String-based rewriters are bundled together and all Parse-based
> ones are as well. Just about every rewriter will use only one
> representation, though, so the latter burden is the major issue. Given our
> dogged insistence (rightly so) on performance/latency reduction, isn't that
> a reasonable price to pay?


No. Again, you're assuming that parsing automatically and eternally means
"expensive". Just looking at cajita should tell you that it's not an issue.
We create the right architecture, and then we worry about optimization if we
need to.

I suppose it's ironic that I'm championing the String representation given
> the main thrust of my changes is to introduce a parse tree. If you both -
> and anyone else in the community with an opinion! - still feel that parse
> tree only is the way to go, I can accede... though to set expectations,
> it'll take me a little while to get to making those mods. In the meantime,
> I'd like to add *some* ability to retrieve <intermediary/mutable content
> representation, currently MutableHtmlContent, later maybe just parse tree>
> from an HttpResponse in such a way that it can (through collaboration with
> HttpResponseBuilder) be shared between HttpResponse instances to avoid
> re-parsing in a chain of Rewriters. Seem reasonable? Like Kevin said, we're
> dealing with code, not stone here, and we can fiddle with the relevant APIs
> until the balance feels right.


No, just fix the interface. Why do you want to go through these unnecessary
intermediate steps? Fixing the interface and tweaking the existing
implementations is infinitely easier than mucking around "temporarily" with
HttpResponse. If you think it's too much work or don't have the time to do
this, just say so and one of us will do it -- I've been held up on
completing proxied content because of this and I'd rather not wait any more.
The current code base is not in a stable, usable state to integrate against
and it needs to be moved back to that state rapidly.

There's no reason why a chain of rewriters should have to re-parse a piece
of content, and there's no reason to change HttpResponse. Pass the parse
tree as an explicit parameter and move on. Even if it did require re-parsing
right now, it wouldn't matter because there's exactly 1 rewriter to worry
about at the moment.

Re: [jira] Commented: (SHINDIG-571) Create shared intermediary representation of mutable HTML content for Gadget and HttpResponse to use.

Reply via email to