On Wed, Sep 24, 2008 at 8:15 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:

> On Wed, Sep 24, 2008 at 10:51 AM, Kevin Brown <[EMAIL PROTECTED]> wrote:
>
> > >
> > > It seems this is the crux of the problem - the rendering rewriter
> (which
> > I
> > > admit that, until today, I hadn't inspected as closely as I should
> have).
> > > That either makes adding the rendering rewriter to a
> > > ContentRewriterRegistry
> > > a bad idea (it can still be called outside this context), or calls for
> > some
> > > more intelligent caching (which would be a side effect of moving the
> > > rendering rewriter out of the registry).
> >
> >
> > You seem to be under the impression that rewriting is purely for
> > optimization, which is not the case. Every other aspect of the rendering
> > pipeline has been hinging on this, and it  has been backed up extensively
> > while you were making the changes to rewriting. The original plan was to
> > take the caja parse tree and use it for doing all the necessary
> > manipulation
> > to produce the output markup. This was planned since July, right here on
> > this mailing list. We needed a structure that represented the document
> that
> > we could manipulate in a variety of ways, with many different components
> > performing modifications to the source document to produce the final
> > output.
> > One rewriter might insert javascript, another might inject configuration
> > data, and another would perform variable substitution, and others might
> > perform optimization.
>
>
> I've long been aware of this approach. But it's not a foregone conclusion
> how rewriters are run, which is what this discussion is about. Let's decide
> how we want that to work.
>
> Given:
> A) I have an immediate and pressing need for rewriters that are
> caching-sensitive (or alternately, the ability to say that we'll run every
> rewriter on every pass without caching at all: caching which this list has
> repeatedly said is a necessity).
> B) Past my needs, there is a full class of rewriters that are
> caching-sensitive anyway, which we should accommodate now at least in terms
> of API so we can solidify it ahead of 1.0.
> C) We're implementing a lot of rendering atop rewriter-linked
> infrastructure
> - which is not cacheable.
>
> Do we:
> 1. Eschew all rewriter caching.
> 2. Make CachingContentRewriterRegistry smart enough to bundle rewriters for
> caching decisions. Ideas on mechanisms:
>  i) Inspecting RewriterResults to determine cache-batching by some kind of
> measurement and heuristics. [Sounds painful, difficult, and imprecise.]
>  ii) Add a method to ContentRewriter indicating whether a given pass can
> possibly cache anything. CachingContentRewriterRegistry batches caching
> decisions based on the results.
>  iii) Manually batch rewrites in rough fashion, something like: core
> features first (maybe cached?), optimizations next (cached), rendering last
> (not cached).
>  iv) Add some concept of rewriter "groups" to
> CachingContentRewriterRegistry, eg. change its ctor to
> CCRR(List<RewriterGroup>, ...) -- where RewriterGroup has a String label;
> boolean cacheEnabled; and List<ContentRewriter>. Caching decisions made by
> group.
>
> I vote 2(iii) or 2(iv). In fact, 2(iv) could simply be a generalization of
> 2(iii), with CCRR automatically registering "core" RewriterGroups in
> appropriate order by default, preventing the need for everyone to
> reconstruct rewriters that are necessary anyway.



i, iii, and iv are overly complicated solutions to a very simple problem:

i Is impractical if not impossible.

iii is unnecessary. There's no such thing as a "core feature" when it comes
to rewriting -- there are just things that modify the document. Some of
these things are cacheable, others are not. You can't possibly flip between
the two, so you always have to do cacheable things first. Even if you really
have a compelling case for TTLs on cacheable things, you still have to group
all the cacheable things together unless you're going to create separate
cache entries for each stage, which is horribly wasteful.

iv is a generalization of iii, and it's unnecessary for the same reason.

Nobody has to construct rewriters -- they just bind them in their Guice
module. If they want to swap out an optimizing rewriter or a rewriter used
for injecting configuration or whatever they can do so. It doesn't matter to
the rest of the code at all, and rewriters can be blissfully ignorant of one
another. If order really matters (within the cacheable  / uncacheable
groups), they can control this when they configure their binding.

We don't have to worry about people forgetting to bind an important
rewriter. They'll see the problem right away (the expected output will be
incorrect), and even if they didn't there's no reason why we should care. We
may as well worry that people are going to write a custom guice module and
forget to bind an HTTP fetcher or something.

Option 1 may very well be viable, but nobody can know until there are real
benchmarks. Parsing isn't a free operation, but it's not the most expensive
thing either (it pales in comparison to, say, RSA signed fetch). Option 2
(in any variant) is workable, but all variants require huge compromises in
what we can do to the document. If it isn't cacheable, and parsing is
expensive, you're going to wind up with ugly string manipulation like you
see in RenderingContentRewriter right now.

Once again, I think our best bet is to stop speculating and do some real
benchmarks, figuring out what the latency and CPU consumption is when using
the parser. There are many, many projects that parse HTML documents in real
time -- there's little reason why we shouldn't be able to.

Reply via email to