Re: Content Rewriter Modularization: Design/Change

Louis Ryan Tue, 12 Aug 2008 19:16:28 -0700

Can we prove this out incrementally bottom-up. In general I think using DOM
is the right thing to do from a rewriting standpoint. So here's how I
propose we proceed


1. If the Caja dom is a little awkward wrap it, if not lets just use it as
is. We can always resolve this later
2. Change the existing content rewriters to use the DOM instead of a lexer,
should be pretty easy. Maybe add some fancier rewriting like moving CSS into
HEAD
3. Do some perf testing, look into memory overhead of dom transformation
etc.
4. Alter GadgetSpec's to retain the dom when they are cached
5. Alter the gadget rendering phase to serialize the content of the dom to
output
6. Annotate the dom at parse time to make render time user-pref substituions
faster, this should be easy enough too...

This should be enough to prove out the pipeline end-to-end and identify any
major perf niggles. Once this is done we can look into how to inject a
rewriter pipeline into the parsing phase and the rendering phase.

-Louis



On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:

> Re-responding in order to apply the last few exchanges to
> google-caja-discuss@ (@gmail vs. @google membership issues).
>
> On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:
>
> > Hello,
> >
> > While beginning to refactor the rewriter APIs I've discovered that there
> > unfortunately is one semantic difference inherent to moving getContent()
> and
> > setContent() methods into the Gadget object (replacing
> > View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer caches
> > rewritten content.
> >
> > I've written a discussion of this in issue SHINDIG-500, which tracks this
> > implementation sub-task:
> https://issues.apache.org/jira/browse/SHINDIG-500
> >
> > To summarize:
> > 1. Is this change acceptable for the time being?
> > 2. I suggest that we can, at a later date, move fetching of gadget specs
> > into GadgetServer while injecting a Gadget(Spec) cache there as well,
> > offering finer-tuned control over caching characteristics.
> >
> > Thanks,
> > John
> >
> >
> > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <[EMAIL PROTECTED]>
> wrote:
> >
> >> I understand these concerns, and should be clear that I don't (despite
> my
> >> personal interest in experimenting with the idea, agreed that we don't
> have
> >> time for it at the moment) have any plans to introduce this sort of RPC
> >> anywhere - certainly not in Shindig itself, as any such call would be
> hidden
> >> behind an interface anyway.
> >>
> >> Putting the RPC hypothetical aside, I still feel that there's value to
> >> implementing HTML parsing in terms of an interface:
> >> * Clearer separation of concerns/boundary between projects.
> >>   - Corollary simplicity in testing.
> >> * Clearer API for content manipulation (that doesn't require knowledge
> of
> >> Caja).
> >>
> >> I could be convinced otherwise, but at this point the code involved
> seems
> >> of manageable size, so still worth doing. Thoughts?
> >>
> >> John
> >>
> >>
> >>
> >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <[EMAIL PROTECTED]> wrote:
> >>
> >>> I agree with Louis -- that's just not practical. Every rewriting
> >>> operation
> >>> must work in real time. Caja's existing html parser is adequate for our
> >>> needs, and we shouldn't go out of our way to tolerate every oddity of
> >>> random
> >>> web browsers (especially as it simply wouldn't work unless you farmed
> it
> >>> out
> >>> to *every* browser). Any new code needs to be grounded in practical,
> >>> current
> >>> needs, not theoretical options. We can always change code later if we
> >>> find a
> >>> real need for something like that. We have real work to do in the
> >>> meantime.
> >>>
> >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <[EMAIL PROTECTED]> wrote:
> >>>
> >>> > John,
> >>> >
> >>> > From a practicality standpoint I'm a little nervous about this plan
> to
> >>> make
> >>> > RPCs calls out of a Java process to a native process to fetch a parse
> >>> tree
> >>> > for transformations that have to occur realtime. I don't think the
> >>> > motivating factor here is to accept all inputs that browsers can.
> >>> Gadget
> >>> > developers will tailor their markup to the platform as they have done
> >>> > already. I would greatly prefer us to pick one 'good' parser and
> stick
> >>> with
> >>> > it for all the manageability and consumability benefits that come
> with
> >>> that
> >>> > decision. Perhaps Im missing something here?
> >>> >
> >>> > -Louis
> >>> >
> >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <[EMAIL PROTECTED]>
> >>> wrote:
> >>> >
> >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <[EMAIL PROTECTED]>
> wrote:
> >>> > >
> >>> > > > [+google-caja-discuss]
> >>> > > >
> >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <[EMAIL PROTECTED]
> >
> >>> > wrote:
> >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <[EMAIL PROTECTED]>
> >>> wrote:
> >>> > > > >
> >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
> >>> [EMAIL PROTECTED]>
> >>> > > > wrote:
> >>> > > > >> > This proposal effectively enables the renderer to become a
> >>> > > multi-pass
> >>> > > > >> > compiler for gadget content (essentially, arbitrary web
> >>> content).
> >>> > > Such
> >>> > > > a
> >>> > > > >> > compiler can provide several benefits: static optimization
> of
> >>> > gadget
> >>> > > > >> content
> >>> > > > >> > (auto-proxying of images, whitespace/comment removal,
> >>> > consolidation
> >>> > > of
> >>> > > > >> CSS
> >>> > > > >> > blocks), security benefits (caja et al), new functionality
> >>> > > (annotation
> >>> > > > of
> >>> > > > >> > content for stats, document analysis, container-specific
> >>> > features),
> >>> > > > etc.
> >>> > > > >> To
> >>> > > > >> > my knowledge no such infrastructure exists today (with the
> >>> > possible
> >>> > > > >> > exception of Caja itself, which I'd like to dovetail with
> this
> >>> > > work).
> >>> > > > >>
> >>> > > > >> Caja clearly provides a large chunk of the code you'd need for
> >>> this.
> >>> > > > >> I'd like to hear how we'd manage to avoid duplication between
> >>> the
> >>> > two
> >>> > > > >> projects.
> >>> > > > >>
> >>> > > > >> A generalised framework for manipulating content sounds like a
> >>> great
> >>> > > > >> idea, but probably should not live in either of the two
> projects
> >>> > (Caja
> >>> > > > >> and Shindig) but rather should be shared by both of them, I
> >>> suspect.
> >>> > > > >
> >>> > > > >
> >>> > > > > I agree on both counts. As I mentioned, the piece of this idea
> >>> that I
> >>> > > > expect
> >>> > > > > to change the most is the parse tree, and Caja's .parser.html
> and
> >>> > > > > .parser.css packages contain much of what I've thrown in here
> as
> >>> a
> >>> > > base.
> >>> > > > >
> >>> > > > > My key requirements are:
> >>> > > > > * Lightweight framework.
> >>> > > > > * Parser modularity, mostly for HTML parsers (to re-use the
> good
> >>> work
> >>> > > > done
> >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
> >>> > > > > * Automatic maintenance of DOM<->String conversion.
> >>> > > > > * Easy to manipulate structure.
> >>> > > >
> >>> > > > I'm not sure what the value of parser modularity is? If the
> >>> resulting
> >>> > > > tree is different, then that's a problem for people processing
> the
> >>> > > > tree. And if it is not, then why do we care?
> >>> > >
> >>> > >
> >>> > > IMO the value of parser modularity is that the lenient parsers
> native
> >>> to
> >>> > > browsers can be used in place of those that might not accept all
> >>> inputs.
> >>> > > One
> >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code into a
> >>> > server
> >>> > > that runs parallel to Shindig and provides a "local RPC" service
> for
> >>> > > parsing
> >>> > > semi-structured HTML. The resulting tree for WebKit's parser might
> be
> >>> > > different than that for an XHTML parser, Gecko's parser, etc, but
> if
> >>> the
> >>> > > algorithm implemented atop it is rule-based rather than
> >>> strict-structure
> >>> > > based that should be fine, no?
> >>> > >
> >>> > >
> >>> > > >
> >>> > > >
> >>> > > > >
> >>> > > > > I'd love to see both projects share the same base syntax tree
> >>> > > > > representations. I considered .parser.html(.DomTree) and
> >>> .parser.css
> >>> > > for
> >>> > > > > these, but at the moment these appeared to be a little more
> tied
> >>> to
> >>> > > > Caja's
> >>> > > > > lexer/parser implementation than I preferred (though I admit
> >>> > > > > AbstractParseTreeNode contains most of what's needed).
> >>> > > > >
> >>> > > > > To be sure, I don't see this as an end-all-be-all
> transformation
> >>> > system
> >>> > > > in
> >>> > > > > any way. I'd just like to put *something* reasonable in place
> >>> that we
> >>> > > can
> >>> > > > > play with, provide some benefit, and enhance into a truly
> >>> > sophisticated
> >>> > > > > vision of document rewriting.
> >>> > > > >
> >>> > > > >
> >>> > > > >>
> >>> > > > >>
> >>> > > > >> >  c. Add Gadget.getParsedContent().
> >>> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
> >>> manipulate
> >>> > > > Gadget
> >>> > > > >> > Contents.
> >>> > > > >> >    ii. Mutable tree calls back to the Gadget object
> indicating
> >>> > when
> >>> > > > any
> >>> > > > >> > change is made, and emits an error if setContent() has been
> >>> called
> >>> > > in
> >>> > > > the
> >>> > > > >> > interim.
> >>> > > > >>
> >>> > > > >> In Caja we have been moving towards immutable trees...
> >>> > > > >
> >>> > > > >
> >>> > > > > Interested to hear more about this. The whole idea is for the
> >>> > gadget's
> >>> > > > tree
> >>> > > > > representation to be modifiable. Doing that with immutable
> trees
> >>> to
> >>> > me
> >>> > > > > suggests that a rewriter would have to create a completely new
> >>> tree
> >>> > and
> >>> > > > set
> >>> > > > > it as a representation of new content. That's convenient as far
> >>> as
> >>> > the
> >>> > > > > Gadget's maintenance of String<->Tree representations is
> >>> concerned...
> >>> > > but
> >>> > > > > seems pretty heavyweight for many types of edits: in-situ
> >>> > modifications
> >>> > > > of
> >>> > > > > text, content reordering, etc. That's particularly so in a
> >>> > > > single-threaded
> >>> > > > > (viz rewriting) environment.
> >>> > > >
> >>> > > > Never having been entirely sold on the concept, I'll let those on
> >>> the
> >>> > > > Caja team who advocate immutability explain why.
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: Content Rewriter Modularization: Design/Change

Reply via email to