Re: Content Rewriter Modularization: Design/Change

John Hjelmstad Tue, 12 Aug 2008 16:48:56 -0700

Hello,

While beginning to refactor the rewriter APIs I've discovered that there
unfortunately is one semantic difference inherent to moving getContent() and
setContent() methods into the Gadget object (replacing
View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer caches
rewritten content.


I've written a discussion of this in issue SHINDIG-500, which tracks this
implementation sub-task: https://issues.apache.org/jira/browse/SHINDIG-500

To summarize:
1. Is this change acceptable for the time being?
2. I suggest that we can, at a later date, move fetching of gadget specs
into GadgetServer while injecting a Gadget(Spec) cache there as well,
offering finer-tuned control over caching characteristics.

Thanks,
John

On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:

> I understand these concerns, and should be clear that I don't (despite my
> personal interest in experimenting with the idea, agreed that we don't have
> time for it at the moment) have any plans to introduce this sort of RPC
> anywhere - certainly not in Shindig itself, as any such call would be hidden
> behind an interface anyway.
>
> Putting the RPC hypothetical aside, I still feel that there's value to
> implementing HTML parsing in terms of an interface:
> * Clearer separation of concerns/boundary between projects.
>   - Corollary simplicity in testing.
> * Clearer API for content manipulation (that doesn't require knowledge of
> Caja).
>
> I could be convinced otherwise, but at this point the code involved seems
> of manageable size, so still worth doing. Thoughts?
>
> John
>
>
>
> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <[EMAIL PROTECTED]> wrote:
>
>> I agree with Louis -- that's just not practical. Every rewriting operation
>> must work in real time. Caja's existing html parser is adequate for our
>> needs, and we shouldn't go out of our way to tolerate every oddity of
>> random
>> web browsers (especially as it simply wouldn't work unless you farmed it
>> out
>> to *every* browser). Any new code needs to be grounded in practical,
>> current
>> needs, not theoretical options. We can always change code later if we find
>> a
>> real need for something like that. We have real work to do in the
>> meantime.
>>
>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <[EMAIL PROTECTED]> wrote:
>>
>> > John,
>> >
>> > From a practicality standpoint I'm a little nervous about this plan to
>> make
>> > RPCs calls out of a Java process to a native process to fetch a parse
>> tree
>> > for transformations that have to occur realtime. I don't think the
>> > motivating factor here is to accept all inputs that browsers can. Gadget
>> > developers will tailor their markup to the platform as they have done
>> > already. I would greatly prefer us to pick one 'good' parser and stick
>> with
>> > it for all the manageability and consumability benefits that come with
>> that
>> > decision. Perhaps Im missing something here?
>> >
>> > -Louis
>> >
>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <[EMAIL PROTECTED]>
>> wrote:
>> >
>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <[EMAIL PROTECTED]> wrote:
>> > >
>> > > > [+google-caja-discuss]
>> > > >
>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <[EMAIL PROTECTED]>
>> > wrote:
>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <[EMAIL PROTECTED]>
>> wrote:
>> > > > >
>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
>> [EMAIL PROTECTED]>
>> > > > wrote:
>> > > > >> > This proposal effectively enables the renderer to become a
>> > > multi-pass
>> > > > >> > compiler for gadget content (essentially, arbitrary web
>> content).
>> > > Such
>> > > > a
>> > > > >> > compiler can provide several benefits: static optimization of
>> > gadget
>> > > > >> content
>> > > > >> > (auto-proxying of images, whitespace/comment removal,
>> > consolidation
>> > > of
>> > > > >> CSS
>> > > > >> > blocks), security benefits (caja et al), new functionality
>> > > (annotation
>> > > > of
>> > > > >> > content for stats, document analysis, container-specific
>> > features),
>> > > > etc.
>> > > > >> To
>> > > > >> > my knowledge no such infrastructure exists today (with the
>> > possible
>> > > > >> > exception of Caja itself, which I'd like to dovetail with this
>> > > work).
>> > > > >>
>> > > > >> Caja clearly provides a large chunk of the code you'd need for
>> this.
>> > > > >> I'd like to hear how we'd manage to avoid duplication between the
>> > two
>> > > > >> projects.
>> > > > >>
>> > > > >> A generalised framework for manipulating content sounds like a
>> great
>> > > > >> idea, but probably should not live in either of the two projects
>> > (Caja
>> > > > >> and Shindig) but rather should be shared by both of them, I
>> suspect.
>> > > > >
>> > > > >
>> > > > > I agree on both counts. As I mentioned, the piece of this idea
>> that I
>> > > > expect
>> > > > > to change the most is the parse tree, and Caja's .parser.html and
>> > > > > .parser.css packages contain much of what I've thrown in here as a
>> > > base.
>> > > > >
>> > > > > My key requirements are:
>> > > > > * Lightweight framework.
>> > > > > * Parser modularity, mostly for HTML parsers (to re-use the good
>> work
>> > > > done
>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
>> > > > > * Automatic maintenance of DOM<->String conversion.
>> > > > > * Easy to manipulate structure.
>> > > >
>> > > > I'm not sure what the value of parser modularity is? If the
>> resulting
>> > > > tree is different, then that's a problem for people processing the
>> > > > tree. And if it is not, then why do we care?
>> > >
>> > >
>> > > IMO the value of parser modularity is that the lenient parsers native
>> to
>> > > browsers can be used in place of those that might not accept all
>> inputs.
>> > > One
>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code into a
>> > server
>> > > that runs parallel to Shindig and provides a "local RPC" service for
>> > > parsing
>> > > semi-structured HTML. The resulting tree for WebKit's parser might be
>> > > different than that for an XHTML parser, Gecko's parser, etc, but if
>> the
>> > > algorithm implemented atop it is rule-based rather than
>> strict-structure
>> > > based that should be fine, no?
>> > >
>> > >
>> > > >
>> > > >
>> > > > >
>> > > > > I'd love to see both projects share the same base syntax tree
>> > > > > representations. I considered .parser.html(.DomTree) and
>> .parser.css
>> > > for
>> > > > > these, but at the moment these appeared to be a little more tied
>> to
>> > > > Caja's
>> > > > > lexer/parser implementation than I preferred (though I admit
>> > > > > AbstractParseTreeNode contains most of what's needed).
>> > > > >
>> > > > > To be sure, I don't see this as an end-all-be-all transformation
>> > system
>> > > > in
>> > > > > any way. I'd just like to put *something* reasonable in place that
>> we
>> > > can
>> > > > > play with, provide some benefit, and enhance into a truly
>> > sophisticated
>> > > > > vision of document rewriting.
>> > > > >
>> > > > >
>> > > > >>
>> > > > >>
>> > > > >> >  c. Add Gadget.getParsedContent().
>> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
>> manipulate
>> > > > Gadget
>> > > > >> > Contents.
>> > > > >> >    ii. Mutable tree calls back to the Gadget object indicating
>> > when
>> > > > any
>> > > > >> > change is made, and emits an error if setContent() has been
>> called
>> > > in
>> > > > the
>> > > > >> > interim.
>> > > > >>
>> > > > >> In Caja we have been moving towards immutable trees...
>> > > > >
>> > > > >
>> > > > > Interested to hear more about this. The whole idea is for the
>> > gadget's
>> > > > tree
>> > > > > representation to be modifiable. Doing that with immutable trees
>> to
>> > me
>> > > > > suggests that a rewriter would have to create a completely new
>> tree
>> > and
>> > > > set
>> > > > > it as a representation of new content. That's convenient as far as
>> > the
>> > > > > Gadget's maintenance of String<->Tree representations is
>> concerned...
>> > > but
>> > > > > seems pretty heavyweight for many types of edits: in-situ
>> > modifications
>> > > > of
>> > > > > text, content reordering, etc. That's particularly so in a
>> > > > single-threaded
>> > > > > (viz rewriting) environment.
>> > > >
>> > > > Never having been entirely sold on the concept, I'll let those on
>> the
>> > > > Caja team who advocate immutability explain why.
>> > > >
>> > >
>> >
>>
>
>

Re: Content Rewriter Modularization: Design/Change

Reply via email to