Re: [Caja] Re: Content Rewriter Modularization: Design/Change

Ben Laurie Tue, 02 Sep 2008 08:46:19 -0700

On Wed, Aug 27, 2008 at 2:12 AM, John Hjelmstad <[EMAIL PROTECTED]> wrote:
> Agreed with these conclusions. There are a number of avenues open to us for
> rewriter optimization, and per stats, working with the Caja team on an
> optimized, ideally purpose-built-for-Shindig parser is tops on the list.


Caja also has optimisation on the road map. "Working" comes first, though :-)

Curious what "purpose-built-for-Shindig" would entail?

> Re: making ParseTreeDefaultContentRewriter the default... thanks! I plan to
> do this, but FYI I'll be making the actual change only after cleaning up
> plumbing a bit: making GadgetSpec immutable, then cleaning up the rewriter
> interface to consume a (mutable) Gadget, which maintains its own parse tree
> (per the original plan). This in turn will allow
> ParseTreeDefaultContentRewriter to be split into its constituent parts (link
> rewriter, js consolidator, style rewriter), and more importantly allow other
> rewriters to modularly take advantage of the parse tree. Until then, the
> parse tree is hidden from other rewriters.
> Expect these changes soon, along with resuscitation of the discussion on how
> best to handle caching of rewritten content.
> --John
>
> On Mon, Aug 25, 2008 at 5:21 PM, Kevin Brown <[EMAIL PROTECTED]> wrote:
>>
>> I think any arguments one way or another contingent on caching should be
>> ignored, because caching isn't always going to be possible (many
>> makeRequest
>> calls, proxied content, etc.)
>>
>> Instead, we should just make sure that the total time spent rewriting a
>> given piece of content is under some target -- say 50-100ms. We can
>> optimize
>> aggressively to meet this goal, regardless of which approach we take. From
>> the initial results, I think we can simply work with the caja team to
>> improve the performance of the parser to bring it into that range. Having
>> a
>> parse tree is more useful than not having one, so we should go ahead and
>> use
>> the parser instead of the lexer and then just focus on making it better.
>>
>> On Mon, Aug 25, 2008 at 4:44 PM, Louis Ryan <[EMAIL PROTECTED]> wrote:
>>
>> > These numbers seem pretty solid so I believe its fine to switch the
>> > general
>> > rewriter over to this new code and ditch the lexer based stuff. Nice
>> > work.
>> >
>> > On Mon, Aug 25, 2008 at 2:27 PM, John Hjelmstad <[EMAIL PROTECTED]>
>> > wrote:
>> >
>> > > I've implemented a parse-tree based ContentRewriter using the existing
>> > > plumbing (same caching semantics et al), as
>> > ParseTreeDefaultContentRewriter
>> > > and ParseTreeHtmlRewriter, respectively. The latter contains
>> > > essentially
>> > > all
>> > > rewriting functionality from the previous HtmlTagTransformer classes.
>> > > The
>> > > parse-tree based rewriter is now functionally equivalent to the
>> > > previous
>> > > rewriter. The new classes use a Caja-based HTML parser under the hood.
>> > > This proves out the functional viability of a tree-based rewriter, but
>> > > performance also needs to be assessed.
>> > >
>> > > I've gone ahead and profiled the comparative performance of each
>> > rewriter,
>> > > "Lexer" based and "ParseTree" based. To no surprise, "Lexer" wins out
>> > every
>> > > time essentially by definition, since obviously Caja's parser uses its
>> > own
>> > > lexer under the hood.
>> > >
>> > > Summary:
>> > > * The performance of each is fundamentally O(n), though...
>> > > * For any given input size, Lexer-based rewriting averages between 2.5
>> > > -
>> > > 3.5x faster than ParseTree-based (ie. c =~ 3.5 at worst).
>> > > * By far, the majority of time involved in ParseTree-based
>> > > optimization
>> > is
>> > > initial parsing: 75% of all processing.
>> > >
>> > > Details:
>> > > 1. I wrote a simple rewriter/parser profiler which rewrites (the
>> > > sample
>> > > rewriter gadget's content * X repetitions) N times, recording the
>> > resulting
>> > > run time. The run time of parse-based rewriting degraded as N
>> > > increased,
>> > in
>> > > all likelihood due to the additional cost of object management
>> > (lexer-based
>> > > rewriting involves few intermediate objects). Given that the results
>> > > of
>> > > rewriting will be variously cached, it's very unlikely that rewriting
>> > will
>> > > happen in immediate succession hundreds or thousands of times. As
>> > > such, I
>> > > fixed N = 1 to re-run the tests in relative isolation from one
>> > > another.
>> > > Results from a given run:
>> > >
>> > > LEX-BASED*100 rewriter, 1 runs in 177047 microsecs [177.04704]
>> > > millis/run
>> > >
>> > > PARSE-BASED*100 rewriter, 1 runs in 609136 microsecs [609.136128]
>> > > millis/run
>> > >
>> > > Parse/lex ratio: 3.4405327398939263
>> > >
>> > > LEX-BASED*50 rewriter, 1 runs in 43936 microsecs [43.936] millis/run
>> > >
>> > > PARSE-BASED*50 rewriter, 1 runs in 148980 microsecs [148.979968]
>> > millis/run
>> > >
>> > > Parse/lex ratio: 3.3908412235979606
>> > >
>> > > LEX-BASED*10 rewriter, 1 runs in 3093 microsecs [3.092992] millis/run
>> > >
>> > > PARSE-BASED*10 rewriter, 1 runs in 11020 microsecs [11.020032]
>> > > millis/run
>> > >
>> > > Parse/lex ratio: 3.5628839314581313
>> > >
>> > > LEX-BASED*1 rewriter, 1 runs in 600 microsecs [0.600064] millis/run
>> > >
>> > > PARSE-BASED*1 rewriter, 1 runs in 1819 microsecs [1.819136] millis/run
>> > >
>> > > Parse/lex ratio: 3.0316666666666667
>> > >
>> > >
>> > > 2. Drilling down, I added simple operation profiling to each component
>> > > of
>> > > parse-tree rewriting: original parse (CajaHtmlParser); building
>> > > mutable
>> > > tree
>> > > nodes; rewriting links; concatenating JS nodes; rewriting style
>> > > blocks;
>> > > rendering parse tree. I then reran the same tests.
>> > >
>> > > Results from subsequent run:
>> > >
>> > > LEX-BASED*100 rewriter, 1 runs in 165321 microsecs [165.32096]
>> > > millis/run
>> > >
>> > > PARSE-BASED*100 rewriter, 1 runs in 646884 microsecs [646.88384]
>> > millis/run
>> > >
>> > > Parse/lex ratio: 3.912896728183352
>> > >
>> > > [PARSE OPS]
>> > >
>> > > Op[style-rewrite] min:25.419ms, max:25.419ms, avg:25.419ms
>> > >
>> > > Op[render] min:36.851ms, max:36.851ms, avg:36.851ms
>> > >
>> > > Op[js-rewrite] min:53.983ms, max:53.983ms, avg:53.983ms
>> > >
>> > > Op[link-rewrite] min:31.136ms, max:31.136ms, avg:31.136ms
>> > >
>> > > Op[build-nodes] min:32.929ms, max:32.929ms, avg:32.929ms
>> > >
>> > > Op[parse] min:464.211ms, max:464.211ms, avg:464.211ms
>> > >
>> > >
>> > > LEX-BASED*50 rewriter, 1 runs in 30684 microsecs [30.683904]
>> > > millis/run
>> > >
>> > > PARSE-BASED*50 rewriter, 1 runs in 161132 microsecs [161.132032]
>> > millis/run
>> > >
>> > > Parse/lex ratio: 5.251336201277539
>> > >
>> > > [PARSE OPS]
>> > >
>> > > Op[style-rewrite] min:8.581ms, max:8.581ms, avg:8.581ms
>> > >
>> > > Op[render] min:5.184ms, max:5.184ms, avg:5.184ms
>> > >
>> > > Op[js-rewrite] min:11.606ms, max:11.606ms, avg:11.606ms
>> > >
>> > > Op[link-rewrite] min:7.533ms, max:7.533ms, avg:7.533ms
>> > >
>> > > Op[build-nodes] min:3.41ms, max:3.41ms, avg:3.41ms
>> > >
>> > > Op[parse] min:121.367ms, max:121.367ms, avg:121.367ms
>> > >
>> > >
>> > > LEX-BASED*10 rewriter, 1 runs in 3371 microsecs [3.371008] millis/run
>> > >
>> > > PARSE-BASED*10 rewriter, 1 runs in 10336 microsecs [10.336] millis/run
>> > >
>> > > Parse/lex ratio: 3.066152477009789
>> > >
>> > > [PARSE OPS]
>> > >
>> > > Op[style-rewrite] min:0.563ms, max:0.563ms, avg:0.563ms
>> > >
>> > > Op[render] min:0.678ms, max:0.678ms, avg:0.678ms
>> > >
>> > > Op[js-rewrite] min:1.374ms, max:1.374ms, avg:1.374ms
>> > >
>> > > Op[link-rewrite] min:0.718ms, max:0.718ms, avg:0.718ms
>> > >
>> > > Op[build-nodes] min:0.295ms, max:0.295ms, avg:0.295ms
>> > >
>> > > Op[parse] min:6.466ms, max:6.466ms, avg:6.466ms
>> > >
>> > >
>> > > LEX-BASED*1 rewriter, 1 runs in 592 microsecs [0.592128] millis/run
>> > >
>> > > PARSE-BASED*1 rewriter, 1 runs in 2083 microsecs [2.083072] millis/run
>> > >
>> > > Parse/lex ratio: 3.518581081081081
>> > >
>> > > [PARSE OPS]
>> > >
>> > > Op[style-rewrite] min:0.082ms, max:0.082ms, avg:0.082ms
>> > >
>> > > Op[render] min:0.077ms, max:0.077ms, avg:0.077ms
>> > >
>> > > Op[js-rewrite] min:0.143ms, max:0.143ms, avg:0.143ms
>> > >
>> > > Op[link-rewrite] min:0.111ms, max:0.111ms, avg:0.111ms
>> > >
>> > > Op[build-nodes] min:0.043ms, max:0.043ms, avg:0.043ms
>> > >
>> > > Op[parse] min:1.437ms, max:1.437ms, avg:1.437ms
>> > >
>> > >
>> > > 3. Drilling further, I wrote a separate test breaking out the
>> > > performance
>> > > components to parsing: calling the Caja DomParser.parseFragment(...)
>> > > API,
>> > > and subsequently wrapping the results of that call with ParsedHtmlNode
>> > > objects to satisfy interface requirements:
>> > >
>> > > Typical run:
>> > >
>> > > Caja parser [size*1, runs:1] in 97538 microsecs [97.538048] millis/run
>> > >
>> > > [PARSER COMPONENTS]
>> > >
>> > > Op[raw-caja-parse] min:70.033ms, max:70.033ms, avg:70.033ms
>> > >
>> > > Op[build-parse-nodes] min:3.644ms, max:3.644ms, avg:3.644ms
>> > >
>> > >
>> > > Caja parser [size*10, runs:1] in 42915 microsecs [42.915072]
>> > > millis/run
>> > >
>> > > [PARSER COMPONENTS]
>> > >
>> > > Op[raw-caja-parse] min:34.676ms, max:34.676ms, avg:34.676ms
>> > >
>> > > Op[build-parse-nodes] min:7.148ms, max:7.148ms, avg:7.148ms
>> > >
>> > >
>> > > Caja parser [size*50, runs:1] in 157048 microsecs [157.048064]
>> > > millis/run
>> > >
>> > > [PARSER COMPONENTS]
>> > >
>> > > Op[raw-caja-parse] min:138.904ms, max:138.904ms, avg:138.904ms
>> > >
>> > > Op[build-parse-nodes] min:17.313ms, max:17.313ms, avg:17.313ms
>> > >
>> > >
>> > > Caja parser [size*100, runs:1] in 236073 microsecs [236.07296]
>> > > millis/run
>> > >
>> > > [PARSER COMPONENTS]
>> > >
>> > > Op[raw-caja-parse] min:173.743ms, max:173.743ms, avg:173.743ms
>> > >
>> > > Op[build-parse-nodes] min:43.295ms, max:43.295ms, avg:43.295ms
>> > >
>> > >
>> > > Conclusions and Discussion:
>> > >
>> > > The purpose of this task was to prove that tree-based parsing is
>> > > functionally viable, which has succeeded. Past that, it's a matter of
>> > > choosing functionality vs. performance. Given that rewriting results
>> > > are
>> > > cached, perhaps even ~3x increase in rewriting cost will be worth
>> > > paying.
>> > >
>> > >
>> > > That's particularly true given the new class of optimizations/rewrites
>> > made
>> > > possible with a parse tree, as well as some bugs that are more easily
>> > fixed
>> > > using it. For instance, I recently discovered a bug with the existing
>> > > JS
>> > > tag
>> > > rewriter which ignores type="..." attributes and doesn't maintain "id"
>> > > attributes in certain situations. These can be resolved in the lexer
>> > case,
>> > > but are clearer in the parser one.
>> > >
>> > >
>> > > Lastly, as mentioned at the beginning of this thread, I plan to
>> > > maintain
>> > > the
>> > > ability to manipulate a gadget by string, meaning a lexer-based
>> > > approach
>> > > can
>> > > still be used where desired and parse-tree isn't required.
>> > >
>> > >
>> > > Next steps:
>> > >
>> > > 1. My next step is to add modularity to content rewriting, but again
>> > > without
>> > > changing any caching semantics. Instead, rather than a single
>> > > ContentRewriter being injected, a ContentRewriterRegistry will be. The
>> > > default Registry will support injection of a single ContentRewriter to
>> > > maintain backward compatibility for now.
>> > >
>> > > 2. GadgetSpec immutability restored, ensuring post-rewritten caching.
>> > >
>> > > 3. ContentRewriter API cleanup.
>> > >
>> > >
>> > > --John
>> > >
>> > >
>> > > On Tue, Aug 12, 2008 at 7:43 PM, John Hjelmstad <[EMAIL PROTECTED]>
>> > wrote:
>> > >
>> > > > Interesting idea, and sounds fine to me. Concretely, this lets me
>> > > sidestep
>> > > > SHINDIG-500 for a little while, which is nice (though I'd _really_
>> > > > like
>> > > to
>> > > > see the API cleanup go in! :)), in favor of migrating the existing
>> > > rewriter
>> > > > to a tree-based approach. Turns out I've been working on #1 and #2
>> > > > independently anyway. I'll post a patch soon. Thanks!
>> > > >
>> > > > John
>> > > >
>> > > >
>> > > > On Tue, Aug 12, 2008 at 7:14 PM, Louis Ryan <[EMAIL PROTECTED]>
>> > > > wrote:
>> > > >
>> > > >> Can we prove this out incrementally bottom-up. In general I think
>> > using
>> > > >> DOM
>> > > >> is the right thing to do from a rewriting standpoint. So here's how
>> > > >> I
>> > > >> propose we proceed
>> > > >>
>> > > >> 1. If the Caja dom is a little awkward wrap it, if not lets just
>> > > >> use
>> > it
>> > > as
>> > > >> is. We can always resolve this later
>> > > >> 2. Change the existing content rewriters to use the DOM instead of
>> > > >> a
>> > > >> lexer,
>> > > >> should be pretty easy. Maybe add some fancier rewriting like moving
>> > CSS
>> > > >> into
>> > > >> HEAD
>> > > >> 3. Do some perf testing, look into memory overhead of dom
>> > transformation
>> > > >> etc.
>> > > >> 4. Alter GadgetSpec's to retain the dom when they are cached
>> > > >> 5. Alter the gadget rendering phase to serialize the content of the
>> > dom
>> > > to
>> > > >> output
>> > > >> 6. Annotate the dom at parse time to make render time user-pref
>> > > >> substituions
>> > > >> faster, this should be easy enough too...
>> > > >>
>> > > >> This should be enough to prove out the pipeline end-to-end and
>> > identify
>> > > >> any
>> > > >> major perf niggles. Once this is done we can look into how to
>> > > >> inject a
>> > > >> rewriter pipeline into the parsing phase and the rendering phase.
>> > > >>
>> > > >> -Louis
>> > > >>
>> > > >>
>> > > >>
>> > > >> On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <[EMAIL PROTECTED]>
>> > > wrote:
>> > > >>
>> > > >> > Re-responding in order to apply the last few exchanges to
>> > > >> > google-caja-discuss@ (@gmail vs. @google membership issues).
>> > > >> >
>> > > >> > On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad
>> > > >> > <[EMAIL PROTECTED]>
>> > > >> wrote:
>> > > >> >
>> > > >> > > Hello,
>> > > >> > >
>> > > >> > > While beginning to refactor the rewriter APIs I've discovered
>> > > >> > > that
>> > > >> there
>> > > >> > > unfortunately is one semantic difference inherent to moving
>> > > >> getContent()
>> > > >> > and
>> > > >> > > setContent() methods into the Gadget object (replacing
>> > > >> > > View.get/setRewrittenContent()): BasicGadgetSpecFactory no
>> > > >> > > longer
>> > > >> caches
>> > > >> > > rewritten content.
>> > > >> > >
>> > > >> > > I've written a discussion of this in issue SHINDIG-500, which
>> > tracks
>> > > >> this
>> > > >> > > implementation sub-task:
>> > > >> > https://issues.apache.org/jira/browse/SHINDIG-500
>> > > >> > >
>> > > >> > > To summarize:
>> > > >> > > 1. Is this change acceptable for the time being?
>> > > >> > > 2. I suggest that we can, at a later date, move fetching of
>> > > >> > > gadget
>> > > >> specs
>> > > >> > > into GadgetServer while injecting a Gadget(Spec) cache there as
>> > > well,
>> > > >> > > offering finer-tuned control over caching characteristics.
>> > > >> > >
>> > > >> > > Thanks,
>> > > >> > > John
>> > > >> > >
>> > > >> > >
>> > > >> > > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad
>> > > >> > > <[EMAIL PROTECTED]
>> > >
>> > > >> > wrote:
>> > > >> > >
>> > > >> > >> I understand these concerns, and should be clear that I don't
>> > > >> (despite
>> > > >> > my
>> > > >> > >> personal interest in experimenting with the idea, agreed that
>> > > >> > >> we
>> > > >> don't
>> > > >> > have
>> > > >> > >> time for it at the moment) have any plans to introduce this
>> > > >> > >> sort
>> > of
>> > > >> RPC
>> > > >> > >> anywhere - certainly not in Shindig itself, as any such call
>> > would
>> > > be
>> > > >> > hidden
>> > > >> > >> behind an interface anyway.
>> > > >> > >>
>> > > >> > >> Putting the RPC hypothetical aside, I still feel that there's
>> > value
>> > > >> to
>> > > >> > >> implementing HTML parsing in terms of an interface:
>> > > >> > >> * Clearer separation of concerns/boundary between projects.
>> > > >> > >>   - Corollary simplicity in testing.
>> > > >> > >> * Clearer API for content manipulation (that doesn't require
>> > > >> knowledge
>> > > >> > of
>> > > >> > >> Caja).
>> > > >> > >>
>> > > >> > >> I could be convinced otherwise, but at this point the code
>> > involved
>> > > >> > seems
>> > > >> > >> of manageable size, so still worth doing. Thoughts?
>> > > >> > >>
>> > > >> > >> John
>> > > >> > >>
>> > > >> > >>
>> > > >> > >>
>> > > >> > >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <[EMAIL PROTECTED]>
>> > > >> wrote:
>> > > >> > >>
>> > > >> > >>> I agree with Louis -- that's just not practical. Every
>> > > >> > >>> rewriting
>> > > >> > >>> operation
>> > > >> > >>> must work in real time. Caja's existing html parser is
>> > > >> > >>> adequate
>> > > for
>> > > >> our
>> > > >> > >>> needs, and we shouldn't go out of our way to tolerate every
>> > oddity
>> > > >> of
>> > > >> > >>> random
>> > > >> > >>> web browsers (especially as it simply wouldn't work unless
>> > > >> > >>> you
>> > > >> farmed
>> > > >> > it
>> > > >> > >>> out
>> > > >> > >>> to *every* browser). Any new code needs to be grounded in
>> > > practical,
>> > > >> > >>> current
>> > > >> > >>> needs, not theoretical options. We can always change code
>> > > >> > >>> later
>> > if
>> > > >> we
>> > > >> > >>> find a
>> > > >> > >>> real need for something like that. We have real work to do in
>> > the
>> > > >> > >>> meantime.
>> > > >> > >>>
>> > > >> > >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan
>> > > >> > >>> <[EMAIL PROTECTED]>
>> > > >> wrote:
>> > > >> > >>>
>> > > >> > >>> > John,
>> > > >> > >>> >
>> > > >> > >>> > From a practicality standpoint I'm a little nervous about
>> > > >> > >>> > this
>> > > >> plan
>> > > >> > to
>> > > >> > >>> make
>> > > >> > >>> > RPCs calls out of a Java process to a native process to
>> > > >> > >>> > fetch
>> > a
>> > > >> parse
>> > > >> > >>> tree
>> > > >> > >>> > for transformations that have to occur realtime. I don't
>> > > >> > >>> > think
>> > > the
>> > > >> > >>> > motivating factor here is to accept all inputs that
>> > > >> > >>> > browsers
>> > > can.
>> > > >> > >>> Gadget
>> > > >> > >>> > developers will tailor their markup to the platform as they
>> > have
>> > > >> done
>> > > >> > >>> > already. I would greatly prefer us to pick one 'good'
>> > > >> > >>> > parser
>> > and
>> > > >> > stick
>> > > >> > >>> with
>> > > >> > >>> > it for all the manageability and consumability benefits
>> > > >> > >>> > that
>> > > come
>> > > >> > with
>> > > >> > >>> that
>> > > >> > >>> > decision. Perhaps Im missing something here?
>> > > >> > >>> >
>> > > >> > >>> > -Louis
>> > > >> > >>> >
>> > > >> > >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <
>> > > >> [EMAIL PROTECTED]>
>> > > >> > >>> wrote:
>> > > >> > >>> >
>> > > >> > >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie
>> > > >> > >>> > > <[EMAIL PROTECTED]
>> > >
>> > > >> > wrote:
>> > > >> > >>> > >
>> > > >> > >>> > > > [+google-caja-discuss]
>> > > >> > >>> > > >
>> > > >> > >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <
>> > > >> [EMAIL PROTECTED]
>> > > >> > >
>> > > >> > >>> > wrote:
>> > > >> > >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <
>> > > [EMAIL PROTECTED]
>> > > >> >
>> > > >> > >>> wrote:
>> > > >> > >>> > > > >
>> > > >> > >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
>> > > >> > >>> [EMAIL PROTECTED]>
>> > > >> > >>> > > > wrote:
>> > > >> > >>> > > > >> > This proposal effectively enables the renderer to
>> > > become
>> > > >> a
>> > > >> > >>> > > multi-pass
>> > > >> > >>> > > > >> > compiler for gadget content (essentially,
>> > > >> > >>> > > > >> > arbitrary
>> > web
>> > > >> > >>> content).
>> > > >> > >>> > > Such
>> > > >> > >>> > > > a
>> > > >> > >>> > > > >> > compiler can provide several benefits: static
>> > > >> optimization
>> > > >> > of
>> > > >> > >>> > gadget
>> > > >> > >>> > > > >> content
>> > > >> > >>> > > > >> > (auto-proxying of images, whitespace/comment
>> > > >> > >>> > > > >> > removal,
>> > > >> > >>> > consolidation
>> > > >> > >>> > > of
>> > > >> > >>> > > > >> CSS
>> > > >> > >>> > > > >> > blocks), security benefits (caja et al), new
>> > > >> functionality
>> > > >> > >>> > > (annotation
>> > > >> > >>> > > > of
>> > > >> > >>> > > > >> > content for stats, document analysis,
>> > > container-specific
>> > > >> > >>> > features),
>> > > >> > >>> > > > etc.
>> > > >> > >>> > > > >> To
>> > > >> > >>> > > > >> > my knowledge no such infrastructure exists today
>> > (with
>> > > >> the
>> > > >> > >>> > possible
>> > > >> > >>> > > > >> > exception of Caja itself, which I'd like to
>> > > >> > >>> > > > >> > dovetail
>> > > with
>> > > >> > this
>> > > >> > >>> > > work).
>> > > >> > >>> > > > >>
>> > > >> > >>> > > > >> Caja clearly provides a large chunk of the code
>> > > >> > >>> > > > >> you'd
>> > > need
>> > > >> for
>> > > >> > >>> this.
>> > > >> > >>> > > > >> I'd like to hear how we'd manage to avoid
>> > > >> > >>> > > > >> duplication
>> > > >> between
>> > > >> > >>> the
>> > > >> > >>> > two
>> > > >> > >>> > > > >> projects.
>> > > >> > >>> > > > >>
>> > > >> > >>> > > > >> A generalised framework for manipulating content
>> > > >> > >>> > > > >> sounds
>> > > >> like a
>> > > >> > >>> great
>> > > >> > >>> > > > >> idea, but probably should not live in either of the
>> > > >> > >>> > > > >> two
>> > > >> > projects
>> > > >> > >>> > (Caja
>> > > >> > >>> > > > >> and Shindig) but rather should be shared by both of
>> > them,
>> > > I
>> > > >> > >>> suspect.
>> > > >> > >>> > > > >
>> > > >> > >>> > > > >
>> > > >> > >>> > > > > I agree on both counts. As I mentioned, the piece of
>> > this
>> > > >> idea
>> > > >> > >>> that I
>> > > >> > >>> > > > expect
>> > > >> > >>> > > > > to change the most is the parse tree, and Caja's
>> > > >> .parser.html
>> > > >> > and
>> > > >> > >>> > > > > .parser.css packages contain much of what I've thrown
>> > > >> > >>> > > > > in
>> > > >> here
>> > > >> > as
>> > > >> > >>> a
>> > > >> > >>> > > base.
>> > > >> > >>> > > > >
>> > > >> > >>> > > > > My key requirements are:
>> > > >> > >>> > > > > * Lightweight framework.
>> > > >> > >>> > > > > * Parser modularity, mostly for HTML parsers (to
>> > > >> > >>> > > > > re-use
>> > > the
>> > > >> > good
>> > > >> > >>> work
>> > > >> > >>> > > > done
>> > > >> > >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja
>> > I'd
>> > > >> bet)
>> > > >> > >>> > > > > * Automatic maintenance of DOM<->String conversion.
>> > > >> > >>> > > > > * Easy to manipulate structure.
>> > > >> > >>> > > >
>> > > >> > >>> > > > I'm not sure what the value of parser modularity is? If
>> > the
>> > > >> > >>> resulting
>> > > >> > >>> > > > tree is different, then that's a problem for people
>> > > processing
>> > > >> > the
>> > > >> > >>> > > > tree. And if it is not, then why do we care?
>> > > >> > >>> > >
>> > > >> > >>> > >
>> > > >> > >>> > > IMO the value of parser modularity is that the lenient
>> > parsers
>> > > >> > native
>> > > >> > >>> to
>> > > >> > >>> > > browsers can be used in place of those that might not
>> > > >> > >>> > > accept
>> > > all
>> > > >> > >>> inputs.
>> > > >> > >>> > > One
>> > > >> > >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing
>> > > >> > >>> > > code
>> > > >> into a
>> > > >> > >>> > server
>> > > >> > >>> > > that runs parallel to Shindig and provides a "local RPC"
>> > > service
>> > > >> > for
>> > > >> > >>> > > parsing
>> > > >> > >>> > > semi-structured HTML. The resulting tree for WebKit's
>> > > >> > >>> > > parser
>> > > >> might
>> > > >> > be
>> > > >> > >>> > > different than that for an XHTML parser, Gecko's parser,
>> > etc,
>> > > >> but
>> > > >> > if
>> > > >> > >>> the
>> > > >> > >>> > > algorithm implemented atop it is rule-based rather than
>> > > >> > >>> strict-structure
>> > > >> > >>> > > based that should be fine, no?
>> > > >> > >>> > >
>> > > >> > >>> > >
>> > > >> > >>> > > >
>> > > >> > >>> > > >
>> > > >> > >>> > > > >
>> > > >> > >>> > > > > I'd love to see both projects share the same base
>> > > >> > >>> > > > > syntax
>> > > >> tree
>> > > >> > >>> > > > > representations. I considered .parser.html(.DomTree)
>> > > >> > >>> > > > > and
>> > > >> > >>> .parser.css
>> > > >> > >>> > > for
>> > > >> > >>> > > > > these, but at the moment these appeared to be a
>> > > >> > >>> > > > > little
>> > > more
>> > > >> > tied
>> > > >> > >>> to
>> > > >> > >>> > > > Caja's
>> > > >> > >>> > > > > lexer/parser implementation than I preferred (though
>> > > >> > >>> > > > > I
>> > > admit
>> > > >> > >>> > > > > AbstractParseTreeNode contains most of what's
>> > > >> > >>> > > > > needed).
>> > > >> > >>> > > > >
>> > > >> > >>> > > > > To be sure, I don't see this as an end-all-be-all
>> > > >> > transformation
>> > > >> > >>> > system
>> > > >> > >>> > > > in
>> > > >> > >>> > > > > any way. I'd just like to put *something* reasonable
>> > > >> > >>> > > > > in
>> > > >> place
>> > > >> > >>> that we
>> > > >> > >>> > > can
>> > > >> > >>> > > > > play with, provide some benefit, and enhance into a
>> > truly
>> > > >> > >>> > sophisticated
>> > > >> > >>> > > > > vision of document rewriting.
>> > > >> > >>> > > > >
>> > > >> > >>> > > > >
>> > > >> > >>> > > > >>
>> > > >> > >>> > > > >>
>> > > >> > >>> > > > >> >  c. Add Gadget.getParsedContent().
>> > > >> > >>> > > > >> >    i. Returns a mutable GadgetContentParseTree
>> > > >> > >>> > > > >> > used
>> > to
>> > > >> > >>> manipulate
>> > > >> > >>> > > > Gadget
>> > > >> > >>> > > > >> > Contents.
>> > > >> > >>> > > > >> >    ii. Mutable tree calls back to the Gadget
>> > > >> > >>> > > > >> > object
>> > > >> > indicating
>> > > >> > >>> > when
>> > > >> > >>> > > > any
>> > > >> > >>> > > > >> > change is made, and emits an error if setContent()
>> > has
>> > > >> been
>> > > >> > >>> called
>> > > >> > >>> > > in
>> > > >> > >>> > > > the
>> > > >> > >>> > > > >> > interim.
>> > > >> > >>> > > > >>
>> > > >> > >>> > > > >> In Caja we have been moving towards immutable
>> > > >> > >>> > > > >> trees...
>> > > >> > >>> > > > >
>> > > >> > >>> > > > >
>> > > >> > >>> > > > > Interested to hear more about this. The whole idea is
>> > for
>> > > >> the
>> > > >> > >>> > gadget's
>> > > >> > >>> > > > tree
>> > > >> > >>> > > > > representation to be modifiable. Doing that with
>> > immutable
>> > > >> > trees
>> > > >> > >>> to
>> > > >> > >>> > me
>> > > >> > >>> > > > > suggests that a rewriter would have to create a
>> > completely
>> > > >> new
>> > > >> > >>> tree
>> > > >> > >>> > and
>> > > >> > >>> > > > set
>> > > >> > >>> > > > > it as a representation of new content. That's
>> > > >> > >>> > > > > convenient
>> > > as
>> > > >> far
>> > > >> > >>> as
>> > > >> > >>> > the
>> > > >> > >>> > > > > Gadget's maintenance of String<->Tree representations
>> > > >> > >>> > > > > is
>> > > >> > >>> concerned...
>> > > >> > >>> > > but
>> > > >> > >>> > > > > seems pretty heavyweight for many types of edits:
>> > in-situ
>> > > >> > >>> > modifications
>> > > >> > >>> > > > of
>> > > >> > >>> > > > > text, content reordering, etc. That's particularly so
>> > > >> > >>> > > > > in
>> > a
>> > > >> > >>> > > > single-threaded
>> > > >> > >>> > > > > (viz rewriting) environment.
>> > > >> > >>> > > >
>> > > >> > >>> > > > Never having been entirely sold on the concept, I'll
>> > > >> > >>> > > > let
>> > > those
>> > > >> on
>> > > >> > >>> the
>> > > >> > >>> > > > Caja team who advocate immutability explain why.
>> > > >> > >>> > > >
>> > > >> > >>> > >
>> > > >> > >>> >
>> > > >> > >>>
>> > > >> > >>
>> > > >> > >>
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>
>
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to
> http://groups.google.com/group/google-caja-discuss
> To unsubscribe, email [EMAIL PROTECTED]
> -~----------~----~----~----~------~----~------~--~---
>
>

Re: [Caja] Re: Content Rewriter Modularization: Design/Change

Reply via email to