I've implemented a parse-tree based ContentRewriter using the existing
plumbing (same caching semantics et al), as ParseTreeDefaultContentRewriter
and ParseTreeHtmlRewriter, respectively. The latter contains essentially all
rewriting functionality from the previous HtmlTagTransformer classes. The
parse-tree based rewriter is now functionally equivalent to the previous
rewriter. The new classes use a Caja-based HTML parser under the hood.
This proves out the functional viability of a tree-based rewriter, but
performance also needs to be assessed.

I've gone ahead and profiled the comparative performance of each rewriter,
"Lexer" based and "ParseTree" based. To no surprise, "Lexer" wins out every
time essentially by definition, since obviously Caja's parser uses its own
lexer under the hood.

Summary:
* The performance of each is fundamentally O(n), though...
* For any given input size, Lexer-based rewriting averages between 2.5 -
3.5x faster than ParseTree-based (ie. c =~ 3.5 at worst).
* By far, the majority of time involved in ParseTree-based optimization is
initial parsing: 75% of all processing.

Details:
1. I wrote a simple rewriter/parser profiler which rewrites (the sample
rewriter gadget's content * X repetitions) N times, recording the resulting
run time. The run time of parse-based rewriting degraded as N increased, in
all likelihood due to the additional cost of object management (lexer-based
rewriting involves few intermediate objects). Given that the results of
rewriting will be variously cached, it's very unlikely that rewriting will
happen in immediate succession hundreds or thousands of times. As such, I
fixed N = 1 to re-run the tests in relative isolation from one another.
Results from a given run:

LEX-BASED*100 rewriter, 1 runs in 177047 microsecs [177.04704] millis/run

PARSE-BASED*100 rewriter, 1 runs in 609136 microsecs [609.136128] millis/run

Parse/lex ratio: 3.4405327398939263

LEX-BASED*50 rewriter, 1 runs in 43936 microsecs [43.936] millis/run

PARSE-BASED*50 rewriter, 1 runs in 148980 microsecs [148.979968] millis/run

Parse/lex ratio: 3.3908412235979606

LEX-BASED*10 rewriter, 1 runs in 3093 microsecs [3.092992] millis/run

PARSE-BASED*10 rewriter, 1 runs in 11020 microsecs [11.020032] millis/run

Parse/lex ratio: 3.5628839314581313

LEX-BASED*1 rewriter, 1 runs in 600 microsecs [0.600064] millis/run

PARSE-BASED*1 rewriter, 1 runs in 1819 microsecs [1.819136] millis/run

Parse/lex ratio: 3.0316666666666667


2. Drilling down, I added simple operation profiling to each component of
parse-tree rewriting: original parse (CajaHtmlParser); building mutable tree
nodes; rewriting links; concatenating JS nodes; rewriting style blocks;
rendering parse tree. I then reran the same tests.

Results from subsequent run:

LEX-BASED*100 rewriter, 1 runs in 165321 microsecs [165.32096] millis/run

PARSE-BASED*100 rewriter, 1 runs in 646884 microsecs [646.88384] millis/run

Parse/lex ratio: 3.912896728183352

[PARSE OPS]

Op[style-rewrite] min:25.419ms, max:25.419ms, avg:25.419ms

Op[render] min:36.851ms, max:36.851ms, avg:36.851ms

Op[js-rewrite] min:53.983ms, max:53.983ms, avg:53.983ms

Op[link-rewrite] min:31.136ms, max:31.136ms, avg:31.136ms

Op[build-nodes] min:32.929ms, max:32.929ms, avg:32.929ms

Op[parse] min:464.211ms, max:464.211ms, avg:464.211ms


LEX-BASED*50 rewriter, 1 runs in 30684 microsecs [30.683904] millis/run

PARSE-BASED*50 rewriter, 1 runs in 161132 microsecs [161.132032] millis/run

Parse/lex ratio: 5.251336201277539

[PARSE OPS]

Op[style-rewrite] min:8.581ms, max:8.581ms, avg:8.581ms

Op[render] min:5.184ms, max:5.184ms, avg:5.184ms

Op[js-rewrite] min:11.606ms, max:11.606ms, avg:11.606ms

Op[link-rewrite] min:7.533ms, max:7.533ms, avg:7.533ms

Op[build-nodes] min:3.41ms, max:3.41ms, avg:3.41ms

Op[parse] min:121.367ms, max:121.367ms, avg:121.367ms


LEX-BASED*10 rewriter, 1 runs in 3371 microsecs [3.371008] millis/run

PARSE-BASED*10 rewriter, 1 runs in 10336 microsecs [10.336] millis/run

Parse/lex ratio: 3.066152477009789

[PARSE OPS]

Op[style-rewrite] min:0.563ms, max:0.563ms, avg:0.563ms

Op[render] min:0.678ms, max:0.678ms, avg:0.678ms

Op[js-rewrite] min:1.374ms, max:1.374ms, avg:1.374ms

Op[link-rewrite] min:0.718ms, max:0.718ms, avg:0.718ms

Op[build-nodes] min:0.295ms, max:0.295ms, avg:0.295ms

Op[parse] min:6.466ms, max:6.466ms, avg:6.466ms


LEX-BASED*1 rewriter, 1 runs in 592 microsecs [0.592128] millis/run

PARSE-BASED*1 rewriter, 1 runs in 2083 microsecs [2.083072] millis/run

Parse/lex ratio: 3.518581081081081

[PARSE OPS]

Op[style-rewrite] min:0.082ms, max:0.082ms, avg:0.082ms

Op[render] min:0.077ms, max:0.077ms, avg:0.077ms

Op[js-rewrite] min:0.143ms, max:0.143ms, avg:0.143ms

Op[link-rewrite] min:0.111ms, max:0.111ms, avg:0.111ms

Op[build-nodes] min:0.043ms, max:0.043ms, avg:0.043ms

Op[parse] min:1.437ms, max:1.437ms, avg:1.437ms


3. Drilling further, I wrote a separate test breaking out the performance
components to parsing: calling the Caja DomParser.parseFragment(...) API,
and subsequently wrapping the results of that call with ParsedHtmlNode
objects to satisfy interface requirements:

Typical run:

Caja parser [size*1, runs:1] in 97538 microsecs [97.538048] millis/run

[PARSER COMPONENTS]

Op[raw-caja-parse] min:70.033ms, max:70.033ms, avg:70.033ms

Op[build-parse-nodes] min:3.644ms, max:3.644ms, avg:3.644ms


Caja parser [size*10, runs:1] in 42915 microsecs [42.915072] millis/run

[PARSER COMPONENTS]

Op[raw-caja-parse] min:34.676ms, max:34.676ms, avg:34.676ms

Op[build-parse-nodes] min:7.148ms, max:7.148ms, avg:7.148ms


Caja parser [size*50, runs:1] in 157048 microsecs [157.048064] millis/run

[PARSER COMPONENTS]

Op[raw-caja-parse] min:138.904ms, max:138.904ms, avg:138.904ms

Op[build-parse-nodes] min:17.313ms, max:17.313ms, avg:17.313ms


Caja parser [size*100, runs:1] in 236073 microsecs [236.07296] millis/run

[PARSER COMPONENTS]

Op[raw-caja-parse] min:173.743ms, max:173.743ms, avg:173.743ms

Op[build-parse-nodes] min:43.295ms, max:43.295ms, avg:43.295ms


Conclusions and Discussion:

The purpose of this task was to prove that tree-based parsing is
functionally viable, which has succeeded. Past that, it's a matter of
choosing functionality vs. performance. Given that rewriting results are
cached, perhaps even ~3x increase in rewriting cost will be worth paying.


That's particularly true given the new class of optimizations/rewrites made
possible with a parse tree, as well as some bugs that are more easily fixed
using it. For instance, I recently discovered a bug with the existing JS tag
rewriter which ignores type="..." attributes and doesn't maintain "id"
attributes in certain situations. These can be resolved in the lexer case,
but are clearer in the parser one.


Lastly, as mentioned at the beginning of this thread, I plan to maintain the
ability to manipulate a gadget by string, meaning a lexer-based approach can
still be used where desired and parse-tree isn't required.


Next steps:

1. My next step is to add modularity to content rewriting, but again without
changing any caching semantics. Instead, rather than a single
ContentRewriter being injected, a ContentRewriterRegistry will be. The
default Registry will support injection of a single ContentRewriter to
maintain backward compatibility for now.

2. GadgetSpec immutability restored, ensuring post-rewritten caching.

3. ContentRewriter API cleanup.


--John


On Tue, Aug 12, 2008 at 7:43 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:

> Interesting idea, and sounds fine to me. Concretely, this lets me sidestep
> SHINDIG-500 for a little while, which is nice (though I'd _really_ like to
> see the API cleanup go in! :)), in favor of migrating the existing rewriter
> to a tree-based approach. Turns out I've been working on #1 and #2
> independently anyway. I'll post a patch soon. Thanks!
>
> John
>
>
> On Tue, Aug 12, 2008 at 7:14 PM, Louis Ryan <[EMAIL PROTECTED]> wrote:
>
>> Can we prove this out incrementally bottom-up. In general I think using
>> DOM
>> is the right thing to do from a rewriting standpoint. So here's how I
>> propose we proceed
>>
>> 1. If the Caja dom is a little awkward wrap it, if not lets just use it as
>> is. We can always resolve this later
>> 2. Change the existing content rewriters to use the DOM instead of a
>> lexer,
>> should be pretty easy. Maybe add some fancier rewriting like moving CSS
>> into
>> HEAD
>> 3. Do some perf testing, look into memory overhead of dom transformation
>> etc.
>> 4. Alter GadgetSpec's to retain the dom when they are cached
>> 5. Alter the gadget rendering phase to serialize the content of the dom to
>> output
>> 6. Annotate the dom at parse time to make render time user-pref
>> substituions
>> faster, this should be easy enough too...
>>
>> This should be enough to prove out the pipeline end-to-end and identify
>> any
>> major perf niggles. Once this is done we can look into how to inject a
>> rewriter pipeline into the parsing phase and the rendering phase.
>>
>> -Louis
>>
>>
>>
>> On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:
>>
>> > Re-responding in order to apply the last few exchanges to
>> > google-caja-discuss@ (@gmail vs. @google membership issues).
>> >
>> > On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <[EMAIL PROTECTED]>
>> wrote:
>> >
>> > > Hello,
>> > >
>> > > While beginning to refactor the rewriter APIs I've discovered that
>> there
>> > > unfortunately is one semantic difference inherent to moving
>> getContent()
>> > and
>> > > setContent() methods into the Gadget object (replacing
>> > > View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer
>> caches
>> > > rewritten content.
>> > >
>> > > I've written a discussion of this in issue SHINDIG-500, which tracks
>> this
>> > > implementation sub-task:
>> > https://issues.apache.org/jira/browse/SHINDIG-500
>> > >
>> > > To summarize:
>> > > 1. Is this change acceptable for the time being?
>> > > 2. I suggest that we can, at a later date, move fetching of gadget
>> specs
>> > > into GadgetServer while injecting a Gadget(Spec) cache there as well,
>> > > offering finer-tuned control over caching characteristics.
>> > >
>> > > Thanks,
>> > > John
>> > >
>> > >
>> > > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <[EMAIL PROTECTED]>
>> > wrote:
>> > >
>> > >> I understand these concerns, and should be clear that I don't
>> (despite
>> > my
>> > >> personal interest in experimenting with the idea, agreed that we
>> don't
>> > have
>> > >> time for it at the moment) have any plans to introduce this sort of
>> RPC
>> > >> anywhere - certainly not in Shindig itself, as any such call would be
>> > hidden
>> > >> behind an interface anyway.
>> > >>
>> > >> Putting the RPC hypothetical aside, I still feel that there's value
>> to
>> > >> implementing HTML parsing in terms of an interface:
>> > >> * Clearer separation of concerns/boundary between projects.
>> > >>   - Corollary simplicity in testing.
>> > >> * Clearer API for content manipulation (that doesn't require
>> knowledge
>> > of
>> > >> Caja).
>> > >>
>> > >> I could be convinced otherwise, but at this point the code involved
>> > seems
>> > >> of manageable size, so still worth doing. Thoughts?
>> > >>
>> > >> John
>> > >>
>> > >>
>> > >>
>> > >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <[EMAIL PROTECTED]>
>> wrote:
>> > >>
>> > >>> I agree with Louis -- that's just not practical. Every rewriting
>> > >>> operation
>> > >>> must work in real time. Caja's existing html parser is adequate for
>> our
>> > >>> needs, and we shouldn't go out of our way to tolerate every oddity
>> of
>> > >>> random
>> > >>> web browsers (especially as it simply wouldn't work unless you
>> farmed
>> > it
>> > >>> out
>> > >>> to *every* browser). Any new code needs to be grounded in practical,
>> > >>> current
>> > >>> needs, not theoretical options. We can always change code later if
>> we
>> > >>> find a
>> > >>> real need for something like that. We have real work to do in the
>> > >>> meantime.
>> > >>>
>> > >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <[EMAIL PROTECTED]>
>> wrote:
>> > >>>
>> > >>> > John,
>> > >>> >
>> > >>> > From a practicality standpoint I'm a little nervous about this
>> plan
>> > to
>> > >>> make
>> > >>> > RPCs calls out of a Java process to a native process to fetch a
>> parse
>> > >>> tree
>> > >>> > for transformations that have to occur realtime. I don't think the
>> > >>> > motivating factor here is to accept all inputs that browsers can.
>> > >>> Gadget
>> > >>> > developers will tailor their markup to the platform as they have
>> done
>> > >>> > already. I would greatly prefer us to pick one 'good' parser and
>> > stick
>> > >>> with
>> > >>> > it for all the manageability and consumability benefits that come
>> > with
>> > >>> that
>> > >>> > decision. Perhaps Im missing something here?
>> > >>> >
>> > >>> > -Louis
>> > >>> >
>> > >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <
>> [EMAIL PROTECTED]>
>> > >>> wrote:
>> > >>> >
>> > >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <[EMAIL PROTECTED]>
>> > wrote:
>> > >>> > >
>> > >>> > > > [+google-caja-discuss]
>> > >>> > > >
>> > >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <
>> [EMAIL PROTECTED]
>> > >
>> > >>> > wrote:
>> > >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <[EMAIL PROTECTED]
>> >
>> > >>> wrote:
>> > >>> > > > >
>> > >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
>> > >>> [EMAIL PROTECTED]>
>> > >>> > > > wrote:
>> > >>> > > > >> > This proposal effectively enables the renderer to become
>> a
>> > >>> > > multi-pass
>> > >>> > > > >> > compiler for gadget content (essentially, arbitrary web
>> > >>> content).
>> > >>> > > Such
>> > >>> > > > a
>> > >>> > > > >> > compiler can provide several benefits: static
>> optimization
>> > of
>> > >>> > gadget
>> > >>> > > > >> content
>> > >>> > > > >> > (auto-proxying of images, whitespace/comment removal,
>> > >>> > consolidation
>> > >>> > > of
>> > >>> > > > >> CSS
>> > >>> > > > >> > blocks), security benefits (caja et al), new
>> functionality
>> > >>> > > (annotation
>> > >>> > > > of
>> > >>> > > > >> > content for stats, document analysis, container-specific
>> > >>> > features),
>> > >>> > > > etc.
>> > >>> > > > >> To
>> > >>> > > > >> > my knowledge no such infrastructure exists today (with
>> the
>> > >>> > possible
>> > >>> > > > >> > exception of Caja itself, which I'd like to dovetail with
>> > this
>> > >>> > > work).
>> > >>> > > > >>
>> > >>> > > > >> Caja clearly provides a large chunk of the code you'd need
>> for
>> > >>> this.
>> > >>> > > > >> I'd like to hear how we'd manage to avoid duplication
>> between
>> > >>> the
>> > >>> > two
>> > >>> > > > >> projects.
>> > >>> > > > >>
>> > >>> > > > >> A generalised framework for manipulating content sounds
>> like a
>> > >>> great
>> > >>> > > > >> idea, but probably should not live in either of the two
>> > projects
>> > >>> > (Caja
>> > >>> > > > >> and Shindig) but rather should be shared by both of them, I
>> > >>> suspect.
>> > >>> > > > >
>> > >>> > > > >
>> > >>> > > > > I agree on both counts. As I mentioned, the piece of this
>> idea
>> > >>> that I
>> > >>> > > > expect
>> > >>> > > > > to change the most is the parse tree, and Caja's
>> .parser.html
>> > and
>> > >>> > > > > .parser.css packages contain much of what I've thrown in
>> here
>> > as
>> > >>> a
>> > >>> > > base.
>> > >>> > > > >
>> > >>> > > > > My key requirements are:
>> > >>> > > > > * Lightweight framework.
>> > >>> > > > > * Parser modularity, mostly for HTML parsers (to re-use the
>> > good
>> > >>> work
>> > >>> > > > done
>> > >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd
>> bet)
>> > >>> > > > > * Automatic maintenance of DOM<->String conversion.
>> > >>> > > > > * Easy to manipulate structure.
>> > >>> > > >
>> > >>> > > > I'm not sure what the value of parser modularity is? If the
>> > >>> resulting
>> > >>> > > > tree is different, then that's a problem for people processing
>> > the
>> > >>> > > > tree. And if it is not, then why do we care?
>> > >>> > >
>> > >>> > >
>> > >>> > > IMO the value of parser modularity is that the lenient parsers
>> > native
>> > >>> to
>> > >>> > > browsers can be used in place of those that might not accept all
>> > >>> inputs.
>> > >>> > > One
>> > >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code
>> into a
>> > >>> > server
>> > >>> > > that runs parallel to Shindig and provides a "local RPC" service
>> > for
>> > >>> > > parsing
>> > >>> > > semi-structured HTML. The resulting tree for WebKit's parser
>> might
>> > be
>> > >>> > > different than that for an XHTML parser, Gecko's parser, etc,
>> but
>> > if
>> > >>> the
>> > >>> > > algorithm implemented atop it is rule-based rather than
>> > >>> strict-structure
>> > >>> > > based that should be fine, no?
>> > >>> > >
>> > >>> > >
>> > >>> > > >
>> > >>> > > >
>> > >>> > > > >
>> > >>> > > > > I'd love to see both projects share the same base syntax
>> tree
>> > >>> > > > > representations. I considered .parser.html(.DomTree) and
>> > >>> .parser.css
>> > >>> > > for
>> > >>> > > > > these, but at the moment these appeared to be a little more
>> > tied
>> > >>> to
>> > >>> > > > Caja's
>> > >>> > > > > lexer/parser implementation than I preferred (though I admit
>> > >>> > > > > AbstractParseTreeNode contains most of what's needed).
>> > >>> > > > >
>> > >>> > > > > To be sure, I don't see this as an end-all-be-all
>> > transformation
>> > >>> > system
>> > >>> > > > in
>> > >>> > > > > any way. I'd just like to put *something* reasonable in
>> place
>> > >>> that we
>> > >>> > > can
>> > >>> > > > > play with, provide some benefit, and enhance into a truly
>> > >>> > sophisticated
>> > >>> > > > > vision of document rewriting.
>> > >>> > > > >
>> > >>> > > > >
>> > >>> > > > >>
>> > >>> > > > >>
>> > >>> > > > >> >  c. Add Gadget.getParsedContent().
>> > >>> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
>> > >>> manipulate
>> > >>> > > > Gadget
>> > >>> > > > >> > Contents.
>> > >>> > > > >> >    ii. Mutable tree calls back to the Gadget object
>> > indicating
>> > >>> > when
>> > >>> > > > any
>> > >>> > > > >> > change is made, and emits an error if setContent() has
>> been
>> > >>> called
>> > >>> > > in
>> > >>> > > > the
>> > >>> > > > >> > interim.
>> > >>> > > > >>
>> > >>> > > > >> In Caja we have been moving towards immutable trees...
>> > >>> > > > >
>> > >>> > > > >
>> > >>> > > > > Interested to hear more about this. The whole idea is for
>> the
>> > >>> > gadget's
>> > >>> > > > tree
>> > >>> > > > > representation to be modifiable. Doing that with immutable
>> > trees
>> > >>> to
>> > >>> > me
>> > >>> > > > > suggests that a rewriter would have to create a completely
>> new
>> > >>> tree
>> > >>> > and
>> > >>> > > > set
>> > >>> > > > > it as a representation of new content. That's convenient as
>> far
>> > >>> as
>> > >>> > the
>> > >>> > > > > Gadget's maintenance of String<->Tree representations is
>> > >>> concerned...
>> > >>> > > but
>> > >>> > > > > seems pretty heavyweight for many types of edits: in-situ
>> > >>> > modifications
>> > >>> > > > of
>> > >>> > > > > text, content reordering, etc. That's particularly so in a
>> > >>> > > > single-threaded
>> > >>> > > > > (viz rewriting) environment.
>> > >>> > > >
>> > >>> > > > Never having been entirely sold on the concept, I'll let those
>> on
>> > >>> the
>> > >>> > > > Caja team who advocate immutability explain why.
>> > >>> > > >
>> > >>> > >
>> > >>> >
>> > >>>
>> > >>
>> > >>
>> > >
>> >
>>
>
>

Reply via email to