Re: Content Rewriter Modularization: Design/Change

Louis Ryan Mon, 25 Aug 2008 16:44:50 -0700

These numbers seem pretty solid so I believe its fine to switch the general
rewriter over to this new code and ditch the lexer based stuff. Nice work.


On Mon, Aug 25, 2008 at 2:27 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:

> I've implemented a parse-tree based ContentRewriter using the existing
> plumbing (same caching semantics et al), as ParseTreeDefaultContentRewriter
> and ParseTreeHtmlRewriter, respectively. The latter contains essentially
> all
> rewriting functionality from the previous HtmlTagTransformer classes. The
> parse-tree based rewriter is now functionally equivalent to the previous
> rewriter. The new classes use a Caja-based HTML parser under the hood.
> This proves out the functional viability of a tree-based rewriter, but
> performance also needs to be assessed.
>
> I've gone ahead and profiled the comparative performance of each rewriter,
> "Lexer" based and "ParseTree" based. To no surprise, "Lexer" wins out every
> time essentially by definition, since obviously Caja's parser uses its own
> lexer under the hood.
>
> Summary:
> * The performance of each is fundamentally O(n), though...
> * For any given input size, Lexer-based rewriting averages between 2.5 -
> 3.5x faster than ParseTree-based (ie. c =~ 3.5 at worst).
> * By far, the majority of time involved in ParseTree-based optimization is
> initial parsing: 75% of all processing.
>
> Details:
> 1. I wrote a simple rewriter/parser profiler which rewrites (the sample
> rewriter gadget's content * X repetitions) N times, recording the resulting
> run time. The run time of parse-based rewriting degraded as N increased, in
> all likelihood due to the additional cost of object management (lexer-based
> rewriting involves few intermediate objects). Given that the results of
> rewriting will be variously cached, it's very unlikely that rewriting will
> happen in immediate succession hundreds or thousands of times. As such, I
> fixed N = 1 to re-run the tests in relative isolation from one another.
> Results from a given run:
>
> LEX-BASED*100 rewriter, 1 runs in 177047 microsecs [177.04704] millis/run
>
> PARSE-BASED*100 rewriter, 1 runs in 609136 microsecs [609.136128]
> millis/run
>
> Parse/lex ratio: 3.4405327398939263
>
> LEX-BASED*50 rewriter, 1 runs in 43936 microsecs [43.936] millis/run
>
> PARSE-BASED*50 rewriter, 1 runs in 148980 microsecs [148.979968] millis/run
>
> Parse/lex ratio: 3.3908412235979606
>
> LEX-BASED*10 rewriter, 1 runs in 3093 microsecs [3.092992] millis/run
>
> PARSE-BASED*10 rewriter, 1 runs in 11020 microsecs [11.020032] millis/run
>
> Parse/lex ratio: 3.5628839314581313
>
> LEX-BASED*1 rewriter, 1 runs in 600 microsecs [0.600064] millis/run
>
> PARSE-BASED*1 rewriter, 1 runs in 1819 microsecs [1.819136] millis/run
>
> Parse/lex ratio: 3.0316666666666667
>
>
> 2. Drilling down, I added simple operation profiling to each component of
> parse-tree rewriting: original parse (CajaHtmlParser); building mutable
> tree
> nodes; rewriting links; concatenating JS nodes; rewriting style blocks;
> rendering parse tree. I then reran the same tests.
>
> Results from subsequent run:
>
> LEX-BASED*100 rewriter, 1 runs in 165321 microsecs [165.32096] millis/run
>
> PARSE-BASED*100 rewriter, 1 runs in 646884 microsecs [646.88384] millis/run
>
> Parse/lex ratio: 3.912896728183352
>
> [PARSE OPS]
>
> Op[style-rewrite] min:25.419ms, max:25.419ms, avg:25.419ms
>
> Op[render] min:36.851ms, max:36.851ms, avg:36.851ms
>
> Op[js-rewrite] min:53.983ms, max:53.983ms, avg:53.983ms
>
> Op[link-rewrite] min:31.136ms, max:31.136ms, avg:31.136ms
>
> Op[build-nodes] min:32.929ms, max:32.929ms, avg:32.929ms
>
> Op[parse] min:464.211ms, max:464.211ms, avg:464.211ms
>
>
> LEX-BASED*50 rewriter, 1 runs in 30684 microsecs [30.683904] millis/run
>
> PARSE-BASED*50 rewriter, 1 runs in 161132 microsecs [161.132032] millis/run
>
> Parse/lex ratio: 5.251336201277539
>
> [PARSE OPS]
>
> Op[style-rewrite] min:8.581ms, max:8.581ms, avg:8.581ms
>
> Op[render] min:5.184ms, max:5.184ms, avg:5.184ms
>
> Op[js-rewrite] min:11.606ms, max:11.606ms, avg:11.606ms
>
> Op[link-rewrite] min:7.533ms, max:7.533ms, avg:7.533ms
>
> Op[build-nodes] min:3.41ms, max:3.41ms, avg:3.41ms
>
> Op[parse] min:121.367ms, max:121.367ms, avg:121.367ms
>
>
> LEX-BASED*10 rewriter, 1 runs in 3371 microsecs [3.371008] millis/run
>
> PARSE-BASED*10 rewriter, 1 runs in 10336 microsecs [10.336] millis/run
>
> Parse/lex ratio: 3.066152477009789
>
> [PARSE OPS]
>
> Op[style-rewrite] min:0.563ms, max:0.563ms, avg:0.563ms
>
> Op[render] min:0.678ms, max:0.678ms, avg:0.678ms
>
> Op[js-rewrite] min:1.374ms, max:1.374ms, avg:1.374ms
>
> Op[link-rewrite] min:0.718ms, max:0.718ms, avg:0.718ms
>
> Op[build-nodes] min:0.295ms, max:0.295ms, avg:0.295ms
>
> Op[parse] min:6.466ms, max:6.466ms, avg:6.466ms
>
>
> LEX-BASED*1 rewriter, 1 runs in 592 microsecs [0.592128] millis/run
>
> PARSE-BASED*1 rewriter, 1 runs in 2083 microsecs [2.083072] millis/run
>
> Parse/lex ratio: 3.518581081081081
>
> [PARSE OPS]
>
> Op[style-rewrite] min:0.082ms, max:0.082ms, avg:0.082ms
>
> Op[render] min:0.077ms, max:0.077ms, avg:0.077ms
>
> Op[js-rewrite] min:0.143ms, max:0.143ms, avg:0.143ms
>
> Op[link-rewrite] min:0.111ms, max:0.111ms, avg:0.111ms
>
> Op[build-nodes] min:0.043ms, max:0.043ms, avg:0.043ms
>
> Op[parse] min:1.437ms, max:1.437ms, avg:1.437ms
>
>
> 3. Drilling further, I wrote a separate test breaking out the performance
> components to parsing: calling the Caja DomParser.parseFragment(...) API,
> and subsequently wrapping the results of that call with ParsedHtmlNode
> objects to satisfy interface requirements:
>
> Typical run:
>
> Caja parser [size*1, runs:1] in 97538 microsecs [97.538048] millis/run
>
> [PARSER COMPONENTS]
>
> Op[raw-caja-parse] min:70.033ms, max:70.033ms, avg:70.033ms
>
> Op[build-parse-nodes] min:3.644ms, max:3.644ms, avg:3.644ms
>
>
> Caja parser [size*10, runs:1] in 42915 microsecs [42.915072] millis/run
>
> [PARSER COMPONENTS]
>
> Op[raw-caja-parse] min:34.676ms, max:34.676ms, avg:34.676ms
>
> Op[build-parse-nodes] min:7.148ms, max:7.148ms, avg:7.148ms
>
>
> Caja parser [size*50, runs:1] in 157048 microsecs [157.048064] millis/run
>
> [PARSER COMPONENTS]
>
> Op[raw-caja-parse] min:138.904ms, max:138.904ms, avg:138.904ms
>
> Op[build-parse-nodes] min:17.313ms, max:17.313ms, avg:17.313ms
>
>
> Caja parser [size*100, runs:1] in 236073 microsecs [236.07296] millis/run
>
> [PARSER COMPONENTS]
>
> Op[raw-caja-parse] min:173.743ms, max:173.743ms, avg:173.743ms
>
> Op[build-parse-nodes] min:43.295ms, max:43.295ms, avg:43.295ms
>
>
> Conclusions and Discussion:
>
> The purpose of this task was to prove that tree-based parsing is
> functionally viable, which has succeeded. Past that, it's a matter of
> choosing functionality vs. performance. Given that rewriting results are
> cached, perhaps even ~3x increase in rewriting cost will be worth paying.
>
>
> That's particularly true given the new class of optimizations/rewrites made
> possible with a parse tree, as well as some bugs that are more easily fixed
> using it. For instance, I recently discovered a bug with the existing JS
> tag
> rewriter which ignores type="..." attributes and doesn't maintain "id"
> attributes in certain situations. These can be resolved in the lexer case,
> but are clearer in the parser one.
>
>
> Lastly, as mentioned at the beginning of this thread, I plan to maintain
> the
> ability to manipulate a gadget by string, meaning a lexer-based approach
> can
> still be used where desired and parse-tree isn't required.
>
>
> Next steps:
>
> 1. My next step is to add modularity to content rewriting, but again
> without
> changing any caching semantics. Instead, rather than a single
> ContentRewriter being injected, a ContentRewriterRegistry will be. The
> default Registry will support injection of a single ContentRewriter to
> maintain backward compatibility for now.
>
> 2. GadgetSpec immutability restored, ensuring post-rewritten caching.
>
> 3. ContentRewriter API cleanup.
>
>
> --John
>
>
> On Tue, Aug 12, 2008 at 7:43 PM, John Hjelmstad <[EMAIL PROTECTED]> wrote:
>
> > Interesting idea, and sounds fine to me. Concretely, this lets me
> sidestep
> > SHINDIG-500 for a little while, which is nice (though I'd _really_ like
> to
> > see the API cleanup go in! :)), in favor of migrating the existing
> rewriter
> > to a tree-based approach. Turns out I've been working on #1 and #2
> > independently anyway. I'll post a patch soon. Thanks!
> >
> > John
> >
> >
> > On Tue, Aug 12, 2008 at 7:14 PM, Louis Ryan <[EMAIL PROTECTED]> wrote:
> >
> >> Can we prove this out incrementally bottom-up. In general I think using
> >> DOM
> >> is the right thing to do from a rewriting standpoint. So here's how I
> >> propose we proceed
> >>
> >> 1. If the Caja dom is a little awkward wrap it, if not lets just use it
> as
> >> is. We can always resolve this later
> >> 2. Change the existing content rewriters to use the DOM instead of a
> >> lexer,
> >> should be pretty easy. Maybe add some fancier rewriting like moving CSS
> >> into
> >> HEAD
> >> 3. Do some perf testing, look into memory overhead of dom transformation
> >> etc.
> >> 4. Alter GadgetSpec's to retain the dom when they are cached
> >> 5. Alter the gadget rendering phase to serialize the content of the dom
> to
> >> output
> >> 6. Annotate the dom at parse time to make render time user-pref
> >> substituions
> >> faster, this should be easy enough too...
> >>
> >> This should be enough to prove out the pipeline end-to-end and identify
> >> any
> >> major perf niggles. Once this is done we can look into how to inject a
> >> rewriter pipeline into the parsing phase and the rendering phase.
> >>
> >> -Louis
> >>
> >>
> >>
> >> On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <[EMAIL PROTECTED]>
> wrote:
> >>
> >> > Re-responding in order to apply the last few exchanges to
> >> > google-caja-discuss@ (@gmail vs. @google membership issues).
> >> >
> >> > On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <[EMAIL PROTECTED]>
> >> wrote:
> >> >
> >> > > Hello,
> >> > >
> >> > > While beginning to refactor the rewriter APIs I've discovered that
> >> there
> >> > > unfortunately is one semantic difference inherent to moving
> >> getContent()
> >> > and
> >> > > setContent() methods into the Gadget object (replacing
> >> > > View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer
> >> caches
> >> > > rewritten content.
> >> > >
> >> > > I've written a discussion of this in issue SHINDIG-500, which tracks
> >> this
> >> > > implementation sub-task:
> >> > https://issues.apache.org/jira/browse/SHINDIG-500
> >> > >
> >> > > To summarize:
> >> > > 1. Is this change acceptable for the time being?
> >> > > 2. I suggest that we can, at a later date, move fetching of gadget
> >> specs
> >> > > into GadgetServer while injecting a Gadget(Spec) cache there as
> well,
> >> > > offering finer-tuned control over caching characteristics.
> >> > >
> >> > > Thanks,
> >> > > John
> >> > >
> >> > >
> >> > > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <[EMAIL PROTECTED]>
> >> > wrote:
> >> > >
> >> > >> I understand these concerns, and should be clear that I don't
> >> (despite
> >> > my
> >> > >> personal interest in experimenting with the idea, agreed that we
> >> don't
> >> > have
> >> > >> time for it at the moment) have any plans to introduce this sort of
> >> RPC
> >> > >> anywhere - certainly not in Shindig itself, as any such call would
> be
> >> > hidden
> >> > >> behind an interface anyway.
> >> > >>
> >> > >> Putting the RPC hypothetical aside, I still feel that there's value
> >> to
> >> > >> implementing HTML parsing in terms of an interface:
> >> > >> * Clearer separation of concerns/boundary between projects.
> >> > >>   - Corollary simplicity in testing.
> >> > >> * Clearer API for content manipulation (that doesn't require
> >> knowledge
> >> > of
> >> > >> Caja).
> >> > >>
> >> > >> I could be convinced otherwise, but at this point the code involved
> >> > seems
> >> > >> of manageable size, so still worth doing. Thoughts?
> >> > >>
> >> > >> John
> >> > >>
> >> > >>
> >> > >>
> >> > >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <[EMAIL PROTECTED]>
> >> wrote:
> >> > >>
> >> > >>> I agree with Louis -- that's just not practical. Every rewriting
> >> > >>> operation
> >> > >>> must work in real time. Caja's existing html parser is adequate
> for
> >> our
> >> > >>> needs, and we shouldn't go out of our way to tolerate every oddity
> >> of
> >> > >>> random
> >> > >>> web browsers (especially as it simply wouldn't work unless you
> >> farmed
> >> > it
> >> > >>> out
> >> > >>> to *every* browser). Any new code needs to be grounded in
> practical,
> >> > >>> current
> >> > >>> needs, not theoretical options. We can always change code later if
> >> we
> >> > >>> find a
> >> > >>> real need for something like that. We have real work to do in the
> >> > >>> meantime.
> >> > >>>
> >> > >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <[EMAIL PROTECTED]>
> >> wrote:
> >> > >>>
> >> > >>> > John,
> >> > >>> >
> >> > >>> > From a practicality standpoint I'm a little nervous about this
> >> plan
> >> > to
> >> > >>> make
> >> > >>> > RPCs calls out of a Java process to a native process to fetch a
> >> parse
> >> > >>> tree
> >> > >>> > for transformations that have to occur realtime. I don't think
> the
> >> > >>> > motivating factor here is to accept all inputs that browsers
> can.
> >> > >>> Gadget
> >> > >>> > developers will tailor their markup to the platform as they have
> >> done
> >> > >>> > already. I would greatly prefer us to pick one 'good' parser and
> >> > stick
> >> > >>> with
> >> > >>> > it for all the manageability and consumability benefits that
> come
> >> > with
> >> > >>> that
> >> > >>> > decision. Perhaps Im missing something here?
> >> > >>> >
> >> > >>> > -Louis
> >> > >>> >
> >> > >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <
> >> [EMAIL PROTECTED]>
> >> > >>> wrote:
> >> > >>> >
> >> > >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <[EMAIL PROTECTED]>
> >> > wrote:
> >> > >>> > >
> >> > >>> > > > [+google-caja-discuss]
> >> > >>> > > >
> >> > >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <
> >> [EMAIL PROTECTED]
> >> > >
> >> > >>> > wrote:
> >> > >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <
> [EMAIL PROTECTED]
> >> >
> >> > >>> wrote:
> >> > >>> > > > >
> >> > >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
> >> > >>> [EMAIL PROTECTED]>
> >> > >>> > > > wrote:
> >> > >>> > > > >> > This proposal effectively enables the renderer to
> become
> >> a
> >> > >>> > > multi-pass
> >> > >>> > > > >> > compiler for gadget content (essentially, arbitrary web
> >> > >>> content).
> >> > >>> > > Such
> >> > >>> > > > a
> >> > >>> > > > >> > compiler can provide several benefits: static
> >> optimization
> >> > of
> >> > >>> > gadget
> >> > >>> > > > >> content
> >> > >>> > > > >> > (auto-proxying of images, whitespace/comment removal,
> >> > >>> > consolidation
> >> > >>> > > of
> >> > >>> > > > >> CSS
> >> > >>> > > > >> > blocks), security benefits (caja et al), new
> >> functionality
> >> > >>> > > (annotation
> >> > >>> > > > of
> >> > >>> > > > >> > content for stats, document analysis,
> container-specific
> >> > >>> > features),
> >> > >>> > > > etc.
> >> > >>> > > > >> To
> >> > >>> > > > >> > my knowledge no such infrastructure exists today (with
> >> the
> >> > >>> > possible
> >> > >>> > > > >> > exception of Caja itself, which I'd like to dovetail
> with
> >> > this
> >> > >>> > > work).
> >> > >>> > > > >>
> >> > >>> > > > >> Caja clearly provides a large chunk of the code you'd
> need
> >> for
> >> > >>> this.
> >> > >>> > > > >> I'd like to hear how we'd manage to avoid duplication
> >> between
> >> > >>> the
> >> > >>> > two
> >> > >>> > > > >> projects.
> >> > >>> > > > >>
> >> > >>> > > > >> A generalised framework for manipulating content sounds
> >> like a
> >> > >>> great
> >> > >>> > > > >> idea, but probably should not live in either of the two
> >> > projects
> >> > >>> > (Caja
> >> > >>> > > > >> and Shindig) but rather should be shared by both of them,
> I
> >> > >>> suspect.
> >> > >>> > > > >
> >> > >>> > > > >
> >> > >>> > > > > I agree on both counts. As I mentioned, the piece of this
> >> idea
> >> > >>> that I
> >> > >>> > > > expect
> >> > >>> > > > > to change the most is the parse tree, and Caja's
> >> .parser.html
> >> > and
> >> > >>> > > > > .parser.css packages contain much of what I've thrown in
> >> here
> >> > as
> >> > >>> a
> >> > >>> > > base.
> >> > >>> > > > >
> >> > >>> > > > > My key requirements are:
> >> > >>> > > > > * Lightweight framework.
> >> > >>> > > > > * Parser modularity, mostly for HTML parsers (to re-use
> the
> >> > good
> >> > >>> work
> >> > >>> > > > done
> >> > >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd
> >> bet)
> >> > >>> > > > > * Automatic maintenance of DOM<->String conversion.
> >> > >>> > > > > * Easy to manipulate structure.
> >> > >>> > > >
> >> > >>> > > > I'm not sure what the value of parser modularity is? If the
> >> > >>> resulting
> >> > >>> > > > tree is different, then that's a problem for people
> processing
> >> > the
> >> > >>> > > > tree. And if it is not, then why do we care?
> >> > >>> > >
> >> > >>> > >
> >> > >>> > > IMO the value of parser modularity is that the lenient parsers
> >> > native
> >> > >>> to
> >> > >>> > > browsers can be used in place of those that might not accept
> all
> >> > >>> inputs.
> >> > >>> > > One
> >> > >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code
> >> into a
> >> > >>> > server
> >> > >>> > > that runs parallel to Shindig and provides a "local RPC"
> service
> >> > for
> >> > >>> > > parsing
> >> > >>> > > semi-structured HTML. The resulting tree for WebKit's parser
> >> might
> >> > be
> >> > >>> > > different than that for an XHTML parser, Gecko's parser, etc,
> >> but
> >> > if
> >> > >>> the
> >> > >>> > > algorithm implemented atop it is rule-based rather than
> >> > >>> strict-structure
> >> > >>> > > based that should be fine, no?
> >> > >>> > >
> >> > >>> > >
> >> > >>> > > >
> >> > >>> > > >
> >> > >>> > > > >
> >> > >>> > > > > I'd love to see both projects share the same base syntax
> >> tree
> >> > >>> > > > > representations. I considered .parser.html(.DomTree) and
> >> > >>> .parser.css
> >> > >>> > > for
> >> > >>> > > > > these, but at the moment these appeared to be a little
> more
> >> > tied
> >> > >>> to
> >> > >>> > > > Caja's
> >> > >>> > > > > lexer/parser implementation than I preferred (though I
> admit
> >> > >>> > > > > AbstractParseTreeNode contains most of what's needed).
> >> > >>> > > > >
> >> > >>> > > > > To be sure, I don't see this as an end-all-be-all
> >> > transformation
> >> > >>> > system
> >> > >>> > > > in
> >> > >>> > > > > any way. I'd just like to put *something* reasonable in
> >> place
> >> > >>> that we
> >> > >>> > > can
> >> > >>> > > > > play with, provide some benefit, and enhance into a truly
> >> > >>> > sophisticated
> >> > >>> > > > > vision of document rewriting.
> >> > >>> > > > >
> >> > >>> > > > >
> >> > >>> > > > >>
> >> > >>> > > > >>
> >> > >>> > > > >> >  c. Add Gadget.getParsedContent().
> >> > >>> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
> >> > >>> manipulate
> >> > >>> > > > Gadget
> >> > >>> > > > >> > Contents.
> >> > >>> > > > >> >    ii. Mutable tree calls back to the Gadget object
> >> > indicating
> >> > >>> > when
> >> > >>> > > > any
> >> > >>> > > > >> > change is made, and emits an error if setContent() has
> >> been
> >> > >>> called
> >> > >>> > > in
> >> > >>> > > > the
> >> > >>> > > > >> > interim.
> >> > >>> > > > >>
> >> > >>> > > > >> In Caja we have been moving towards immutable trees...
> >> > >>> > > > >
> >> > >>> > > > >
> >> > >>> > > > > Interested to hear more about this. The whole idea is for
> >> the
> >> > >>> > gadget's
> >> > >>> > > > tree
> >> > >>> > > > > representation to be modifiable. Doing that with immutable
> >> > trees
> >> > >>> to
> >> > >>> > me
> >> > >>> > > > > suggests that a rewriter would have to create a completely
> >> new
> >> > >>> tree
> >> > >>> > and
> >> > >>> > > > set
> >> > >>> > > > > it as a representation of new content. That's convenient
> as
> >> far
> >> > >>> as
> >> > >>> > the
> >> > >>> > > > > Gadget's maintenance of String<->Tree representations is
> >> > >>> concerned...
> >> > >>> > > but
> >> > >>> > > > > seems pretty heavyweight for many types of edits: in-situ
> >> > >>> > modifications
> >> > >>> > > > of
> >> > >>> > > > > text, content reordering, etc. That's particularly so in a
> >> > >>> > > > single-threaded
> >> > >>> > > > > (viz rewriting) environment.
> >> > >>> > > >
> >> > >>> > > > Never having been entirely sold on the concept, I'll let
> those
> >> on
> >> > >>> the
> >> > >>> > > > Caja team who advocate immutability explain why.
> >> > >>> > > >
> >> > >>> > >
> >> > >>> >
> >> > >>>
> >> > >>
> >> > >>
> >> > >
> >> >
> >>
> >
> >
>

Re: Content Rewriter Modularization: Design/Change

Reply via email to