Sorry, I didn't mean to imply that this division was my idea or anything. The phases of parsing are explicit already. By 'monster' object I don't mean that it is large or incomprehensible, but that it has a few too many responsibilities to be easy to test.
For instance, right now it's returning its output as a property of itself, and the serializer is sort of added on later. The pipeline should be a bit clearer and more stateless. Anyway this is easily fixed, and will be soon... On 12/28/11 4:35 AM, Gabriel Wicke wrote: > On 12/28/2011 05:45 AM, Neil Kandalgaonkar wrote: >> I pulled out most of the parser-y parts from the parserTests, leaving >> behind just tests. > > Very good, this was really needed. > >> However, the parser is still a bit of a monster object, hence the >> deliberately silly name, ParserThingy. >> >> I'm trying to decompose it into a chain, roughly like: > > The current implementation already operates as a chain, as documented in > https://www.mediawiki.org/wiki/Future/Parser_development: > > PEG wiki/HTML tokenizer (or other tokenizers / SAX-like parsers) > | Chunks of tokens > V > Token stream transformations > | Chunks of tokens > V > HTML5 tree builder > | HTML 5 DOM tree > V > DOM Postprocessors > | HTML5 DOM tree > +------------------> (X)HTML serialization > | > V > DomConverter > | WikiDom > V > JSON serialization > | JSON string > V > Visual Editor > > The token stream transformation phase and to some degree the DOM > postprocessor phase will soon differ in their configuration depending on > the intended output format, enabled extensions and other wiki-specific > settings. Output intended for viewing will have templates fully expanded > and more aggressive sanitation applied in DOM postprocessors. Output > destined for the editor will have templates and extension tag results > encapsulated. At least, that is the plan so far- we might come up with > better ways to handle this later. > > The interface between the tokenizer, token stream transforms and the > tree builder wrapper is currently synchronous with a single list of > tokens being passed from one phase to the next. This should be changed > to event emitters that emit chunks of tokens. The tree builder wrapper > already implements the event emitter pattern to internally communicate > with the HTML5 tree builder library. > > The tree builder consumes token events until the end token is reached. > The FauxHTML5.TreeBuilder wrapper could be extended to emit an > additional signal when the end token was processed, so that DOM > postprocessing and WikiDom conversion and JSON serialization can be > triggered. All DOM-based processing is essentially synchronous and does > not perform any IO, so these stages can all be called from a single > function for now. This stage should in turn be an event emitter, so that > you can register for further asynchronous processing of the result. > > After the conversion to EvenEmitters, the wrapper object (the > ParserThingy you just created) still configures the stages in a > particular way, and registers the stages as event listeners with each > other. The size of the wrapper can eventually be reduced a bit by > pushing more of the phase-specific setup into the phase constructors and > setup functions themselves. The high degree of decomposition into phases > already there still means that a few lines of setup per phase will still > add up to a 'monster object' of a few dozen lines. A reasonable price to > pay for independent testing, potential parallel execution of stages and > modularity, IMHO. > > Finally, the wrapper will start the pipeline by calling the tokenizer. > No result will be returned, but a callback is called or an event emitted > when the pipeline is done. > >> I'm assuming exceptions are not a good idea, due to Node's async nature >> and there are certain constructs where we are explicitly async -- >> tokenizing can be streamed, and I assume when we start doing lookups to >> figure out what to do with templates we'll need to go async for seconds >> at a time. > > Error reporting will have to happen in-band in the form of specific > tokens or DOM nodes with specfic attributes that allow the editor or > browser to render some error message. We should decide on an > encapsulation for these that makes it easy to render or otherwise handle > them generically. Exceptions should only be thrown for fatal bugs, but > not network failures or similar. > >> I'm also assuming that 99.99% of the time we want a simple in-out >> interface as described above. But for testing and debugging, we want to >> instrument what's really going on. And we may want to pass control off >> for a while when we bring template parsing into the mix. So that means >> that either there are magic values, or there's some way to attach event >> listeners to the serializer? > > Converting the pipeline to communicate using events is sufficient > really. Apart from interface definitions regarding the representation of > errors, tokens etc no magic values are involved. Note that the parse() > function of a simplified wrapper will also require a callback to receive > the result, or be an EventEmitter itself to support asynchronous processing. > >> Is it okay to attach event listeners to the >> serializer without tying them to a specific pipeline of wikitext that's >> finding its way through the code? > > Depends on what you are trying to do. Reusing a parser pipeline for > multiple parses will be fine (after adding implicit clean-ups for the > tree builder phase). Your event receiver or callback will have to know > what to do with the results from different parses though. > > Gabriel > > > _______________________________________________ > Wikitext-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitext-l -- Neil Kandalgaonkar ) <[email protected]> _______________________________________________ Wikitext-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitext-l
