Sorry, I didn't mean to imply that this division was my idea or 
anything. The phases of parsing are explicit already. By 'monster' 
object I don't mean that it is large or incomprehensible, but that it 
has a few too many responsibilities to be easy to test.

For instance, right now it's returning its output as a property of 
itself, and the serializer is sort of added on later. The pipeline 
should be a bit clearer and more stateless.

Anyway this is easily fixed, and will be soon...


On 12/28/11 4:35 AM, Gabriel Wicke wrote:
> On 12/28/2011 05:45 AM, Neil Kandalgaonkar wrote:
>> I pulled out most of the parser-y parts from the parserTests, leaving
>> behind just tests.
>
> Very good, this was really needed.
>
>> However, the parser is still a bit of a monster object, hence the
>> deliberately silly name, ParserThingy.
>>
>> I'm trying to decompose it into a chain, roughly like:
>
> The current implementation already operates as a chain, as documented in
> https://www.mediawiki.org/wiki/Future/Parser_development:
>
> PEG wiki/HTML tokenizer         (or other tokenizers / SAX-like parsers)
>      | Chunks of tokens
>      V
> Token stream transformations
>      | Chunks of tokens
>      V
> HTML5 tree builder
>      | HTML 5 DOM tree
>      V
> DOM Postprocessors
>      | HTML5 DOM tree
>      +------------------>  (X)HTML serialization
>      |
>      V
> DomConverter
>      | WikiDom
>      V
> JSON serialization
>      | JSON string
>      V
> Visual Editor
>
> The token stream transformation phase and to some degree the DOM
> postprocessor phase will soon differ in their configuration depending on
> the intended output format, enabled extensions and other wiki-specific
> settings. Output intended for viewing will have templates fully expanded
> and more aggressive sanitation applied in DOM postprocessors. Output
> destined for the editor will have templates and extension tag results
> encapsulated. At least, that is the plan so far- we might come up with
> better ways to handle this later.
>
> The interface between the tokenizer, token stream transforms and the
> tree builder wrapper is currently synchronous with a single list of
> tokens being passed from one phase to the next. This should be changed
> to event emitters that emit chunks of tokens. The tree builder wrapper
> already implements the event emitter pattern to internally communicate
> with the HTML5 tree builder library.
>
> The tree builder consumes token events until the end token is reached.
> The FauxHTML5.TreeBuilder wrapper could be extended to emit an
> additional signal when the end token was processed, so that DOM
> postprocessing and WikiDom conversion and JSON serialization can be
> triggered. All DOM-based processing is essentially synchronous and does
> not perform any IO, so these stages can all be called from a single
> function for now. This stage should in turn be an event emitter, so that
> you can register for further asynchronous processing of the result.
>
> After the conversion to EvenEmitters, the wrapper object (the
> ParserThingy you just created) still configures the stages in a
> particular way, and registers the stages as event listeners with each
> other. The size of the wrapper can eventually be reduced a bit by
> pushing more of the phase-specific setup into the phase constructors and
> setup functions themselves. The high degree of decomposition into phases
> already there still means that a few lines of setup per phase will still
> add up to a 'monster object' of a few dozen lines. A reasonable price to
> pay for independent testing, potential parallel execution of stages and
> modularity, IMHO.
>
> Finally, the wrapper will start the pipeline by calling the tokenizer.
> No result will be returned, but a callback is called or an event emitted
> when the pipeline is done.
>
>> I'm assuming exceptions are not a good idea, due to Node's async nature
>> and there are certain constructs where we are explicitly async --
>> tokenizing can be streamed, and I assume when we start doing lookups to
>> figure out what to do with templates we'll need to go async for seconds
>> at a time.
>
> Error reporting will have to happen in-band in the form of specific
> tokens or DOM nodes with specfic attributes that allow the editor or
> browser to render some error message. We should decide on an
> encapsulation for these that makes it easy to render or otherwise handle
> them generically. Exceptions should only be thrown for fatal bugs, but
> not network failures or similar.
>
>> I'm also assuming that 99.99% of the time we want a simple in-out
>> interface as described above. But for testing and debugging, we want to
>> instrument what's really going on.  And we may want to pass control off
>> for a while when we bring template parsing into the mix. So that means
>> that either there are magic values, or there's some way to attach event
>> listeners to the serializer?
>
> Converting the pipeline to communicate using events is sufficient
> really. Apart from interface definitions regarding the representation of
> errors, tokens etc no magic values are involved. Note that the parse()
> function of a simplified wrapper will also require a callback to receive
> the result, or be an EventEmitter itself to support asynchronous processing.
>
>> Is it okay to attach event listeners to the
>> serializer without tying them to a specific pipeline of wikitext that's
>> finding its way through the code?
>
> Depends on what you are trying to do. Reusing a parser pipeline for
> multiple parses will be fine (after adding implicit clean-ups for the
> tree builder phase). Your event receiver or callback will have to know
> what to do with the results from different parses though.
>
> Gabriel
>
>
> _______________________________________________
> Wikitext-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitext-l

-- 
Neil Kandalgaonkar   ) <[email protected]>

_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply via email to