Parser internals (was: [Templates] WRAPPER question)

'Andy Wardley' Thu, 13 Oct 2005 05:55:27 -0700

[WARNING: long post only for those interested in TT parser internals]

Sergey Martynoff wrote:
> As I can see, current TT2 parser is not state-driven.


It is, once you get into the Template::Parser::_parse () method.
The parser/Parser.yp files defines the YAPP grammar.  This is then
compiled to generate the Template::Grammar module which defines the
$STATES and $RULES tables.  The _parse() method then implements the
state transition machine.

> But why don's use "multi-level" structure of parsers? 

Yep, that's pretty much now it works.  In TT2, the parser split_text()
and tokenise_directive() methods produce the array of tokens, _parse()
implements the table-driven syntax analyser, and Template::Directives() 
provides the code generating back end.

In TT3 things are a little more complicated, but they still follow the 
same multi-level approach.

At the front end, the Template::Scanner module scans the template to 
identify text and tags.  Depending on what kind of tag it finds (e.g. 
a regular [% ... %] TT tag, or an embedded variable like $this or 
${this}), it delegates to the appropriate Template::Tag module.  This 
then performs any tag-specific pre-processing (e.g. the - and + chomp 
flags allowed in TT directives that remove leading/trailing whitespace), 
and then goes on to parse the directives contained therein.

Directives are identified by keywords like INCLUDE, FOREACH, etc.
In TT3, I want the grammar to be dynamic, so that you can enable or
disable inbuilt TT3 directives, or add your own custom directives to 
perform particular tasks.  

For example, you might want a custom EMAIL directive to send email 
messages (and then again you might not, but you get the idea...)

  [% EMAIL to=user.email subject="Hello World" %]
     Dear [% user.name %]
    
     I like your shirt.  Is it new?
  [% END %]

So the TT3 grammar is essentially an extensible collection of directive 
modules, each indexed by their identifying keyword(s).  When we detect 
an active directive keyword in a tag, we delegate to the appropriate 
directive module, leaving it to parse any further arguments expected by 
the directive.  For example, the INCLUDE directive expects a template
name, and a list of one or more named parameters.  The code looks 
something like this:

package Template::Directive::Include;

sub parse {
   my ($self, $text, $handler) = @_;

   my $template = $self->parse_template_name($text);
   my $params   = $self->parse_params($text);

   return $handler->directive($self, $template, $params);
}

So each directive works as a recursive descent parser (i.e. the
calls to parse_template_name() and parse_params()) to analyse the
input text, and then calls off to the back-end handler to generate
the compiled template code.

Note that when it comes to parsing the arguments of a directive, 
the above parse() method commits fully to it, and expects to find
all arguments inside the current tag.  Like so:

  [% INCLUDE header title="hello world" %]

You cannot do this (for good reason):

  [% INCLUDE %][% header %][% title="hello world" %]

When it comes to block directives, things get a little more complicated.  
Take the WRAPPER directive, for example.  We parse the arguments as for
the INCLUDE directive, but then must begin a nested block that continues
until the corresponding END keyword.  The directive parse() method must
return after parsing the opening arguments to allow the scanner to continue
scanning on past the current tag (i.e. to scan the body of the WRAPPER...END
directive).  

So it first calls the begin_directive() method on the handler to register 
itself as a nested block directive.  At some point later, when the END 
keyword is encountered, the T::D::End directive calls end_directive() to 
terminate the WRAPPER block.

package Template::Directive::Wrapper;

sub parse {
   my ($self, $text, $handler) = @_;

   my $template = $self->parse_template_name($text);
   my $params   = $self->parse_params($text);

   return $handler->begin_directive($self, $template, $params);
}

package Template::Directive::End;

sub parse {
   my ($self, $text, $handler) = @_;
   $handler->end_directive();
}

For things like IF...ELSIF...ELSE...END, it gets more complicated.  The
current TT3 implementation works, but is a bit of a proof-of-concept hack.

What I'm working on at the moment is a process whereby each directive 
declares the keywords that consititute it's own context-free grammar.
Then the container grammar calculates the 'first' and 'follow' tables
(see the Dragon Book for info) to work out which keywords can appear
where, and how to shift/reduce the handlers to nest and un-nest block
directives as intended.

The upshot is, that the TT3 grammar will be defined something like so:

  GET;
  SET;
  WRAPPER END;
  IF ELSIF* ELSE? END;
  TRY CATCH* FINAL? END;
  ...etc...

This keyword-level view of the grammar defines the overall block structure
of the language.  It is used to generate the states that control how 
directives are pushed onto and popped off the parser control stack 
when significant keywords are identified in the input stream
(equating to the shift and reduce actions of a purely table-driven parser).
This effectively determines which directive object gets a call to its
parse() method, and when.

For the technically inclined, this reduces the "sentence-level" view (e.g.
a directive like <INCLUDE template params>) and the "paragraph-level" view
(e.g. something like <WRAPPER template params> ... <END>) to an LL(1)
grammar.  This allows this part of the parser to be quite a bit simpler 
than it would have been otherwise.  It also means that you should be 
able to add directives to a grammar at any time (even half way through 
parsing a template) and generate new state transition tables on the fly.  
With a more liberal grammar like the LALR(1) of the current TT2 YAPP 
implementation, modifying the grammar in any slight way is likely to 
throw you a bunch of conflicts that won't make a great deal of sense to
anyone but a hardened parser hacker.

So, LL(1) is a Good Thing here.

However, when it comes to parsing the arguments of a directive (i.e. 
the bits following a keyword), we opt for recursive descent.  This 
allows us to harness the full power of Perl's regex engine to perform
pretty much any kind of syntax analysis we like, with as much lookahead
(or lookbehind) as we need.  This allows us to be much more flexible 
when it comes to parsing the expression-level elements of the grammar.

So hopefully we'll have the best of both worlds: partly driven by 
state tables, partly recursive descent.  But this is just the middle
layer of the three.  We still have the scanner out front and the 
code generator out back, making it like a classic "multi-level" parser,
but with some extra magic going on in the middle.

> P.S. I have not reviewed TT3 parser, so maybe it is designed just
> like I suggested. Sorry for idle talk in this case.

Not at all.  

My apologies for the long, rambling reply which I'm sure has more detail 
than anyone is interested in.  

But in my defence, the process of describing it above has been very 
useful in forcing me to think carefully about it and clarify my 
thoughts regarding the current state of play.  So even if I'm mostly 
talking to myself, at least I'm enjoying the coversation :-)

Any comments, suggestions, etc., are of course most welcome.

Cheers
A


_______________________________________________
templates mailing list
[email protected]
http://lists.template-toolkit.org/mailman/listinfo/templates

Parser internals (was: [Templates] WRAPPER question)

Reply via email to