[Tutor] indented grammar parsing

spir Sun, 11 Jan 2009 10:12:29 -0800

Hello,

this is a rather specific question about parsing an indented grammar.
I have a working implementation (presented below) that may be worth a critic 
review -- if you like
it.


First issue is: is there a way to express an indented formatfing using a common 
grammar
language such as BNF (or regex)? If yes, how to do that? If not, what kind of 
formal grammar is
actually such a format? Is there any standard way to express it?

I have searched python's grammar for this: actually, for any compound 
statement, a block (=
section content) is simply renamed 'suite' with no more format expression. The 
lexical analysis
section on indentation reads:

'' The indentation levels of consecutive lines are used to generate INDENT and 
DEDENT tokens, using
a stack, as follows. Before the first line of the file is read, a single zero 
is pushed on the
stack; this will never be popped off again. The numbers pushed on the stack 
will always be
strictly increasing from bottom to top. At the beginning of each logical line, 
the line’s
indentation level is compared to the top of the stack. If it is equal, nothing 
happens. If it is
larger, it is pushed on the stack, and one INDENT token is generated. If it is 
smaller, it must be
one of the numbers occurring on the stack; all numbers on the stack that are 
larger are popped
off, and for each number popped off a DEDENT token is generated. At the end of 
the file, a DEDENT
token is generated for each number remaining on the stack that is larger than 
zero. ''

I guess from this text that python parsers have a pre-parse phase that generate 
the so-called
INDENT and DEDENT tokens -- which are then used during actual parsing. So that 
a typical section
looks like this for the parser:

header
        INDENT
        line
        line
        line
        DEDENT

Which is, in fact, recreating a common block start / stop format (e.g. C's 
{block}). Is it? Do you
know whether python parsers actually do that during a kind of pre-parse phase, 
or whether their
designers have found a way of matching indent/dedent together with the rest of 
the source text?

Thank you,
Denis

==================

My trial:
I have have implemented an overall parser to parse an indented grammar in the 
simplest (recursive)
case I could specify:
* no blank line, no comment
* a single format of inline pattern
* each indent level marked as a single (tab) char
* only step 1 indentation increase

To do this, I use a set of tricks. Below the grammar (using a pseudo BNF/regex
format -- actually it is implemented using a custom parser generator):

inline  : [a..z0..9_]+
line    : Indent inline NL
bloc    : (section | line)+
section : line IndentMore bloc IndentLess
all     : <= bloc>

Oviously, I use 3 names starting with Indent, that are not defined. They are 
kinds of
pseudo-patterns and all rely on an external variable that holds the current 
indent level and is
hosted as an attribute of Indent:
* Indent: matches if a proper indentation is found, meaning equal to current 
indent level --
advances pointer
* IndentMore: matches if the actual indentation found is =level+1 -- does not 
advance pointer
(=lookahead) -- increments current level
* IndentLess: matches if the actual indentation found is <level -- does not 
advance pointer --
decrements current level of -1 whatever the actual indent level found

Last note causes that n IndentLess matches will be generated even when the 
indent level is
brutally decremented of n steps at once -- so that in a valid text each 
IndentMore match gets its
IndentLess counter-part.
I wonder about the indent level record stack introduced in CPython's 
description: I need only a
single variable.

As a example, here is the actual code of IndentLess:

class IndentLess(Pattern):
        def __init__(self):
                Pattern.__init__(self)
        def _check(self,text,pos):
                ''' check indent level decrease (of any number of step)
                        -- but decrement current level by 1 only in any case
                        ~ no pointer advance <==> Right() '''
                # match indentation
                start_pos = pos
                (result,pos) = ZeroOrMore(TAB)._check(text,pos)
                # case proper indent decrement
                indent_level = pos - start_pos
                if indent_level < Indent.level:
                        Indent.level -= 1
                        result = Result(self, None, text,start_pos,start_pos)
                        return (result,start_pos)       # (no pointer advance)
                # case no indent decrement (by step 1 or more)
                # (may also be end-of-text reached)
                if pos >= len(text):
                        raise EndOfText(self, text, start_pos)
                message =       "No proper indentation decrement:\nindent level 
%s --> %s"\
                                %(Indent.level, indent_level)
                raise NoMatch(self, text, start_pos, message)
        def _full_expr(self):
                return "INDENT--"

Below a test text and the corresponding parse result shown in tree view:

=======
1
2
3
        31
        32
4
5
6
        61
        62
        63
                631
                632
7
=======

=======
all
        line:1
        line:2
        section
                line:3
                bloc
                        line:31
                        line:32
        line:4
        line:5
        section
                line:6
                bloc
                        line:61
                        line:62
                        section
                                line:63
                                bloc
                                        line:631
                                        line:632
        line:7
=======

I first want to use this structure to parse config files written according to 
an recursive
indented format. So that the result would be a (record) object holding 
properties which actually
each are either a parameter or a nested sub-config, e.g.:

config
        name : val
        name : val
        connexion
                name : val
                name : val
        transmission
                name : val
                name : val


------
la vida e estranya
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] indented grammar parsing

Reply via email to