Hello, this is a rather specific question about parsing an indented grammar. I have a working implementation (presented below) that may be worth a critic review -- if you like it.
First issue is: is there a way to express an indented formatfing using a common grammar language such as BNF (or regex)? If yes, how to do that? If not, what kind of formal grammar is actually such a format? Is there any standard way to express it? I have searched python's grammar for this: actually, for any compound statement, a block (= section content) is simply renamed 'suite' with no more format expression. The lexical analysis section on indentation reads: '' The indentation levels of consecutive lines are used to generate INDENT and DEDENT tokens, using a stack, as follows. Before the first line of the file is read, a single zero is pushed on the stack; this will never be popped off again. The numbers pushed on the stack will always be strictly increasing from bottom to top. At the beginning of each logical line, the line’s indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and one INDENT token is generated. If it is smaller, it must be one of the numbers occurring on the stack; all numbers on the stack that are larger are popped off, and for each number popped off a DEDENT token is generated. At the end of the file, a DEDENT token is generated for each number remaining on the stack that is larger than zero. '' I guess from this text that python parsers have a pre-parse phase that generate the so-called INDENT and DEDENT tokens -- which are then used during actual parsing. So that a typical section looks like this for the parser: header INDENT line line line DEDENT Which is, in fact, recreating a common block start / stop format (e.g. C's {block}). Is it? Do you know whether python parsers actually do that during a kind of pre-parse phase, or whether their designers have found a way of matching indent/dedent together with the rest of the source text? Thank you, Denis ================== My trial: I have have implemented an overall parser to parse an indented grammar in the simplest (recursive) case I could specify: * no blank line, no comment * a single format of inline pattern * each indent level marked as a single (tab) char * only step 1 indentation increase To do this, I use a set of tricks. Below the grammar (using a pseudo BNF/regex format -- actually it is implemented using a custom parser generator): inline : [a..z0..9_]+ line : Indent inline NL bloc : (section | line)+ section : line IndentMore bloc IndentLess all : <= bloc> Oviously, I use 3 names starting with Indent, that are not defined. They are kinds of pseudo-patterns and all rely on an external variable that holds the current indent level and is hosted as an attribute of Indent: * Indent: matches if a proper indentation is found, meaning equal to current indent level -- advances pointer * IndentMore: matches if the actual indentation found is =level+1 -- does not advance pointer (=lookahead) -- increments current level * IndentLess: matches if the actual indentation found is <level -- does not advance pointer -- decrements current level of -1 whatever the actual indent level found Last note causes that n IndentLess matches will be generated even when the indent level is brutally decremented of n steps at once -- so that in a valid text each IndentMore match gets its IndentLess counter-part. I wonder about the indent level record stack introduced in CPython's description: I need only a single variable. As a example, here is the actual code of IndentLess: class IndentLess(Pattern): def __init__(self): Pattern.__init__(self) def _check(self,text,pos): ''' check indent level decrease (of any number of step) -- but decrement current level by 1 only in any case ~ no pointer advance <==> Right() ''' # match indentation start_pos = pos (result,pos) = ZeroOrMore(TAB)._check(text,pos) # case proper indent decrement indent_level = pos - start_pos if indent_level < Indent.level: Indent.level -= 1 result = Result(self, None, text,start_pos,start_pos) return (result,start_pos) # (no pointer advance) # case no indent decrement (by step 1 or more) # (may also be end-of-text reached) if pos >= len(text): raise EndOfText(self, text, start_pos) message = "No proper indentation decrement:\nindent level %s --> %s"\ %(Indent.level, indent_level) raise NoMatch(self, text, start_pos, message) def _full_expr(self): return "INDENT--" Below a test text and the corresponding parse result shown in tree view: ======= 1 2 3 31 32 4 5 6 61 62 63 631 632 7 ======= ======= all line:1 line:2 section line:3 bloc line:31 line:32 line:4 line:5 section line:6 bloc line:61 line:62 section line:63 bloc line:631 line:632 line:7 ======= I first want to use this structure to parse config files written according to an recursive indented format. So that the result would be a (record) object holding properties which actually each are either a parameter or a nested sub-config, e.g.: config name : val name : val connexion name : val name : val transmission name : val name : val ------ la vida e estranya _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor