Pyparsing has a built-in helper called nestedExpr that fits neatly in with this data. Here is the whole script:
from pyparsing import nestedExpr syntax_tree = nestedExpr() results = syntax_tree.parseString(st_data) from pprint import pprint pprint(results.asList()) Prints: [[['S', ['NP-SBJ-1', ['NP', ['NNP', 'Rudolph'], ['NNP', 'Agnew']], [',', ','], ['UCP', ['ADJP', ['NP', ['CD', '55'], ['NNS', 'years']], ['JJ', 'old']], ['CC', 'and'], ['NP', ['NP', ['JJ', 'former'], ['NN', 'chairman']], ['PP', ['IN', 'of'], ['NP', ['NNP', 'Consolidated'], ['NNP', 'Gold'], ['NNP', 'Fields'], ['NNP', 'PLC']]]]], [',', ',']], ['VP', ['VBD', 'was'], ['VP', ['VBN', 'named'], ['S', ['NP-SBJ', ['-NONE-', '*-1']], ['NP-PRD', ['NP', ['DT', 'a'], ['JJ', 'nonexecutive'], ['NN', 'director']], ['PP', ['IN', 'of'], ['NP', ['DT', 'this'], ['JJ', 'British'], ['JJ', 'industrial'], ['NN', 'conglomerate']]]]]]], ['.', '.']]]] If you want to delve deeper into this, you could, since the content of the () groups is so regular. You in essence reconstruct nestedExpr in your own code, but you do get some increased control and visibility to the parsed content. Since this is a recursive syntax, you will need to use pyparsing's mechanism for recursion, which is the Forward class. Forward is sort of a "I can't define the whole thing yet, just create a placeholder" placeholder. syntax_element = Forward() LPAR,RPAR = map(Suppress,"()") syntax_tree = LPAR + syntax_element + RPAR Now in your example, a syntax_element can be one of 4 things: - a punctuation mark, twice - a syntax marker followed by one or more syntax_trees - a syntax marker followed by a word - a syntax tree Here is how I define those: marker = oneOf("VBD ADJP VBN JJ DT PP NN UCP NP-PRD " "NP NNS NNP CC NP-SBJ-1 CD VP -NONE- " "IN NP-SBJ S") punc = oneOf(", . ! ?") wordchars = printables.replace("(","").replace(")","") syntax_element << ( punc + punc | marker + OneOrMore(Group(syntax_tree)) | marker + Word(wordchars) | syntax_tree ) Note that we use '<<' operator to "inject" the definition of a syntax_element - we can't use '=' or we would get a different expression than the one we used to define syntax_tree. Now parse the string, and voila! Same as before. Here is the entire script: from pyparsing import nestedExpr, Suppress, oneOf, Forward, OneOrMore, Word, printables, Group syntax_element = Forward() LPAR,RPAR = map(Suppress,"()") syntax_tree = LPAR + syntax_element + RPAR marker = oneOf("VBD ADJP VBN JJ DT PP NN UCP NP-PRD " "NP NNS NNP CC NP-SBJ-1 CD VP -NONE- " "IN NP-SBJ S") punc = oneOf(", . ! ?") wordchars = printables.replace("(","").replace(")","") syntax_element << ( punc + punc | marker + OneOrMore(Group(syntax_tree)) | marker + Word(wordchars) | syntax_tree ) results = syntax_tree.parseString(st_data) from pprint import pprint pprint(results.asList()) -- Paul _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor