Liam - I made some changes and timed them, I think this problem is solvable. (All timings are done using your data, on a P4 800MHz laptop.)
1. Baseline, the current state, in the parser code you sent me: bracketGroup << ( pp.Group( LBRACE + ( pp.empty ^ pp.OneOrMore(assignment) ^ pp.OneOrMore(identifier) ^ pp.OneOrMore(pp.dblQuotedString) ^ pp.OneOrMore(number) ^ pp.OneOrMore(bracketGroup) ) + RBRACE ) ) Time: 02:20.71 (mm:ss.sss) Just for general edification, '^' means Or, and it will evaluate all the alternatives and choose the longest match (in regexp docs, this is sometimes referred to as "greedy" matching); '|' means MatchFirst, and it will only evaluate alternatives until it finds a match (which I refer to as "eager" matching). In the past, I've had only slight results converting '^' to '|', but since this is a recursive expression, evaluating all of the possible alternatives can involve quite a bit of processing before selecting the longest. 2. Convert to '|', replace empty with ZeroOrMore: bracketGroup << ( pp.Group( LBRACE + pp.ZeroOrMore( assignment | identifier | pp.dblQuotedString | number | bracketGroup ) + RBRACE ) ) Time: 00:14.57 This is getting us somewhere! Replaced empty and OneOrMore's with a single ZeroOrMore, and changed from '^' to '|'. Since there is no overlap of the various alternatives *in their current order*, it is safe to use '|'. (This would not be the case if assignment came after identifier - this should be a hint on how to resolve the 'empty' problem.) One problem with this expression is that it will permit mixed bracket groups, such as { "A" 10 b=1000 {} }. 3. Go back to baseline, change '^' to '|', *and put empty at the end* bracketGroup << ( pp.Group( LBRACE + ( pp.OneOrMore(assignment) | pp.OneOrMore(identifier) | pp.OneOrMore(pp.dblQuotedString) | pp.OneOrMore(number) | pp.OneOrMore(bracketGroup) | pp.empty ) + RBRACE ) ) Time: 00:12.04 Best solution yet! This is faster than #2, since, once a match is made on the first term within a bracketGroup, all others in the group are expected to be the same type. Since '|' means "take first match", we resolve empty's "accept anything" behavior simply by putting it at the end of the list. 4. Make change in #3, also convert '^' to '|' in RHS. RHS << ( pp.dblQuotedString | identifier | number | bracketGroup ) Time: 00:01.15 Voila! I'm happy to say, this is the first time I've seen a 100X improvement, mostly by replacing '^' by '|'. While this is not *always* possible (see the CORBA IDL parser in the examples directory), it is worth the effort, especially with a recursive expression. The one item to be wary of when using '|' is when expressions mask each other. The easiest example is when trying to parse numbers, which may be integers or reals. If I write the expression as (assuming that integers will match a sequence of digits, and reals will match digits with a decimal point and some more digits): number = (integer | real) I will never match a real number! The integer expression "masks" the real, and since it occurs first, it will match first. The two solutions are: number = (integer ^ real) Or number = (real | integer) That is, use an Or, which will match the longest, or reorder the MatchFirst to put the most restrictive expression first. Welcome to pyparsing, please let me know how your project goes! -- Paul -----Original Message----- From: Liam Clarke [mailto:[EMAIL PROTECTED] Sent: Monday, July 25, 2005 8:31 AM To: Paul McGuire Subject: Re: [Tutor] Parsing problem Hi Paul, I've attached the latest version. It includes my sample data within the file. The sample data came in at 8 minutes 32 seconds without Pysco, 5 minutes 25 with, on a 650MHz Athlon. I was pondering whether breaking the test data down into separate bits via some preprocessing and feeding the simpler data structures in would help at all. Unfortunately, as I'm using pp.empty to deal with empty bracket sets (which were causing my 'expected "}" ' exceptions), using | matches to pp.empty first. I'm not sure how to get around the empty brackets without using that. I also get the feeling that pyparsing was more designed for making parsing small complex expressions easy, as opposed to my data churning. That said, I can think of about ten different projects I'd played with before giving up because of a problem that pyparsing handles elegantly. Usually it was regexes that got me. So if I have to attack this another way, at least I know the basics of a good module now. :) Much thanks for your time and energy, having read your listing on the c2 wiki (I searched for pyparsing on the offchance) I can see you're a busy man, and I'm grateful for your efforts to help me. Regards, Liam Clarke On 7/26/05, Paul McGuire <[EMAIL PROTECTED]> wrote: Liam- Please send me your latest version of the grammar, and I will post suggestions on the Tutor list. -- Paul -----Original Message----- From: Liam Clarke [mailto: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> ] Sent: Monday, July 25, 2005 7:38 AM To: Paul McGuire Cc: tutor@python.org Subject: Re: [Tutor] Parsing problem Hi Paul, Well various tweaks and such done, it parses perfectly, so much thanks, I think I now have a rough understanding of the basics of pyparsing. Now, onto the fun part of optimising it. At the moment, I'm looking at 2 - 5 minutes to parse a 2000 line country section, and that's with psyco. Only problem is, I have 157 country sections... I am running a 650 MHz processor, so that isn't helping either. I read this quote on http://pyparsing.sourceforge.net . "Thanks again for your help and thanks for writing pyparser! It seems my code needed to be optimized and now I am able to parse a 200mb file in 3 seconds. Now I can stick my tongue out at the Perl guys ;)" I'm jealous, 200mb in 3 seconds, my file's only 4mb. Are there any general approaches to optimisation that work well? My current thinking is to use string methods to split the string into each component section, and then parse each section to a bare minimum key, value. ie - instead of parsing x = { foo = { bar = 10 bob = 20 } type = { z = { } y = { } }} out fully, just parse to "x":"{ foo = { bar = 10 bob = 20 } type = { z = { } y = { } }}" I'm thinking that would avoid the complicated nested structure I have now, and I could parse data out of the string as needed, if needed at all. Erk, I don't know, I've never had to optimise anything. Much thanks for creating pyparsing, and doubly thank-you for your assistance in learning how to use it. Regards, Liam Clarke On 7/25/05, Liam Clarke <[EMAIL PROTECTED]> wrote: Hi Paul, My apologies, as I was jumping into my car after sending that email, it clicked in my brain. "Oh yeah... initial & body..." But good to know about how to accept valid numbers. Sorry, getting a bit too quick to fire off emails here. Regards, Liam Clarke On 7/25/05, Paul McGuire < [EMAIL PROTECTED] <mailto: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> > > wrote: Liam - The two arguments to Word work this way: - the first argument lists valid *initial* characters - the second argument lists valid *body* or subsequent characters For example, in the identifier definition, identifier = pp.Word(pp.alphas, pp.alphanums + "_/:.") identifiers *must* start with an alphabetic character, and then may be followed by 0 or more alphanumeric or _/: or . characters. If only one argument is supplied, then the same string of characters is used as both initial and body. Identifiers are very typical for 2 argument Word's, as they often start with alphas, but then accept digits and other punctuation. No whitespace is permitted within a Word. The Word matching will end when a non-body character is seen. Using this definition: integer = pp.Word(pp.nums+"-+.", pp.nums) It will accept "+123", "-345", "678", and ".901". But in a real number, a period may occur anywhere in the number, not just as the initial character, as in "3.14159". So your bodyCharacters must also include a ".", as in: integer = pp.Word(pp.nums+"-+.", pp.nums+".") Let me say, though, that this is a very permissive definition of integer - for one thing, we really should rename it something like "number", since it now accepts non-integers as well! But also, there is no restriction on the frequency of body characters. This definition would accept a "number" that looks like "3.4.3234.111.123.3234". If you are certain that you will only receive valid inputs, then this simple definition will be fine. But if you will have to handle and reject erroneous inputs, then you might do better with a number definition like: number = Combine( Word( "+-"+nums, nums ) + Optional( point + Optional( Word( nums ) ) ) ) This will handle "+123", "-345", "678", and "0.901", but not ".901". If you want to accept numbers that begin with "."s, then you'll need to tweak this a bit further. One last thing: you may want to start using setName() on some of your expressions, as in: number = Combine( Word( "+-"+nums, nums ) + Optional( point + Optional( Word( nums ) ) ) ).setName("number") Note, this is *not* the same as setResultsName. Here setName is attaching a name to this pattern, so that when it appears in an exception, the name will be used instead of an encoded pattern string (such as W:012345...). No need to do this for Literals, the literal string is used when it appears in an exception. -- Paul -- 'There is only one basic human right, and that is to do as you damn well please. And with it comes the only basic human duty, to take the consequences.' -- 'There is only one basic human right, and that is to do as you damn well please. And with it comes the only basic human duty, to take the consequences.' -- 'There is only one basic human right, and that is to do as you damn well please. And with it comes the only basic human duty, to take the consequences.' _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor