Re: [Tutor] Parsing problem

Paul McGuire Mon, 25 Jul 2005 14:10:10 -0700

Liam -

I made some changes and timed them, I think this problem is solvable.  (All
timings are done using your data, on a P4 800MHz laptop.)


1. Baseline, the current state, in the parser code you sent me:

bracketGroup << ( pp.Group( LBRACE + ( pp.empty ^ pp.OneOrMore(assignment) ^
pp.OneOrMore(identifier) ^ pp.OneOrMore(pp.dblQuotedString) ^
pp.OneOrMore(number) ^ pp.OneOrMore(bracketGroup) ) + RBRACE ) )

Time: 02:20.71 (mm:ss.sss)

Just for general edification, '^' means Or, and it will evaluate all the
alternatives and choose the longest match (in regexp docs, this is sometimes
referred to as "greedy" matching); '|' means MatchFirst, and it will only
evaluate alternatives until it finds a match (which I refer to as "eager"
matching).  In the past, I've had only slight results converting '^' to '|',
but since this is a recursive expression, evaluating all of the possible
alternatives can involve quite a bit of processing before selecting the
longest.

2. Convert to '|', replace empty with ZeroOrMore:

bracketGroup << ( pp.Group( LBRACE + pp.ZeroOrMore( assignment | identifier
| pp.dblQuotedString | number | bracketGroup ) + RBRACE ) )

Time: 00:14.57

This is getting us somewhere!  Replaced empty and OneOrMore's with a single
ZeroOrMore, and changed from '^' to '|'.  Since there is no overlap of the
various alternatives *in their current order*, it is safe to use '|'. (This
would not be the case if assignment came after identifier - this should be a
hint on how to resolve the 'empty' problem.)  One problem with this
expression is that it will permit mixed bracket groups, such as { "A" 10
b=1000 {} }.

3. Go back to baseline, change '^' to '|', *and put empty at the end*

bracketGroup << ( pp.Group( LBRACE + ( pp.OneOrMore(assignment) |
pp.OneOrMore(identifier) | pp.OneOrMore(pp.dblQuotedString) |
pp.OneOrMore(number) | pp.OneOrMore(bracketGroup) | pp.empty ) + RBRACE ) )

Time: 00:12.04

Best solution yet!  This is faster than #2, since, once a match is made on
the first term within  a bracketGroup, all others in the group are expected
to be the same type.  Since '|' means "take first match", we resolve empty's
"accept anything" behavior simply by putting it at the end of the list.

4. Make change in #3, also convert '^' to '|' in RHS.

RHS << ( pp.dblQuotedString | identifier | number | bracketGroup )

Time: 00:01.15

Voila!  I'm happy to say, this is the first time I've seen a 100X
improvement, mostly by replacing '^' by '|'.  While this is not *always*
possible (see the CORBA IDL parser in the examples directory), it is worth
the effort, especially with a recursive expression.

The one item to be wary of when using '|' is when expressions mask each
other.  The easiest example is when trying to parse numbers, which may be
integers or reals.  If I write the expression as (assuming that integers
will match a sequence of digits, and reals will match digits with a decimal
point and some more digits):

number = (integer | real)

I will never match a real number! The integer expression "masks" the real,
and since it occurs first, it will match first.  The two solutions are:

number = (integer ^ real)
Or
number = (real | integer)

That is, use an Or, which will match the longest, or reorder the MatchFirst
to put the most restrictive expression first.

Welcome to pyparsing, please let me know how your project goes!

-- Paul


-----Original Message-----
From: Liam Clarke [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 25, 2005 8:31 AM
To: Paul McGuire
Subject: Re: [Tutor] Parsing problem

Hi Paul, 

I've attached the latest version. It includes my sample data within the
file. The sample data came in at 8 minutes 32 seconds without Pysco, 5
minutes 25 with, on a  650MHz Athlon.

I was pondering whether breaking the test data down into separate bits via
some preprocessing and feeding the simpler data structures in would help at
all.

Unfortunately, as I'm using pp.empty to deal with empty bracket sets (which
were causing my 'expected "}" ' exceptions), using | matches to pp.empty
first. 

I'm not sure how to get around the empty brackets without using that.

I also get the feeling that pyparsing was more designed for making parsing
small complex expressions easy, as opposed to my data churning. That said, I
can think of about ten different projects I'd played with before giving up
because of a problem that pyparsing handles elegantly.
Usually it was regexes that got me. So if I have to attack this another way,
at least I know the basics of a good module now. :)

Much thanks for your time and energy, having read your listing on the c2
wiki (I searched for pyparsing on the offchance) I can see you're a busy
man, and I'm grateful for your efforts to help me.

Regards, 

Liam Clarke


On 7/26/05, Paul McGuire <[EMAIL PROTECTED]> wrote:

        Liam-
        
        Please send me your latest version of the grammar, and I will post
        suggestions on the Tutor list.
        
        -- Paul
        
        -----Original Message-----
        From: Liam Clarke [mailto: [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]> ]
        Sent: Monday, July 25, 2005 7:38 AM
        To: Paul McGuire
        Cc: [email protected]
        Subject: Re: [Tutor] Parsing problem
        
        Hi Paul,
        
        Well various tweaks and such done, it parses perfectly, so much
thanks, I 
        think I now have a rough understanding of the basics of pyparsing.
        
        Now, onto the fun part of optimising it. At the moment, I'm looking
at 2 - 5
        minutes to parse a 2000 line country section, and that's with psyco.
Only 
        problem is, I have 157 country sections...
        
        I am running a 650 MHz processor, so that isn't helping either. I
read this
        quote on http://pyparsing.sourceforge.net .
        
        "Thanks again for your help and thanks for writing pyparser! It
seems my
        code needed to be optimized and now I am able to parse a 200mb file
in 3
        seconds. Now I can stick my tongue out at the Perl guys ;)" 
        
        I'm jealous, 200mb in 3 seconds, my file's only 4mb.
        
        Are there any general approaches to optimisation that work well?
        
        My current thinking is to use string methods to split the string
into each
        component section, and then parse each section to a bare minimum
key, value. 
        ie - instead of parsing
        
        x = { foo = { bar = 10 bob = 20 } type = { z = { } y = { } }}
        
        out fully, just parse to "x":"{ foo = { bar = 10 bob = 20 } type = {
z = { }
        y = { } }}"
        
        I'm thinking that would avoid the complicated nested structure I
have now,
        and I could parse data out of the string as needed, if needed at
all.
        
        Erk, I don't know, I've never had to optimise anything.
        
        Much thanks for creating pyparsing, and doubly thank-you for your
assistance 
        in learning how to use it.
        
        Regards,
        
        Liam Clarke
        
        On 7/25/05, Liam Clarke <[EMAIL PROTECTED]> wrote:
        
                Hi Paul,
        
                My apologies, as I was jumping into my car after sending
that email, 
        it clicked in my brain.
                "Oh yeah... initial & body..."
        
                But good to know about how to accept valid numbers.
        
                Sorry, getting a bit too quick to fire off emails here. 
        
                Regards,
        
                Liam Clarke
        
        
                On 7/25/05, Paul McGuire < [EMAIL PROTECTED]
        <mailto: [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]> > > wrote:
        
        
                        Liam -
        
                        The two arguments to Word work this way:
                        - the first argument lists valid *initial*
characters
                        - the second argument lists valid *body* or
subsequent
        characters
        
                        For example, in the identifier definition,
        
                        identifier = pp.Word(pp.alphas, pp.alphanums +
"_/:.")
        
                        identifiers *must* start with an alphabetic
character, and
        then may be
                        followed by 0 or more alphanumeric or _/: or .
characters.
        If only one
                        argument is supplied, then the same string of
characters is
        used as both
                        initial and body.  Identifiers are very typical for
2
        argument Word's, as
                        they often start with alphas, but then accept digits
and
        other punctuation.
                        No whitespace is permitted within a Word.  The Word
matching
        will end when a
                        non-body character is seen.
        
                        Using this definition:
        
                        integer = pp.Word(pp.nums+"-+.", pp.nums)
        
                        It will accept "+123", "-345", "678", and ".901".
But in a
        real number, a
                        period may occur anywhere in the number, not just as
the
        initial character,
                        as in "3.14159".  So your bodyCharacters must also
include a
        ".", as in:
        
                        integer = pp.Word(pp.nums+"-+.", pp.nums+".")
        
                        Let me say, though, that this is a very permissive
        definition of integer -
                        for one thing, we really should rename it something
like
        "number", since it
                        now accepts non-integers as well!  But also, there
is no
        restriction on the
                        frequency of body characters.  This definition would
accept
        a "number" that
                        looks like "3.4.3234.111.123.3234".  If you are
certain that
        you will only
                        receive valid inputs, then this simple definition
will be
        fine.  But if you
                        will have to handle and reject erroneous inputs,
then you
        might do better
                        with a number definition like:
        
                        number = Combine( Word( "+-"+nums, nums ) +
                                          Optional( point + Optional( Word(
nums ) )
        ) )
        
                        This will handle "+123", "-345", "678", and "0.901",
but not
        ".901".  If you
                        want to accept numbers that begin with "."s, then
you'll
        need to tweak this
                        a bit further.
        
                        One last thing: you may want to start using
setName() on
        some of your
                        expressions, as in:
        
                        number = Combine( Word( "+-"+nums, nums ) +
                                          Optional( point + Optional( Word(
nums ) )
        )
                        ).setName("number")
        
                        Note, this is *not* the same as setResultsName.
Here
        setName is attaching a
                        name to this pattern, so that when it appears in an
        exception, the name will
                        be used instead of an encoded pattern string (such
as
        W:012345...).  No need
                        to do this for Literals, the literal string is used
when it
        appears in an
                        exception.
        
                        -- Paul
        
        
        
        
        
        
        
                --
        
                'There is only one basic human right, and that is to do as
you damn 
        well please.
                And with it comes the only basic human duty, to take the
        consequences.'
        
        
        
        
        --
        'There is only one basic human right, and that is to do as you damn
well
        please.
        And with it comes the only basic human duty, to take the
consequences.' 
        
        




--
'There is only one basic human right, and that is to do as you damn well
please.
And with it comes the only basic human duty, to take the consequences.' 

_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Parsing problem

Reply via email to