Danny Yoo wrote:
| On Wed, 23 Nov 2005, Chris or Leslie Smith wrote:
| 
|| I agree that handling this with Python is pretty straightforward, but
|| I'm wondering if there exists some sort of mechanism for reading
|| these types of well structured (though not XML format, etc...) files.
| 
| Hi Chris,
| 
| Yes, take a look at "parser" tools like pyparsing, mxTextTools, and
| Martel:
| 
|    http://pyparsing.sourceforge.net/
| 
|    http://www.egenix.com/files/python/mxTextTools.html
| 
|    http://www.dalkescientific.com/Martel
| 

Great links, Danny.  Thanks.  I had seen mxTextTools before but didn't search 
for the right thing before raising the question.  The pyparsing seems very 
interesting.  The code that I attach below is a very light-weight version of a 
formatted reader. It assumes that you just want to pluck white-space delimited 
values out of lines in a text file (something I've had to do from time to time 
and something others have asked about on tutor before). Perhaps this is the 
sort of simple approach that evolves into one of the tools above as more 
complex parsing rules are needed.

Again, thank for the pointers.

----

OK, here's a first draft of a simple formatted reader that can be used to read 
and keep certain white-space delimited strings from lines in an input 
stream/file.  The basic idea is to write the template using a representative 
chunk of the text file (so the codes that you write can be seen directly next 
to the data that you are going to read) or else you can separate the two. At 
the start of a line that you want processed, you put in angle brackets the 
number of items that (should) appear on the line when separated by white space 
and then a comma-delimited list of items that you want to keep. Here's a 
working example using the data submitted in this thread:

######
#a template can be done like this (w/ no visual  reference to actual 
lines)...but don't forget to put the \
#after the triple quotes or else an extra line will be processed and don't put 
an extra return before the
#last triple quote. The example below indicates that 4 lines will be processed.

templ1 = '''\
_
_
<5x2,3>
_'''

# or like this, where a sample line is shown

templ1 = '''1 Polonijna Liga Mistrzow
26 wrzesnia 2005
<5x2,3> 6 12 6 4 1
0 1 0'''

# here is another template that will be used to parse the lines

templ2='''<3x0,2>Bohossian - Kolinski
1 
<4x2>      1.000 9 13 19
<4x2>      2.000 2 4 16
<4x2>      1.000 10 8 17
<4x2>      0.000 8 6 17'''

# -------------------here is the data---------------------------------
data = '''1 Polonijna Liga Mistrzow
26 wrzesnia 2005
 6 12 6 4 1
 0 1 0
Bohossian - Kolinski
1 
      1.000 9 13 19
      2.000 2 4 16
      1.000 10 8 17
      0.000 8 6 17
Szadkowska - Szczurek
2
      0.000 11 16 20
      3.000 1 -4 14
      3.500 3 -7 13
      2.500 10 13 19
and then here is single line
'''.split('\n')

lines = iter(data) # to get data from a string that has been split into lines
#----------------------------------------------------------------------

def partition(s, t):
    # from python-dev list, I believe
    if not isinstance(t, basestring) or not t:
        raise ValueError('partititon argument must be a non-empty string')
    parts = s.split(t, 1)
    if len(parts) == 1:
        result = (s, '', '')
    else:
        result = (parts[0], t, parts[1])
    return result

def temp_read(templ, lines):
    '''
Use a template to extract strings from the given lines. Lines in the template 
that 
start with "<" are assumed to contain a parsing command that is in the format, 
<NxL>,
where

    N = number of white space separated items expected on the line
    x is the letter x
    L = a list of comma separated integers indicating which items to keep from 
the line

    e.g. <4x2,3> appearing at the start of a line in the template means that 
the corresponding
    line of data should have 4 items on it, and 2 and 3 should be returned

If one or more lines of the data do not jive with the parsing instructions, a 
value of None will
be returned. This may indicate the end of the data that can be interpreted with 
the template you
gave.
    '''

    rv = [] #all return values for the template will go here
    try: 
        for ti in templ.splitlines(): #get a template line
            li = lines.next()            #and a physical line of data
            if ti.startswith('<'):        #check to see if there is a parse 
command on the line
                # get the command
                cmd = ti[1:].split('>')[0]
                things,_,keep = partition(cmd, 'x')
                things = int(things)
                keep = [int(x.strip()) for x in keep.split(',')]
                #split the physical line
                data = li.split()
                #check that the # of items matches the template specs
                assert len(data)==things
                #add the items to the return list
                for k in keep:
                    rv.append(data[k])
            else:
                pass #don't parse for data
        return rv
    except:
        return None

print temp_read(templ1,lines)
while True:
    vals = temp_read(templ2,lines)
    if vals == None: break
    print vals
######

/c
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to