Danny Yoo wrote:
| On Wed, 23 Nov 2005, Chris or Leslie Smith wrote:
|
|| I agree that handling this with Python is pretty straightforward, but
|| I'm wondering if there exists some sort of mechanism for reading
|| these types of well structured (though not XML format, etc...) files.
|
| Hi Chris,
|
| Yes, take a look at "parser" tools like pyparsing, mxTextTools, and
| Martel:
|
| http://pyparsing.sourceforge.net/
|
| http://www.egenix.com/files/python/mxTextTools.html
|
| http://www.dalkescientific.com/Martel
|
Great links, Danny. Thanks. I had seen mxTextTools before but didn't search
for the right thing before raising the question. The pyparsing seems very
interesting. The code that I attach below is a very light-weight version of a
formatted reader. It assumes that you just want to pluck white-space delimited
values out of lines in a text file (something I've had to do from time to time
and something others have asked about on tutor before). Perhaps this is the
sort of simple approach that evolves into one of the tools above as more
complex parsing rules are needed.
Again, thank for the pointers.
----
OK, here's a first draft of a simple formatted reader that can be used to read
and keep certain white-space delimited strings from lines in an input
stream/file. The basic idea is to write the template using a representative
chunk of the text file (so the codes that you write can be seen directly next
to the data that you are going to read) or else you can separate the two. At
the start of a line that you want processed, you put in angle brackets the
number of items that (should) appear on the line when separated by white space
and then a comma-delimited list of items that you want to keep. Here's a
working example using the data submitted in this thread:
######
#a template can be done like this (w/ no visual reference to actual
lines)...but don't forget to put the \
#after the triple quotes or else an extra line will be processed and don't put
an extra return before the
#last triple quote. The example below indicates that 4 lines will be processed.
templ1 = '''\
_
_
<5x2,3>
_'''
# or like this, where a sample line is shown
templ1 = '''1 Polonijna Liga Mistrzow
26 wrzesnia 2005
<5x2,3> 6 12 6 4 1
0 1 0'''
# here is another template that will be used to parse the lines
templ2='''<3x0,2>Bohossian - Kolinski
1
<4x2> 1.000 9 13 19
<4x2> 2.000 2 4 16
<4x2> 1.000 10 8 17
<4x2> 0.000 8 6 17'''
# -------------------here is the data---------------------------------
data = '''1 Polonijna Liga Mistrzow
26 wrzesnia 2005
6 12 6 4 1
0 1 0
Bohossian - Kolinski
1
1.000 9 13 19
2.000 2 4 16
1.000 10 8 17
0.000 8 6 17
Szadkowska - Szczurek
2
0.000 11 16 20
3.000 1 -4 14
3.500 3 -7 13
2.500 10 13 19
and then here is single line
'''.split('\n')
lines = iter(data) # to get data from a string that has been split into lines
#----------------------------------------------------------------------
def partition(s, t):
# from python-dev list, I believe
if not isinstance(t, basestring) or not t:
raise ValueError('partititon argument must be a non-empty string')
parts = s.split(t, 1)
if len(parts) == 1:
result = (s, '', '')
else:
result = (parts[0], t, parts[1])
return result
def temp_read(templ, lines):
'''
Use a template to extract strings from the given lines. Lines in the template
that
start with "<" are assumed to contain a parsing command that is in the format,
<NxL>,
where
N = number of white space separated items expected on the line
x is the letter x
L = a list of comma separated integers indicating which items to keep from
the line
e.g. <4x2,3> appearing at the start of a line in the template means that
the corresponding
line of data should have 4 items on it, and 2 and 3 should be returned
If one or more lines of the data do not jive with the parsing instructions, a
value of None will
be returned. This may indicate the end of the data that can be interpreted with
the template you
gave.
'''
rv = [] #all return values for the template will go here
try:
for ti in templ.splitlines(): #get a template line
li = lines.next() #and a physical line of data
if ti.startswith('<'): #check to see if there is a parse
command on the line
# get the command
cmd = ti[1:].split('>')[0]
things,_,keep = partition(cmd, 'x')
things = int(things)
keep = [int(x.strip()) for x in keep.split(',')]
#split the physical line
data = li.split()
#check that the # of items matches the template specs
assert len(data)==things
#add the items to the return list
for k in keep:
rv.append(data[k])
else:
pass #don't parse for data
return rv
except:
return None
print temp_read(templ1,lines)
while True:
vals = temp_read(templ2,lines)
if vals == None: break
print vals
######
/c
_______________________________________________
Tutor maillist - [email protected]
http://mail.python.org/mailman/listinfo/tutor