Danny Yoo wrote: | On Wed, 23 Nov 2005, Chris or Leslie Smith wrote: | || I agree that handling this with Python is pretty straightforward, but || I'm wondering if there exists some sort of mechanism for reading || these types of well structured (though not XML format, etc...) files. | | Hi Chris, | | Yes, take a look at "parser" tools like pyparsing, mxTextTools, and | Martel: | | http://pyparsing.sourceforge.net/ | | http://www.egenix.com/files/python/mxTextTools.html | | http://www.dalkescientific.com/Martel |
Great links, Danny. Thanks. I had seen mxTextTools before but didn't search for the right thing before raising the question. The pyparsing seems very interesting. The code that I attach below is a very light-weight version of a formatted reader. It assumes that you just want to pluck white-space delimited values out of lines in a text file (something I've had to do from time to time and something others have asked about on tutor before). Perhaps this is the sort of simple approach that evolves into one of the tools above as more complex parsing rules are needed. Again, thank for the pointers. ---- OK, here's a first draft of a simple formatted reader that can be used to read and keep certain white-space delimited strings from lines in an input stream/file. The basic idea is to write the template using a representative chunk of the text file (so the codes that you write can be seen directly next to the data that you are going to read) or else you can separate the two. At the start of a line that you want processed, you put in angle brackets the number of items that (should) appear on the line when separated by white space and then a comma-delimited list of items that you want to keep. Here's a working example using the data submitted in this thread: ###### #a template can be done like this (w/ no visual reference to actual lines)...but don't forget to put the \ #after the triple quotes or else an extra line will be processed and don't put an extra return before the #last triple quote. The example below indicates that 4 lines will be processed. templ1 = '''\ _ _ <5x2,3> _''' # or like this, where a sample line is shown templ1 = '''1 Polonijna Liga Mistrzow 26 wrzesnia 2005 <5x2,3> 6 12 6 4 1 0 1 0''' # here is another template that will be used to parse the lines templ2='''<3x0,2>Bohossian - Kolinski 1 <4x2> 1.000 9 13 19 <4x2> 2.000 2 4 16 <4x2> 1.000 10 8 17 <4x2> 0.000 8 6 17''' # -------------------here is the data--------------------------------- data = '''1 Polonijna Liga Mistrzow 26 wrzesnia 2005 6 12 6 4 1 0 1 0 Bohossian - Kolinski 1 1.000 9 13 19 2.000 2 4 16 1.000 10 8 17 0.000 8 6 17 Szadkowska - Szczurek 2 0.000 11 16 20 3.000 1 -4 14 3.500 3 -7 13 2.500 10 13 19 and then here is single line '''.split('\n') lines = iter(data) # to get data from a string that has been split into lines #---------------------------------------------------------------------- def partition(s, t): # from python-dev list, I believe if not isinstance(t, basestring) or not t: raise ValueError('partititon argument must be a non-empty string') parts = s.split(t, 1) if len(parts) == 1: result = (s, '', '') else: result = (parts[0], t, parts[1]) return result def temp_read(templ, lines): ''' Use a template to extract strings from the given lines. Lines in the template that start with "<" are assumed to contain a parsing command that is in the format, <NxL>, where N = number of white space separated items expected on the line x is the letter x L = a list of comma separated integers indicating which items to keep from the line e.g. <4x2,3> appearing at the start of a line in the template means that the corresponding line of data should have 4 items on it, and 2 and 3 should be returned If one or more lines of the data do not jive with the parsing instructions, a value of None will be returned. This may indicate the end of the data that can be interpreted with the template you gave. ''' rv = [] #all return values for the template will go here try: for ti in templ.splitlines(): #get a template line li = lines.next() #and a physical line of data if ti.startswith('<'): #check to see if there is a parse command on the line # get the command cmd = ti[1:].split('>')[0] things,_,keep = partition(cmd, 'x') things = int(things) keep = [int(x.strip()) for x in keep.split(',')] #split the physical line data = li.split() #check that the # of items matches the template specs assert len(data)==things #add the items to the return list for k in keep: rv.append(data[k]) else: pass #don't parse for data return rv except: return None print temp_read(templ1,lines) while True: vals = temp_read(templ2,lines) if vals == None: break print vals ###### /c _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor