On Wed, Apr 22, 2009 at 9:41 PM, William Witteman <y...@nerd.cx> wrote: > On Wed, Apr 22, 2009 at 11:23:11PM +0200, Eike Welk wrote: > >>How do you decide that a word is a keyword (AU, AB, UN) and not a part >>of the text? There could be a file like this: >> >><567> >>AU - Bibliographical Theory and Practice - Volume 1 - The AU - Tag >>and its applications >>AB - Texts in Library Science >><568> >>AU - Bibliographical Theory and Practice - Volume 2 - The >>AB - Tag and its applications >>AB - Texts in Library Science >><569> >>AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - >>AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU >>AB - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - >>AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU >>ZZ - Somewhat nonsensical case > > This is a good case, and luckily the files are validated on the other > end to prevent this kind of collision.
>>To me it seems that a parsing library is unnecessary. Just look at the >>first few characters of each line and decide if its the start of a >>record, a tag or normal text. You might need some additional >>algorithm for corner cases. I agree with this. The structure is simple and the lines are easily recognized. Here is one way to do it: data = '''<567> AU - Bibliographical Theory and Practice - Volume 1 - The AU - Tag and its applications AB - Texts in Library Science <568> AU - Bibliographical Theory and Practice - Volume 2 - The AB - Tag and its applications AB - Texts in Library Science <569> AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU AB - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU - AU ZZ - Somewhat nonsensical case '''.splitlines() import pprint, re from collections import defaultdict def parse(data): ''' Yields dictionaries corresponding to bibliographic entries''' result = None key = None for line in data: if not line.strip(): continue # skip blank lines if re.search(r'^<\d+>', line): # start of a new entry if result: # return the previous entry and initialize yield result result = defaultdict(list) key = None else: m = re.search(r'^([A-Z]{2}) +- +(.*)', line) if m: # New field key, value = m.group(1, 2) result[key].append(value) else: # Extension of previous field if result and key: # sanity check result[key][-1] += '\n' + line if result: yield result for entry in parse(data): for key, value in entry.iteritems(): print key pprint.pprint(value) print Note that dicts do not preserve order so the fields are not output in the same order as they appear in the file. > If this was the only type of file I'd need to parse, I'd agree with you, > but this is one of at least 4 formats I'll need to process, and so a > robust methodology will serve me better than a regex-based one-off. Unless there is some commonality between the formats, each parser is going to be a one-off no matter how you implement it. Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor