Chris This block of code made my day - especially yummydataaddrs & "here's your stupid data"
> for start,end in yummydataaddrs: > fd.seek(start) > print "here's your stupid data:", fd.read(end-start+1) Nothing is more impressive than solid code, with a good sense of humor. Thanks for the code & especially since i am in a time crunch, this approach, might get me what i need more quickly. Thanks also for Knuth's awesome quote & reminded me of my stanford friend who told me that Prof. Knuth, still holds a christmas tree lecture every year...unfortunately inspite of being in the bay area this year, i missed it :( http://stanford-online.stanford.edu/seminars/knuth/101206-knuth-500.asx Thanks a ton cheers ashish p.s. To everybody OT(off_topic): I moved to the bay area recently & am passionate about technology in general & linux, python, c, embedded, mobile, wireless stuff,..... I was wondering if any of you guys, are part of some bay area python( or other tech) meetup ( as in do you guys meetup, in person) for like a tech talk / discussion / brainstorming/ hack nights ? If yes, i would love to know more & be a part of it On Mon, Dec 20, 2010 at 9:27 PM, Chris Fuller <cfuller...@thinkingplanet.net > wrote: > > This isn't XML, it's an abomination of XML. Best to not treat it as XML. > Good thing you're only after one class of tags. Here's what I'd do. I'll > give a general solution, but there are two parameters / four cases that > could > make the code simpler, I'll just point them out at the end. > > Iterate over the file descriptor, reading in line-by-line. This will be > slow > on a huge file, but probably not so bad if you're only doing it once. It > makes > the rest easier. Knuth has some sage advice on this point (*) :) Some > feedback on progress to the user can be helpful here, if it is slow. > > Keep track of your offset into the file. There are two ways: use the > tell() > method of the file descriptor (but you will have to subtract the length of > the > current line), or just add up the line lengths as you process them. > > Scan each line for the open tag. Add the offset to the tag to the offset > within > the file of the current line, and push that to a stack. Scan for the end > tag, > when you find one, pop an address from the stack, and put the two > (start/end) > addresses a list for later. Keep doing this until you run out of file. > > Now, take that list, and pull off the address-pairs; seek() and read() them > directly. Lather, rinse, repeat. > > Some off-the-cuff untested code: > > stk = [] > yummydataaddrs = [] > > fileoff = 0 > > fd = open('ginormous.xml', 'r') > for line in fd: > lineoff = line.index(start_tag) > if fileoff != -1: > stk.append(fileoff+lineoff) > > lineoff = line.index(end_tag) > if lineoff != -1: > yummydataaddr.append( (stk.pop(-1), fileoff+lineoff) ) > > fileoff += len(line) > > for start,end in yummydataaddrs: > fd.seek(start) > print "here's your stupid data:", fd.read(end-start+1) > > > You can simplify a bit if the tags are one a line by themselves, since you > don't have to keep track of the offset with the line of the tag. The other > simplification is if they aren't nested. You don't need to mess around > with a > stack in this case. > > > (*) "Premature optimization is the root of all evil." > > > Cheers > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > *"We act as though comfort and luxury were the chief requirements of life, when all that we need to make us happy is something to be enthusiastic about." -- Albert Einstein*
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor