I would wrap the record buffering into a generator function and probably use plain slicing to return the individual records instead of StringIO. I have a writeup on generators here:
http://personalpages.tds.net/~kent37/kk/00004.html Kent Marc Tompkins wrote: > Alan Gauld wrote: > > "Marc Tompkins" <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote > > realized I can implement this myself, using 'read(bigsize)' - > > currently I'm using 'read(recordsize)'; I just need to add an extra > > loop around my record reads. Please disregard... > If you just want to navigate to a specific record then it might be > easier to use seek(), that will save you having to read all the > previous records into memory. > > > No, I need to parse the entire file, checking records as I go. Here's > the solution I came up with - I'm sure it could be optimized, but it's > already about six times faster than going record-by-record: > > def loadInsurance(self): > header = ('Code', 'Name') > Global.Ins.append(header) > obj = Insurance() > recLen = obj.RecordLength > for offNum, offPath in Global.offices.iteritems(): > if (offPath.Ref == ''): > offPath.Ref = offPath.Default > with open(offPath.Ref + obj.TLA + '.dat','rb') as inFile: > tmpIn = inFile.read(recLen) # throw away the > header record > tmpIn = inFile.read(recLen*4096) > while not (len(tmpIn) < recLen): > buf = StringIO.StringIO (tmpIn) > inRec = buf.read(recLen) > while not (len(inRec) < recLen): > obj = Insurance(inRec) > if (obj.Valid): > Global.Ins.append (obj.ID, obj.Name) > inRec = buf.read(recLen) > buf.close() > tmpIn = inFile.read(recLen*4096) > > Obviously this is taken out of context, and I'm afraid I'm too lazy to > sanitize it (much) for posting right now, so here's a brief summary > instead. > > 1- I don't want my calling code to need to know many details. So if I > create an object with no parameters, it provides me with the record > length (files vary from 80-byte records up to 1024) and the TLA portion > of the filename (the data files are named in the format xxTLA.dat, where > xx is the 2-digit office number and TLA is the three-letter acronym for > what the file contains - e.g. INS for insurance.) > > 2- Using the information I just obtained, I then read through the file > one record-length chunk at a time, creating an object out of each chunk > and reading the attributes of that object. In the next version of my > class library, I'll move the whole list-generation logic inside the > classes so I can just pass in a filename and receive a list... but > that's one for my copious free time. > > 3- Each file contains a header record, which is pure garbage. I read > it in and throw it away before I even begin. (I could seek to just past > it instead - would it really be more efficient?) > > 4- Now here's where the read-ahead buffer comes in - I (attempt to) > read 4096 records' worth of data, and store it in a StringIO file-like > object. (4096 is just a number I pulled out of the air, but I've tried > increasing and decreasing it, and it seems good. If I have the time, I > may benchmark to find the best number for each record length, and > retrieve that number along with the record length and TLA. Of course, > the optimal number probably varies per machine, so maybe I won't bother.) > > 5- Now I go through the buffer, one record's worth at a time, and do > whatever I'm doing with the records - in this case, I'm making a list of > insurance company IDs and names to display in a wx.CheckListCtrl . > > 6- If I try to read past the end of the file, there's no error - so I > need to check the size of what's returned. If it's smaller than recLen, > I know I've hit the end. > 6a- When I hit the end of the buffer, I close it and read in another > 4096 records. > 6b- When I try to read 4096 records, and end up with less than recLen, > I know I've hit the end of the file. > > I've only tested on a few machines/client databases so far, but when I > added step 4, processing a 250MB transaction table (256-byte records) > went from nearly 30 seconds down to about 3.5 seconds. Other results > have varied, but they've all shown improvement. > > If anybody sees any glaring inefficiencies, let me know; OTOH if anybody > else needs to do something similar... here's one way to do it. > > -- > www.fsrtechnologies.com <http://www.fsrtechnologies.com> > > > ------------------------------------------------------------------------ > > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor