Alan Gauld wrote: "Marc Tompkins" <[EMAIL PROTECTED]> wrote > realized I can implement this myself, using 'read(bigsize)' - > currently I'm using 'read(recordsize)'; I just need to add an extra > loop around my record reads. Please disregard... If you just want to navigate to a specific record then it might be easier to use seek(), that will save you having to read all the previous records into memory.
No, I need to parse the entire file, checking records as I go. Here's the solution I came up with - I'm sure it could be optimized, but it's already about six times faster than going record-by-record: def loadInsurance(self): header = ('Code', 'Name') Global.Ins.append(header) obj = Insurance() recLen = obj.RecordLength for offNum, offPath in Global.offices.iteritems(): if (offPath.Ref == ''): offPath.Ref = offPath.Default with open(offPath.Ref + obj.TLA + '.dat','rb') as inFile: tmpIn = inFile.read(recLen) # throw away the header record tmpIn = inFile.read(recLen*4096) while not (len(tmpIn) < recLen): buf = StringIO.StringIO(tmpIn) inRec = buf.read(recLen) while not (len(inRec) < recLen): obj = Insurance(inRec) if (obj.Valid): Global.Ins.append(obj.ID, obj.Name) inRec = buf.read(recLen) buf.close() tmpIn = inFile.read(recLen*4096) Obviously this is taken out of context, and I'm afraid I'm too lazy to sanitize it (much) for posting right now, so here's a brief summary instead. 1- I don't want my calling code to need to know many details. So if I create an object with no parameters, it provides me with the record length (files vary from 80-byte records up to 1024) and the TLA portion of the filename (the data files are named in the format xxTLA.dat, where xx is the 2-digit office number and TLA is the three-letter acronym for what the file contains - e.g. INS for insurance.) 2- Using the information I just obtained, I then read through the file one record-length chunk at a time, creating an object out of each chunk and reading the attributes of that object. In the next version of my class library, I'll move the whole list-generation logic inside the classes so I can just pass in a filename and receive a list... but that's one for my copious free time. 3- Each file contains a header record, which is pure garbage. I read it in and throw it away before I even begin. (I could seek to just past it instead - would it really be more efficient?) 4- Now here's where the read-ahead buffer comes in - I (attempt to) read 4096 records' worth of data, and store it in a StringIO file-like object. (4096 is just a number I pulled out of the air, but I've tried increasing and decreasing it, and it seems good. If I have the time, I may benchmark to find the best number for each record length, and retrieve that number along with the record length and TLA. Of course, the optimal number probably varies per machine, so maybe I won't bother.) 5- Now I go through the buffer, one record's worth at a time, and do whatever I'm doing with the records - in this case, I'm making a list of insurance company IDs and names to display in a wx.CheckListCtrl. 6- If I try to read past the end of the file, there's no error - so I need to check the size of what's returned. If it's smaller than recLen, I know I've hit the end. 6a- When I hit the end of the buffer, I close it and read in another 4096 records. 6b- When I try to read 4096 records, and end up with less than recLen, I know I've hit the end of the file. I've only tested on a few machines/client databases so far, but when I added step 4, processing a 250MB transaction table (256-byte records) went from nearly 30 seconds down to about 3.5 seconds. Other results have varied, but they've all shown improvement. If anybody sees any glaring inefficiencies, let me know; OTOH if anybody else needs to do something similar... here's one way to do it. -- www.fsrtechnologies.com
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor