On Sun, Nov 23, 2014 at 7:20 PM, Cameron Simpson <c...@zip.com.au> wrote: > > A remark about the create_lookup() function on pastebin: you go: > > record_start += len(line) > > This presumes that a single text character on a line consumes a single byte > or memory or file disc space. However, your data file is utf-8 encoded, and > some characters may be more than one byte or storage. This means that your > record_start values will not be useful because they are character counts, > not byte counts, and you need byte counts to offset into a file if you are > doing random access.
mmap.readline returns a byte string, so len(line) is a byte count. That said, CsvIter._get_row_lookup shouldn't use the mmap object. Limit its use to __getitem__. In CsvIter.__getitem__, I don't see the need to wrap the line in a filelike object. It's clearly documented that csv.reader takes an iterable object, such as a list. For example: # 2.x csv lacks unicode support line = self.data[start:end].strip() row = next(csv.reader([line])) return [cell.decode('utf-8') for cell in row] # 3.x csv requires unicode line = self.data[start:end].strip() row = next(csv.reader([line.decode('utf-8')])) return row CsvIter._get_row_lookup should work on a regular file from built-in open (not codecs.open), opened in binary mode. I/O on a regular file will release the GIL back to the main thread. mmap objects don't do this. Binary mode ensures the offsets are valid for use with the mmap object in __getitem__. This requires an ASCII compatible encoding such as UTF-8. Also, iterate in a for loop instead of calling readline in a while loop. 2.x file.__next__ uses a read-ahead buffer to improve performance. To see this, check tell() in a for loop. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor