>________________________________ > From: eryksun <eryk...@gmail.com> >To: Python Mailing List <tutor@python.org> >Sent: Tuesday, November 25, 2014 6:41 AM >Subject: Re: [Tutor] multiprocessing question > > >On Sun, Nov 23, 2014 at 7:20 PM, Cameron Simpson <c...@zip.com.au> wrote: >> >> A remark about the create_lookup() function on pastebin: you go: >> >> record_start += len(line) >> >> This presumes that a single text character on a line consumes a single byte >> or memory or file disc space. However, your data file is utf-8 encoded, and >> some characters may be more than one byte or storage. This means that your >> record_start values will not be useful because they are character counts, >> not byte counts, and you need byte counts to offset into a file if you are >> doing random access. > >mmap.readline returns a byte string, so len(line) is a byte count. >That said, CsvIter._get_row_lookup shouldn't use the mmap >object. Limit its use to __getitem__.
Ok, thanks, I will modify the code. >In CsvIter.__getitem__, I don't see the need to wrap the line in a >filelike object. It's clearly documented that csv.reader takes an >iterable object, such as a list. For example: > > # 2.x csv lacks unicode support > line = self.data[start:end].strip() > row = next(csv.reader([line])) > return [cell.decode('utf-8') for cell in row] > > # 3.x csv requires unicode > line = self.data[start:end].strip() > row = next(csv.reader([line.decode('utf-8')])) > return row Nice, thank you! I indeed wanted to write the code for use in Python 2.7 and 3.3+. >CsvIter._get_row_lookup should work on a regular file from built-in >open (not codecs.open), opened in binary mode. I/O on a regular file >will release the GIL back to the main thread. mmap objects don't do >this. Will io.open also work? Until today I thought that Python 3's open was what is codecs.open in Python 2 (probably because Python3 is all about ustrings, and py3-open has an encoding argument). > >Binary mode ensures the offsets are valid for use with >the mmap object in __getitem__. This requires an ASCII compatible >encoding such as UTF-8. What do you mean exactly with "ascii compatible"? Does it mean 'superset of ascii', such as utf-8, windows-1252, latin-1? Hmmm, but Asian encodings like cp874 and shift-JIS are thai/japanese on top of ascii, so this makes me doubt. In my code I am using icu to guess the encoding; I simply put 'utf-8' in the sample code for brevity. > >Also, iterate in a for loop instead of calling readline in a while loop. >2.x file.__next__ uses a read-ahead buffer to improve performance. >To see this, check tell() in a for loop. Wow, great tip. I just modified some sample code that I post shortly. > > >_______________________________________________ >Tutor maillist - Tutor@python.org >To unsubscribe or change subscription options: >https://mail.python.org/mailman/listinfo/tutor > > > _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor