Re: [Tutor] multiprocessing question

Albert-Jan Roskam Thu, 27 Nov 2014 12:47:17 -0800

>________________________________
> From: eryksun <[email protected]>
>To: Python Mailing List <[email protected]> 
>Sent: Tuesday, November 25, 2014 6:41 AM
>Subject: Re: [Tutor] multiprocessing question
> 
>
>On Sun, Nov 23, 2014 at 7:20 PM, Cameron Simpson <[email protected]> wrote:
>>
>> A remark about the create_lookup() function on pastebin: you go:
>>
>>  record_start += len(line)
>>
>> This presumes that a single text character on a line consumes a single byte
>> or memory or file disc space. However, your data file is utf-8 encoded, and
>> some characters may be more than one byte or storage. This means that your
>> record_start values will not be useful because they are character counts,
>> not byte counts, and you need byte counts to offset into a file if you are
>> doing random access.
>
>mmap.readline returns a byte string, so len(line) is a byte count.
>That said, CsvIter._get_row_lookup shouldn't use the mmap
>object. Limit its use to __getitem__.



Ok, thanks, I will modify the code.

>In CsvIter.__getitem__, I don't see the need to wrap the line in a
>filelike object. It's clearly documented that csv.reader takes an
>iterable object, such as a list. For example:
>
>    # 2.x csv lacks unicode support
>    line = self.data[start:end].strip()
>    row = next(csv.reader([line]))
>    return [cell.decode('utf-8') for cell in row]
>
>    # 3.x csv requires unicode
>    line = self.data[start:end].strip()
>    row = next(csv.reader([line.decode('utf-8')]))
>    return row


Nice, thank you! I indeed wanted to write the code for use in Python 2.7 and 
3.3+.

>CsvIter._get_row_lookup should work on a regular file from built-in
>open (not codecs.open), opened in binary mode. I/O on a regular file
>will release the GIL back to the main thread. mmap objects don't do

>this.

Will io.open also work? Until today I thought that Python 3's open was what is 
codecs.open in Python 2 (probably because Python3 is all about ustrings, and 
py3-open has an encoding argument).

>
>Binary mode ensures the offsets are valid for use with
>the mmap object in __getitem__. This requires an ASCII compatible

>encoding such as UTF-8.

What do you mean exactly with "ascii compatible"? Does it mean 'superset of 
ascii', such as utf-8, windows-1252, latin-1? Hmmm, but Asian encodings like 
cp874 and shift-JIS are thai/japanese on top of ascii, so this makes me doubt. 
In my code I am using icu to guess the encoding; I simply put 'utf-8' in the 
sample code for brevity.

>
>Also, iterate in a for loop instead of calling readline in a while loop.
>2.x file.__next__ uses a read-ahead buffer to improve performance.
>To see this, check tell() in a for loop.


Wow, great tip. I just modified some sample code that I post shortly.

>
>
>_______________________________________________
>Tutor maillist  -  [email protected]
>To unsubscribe or change subscription options:
>https://mail.python.org/mailman/listinfo/tutor
>
>
>
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] multiprocessing question

Reply via email to