[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Changes by Serhiy Storchaka storch...@gmail.com: -- resolution: - fixed stage: commit review - resolved status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Roundup Robot added the comment: New changeset 4dc69e5124f8 by Serhiy Storchaka in branch 'default': Issue #23688: Added support of arbitrary bytes-like objects and avoided https://hg.python.org/cpython/rev/4dc69e5124f8 -- nosy: +python-dev ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Serhiy Storchaka added the comment: OK, so left it as is if nobody complains. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Wolfgang Maier added the comment: I see now that it is just issue21560 that went into 2.7 and that's fine. As I said: sorry for the noise -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Serhiy Storchaka added the comment: I think I saw that you committed this also to the 2.7 branch, I committed only working tests and a fix from issue21560. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Serhiy Storchaka added the comment: Here is a patch that restores support on non-contiguous memoryviews. It would be better to drop support of non-contiguous data, because it worked only by accident. Needed support of only bytes-like memoryviews written by BufferedWriter. -- Added file: http://bugs.python.org/file38656/gzip_write_noncontiguous.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___diff -r 7e179ee91af0 Lib/gzip.py --- a/Lib/gzip.py Mon Mar 23 15:26:49 2015 +0200 +++ b/Lib/gzip.py Mon Mar 23 16:27:08 2015 +0200 @@ -340,6 +340,8 @@ class GzipFile(io.BufferedIOBase): # accept any data that supports the buffer protocol data = memoryview(data) length = data.nbytes +if not data.contiguous: +data = bytes(data) if length 0: self.fileobj.write(self.compress.compress(data)) diff -r 7e179ee91af0 Lib/test/test_gzip.py --- a/Lib/test/test_gzip.py Mon Mar 23 15:26:49 2015 +0200 +++ b/Lib/test/test_gzip.py Mon Mar 23 16:27:08 2015 +0200 @@ -74,6 +74,7 @@ class TestGzip(BaseTest): m = memoryview(bytes(range(256))) data = m.cast('B', shape=[8,8,4]) self.write_and_read_back(data) +self.write_and_read_back(memoryview(data1 * 50)[::-1]) def test_write_bytearray(self): self.write_and_read_back(bytearray(data1 * 50)) ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Wolfgang Maier added the comment: to preserve compatibility: there is the memoryview.c_contiguous flag. Maybe we should just check it and if it is False fall back to the old copying behavior ? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Wolfgang Maier added the comment: something like: def write(self,data): self._check_closed() if self.mode != WRITE: import errno raise OSError(errno.EBADF, write() on read-only GzipFile object) if self.fileobj is None: raise ValueError(write() on closed GzipFile object) if isinstance(data, bytes): length = len(data) elif isinstance(data, memoryview) and not data.c_contiguous: data = data.tobytes() length = len(data) else: # accept any data that supports the buffer protocol data = memoryview(data) length = data.nbytes if length 0: self.fileobj.write(self.compress.compress(data)) self.size += length self.crc = zlib.crc32(data, self.crc) 0x self.offset += length return length -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Stefan Krah added the comment: In a sense, the old behavior was an artefact of silently copying the memoryview to bytes. It likely wasn't intentional, but tobytes() *is* used to serialize weird arrays to their C-contiguous representation (following the logical structure of the array rather than the physical one). Since the gzip docs don't help much, I guess the new behavior is probably okay. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Stefan Krah added the comment: I just see that non-contiguous arrays didn't work in 2.7 either, so that was probably the original intention. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Wolfgang Maier added the comment: Serhiy: I think I saw that you committed this also to the 2.7 branch, but that would not work since memoryviews do not have the nbytes attribute (they do not seem to have cast either). One would have to calculate the length instead from other properties. Tests would fail too I think. If I'm mistaken, then sorry for the noise. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Wolfgang Maier added the comment: ouch. haven't thought of this. OTOH, just plain io with your example: with open('xy', 'wb') as f: f.write(y) Traceback (most recent call last): File pyshell#29, line 2, in module f.write(y) BufferError: memoryview: underlying buffer is not C-contiguous fails too and after all that's not too surprising. In a sense, the old behavior was an artefact of silently copying the memoryview to bytes. You never used it *directly*. But, yes, it is a change in (undocumented) behavior :( -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Stefan Krah added the comment: I think there's a behavior change: Before you could gzip non-contiguous views directly, now that operation raises BufferError. -- nosy: +skrah ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Serhiy Storchaka added the comment: Could you provide an example? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Stefan Krah added the comment: Sure: import gzip x = memoryview(b'x' * 10) y = x[::-1] with gzip.GzipFile(x, 'w') as f: f.write(y) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Serhiy Storchaka added the comment: In general the patch LGTM. -- assignee: - serhiy.storchaka stage: - commit review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Changes by Wolfgang Maier wolfgang.ma...@biologie.uni-freiburg.de: Added file: http://bugs.python.org/file38650/write_bytes_like_objects_v3.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Wolfgang Maier added the comment: Here is a revised version of my patch addressing Serhiy's review comments. -- Added file: http://bugs.python.org/file38639/write_bytes_like_objects_v2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Wolfgang Maier added the comment: ok, I've prepared a patch including tests based on my last suggestion, which I think is ready for getting reviewed. -- Added file: http://bugs.python.org/file38600/write_bytes_like_objects.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Wolfgang Maier added the comment: Thanks everyone for the lively discussion ! I like Serhiy's idea of making write work with arbitrary objects supporting the buffer protocol. In fact, I noticed before that GzipFile.write misbehaves with array.array input. It pretends to accept that, but it'll use len(data) for calculating the zip file metadata so reading from the file will later fail. I was assuming then that fixing that would be too complicated for a rather exotic usecase, but now that I see how simple it really is I think it should be done. As for the concrete implementation, I guess an isinstance(data, bytes) check to speed up treatment of the most common input is a good idea, but I am not convinced that bytearray deserves the same attention. Regarding memoryview.cast('B') vs memoryview.nbytes, I see Serhiy's point of keeping the patch size smaller. Personally though, I find use of nbytes much more self-explanatory than cast('B') the purpose of which was not immediately obvious to me. So I would opt for better readability of the final code rather than optimizing patch size, but I would be ok with either solution since it is not about the essence of the patch anyway. Finally, the bug I report in issue21560 would be fixed as a side-effect of this patch here (because trying to get a memoryview from str would throw an early TypeError). Still, I think it would be a good idea to try to write to the wrapped fileobj *before* updating self.size and self.crc to be protected from unforeseen errors. So maybe we could include that change in the patch here ? With all that the final code section could look like this: if isinstance(data, bytes): length = len(data) else: data = memoryview(data) length = data.nbytes if length 0: self.fileobj.write( self.compress.compress(data) ) self.size = self.size + length self.crc = zlib.crc32(data, self.crc) 0x self.offset += length return length One remaining detail then would be whether one would want to catch the TypeError possibly raised by the memoryview constructor to turn it into something less confusing (after all many users will not know what a memoryview has to do with all this). The above code would throw (with str input for example): Traceback (most recent call last): File stdin, line 2, in module File /home/wolma/gzip-bug/Lib/gzip.py, line 340, in write data = memoryview(data) TypeError: memoryview: a bytes-like object is required, not 'str' Maybe, this could be turned into: TypeError: must be bytes / bytes-like object, not 'str' ? to be consistent with the corresponding error in 'wt' mode ? Let me know which of the above options you favour and I'll provide a new patch. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write ?
Serhiy Storchaka added the comment: Better way is data = data.cast('B'). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write ?
STINNER Victor added the comment: Better way is data = data.cast('B'). Why is this cast required? Can you please elaborate? If some memoryview must be rejected, again, we need more unit tests. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Martin Panter added the comment: I would say that the current patch looks correct enough, in that it would still get the correct lengths when a memoryview() object is passed in. The zlib module’s crc32() function and compress() method already seem to support arbitrary bytes-like objects. But to make GzipFile.write() also accept arbitrary bytes-like objects, you probably only need to change the code calculating the length to something like: with memoryview(data) as view: length = view.nbytes # Go on to call compress(data) and crc32(data) -- nosy: +vadmium ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write ?
STINNER Victor added the comment: While we are here, it is possible to add the support of general byte-like objects. With and without the patch, write() accepts bytes, bytearray and memoryview. Which other byte-like types do you know? writeframesraw() method of aifc, sunau and wave modules use this pattern: if not isinstance(data, (bytes, bytearray)): data = memoryview(data).cast('B') We can maybe reuse it in gzip module? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write ?
Serhiy Storchaka added the comment: You patch is correct Wolfgang, but with cast('B') the patch would be smaller (no need to replace len(data) to nbytes). While we are here, it is possible to add the support of general byte-like objects. if not isinstance(data, bytes): data = memoryview(data).cast('B') isinstance() check is just for optimization, it can be omitted if doesn't affect a performance. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write?
Serhiy Storchaka added the comment: With and without the patch, write() accepts bytes, bytearray and memoryview. Which other byte-like types do you know? The bytes-like object term is used as an alias of an instance of type that supports buffer protocol. Besides bytes, bytearray and memoryview, this is array.array and NumPy arrays. file.write() supports arbitrary bytes-like objects, including array.array and NumPy arrays. writeframesraw() method of aifc, sunau and wave modules use this pattern: Yes, I wrote this code, if I remember correct. -- title: unnecessary copying of memoryview in gzip.GzipFile.write ? - unnecessary copying of memoryview in gzip.GzipFile.write? ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write ?
New submission from Wolfgang Maier: I thought I'd go back to work on a test patch for issue21560 today, but now I'm puzzled by the explicit handling of memoryviews in gzip.GzipFile.write. The method is defined as: def write(self,data): self._check_closed() if self.mode != WRITE: import errno raise OSError(errno.EBADF, write() on read-only GzipFile object) if self.fileobj is None: raise ValueError(write() on closed GzipFile object) # Convert data type if called by io.BufferedWriter. if isinstance(data, memoryview): data = data.tobytes() if len(data) 0: self.size = self.size + len(data) self.crc = zlib.crc32(data, self.crc) 0x self.fileobj.write( self.compress.compress(data) ) self.offset += len(data) return len(data) So for some reason, when it gets passed data as a meoryview it will first copy its content to a bytes object and I do not understand why. zlib.crc32 and zlib.compress seem to be able to deal with memoryviews so the only sepcial casing that seems required here is in determining the byte length of the data, which I guess needs to use memoryview.nbytes. I've prepared a patch (overlapping the one for issue21560) that avoids copying the data and seems to work fine. Did I miss something about the importance of the tobytes conversion ? -- components: Library (Lib) messages: 238294 nosy: wolma priority: normal severity: normal status: open title: unnecessary copying of memoryview in gzip.GzipFile.write ? type: resource usage versions: Python 3.5 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write ?
Changes by Wolfgang Maier wolfgang.ma...@biologie.uni-freiburg.de: -- keywords: +patch Added file: http://bugs.python.org/file38521/memoryview_write.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write ?
STINNER Victor added the comment: The patch looks good to be me, but it lacks an unit test. Can you please add a simple unit test to ensure that it's possible to memoryview to write(), and that the result is correct? (ex: uncompress and ensure that you get the same content) -- nosy: +haypo ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write ?
Changes by STINNER Victor victor.stin...@gmail.com: -- nosy: +serhiy.storchaka type: resource usage - performance ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write ?
Wolfgang Maier added the comment: Here is a patch with memoryview tests. Are tests and code patches supposed to go in one file or separate ones ? -- Added file: http://bugs.python.org/file38526/test_memoryview_write.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write ?
Wolfgang Maier added the comment: memoryview is converted to bytes because len() for memoryview returns a size of first dimension (a number of items for one-dimension view), not a number of bytes. m = memoryview(array.array('I', [1, 2, 3])) len(m) 3 len(m.tobytes()) 12 len(m.cast('B')) 12 Right, I was aware of this. But are you saying that my proposed solution (using memoryview.nbytes) is wrong ? If so, then cast is certainly an option and should still outperform tobytes. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write ?
Wolfgang Maier added the comment: @Serhiy: Why would data = data.cast('B') be required ? When would the memoryview not be in 'B' format already ? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write ?
Serhiy Storchaka added the comment: memoryview is converted to bytes because len() for memoryview returns a size of first dimension (a number of items for one-dimension view), not a number of bytes. m = memoryview(array.array('I', [1, 2, 3])) len(m) 3 len(m.tobytes()) 12 len(m.cast('B')) 12 -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23688] unnecessary copying of memoryview in gzip.GzipFile.write ?
STINNER Victor added the comment: Are tests and code patches supposed to go in one file or separate ones ? It's more convinient to have a single patch with both changes. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue23688 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com