[issue43451] pydoc terminal suboptimal rendering of complex annotations
Change by David Wilson : -- keywords: +patch pull_requests: +23570 stage: -> patch review pull_request: https://github.com/python/cpython/pull/24802 ___ Python tracker <https://bugs.python.org/issue43451> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43451] pydoc terminal suboptimal rendering of complex annotations
New submission from David Wilson : When viewing docs for classes that use annotations, pydoc's rendering of argument lists is regularly truncated at the terminal edge (if using `less -S`), or wrapped in a manner where quickly scanning the output is next to impossible. My 'classic' example is 'pydoc aiohttp.client.ClientSession', where the __init__ argument list wraps to 24 lines. The pull request attached to this issue works around the problem by formatting arguments one-per-line if the sum of the arguments would exceed a hard-coded width of 150 characters. It is more of an RFC than a suggested fix. It produces acceptable formatting, but I'm not sure where the correct fix should be made -- in the inspect module, or somehow in the pydoc module, or even what the correct fix shuld be. I will include a before/after screenshot in the pull request I will attach before/after screenshot -- components: Library (Lib) messages: 388373 nosy: dw priority: normal severity: normal status: open title: pydoc terminal suboptimal rendering of complex annotations versions: Python 3.10 ___ Python tracker <https://bugs.python.org/issue43451> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37696] FileIO.read() on a closed TTY throws an exception prematurely
David Wilson added the comment: A real example of where returning the partial buffer is dangerous would be EBADF. - Repeated reading succeeds, building up a partial buffer. Another thread runs, and buggy code causes an unrelated fd blonging to the file object to be closed. - Original thread resumes, calls read(), which now returns EBADF. If partial buffer is returned and EBADF is masked, and user only ever calls readall() once, a potentially deadly IO corruption bug is completely hidden in their code. I think the correct behaviour in the case of 'bad' errno must remain that the partial buffer is discarded, the interface does not seem to make room for any safe+better option So I think to reach the desired outcome of this ticket, the suggested approach is to add special handling for a small list of errnos generally accepted to unambiguously mean EOF, and in that special case, allow returning the 'partial' (actually complete) buffer. -- ___ Python tracker <https://bugs.python.org/issue37696> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37696] FileIO.read() on a closed TTY throws an exception prematurely
David Wilson added the comment: If we treat different errnos specially, the list of 'okay to silently fail' errors seems quite succinct. In another project I treat EIO, EPIPE and ECONNRESET as EOF, and raise all others -- https://github.com/dw/mitogen/blob/c6de090f083a58344e91ab97847bf7ae3feb9134/mitogen/core.py#L501-L532 But even in this case, if readall() returns the partial buffer given a 'good' errno, does it still discard the partial buffer given a 'bad' one? That still doesn't feel right -- ___ Python tracker <https://bugs.python.org/issue37696> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37696] FileIO.read() on a closed TTY throws an exception prematurely
David Wilson added the comment: Interesting, this immediately turns into a little rabbit hole :) The reason read() is failing in this case, is because argument clinic defaults the size parameter to -1, which redirects the call to readall(). So this issue is actually about readall(). Calling read(8) in the previous reproduction returns the buffer up to the point of EIO as expected. readall() is somewhat awkward. It is not really expected that a user would call it twice, and so the old semantic of the second read() returning EIO doesn't seem to apply cleanly. So >=2 issues: - readall() is discarding the partial buffer, that seems unavoidably like a bug - readall() does not intuitively feel like a function you might want to call twice - nothing in fileio.c or fileutils.c make any attempt to understand errno except for handling EINTR. I'm guessing the 2.x behaviour in this case was that no parsing of errno was done either, just silently discard any error when a partial buffer exists. But that leaves the awkward possibility that some real scary error occurred, for example EBADF or EFAULT, and the single call by the user to readall() never flagged it. Opinions? -- ___ Python tracker <https://bugs.python.org/issue37696> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37696] FileIO.read() on a closed TTY throws an exception prematurely
David Wilson added the comment: Happy to send a patch for this if we can agree on the semantic being incorrect, and more importantly, someone is happy to review the patch once it reaches GitHub ;) -- ___ Python tracker <https://bugs.python.org/issue37696> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37696] FileIO.read() on a closed TTY throws an exception prematurely
New submission from David Wilson : Given: $ cat tty-failure.py import pty import os master, slave = pty.openpty() master = os.fdopen(master, 'r+b', 0) slave = os.fdopen(slave, 'r+b', 0) slave.write(b'foo') slave.close() print(master.read()) On Python 2, read() would return b'foo', with subsequent calls raising IOError, whereas on Python 3 an OSError is raised due to the underlying file descriptor returning EIO. In the case of a PTY, EIO indicates the remote side has hung up and more or less can be treated as an EOF indicator. On Python 3 the partial buffer should not be discarded when a subsequent read() syscall returns an error. Secondarily, the change from IOError to OSError looks wrong. Does anyone know what's going on there? I would never expect to see OSError raised by a builtin -- components: IO messages: 348578 nosy: dw priority: normal severity: normal status: open title: FileIO.read() on a closed TTY throws an exception prematurely type: behavior versions: Python 3.9 ___ Python tracker <https://bugs.python.org/issue37696> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue36279] os.wait3() leaks some uninitialized stack when no processes exist
David Wilson added the comment: The original diff is attached here (per the old process) so others can find it, and the PR+fork are closed, as carrying a fork in my GitHub for 4 months has non-zero cost. I'm presently more interested in having a clean GH account than carrying around the baggage of forgotten old patches -- resolution: -> out of date stage: patch review -> resolved status: open -> closed Added file: https://bugs.python.org/file48507/c805d2764e6a0ee4d22a338c5f4fef6154df8687.diff ___ Python tracker <https://bugs.python.org/issue36279> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue36279] os.wait3() leaks some uninitialized stack when no processes exist
New submission from David Wilson : Not sure if this is worth reporting.. p = os.popen('sleep 1234') os.wait3(os.WNOHANG) os.wait3(os.WNOHANG) os.wait3(os.WNOHANG) Notice struct rusage return value. When wait3() succeeds on Linux, but no child was waiting to be reaped, is not updated by the kernel, therefore it is passed uninitialized back to the user, essentially leaking a little bit of stack memory -- messages: 337833 nosy: dw priority: normal severity: normal status: open title: os.wait3() leaks some uninitialized stack when no processes exist ___ Python tracker <https://bugs.python.org/issue36279> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35486] subprocess module import hooks breaks back compatibility
David Wilson added the comment: Hi Nick, The purpose of ModuleNotFoundError is clear and unrelated to the problem documented here. The problem is that due to the introduction of ModuleNotFoundError, ImportError's semantics have been changed within a minor release in a breaking, backwards-incompatible manner. As of Python 3.5, ImportError meant both "ImportError" and "ModuleNotFoundError". As of 3.6, the senses are distinct, and thus code written against ImportError as it existed in 3.5 no longer works correctly as of 3.6. -- ___ Python tracker <https://bugs.python.org/issue35486> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35486] subprocess module breaks backwards compatibility with import hooks
David Wilson added the comment: Having considered this for a few hours, it seems the following is clear: - 3.6 introduces ModuleImportError - 3.7 begins using it within importlib - 3.8 first instance of stdlib code catching it - PEP-302 and PEP-451 are the definitive specifications for how sys.meta_path is supposed to work, and neither makes mention of ModuleNotFoundError. It seems at least this should have been flagged during review of the original change, but apparently the name of the exception was more important. - The recent work done to move most of the import machinery on to sys.meta_path has exposed a set of import hooks that don't quite comply to the documented import hook interface - The newly advertised ModuleNotFoundError appearing in stack traces for everyone means that more people will continue to write new cases of "except ModuleNotFoundError:", which while following best practice (catch most specific relevant exception), at present it amounts to relying on an implementation detail of the default importer. GitHub search reveals this to be an accurate reading: https://github.com/search?q=%22except+ModuleNotFoundError%22=Code Therefore we are in a position where: - Natural developer habit will cause much more protocol-violating code to exist over time, there is no option to stop this process - New import hooks written against the existing documentation will break in the face of developer habit - Existing import hooks were broken in Python 3.6 and this is not documented anywhere - Python's own import machinery contravenes its specification Options: - Updating PEP-302 to mention introduction of the new exception type - Updating ModuleNotFoundError/ImportError documentation to somehow capture the compatibility issue -- ___ Python tracker <https://bugs.python.org/issue35486> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35486] subprocess module breaks backwards compatibility with import hooks
Change by David Wilson : -- title: subprocess module breaks backwards compatibility with older import hooks -> subprocess module breaks backwards compatibility with import hooks ___ Python tracker <https://bugs.python.org/issue35486> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35486] subprocess module breaks backwards compatibility with older import hooks
David Wilson added the comment: Corrected GitHub link for the commit: https://github.com/python/cpython/commit/880d42a3b24 -- ___ Python tracker <https://bugs.python.org/issue35486> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35486] subprocess module breaks backwards compatibility with older import hooks
New submission from David Wilson : The subprocess package since 880d42a3b24 / September 2018 has begun using this idiom: try: import _foo except ModuleNotFoundError: bar However, ModuleNotFoundError may not be thrown by older import hook implementations, since that subclass was only introduced in Python 3.6 -- and so the test above will fail. PEP-302 continues to document ImportError as the appropriate exception that should be raised. https://mitogen.readthedocs.io/en/stable/ is one such import hook that lazily loads packages over the network when they aren't available locally. Current Python subprocess master breaks with Mitogen because when it discovers the master cannot satisfy the import, it throws ImportError. The new exception subtype was introduced in https://bugs.python.org/issue15767 , however very little in the way of rationale was included, and so it's unclear to me what need the new subtype is addressing, whether this is a problem with the subprocess module or the subtype as a whole, or indeed whether any of this should be considered a bug. It seems clear that some kind of regression is in the process of occurring during a minor release, and it also seems clear the new subtype will potentially spawn a whole variety of similar new regressions. I will be updating Mitogen to throw the new subtype if it is available, but I felt it was important to report the regression to see what others think. -- components: Library (Lib) messages: 331774 nosy: dw priority: normal severity: normal status: open title: subprocess module breaks backwards compatibility with older import hooks versions: Python 3.8 ___ Python tracker <https://bugs.python.org/issue35486> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26601] Use new madvise()'s MADV_FREE on the private heap
David Wilson added the comment: @Julian note that ARENA_SIZE is double the threshold after which at least glibc resorts to calling mmap directly, so using malloc in place of mmap on at least Linux would have zero effect -- nosy: +dw ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue26601> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: Hi Piotr, There wasn't an obvious fix that didn't involve changing the buffer interface itself. There is presently ambiguity in the interface regarding the difference between a read only buffer and an immutable buffer, which is crucial for its use in this case. Fixing the interface, followed by every buffer interface user, is a significantly more complicated task than simply optimizing for the most common case, as done here. FWIW I still think this work is worth doing, though I personally don't have time to approach it just now. We could have (and possibly should) approach fixing e.g. mmap.mmap() hashability, possibly causing user code regressions, but even if such cases were fixed it still wouldn't be a enough to rely on for the optimization implemented here. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: Attached trivial patch for whatsnew.rst. -- Added file: http://bugs.python.org/file38058/whatsnew.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14099] ZipFile.open() should not reopen the underlying file
David Wilson added the comment: Could we also make a small tweak to zipfile.rst indicating the new behaviour? I had made an initial attempt in my patch but wasn't particularly happy with the wording. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14099 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14099] ZipFile.open() should not reopen the underlying file
David Wilson added the comment: Sounds great :) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14099 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14099] ZipFile.open() should not reopen the underlying file
David Wilson added the comment: While in spirit this is a bug fix, it's reasonably complex and affects a popular module -- I'm not sure it should be applied to 2.x, and probably not in a minor release of 3.x either. Would it make sense to include as part of 3.5? (That said, I'd love to see this fixed in 2.x ;)) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14099 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14099] ZipFile.open() should not reopen the underlying file
David Wilson added the comment: Hi Serhiy, Thanks for the new patch, it looks better than my attempt. :) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14099 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14099] ZipFile.open() should not reopen the underlying file
David Wilson added the comment: Per my comment on issue16569, the overhead of performing one seek before each (raw file data) read is quite minimal. I have attached a new (but incomplete) patch, on which the following microbenchmarks are based. The patch is essentially identical to Stepan's 2012 patch, except I haven't yet decided how best to preserve the semantics of ZipFile.close(). my.zip is the same my.zip from issue22842. It contains 10,000 files each containing 10 bytes over 2 lines. my2.zip contains 8,000 files each containing the same copy of 64kb of /dev/urandom output. The resulting ZIP is 500mb. For each test, the first run is the existing zipfile module, and the second run is with the patch. In summary: * There is a 35% perf increase in str mode when handling many small files (on OS X at least) * There is a 6% perf decrease in file mode when handling small sequential reads. * There is a 2.4% perf decrease in file mode when handling large sequential reads. From my reading of zipfile.py, it is clear there are _many_ ways to improve its performance (probably starting with readline()), and rejection of a functional fix should almost certainly be at the bottom of that list. For each of the tests below, the functions used were: def a(): Test concurrent line reads to a str mode ZipFile. zf = zipfile.ZipFile('my2.zip') members = [zf.open(n) for n in zf.namelist()] for m in members: m.readline() for m in members: m.readline() def c(): Test sequential small reads to a str mode ZipFile. zf = zipfile.ZipFile('my2.zip') for name in zf.namelist(): with zf.open(name) as zfp: zfp.read(1000) def d(): Test sequential small reads to a file mode ZipFile. fp = open('my2.zip', 'rb') zf = zipfile.ZipFile(fp) for name in zf.namelist(): with zf.open(name) as zfp: zfp.read(1000) def e(): Test sequential large reads to a file mode ZipFile. fp = open('my2.zip', 'rb') zf = zipfile.ZipFile(fp) for name in zf.namelist(): with zf.open(name) as zfp: zfp.read() my.zip $ python3.4 -m timeit -s 'import my' 'my.a()' 10 loops, best of 3: 1.47 sec per loop $ python3.4 -m timeit -s 'import my' 'my.a()' 10 loops, best of 3: 950 msec per loop --- $ python3.4 -m timeit -s 'import my' 'my.c()' 10 loops, best of 3: 1.3 sec per loop $ python3.4 -m timeit -s 'import my' 'my.c()' 10 loops, best of 3: 865 msec per loop --- $ python3.4 -m timeit -s 'import my' 'my.d()' 10 loops, best of 3: 800 msec per loop $ python3.4 -m timeit -s 'import my' 'my.d()' 10 loops, best of 3: 851 msec per loop my2.zip $ python3.4 -m timeit -s 'import my' 'my.a()' 10 loops, best of 3: 1.46 sec per loop $ python3.4 -m timeit -s 'import my' 'my.a()' 10 loops, best of 3: 1.16 sec per loop --- $ python3.4 -m timeit -s 'import my' 'my.c()' 10 loops, best of 3: 1.13 sec per loop $ python3.4 -m timeit -s 'import my' 'my.c()' 10 loops, best of 3: 892 msec per loop --- $ python3.4 -m timeit -s 'import my' 'my.d()' 10 loops, best of 3: 842 msec per loop $ python3.4 -m timeit -s 'import my' 'my.d()' 10 loops, best of 3: 882 msec per loop --- $ python3.4 -m timeit -s 'import my' 'my.e()' 10 loops, best of 3: 1.65 sec per loop $ python3.4 -m timeit -s 'import my' 'my.e()' 10 loops, best of 3: 1.69 sec per loop -- nosy: +dw Added file: http://bugs.python.org/file37191/zipfile_concurrent_read_1.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14099 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22842] zipfile simultaneous open broken and/or needlessly(?) consumes unreasonable number of file descriptors
New submission from David Wilson: There is some really funky behaviour in the zipfile module, where, depending on whether zipfile.ZipFile() is passed a string filename or a file-like object, one of two things happens: a) Given a file-like object, zipfile does not (since it cannot) consume excess file descriptors on each call to '.open()', however simultaneous calls to .open() the zip file's members (from the same thread) will produce file-like objects for each member that appear intertwingled in some unfortunate manner: Traceback (most recent call last): File my.py, line 23, in module b() File my.py, line 18, in b m.readline() File /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py, line 689, in readline return io.BufferedIOBase.readline(self, limit) File /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py, line 727, in peek chunk = self.read(n) File /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py, line 763, in read data = self._read1(n) File /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py, line 839, in _read1 data = self._decompressor.decompress(data, n) zlib.error: Error -3 while decompressing data: invalid stored block lengths b) Given a string filename, simultaneous use of .open() produces a new file descriptor for each opened member, which does not result in the above error, but triggers an even worse one: file descriptor exhaustion given a sufficiently large zip file. This tripped me up rather badly last week during consulting work, and I'd like to see both these behaviours fixed somehow. The ticket is more an RFC to see if anyone has thoughts on how this fix should happen; it seems to me a no-brainer that, since the ZIP file format fundamentally always requires a seekable file, that in both the constructed using file-like object case, and the constructed using filename case, we should somehow reuse the sole file object passed to us to satisfy all reads of compressed member data. It seems the problems can be fixed in both cases without damaging interface semantics by simply tracking the expected 'current' read offset in each ZipExtFile instance. Prior to any read, we simply call .seek() on the file object prior to performing any .read(). Of course the result would not be thread safe, but at least in the current code, ZipExtFile for a constructed from a file-like object edition zipfile is already not thread-safe. With some additional work, we could make the module thread-safe in both cases, however this is not the current semantic and doesn't appear to be guaranteed by the module documentation. --- Finally as to why you'd want to simultaneously open huge numbers of ZIP members, well, ZIP itself easily supports streamy reads, and ZIP files can be quite large, even larger than RAM. So it should be possible, as I needed last week, to read streamily from a large number of members. --- The attached my.zip is sufficient to demonstrate both problems. The attached my.py has function a() to demonstrate the FD leak and b() to demonstrate the interwingly state. -- components: Library (Lib) files: mymy.zip messages: 230987 nosy: dw priority: normal severity: normal status: open title: zipfile simultaneous open broken and/or needlessly(?) consumes unreasonable number of file descriptors versions: Python 3.4 Added file: http://bugs.python.org/file37171/mymy.zip ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22842 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22842] zipfile simultaneous open broken and/or needlessly(?) consumes unreasonable number of file descriptors
David Wilson added the comment: As a random side-note, this is another case where I really wish Python had a .pread() function. It's uniquely valuable for coordinating positioned reads in a threaded app without synchronization (at user level anyway) or extraneous system calls. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22842 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16569] Preventing errors of simultaneous access in zipfile
David Wilson added the comment: Compared to the cost of everything else ZipExtFile must do (e.g. 4kb string concatenation in a loop, zlib), its surprising that lseek() would measurable at all. The attached file 'patch' is the minimal change I tested. It represents, in terms of computation and system call overhead, all required to implement the seek before read solution to simultaneous access. On OSX, churning over ever member of every ZIP in my downloads directory (about 400mb worth), this change results in around 0.9% overhead compared to the original module. Subsequently I'm strongly against the patch here. It is in effect papering over an implementation deficiency of the current zipfile module, one that could easily and cheaply be addressed. (My comment on this ticket is in the context of the now-marked-duplicate issue22842). -- nosy: +dw Added file: http://bugs.python.org/file37172/patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16569 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15381] Optimize BytesIO to do less reallocations when written, similarly to StringIO
David Wilson added the comment: Hey Serhiy, The implementation for your readline optimization seems less contentious (and less risky) than the remainder of the patch -- it could perhaps be easily split off into a separate patch, which may be far more easily committed. I love the concept of this patch, although from my last reading (weeks ago), it's slightly scary that it relies on Py_REFCNT() to know whether to mutate a string or not. In principle this should never break, in practice, however, it is uncertain that there are no strange edge cases that aren't immediately obvious. The _PyBytes_Resize doc is quite clear: Only use this to build up a brand new bytes object; don’t use this if the bytes may already be known in other parts of the code -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue15381 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22125] Cure signedness warnings introduced by #22003
New submission from David Wilson: The attached patch (hopefully) silences the signedness warnings generated by Visual Studio and reported on python-dev in https://mail.python.org/pipermail/python-dev/2014-July/135603.html. This was sloppiness on my part, I even noted the problem in the original ticket and never fixed it. :) I don't have a local dev environment setup for MSVC and Python, but at least the attached patch cures the signedness errors on Clang. They don't seem to occur at all with GCC on my Mac. The added casts ensure comparisons uniformly compare in the unsigned domain. It seems size_t buf_size is pretty redundant in the original struct, it just introduces lots of casting when it only appears to be required during write_bytes() to avoid signed overflow (undefined behaviour) -- components: Library (Lib) files: cow-sign.patch keywords: patch messages: 224593 nosy: dw, pitrou, zach.ware priority: normal severity: normal status: open title: Cure signedness warnings introduced by #22003 type: compile error versions: Python 3.5 Added file: http://bugs.python.org/file36217/cow-sign.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22125 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: I suspect it's all covered now, but is there anything else I can help with to get this patch pushed along its merry way? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: Newest patch incorporates Antoine's review comments. The final benchmark results are below. Just curious, what causes e.g. telco to differ up to 7% between runs? That's really huge Report on Linux k2 3.14-1-amd64 #1 SMP Debian 3.14.9-1 (2014-06-30) x86_64 Total CPU cores: 4 ### call_method_slots ### Min: 0.329869 - 0.340487: 1.03x slower Avg: 0.330512 - 0.341786: 1.03x slower Significant (t=-216.69) Stddev: 0.00067 - 0.00060: 1.x smaller ### call_method_unknown ### Min: 0.351167 - 0.343961: 1.02x faster Avg: 0.351731 - 0.344580: 1.02x faster Significant (t=238.89) Stddev: 0.00033 - 0.00040: 1.2271x larger ### call_simple ### Min: 0.257487 - 0.277366: 1.08x slower Avg: 0.257942 - 0.277809: 1.08x slower Significant (t=-845.64) Stddev: 0.00029 - 0.00029: 1.0126x smaller ### etree_generate ### Min: 0.377985 - 0.365952: 1.03x faster Avg: 0.381797 - 0.369452: 1.03x faster Significant (t=31.15) Stddev: 0.00314 - 0.00241: 1.3017x smaller ### etree_iterparse ### Min: 0.545668 - 0.565437: 1.04x slower Avg: 0.554188 - 0.576807: 1.04x slower Significant (t=-17.00) Stddev: 0.00925 - 0.00956: 1.0340x larger ### etree_process ### Min: 0.294158 - 0.286617: 1.03x faster Avg: 0.296354 - 0.288877: 1.03x faster Significant (t=36.22) Stddev: 0.00149 - 0.00143: 1.0435x smaller ### fastpickle ### Min: 0.458961 - 0.475828: 1.04x slower Avg: 0.460226 - 0.481228: 1.05x slower Significant (t=-109.38) Stddev: 0.00082 - 0.00173: 2.1051x larger ### nqueens ### Min: 0.305883 - 0.295858: 1.03x faster Avg: 0.308085 - 0.297755: 1.03x faster Significant (t=90.22) Stddev: 0.00077 - 0.00085: 1.0942x larger ### silent_logging ### Min: 0.074152 - 0.075818: 1.02x slower Avg: 0.074345 - 0.076005: 1.02x slower Significant (t=-96.29) Stddev: 0.00013 - 0.00012: 1.0975x smaller ### spectral_norm ### Min: 0.355738 - 0.364419: 1.02x slower Avg: 0.356691 - 0.365764: 1.03x slower Significant (t=-126.23) Stddev: 0.00054 - 0.00047: 1.1533x smaller ### telco ### Min: 0.012152 - 0.013038: 1.07x slower Avg: 0.012264 - 0.013157: 1.07x slower Significant (t=-83.98) Stddev: 0.8 - 0.7: 1.0653x smaller The following not significant results are hidden, use -v to show them: 2to3, call_method, chaos, django_v2, etree_parse, fannkuch, fastunpickle, float, formatted_logging, go, hexiom2, iterative_count, json_dump, json_dump_v2, json_load, mako, mako_v2, meteor_contest, nbody, normal_startup, pathlib, pickle_dict, pickle_list, pidigits, raytrace, regex_compile, regex_effbot, regex_v8, richards, simple_logging, startup_nosite, threaded_count, tornado_http, unpack_sequence, unpickle_list. -- Added file: http://bugs.python.org/file36137/cow6.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: Hey Antoine, Thanks for the link. I'm having trouble getting reproducible results at present, and running out of ideas as to what might be causing it. Even after totally isolating a CPU for e.g. django_v2 and with frequency scaling disabled, numbers still jump around for the same binary by as much as 3%. I could not detect any significant change between runs of the old and new binary that could not be described as noise, given the isolation issues above. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: This new patch abandons the buffer interface and specializes for Bytes per the comments on this issue. Anyone care to glance at least at the general structure? Tests could probably use a little more work. Microbenchmark seems fine, at least for construction. It doesn't seem likely this patch would introduce severe performance troubles elsewhere, but I'd like to trying it out with some example heavy BytesIO consumers (any suggestions? Some popular template engine?) cpython] ./python.exe -m timeit -s 'import i' 'i.readlines()' lines: 54471 100 loops, best of 3: 13.3 msec per loop [23:52:55 eldil!58 cpython] ./python-nocow -m timeit -s 'import i' 'i.readlines()' lines: 54471 10 loops, best of 3: 19.6 msec per loop [23:52:59 eldil!59 cpython] cat i.py import io word = b'word' line = (word * int(79/len(word))) + b'\n' ar = line * int((4 * 1048576) / len(line)) def readlines(): return len(list(io.BytesIO(ar))) print('lines: %s' % (readlines(),)) -- Added file: http://bugs.python.org/file36078/cow5.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: Stefan, I like your new idea. If there isn't some backwards compatibility argument about mmap.mmap being hashable, then it could be considered a bug, and fixed in the same hypothetical future release that includes this BytesIO change. The only cost now is that to test for hashability, we must hash the object, which causes every byte in it to be touched (aka. almost 50% the cost of a copy) If however we can't fix mmap.mmap due to the interface change (I think that's a silly idea -- Python has never been about letting the user shoot themselves in the foot), then the specialized-for-Bytes approach is almost as good (and perhaps even better, since the resulting concept and structure layout is more aligned with Serhiy's patch in issue15381). tl;dr: a) mmap.mmap can be fixed - use hashability as strong test for immutability (instead of ugly heuristic involving buffer blags) - undecided: is calling hash(obj) to check for immutability too costly? b) mmap.mmap can't be fixed - use the Bytes specialization approach. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: Stefan, Thanks for digging here. As much as I'd love to follow this interpretation, it simply doesn't match existing buffer implementations, including within the standard library. For example, mmap.mmap(..., flags=mmap.MAP_SHARED, prot=mmap.PROT_READ) will produce a read-only buffer, yet mutability is entirely at the whim of the operating system. In this case, immutability may be apparent for years, until some machine has memory pressure, causing the shared mapping to be be flushed, and refreshed from (say, incoherent NFS storage) on next access. I thought it would be worth auditing some of the most popular types of buffer just to check your interpretation, and this was the first, most obvious candidate. Any thoughts? I'm leaning heavily toward the Bytes specialization approach -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: I'm not sure how much work it would be, or even if it could be made sufficient to solve our problem, but what about extending the buffers interface to include a int stable flag, defaulting to 0? It seems though, that it would just be making the already arcane buffers interface even more arcane simply for the benefit of our specific use case -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
Changes by David Wilson d...@botanicus.net: Added file: http://bugs.python.org/file36016/cow4.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: Hi Stefan, How does this approach in reinit() look? We first ask for a writable buffer, and if the object obliges, immediately copy it. Otherwise if it refused, ask for a read-only buffer, and this time expect that it will never change. This still does not catch the case of mmap.mmap. I am not sure how do deal with mmap.mmap. There is no way for it to export PROT_READ as a read-only buffer without permitted mutation, so the only options seem to either be a) remove buffer support from mmap, or b) blacklist it in bytesio(!). Antoine, I have padded out the unit tests a little. test_memoryio.py seems the best place for them. Also modified test_sizeof(), although to the way this test is designed seems inherently brittle to begin with. Now it is also sensitive to changes in Py_buffer struct. Various other changes: * __new__ once again returns a valid, open, empty BytesIO, since the alternative breaks pickling. * reinit() preserves existing BytesIO state until it knows it can succeed, which fixes another of the pickle tests. * setstate() had CHECK_CLOSED() re-added, again for the pickle tests. Probably the patch guts could be rearranged again, since the definition of the functions is no longer as clear as it was in cow3.patch. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: This version is tidied up enough that I think it could be reviewed. Changes are: * Defer `buf' allocation until __init__, rather than __new__ as was previously done. Now upon completion, BytesIO.__new__ returns a valid, closed BytesIO, whereas previously a valid, empty, open BytesIO was returned. Is this interface change permissible? * Move __init__ guts into a reinit(), for sharing with __setstate__, which also previously caused an unnecessary copy. Additionally gather up various methods for deallocating buffers into a single reset() function, called by reinit(), _dealloc(), and _close() * Per Stefan's suggested approach, reinit() now explicitly checks for a read-only buffer, falling back to silently performing a copy if the returned buffer is read-write. That seems vastly preferable to throwing an exception, which would probably be another interface change. * Call `unshare()` any place the buffer is about to be modified. If the buffer needs to be made private, it also accepts a size hint indicating how much less/more space the subsequent operation needs, to avoid a redundant reallocation after the unsharing. Outstanding issues: * I don't understand why buf_size is a size_t, and I'm certain the casting in unshare() is incorrect somehow. Is it simply to avoid signed overflow? -- Added file: http://bugs.python.org/file36004/cow2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: New patch also calls unshare() during getbuffer() -- Added file: http://bugs.python.org/file36005/cow3.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: I'm not sure the read only buffer test is strong enough: having a readonly view is not a guarantee that the data in the view cannot be changed through some other means, i.e. it is read-only, not immutable. Pretty sure this approach is broken. What about the alternative approach of specializing for Bytes? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: Good catch :( There doesn't seem to be way a to ask for an immutable buffer, so perhaps it could just be a little more selective. I think the majority of use cases would still be covered if the sharing behaviour was restricted only to BytesType. In that case Py_buffer initialdata could become a PyObject*, saving a small amount of memory, and allowing reuse of the struct member if BytesIO was also modified to directly write into a private BytesObject -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
New submission from David Wilson: This is a followup to the thread at https://mail.python.org/pipermail/python-dev/2014-July/135543.html , discussing the existing behaviour of BytesIO copying its source object, and how this regresses compared to cStringIO.StringI. The goal of posting the patch on list was to try and stimulate discussion around the approach. The patch itself obviously isn't ready for review, and I'm not in a position to dedicate time to it just now (although in a few weeks I'd love to give it full attention!). Ignoring this quick implementation, are there any general comments around the approach? My only concern is that it might keep large objects alive in a non-intuitive way in certain circumstances, though I can't think of any obvious ones immediately. Also interested in comments on the second half of that thread: a natural extension of this is to do something very similar on the write side: instead of generating a temporary private heap allocation, generate (and freely resize) a private PyBytes object until it is exposed to the user, at which point, _getvalue() returns it, and converts its into an IO_SHARED buffer. There are quite a few interactions with making that work correctly, in particular: * How BytesIO would implement the buffers interface without causing the under-construction Bytes to become readonly * Avoiding redundant copies and resizes -- we can't simply tack 25% slack on the end of the Bytes and then truncate it during getvalue() without likely triggering a copy and move, however with careful measurement of allocator behavior there are various tradeoffs that could be made - e.g. obmalloc won't move a 500 byte allocation if it shrinks by 25%. glibc malloc's rules are a bit more complex though. Could also add a private _PyBytes_SetSize() API to allow truncation to the final size during getvalue() without informing the allocator. Then we'd simply overallocate by up to 10% or 1-2kb, and write off the loss of the slack space. Notably, this approach completely differs from the one documented in http://bugs.python.org/issue15381 .. it's not clear to me which is better. -- components: Library (Lib) files: cow.patch keywords: patch messages: 223383 nosy: dw priority: normal severity: normal status: open title: BytesIO copy-on-write type: performance versions: Python 3.5 Added file: http://bugs.python.org/file35988/cow.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22003] BytesIO copy-on-write
David Wilson added the comment: Submitted contributor agreement. Please consider the demo patch licensed under the Apache 2 licence. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com