[issue18003] lzma module very slow with line-oriented reading.
Antoine Pitrou added the comment: He accepted it already: A small last-minute optimization is not a release-blocker. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Serhiy Storchaka added the comment: Larry, do you accept the patch for 3.5? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Serhiy Storchaka added the comment: The patch is not so harmless. First, my change in BZ2File is not correct, because reading every line should be guarded with a lock (BZ2File is threading-safe). Second, for now all three compressing files are not only iterables, but iterators. iter(f) returns f, and changing this can have non-obvious effects. I think the patch is too complex for 3.5, we should have more time to analyze all consequences. -- stage: commit review - patch review versions: -Python 3.5 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Larry Hastings added the comment: Sounds good to me. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Martin Panter added the comment: This patch adds an entry to the What’s New for 3.5 (though maybe it will have to be 3.6), and adds three tests to check that next() raises ValueError when the files have been closed. -- Added file: http://bugs.python.org/file39662/decomp-optim.v4.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Martin Panter added the comment: The BufferedReader class is documented as being thread safe: https://docs.python.org/dev/library/io.html#multi-threading. Some experimentation suggests that checking the “raw.closed” property is not actually serialized, but that raw.readinto() calls are serialized. I don’t think this is a big problem in practice, so I think BZ2File would remain as thread-safe as BufferedReader is. But Serhiy’s point is definitely valid about the classes breaking the iterator protocol. FWIW I originally made this patch to satisfy my personal curiosity about why wrapping a second BufferedReader made things so much faster. Now I accept it is partly due to the overhead of the extra LZMAFile.readline() to BufferedReader.readline() delegation. Maybe it is not even worth optimizing around this kind of overhead, so I would even be happy to drop this patch and close the issue. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Antoine Pitrou added the comment: This looks good to me. -- stage: patch review - commit review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Serhiy Storchaka added the comment: Perhaps this change is worth to mention in whatsnews. Could you add this Martin? It would be nice also add tests to ensure that next() after closing the file always raises ValueError. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Changes by Antoine Pitrou pit...@free.fr: -- priority: normal - release blocker ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Serhiy Storchaka added the comment: bz2 will gain great benefit from such optimization too. Microbenchmark results: $ ./python -m timeit -s import gzip -- f=gzip.GzipFile('words.gz', 'r') for line in f: pass 2.7: 10 loops, best of 3: 374 msec per loop 3.2: 10 loops, best of 3: 325 msec per loop 3.3: 10 loops, best of 3: 311 msec per loop 3.4: 10 loops, best of 3: 328 msec per loop 3.5: 10 loops, best of 3: 325 msec per loop 3.5+decomp-optim.v3: 10 loops, best of 3: 61.2 msec per loop $ ./python -m timeit -s import bz2 -- f=bz2.BZ2File('words.bz2', 'r') for line in f: pass 2.7: 10 loops, best of 3: 92.1 msec per loop 3.2: 10 loops, best of 3: 92.4 msec per loop 3.3: 10 loops, best of 3: 567 msec per loop 3.4: 10 loops, best of 3: 535 msec per loop 3.5: 10 loops, best of 3: 603 msec per loop 3.5+decomp-optim.v2: 10 loops, best of 3: 525 msec per loop 3.5+decomp-optim.v3: 10 loops, best of 3: 131 msec per loop $ python -m timeit -s import lzma -- f=lzma.LZMAFile('words.xz', 'r') for line in f: pass 2.7: 10 loops, best of 3: 49.4 msec per loop 3.3: 10 loops, best of 3: 1.67 sec per loop 3.4: 10 loops, best of 3: 400 msec per loop 3.5: 10 loops, best of 3: 423 msec per loop 3.5+decomp-optim.v3: 10 loops, best of 3: 89.6 msec per loop The fact that bz2 and lzma have 5-15% regression in 3.5 (comparing to 3.4) makes applying this patch to 3.5 more desirable. -- Added file: http://bugs.python.org/file39640/decomp-optim.v3.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Martin Panter added the comment: Looking at https://bugs.python.org/file39586/decomp-optim.patch, the “closed” property is the first of the three hunks: 1. Adds @property / def closed(self) to Lib/_compression.py 2. Adds def __iter__(self) to Lib/gzip.py 3. Adds def __iter__(self) to Lib/lzma.py -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Martin Panter added the comment: New patch just fixes the spelling error in the comment. -- stage: needs patch - patch review Added file: http://bugs.python.org/file39604/decomp-optim.v2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Larry Hastings added the comment: I don't see anything about closed in the patch you posted. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Antoine Pitrou added the comment: Yes, this is right. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Martin Panter added the comment: Yes that’s basically right Larry. The __iter__() was previously inherited; now I am overriding it with a custom version. Similarly for the “closed” property, but that one is only a member of objects internal to the gzip, lzma and bz2 modules. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Antoine Pitrou added the comment: Nous disions que tu aurais probablement à valider ce changement, mais que nous pourrions peut-être aussi le faufiler discrètement dans la base de code, vu que tu ne lis pas ces message. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Larry Hastings added the comment: Quoi? Je comprends que le français. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Larry Hastings added the comment: If I understand this correctly, I can ignore everything up to May 2015, as it has to do with line-reading a compressed binary file (!) being slow. Then, Martin Panter proposes a new optimization in May 2015, which is to simply add __iter__ methods to gzip.GzipFile and lzma.LZMAFile. Is this right? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Antoine Pitrou added the comment: This looks good to me. Larry would probably have to validate it for 3.5, although we may try to sneak it in (he isn't reading :-D). -- nosy: +larry ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Martin Panter added the comment: This bug was originally raised against Python 3.3, and the speed has improved a lot since then. Perhaps this bug can be closed as it is, or maybe people would like to consider my decomp-optim.patch which squeezes a bit more speed out. I don’t actually have a strong opinion either way. Python 3.4 was apparently much faster than 3.3 courtesy of Issue 16034. In Python 3.5, all three decompression modules (LZMA, gzip and bzip) now use a BufferedReader internally, due to my work in Issue 23529. The modules delegate method calls to the internal BufferedReader, rather than returning an instance directly, for backwards compatibility. I found that bypassing the readline() delegation speeds things up significantly, and adding a custom “closed” property on the underlying raw reader class also helps. However, I did not think it would be wise to bypass the locking in the “bz2” module, I didn’t bypass BZ2File.readline() in the patch. Timing results and a test script I used to investigate different options below: lzma gzip bz2 === Unpatched3.2 s2.513 s 5.180 s Custom __iter__()1.31 s 1.317 s 2.433 s __iter__() and closed0.53 s* 0.543 s* 1.650 s closed change only 4.047 s* External BufferedReader 0.64 s 0.597 s 1.750 s Direct from BytesIO 0.33 s 0.370 s 1.280 s Command-line tool0.063 s 0.053 s 0.993 s * Option implemented in decomp-optim.patch --- import lzma, io filename = pacman.log.xz # 256206 lines; 389 kB - 13 MB # Basic case reader = lzma.LZMAFile(filename) # 3.2 s # Add __iter__() optimization def lzma_iter(self): self._check_can_read() return iter(self._buffer) lzma.LZMAFile.__iter__ = lzma_iter # 1.31 s # Add “closed” optimization def decompressor_closed(self): return self._decompressor is None import _compression _compression.DecompressReader.closed = property(decompressor_closed) # 0.53 s #~ # External BufferedReader baseline #~ reader = io.BufferedReader(lzma.LZMAFile(filename)) # 0.64 s #~ # Direct from BytesIO baseline #~ with open(filename, rb) as file: #~ data = file.read() #~ reader = io.BytesIO(lzma.decompress(data)) # 0.33 s for line in reader: pass -- keywords: +patch versions: +Python 3.5, Python 3.6 -Python 3.4 Added file: http://bugs.python.org/file39586/decomp-optim.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Martin Panter added the comment: I haven’t done any tests, but my LZMAFile patch to Issue 15955 uses BufferedReader, so it might satisfy this issue -- nosy: +vadmium ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Serhiy Storchaka added the comment: See issue19051. Even preliminary Python implementation noticeable speed up the reading of short lines. $ ./python -m timeit -s import lzma, io f=lzma.LZMAFile('words.xz', 'r') for line in f: pass Unpatched: 1.44 sec per loop Patched: 1.06 sec per loop With C implementation it should be as fast as with BufferedReader. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Antoine Pitrou added the comment: With C implementation it should be as fast as with BufferedReader. So why not simply use BufferedReader? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Antoine Pitrou added the comment: So why not simply use BufferedReader? Because we want good performance LZMAFile and compatibility with older versions. You're reading me wrong. I'm simply suggesting that users interested in readline() performance wrap LZMAFile in a BufferedReader. The documentation can mention it. And I guess that it will be even faster than wrapping in BufferedReader (due to the avoiding of double buffering). Let's wait for the numbers, then. The performance increase would have to be quite large to justify such code duplication. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Serhiy Storchaka added the comment: So why not simply use BufferedReader? Because we want good performance LZMAFile and compatibility with older versions. And I guess that it will be even faster than wrapping in BufferedReader (due to the avoiding of double buffering). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18003] lzma module very slow with line-oriented reading.
Éric Araujo added the comment: A higher-level interface to abstract differences between gzip, xz and others is actually provided in the tarfile module. (zipfile is left out and its file objects have different methods, but that’s another issue. shutil provides even higher-level functions to work on top of tarfile or zipfile.) -- nosy: +eric.araujo title: New lzma crazy slow with line-oriented reading. - lzma module very slow with line-oriented reading. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com