[issue18003] lzma module very slow with line-oriented reading.

2015-06-09 Thread Antoine Pitrou

Antoine Pitrou added the comment:

He accepted it already:

A small last-minute optimization is not a release-blocker.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-09 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Larry, do you accept the patch for 3.5?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-09 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

The patch is not so harmless. First, my change in BZ2File is not correct, 
because reading every line should be guarded with a lock (BZ2File is 
threading-safe). Second, for now all three compressing files are not only 
iterables, but iterators. iter(f) returns f, and changing this can have 
non-obvious effects. I think the patch is too complex for 3.5, we should have 
more time to analyze all consequences.

--
stage: commit review - patch review
versions:  -Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-09 Thread Larry Hastings

Larry Hastings added the comment:

Sounds good to me.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-09 Thread Martin Panter

Martin Panter added the comment:

This patch adds an entry to the What’s New for 3.5 (though maybe it will have 
to be 3.6), and adds three tests to check that next() raises ValueError when 
the files have been closed.

--
Added file: http://bugs.python.org/file39662/decomp-optim.v4.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-09 Thread Martin Panter

Martin Panter added the comment:

The BufferedReader class is documented as being thread safe: 
https://docs.python.org/dev/library/io.html#multi-threading. Some 
experimentation suggests that checking the “raw.closed” property is not 
actually serialized, but that raw.readinto() calls are serialized. I don’t 
think this is a big problem in practice, so I think BZ2File would remain as 
thread-safe as BufferedReader is.

But Serhiy’s point is definitely valid about the classes breaking the iterator 
protocol. FWIW I originally made this patch to satisfy my personal curiosity 
about why wrapping a second BufferedReader made things so much faster. Now I 
accept it is partly due to the overhead of the extra LZMAFile.readline() to 
BufferedReader.readline() delegation. Maybe it is not even worth optimizing 
around this kind of overhead, so I would even be happy to drop this patch and 
close the issue.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-07 Thread Antoine Pitrou

Antoine Pitrou added the comment:

This looks good to me.

--
stage: patch review - commit review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-07 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Perhaps this change is worth to mention in whatsnews. Could you add this Martin?

It would be nice also add tests to ensure that next() after closing the file 
always raises ValueError.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-06 Thread Antoine Pitrou

Changes by Antoine Pitrou pit...@free.fr:


--
priority: normal - release blocker

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-06 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

bz2 will gain great benefit from such optimization too.

Microbenchmark results:

$ ./python -m timeit -s import gzip -- f=gzip.GzipFile('words.gz', 'r') 
for line in f: pass
2.7: 10 loops, best of 3: 374 msec per loop
3.2: 10 loops, best of 3: 325 msec per loop
3.3: 10 loops, best of 3: 311 msec per loop
3.4: 10 loops, best of 3: 328 msec per loop
3.5: 10 loops, best of 3: 325 msec per loop
3.5+decomp-optim.v3: 10 loops, best of 3: 61.2 msec per loop

$ ./python -m timeit -s import bz2 -- f=bz2.BZ2File('words.bz2', 'r') for 
line in f: pass
2.7: 10 loops, best of 3: 92.1 msec per loop
3.2: 10 loops, best of 3: 92.4 msec per loop
3.3: 10 loops, best of 3: 567 msec per loop
3.4: 10 loops, best of 3: 535 msec per loop
3.5: 10 loops, best of 3: 603 msec per loop
3.5+decomp-optim.v2: 10 loops, best of 3: 525 msec per loop
3.5+decomp-optim.v3: 10 loops, best of 3: 131 msec per loop

$ python -m timeit -s import lzma -- f=lzma.LZMAFile('words.xz', 'r') for 
line in f: pass
2.7: 10 loops, best of 3: 49.4 msec per loop
3.3: 10 loops, best of 3: 1.67 sec per loop
3.4: 10 loops, best of 3: 400 msec per loop
3.5: 10 loops, best of 3: 423 msec per loop
3.5+decomp-optim.v3: 10 loops, best of 3: 89.6 msec per loop

The fact that bz2 and lzma have 5-15% regression in 3.5 (comparing to 3.4) 
makes applying this patch to 3.5 more desirable.

--
Added file: http://bugs.python.org/file39640/decomp-optim.v3.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-03 Thread Martin Panter

Martin Panter added the comment:

Looking at https://bugs.python.org/file39586/decomp-optim.patch, the “closed” 
property is the first of the three hunks:

1. Adds @property / def closed(self) to Lib/_compression.py
2. Adds def __iter__(self) to Lib/gzip.py
3. Adds def __iter__(self) to Lib/lzma.py

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-03 Thread Martin Panter

Martin Panter added the comment:

New patch just fixes the spelling error in the comment.

--
stage: needs patch - patch review
Added file: http://bugs.python.org/file39604/decomp-optim.v2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-03 Thread Larry Hastings

Larry Hastings added the comment:

I don't see anything about closed in the patch you posted.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-03 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Yes, this is right.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-03 Thread Martin Panter

Martin Panter added the comment:

Yes that’s basically right Larry. The __iter__() was previously inherited; now 
I am overriding it with a custom version. Similarly for the “closed” property, 
but that one is only a member of objects internal to the gzip, lzma and bz2 
modules.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-02 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Nous disions que tu aurais probablement à valider ce changement, mais que nous 
pourrions peut-être aussi le faufiler discrètement dans la base de code, vu que 
tu ne lis pas ces message.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-02 Thread Larry Hastings

Larry Hastings added the comment:

Quoi? Je comprends que le français.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-02 Thread Larry Hastings

Larry Hastings added the comment:

If I understand this correctly, I can ignore everything up to May 2015, as it 
has to do with line-reading a compressed binary file (!) being slow.

Then, Martin Panter proposes a new optimization in May 2015, which is to simply 
add __iter__ methods to gzip.GzipFile and lzma.LZMAFile.

Is this right?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-02 Thread Antoine Pitrou

Antoine Pitrou added the comment:

This looks good to me. Larry would probably have to validate it for 3.5, 
although we may try to sneak it in (he isn't reading :-D).

--
nosy: +larry

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-06-01 Thread Martin Panter

Martin Panter added the comment:

This bug was originally raised against Python 3.3, and the speed has improved a 
lot since then. Perhaps this bug can be closed as it is, or maybe people would 
like to consider my decomp-optim.patch which squeezes a bit more speed out. I 
don’t actually have a strong opinion either way.

Python 3.4 was apparently much faster than 3.3 courtesy of Issue 16034. In 
Python 3.5, all three decompression modules (LZMA, gzip and bzip) now use a 
BufferedReader internally, due to my work in Issue 23529. The modules delegate 
method calls to the internal BufferedReader, rather than returning an instance 
directly, for backwards compatibility.

I found that bypassing the readline() delegation speeds things up 
significantly, and adding a custom “closed” property on the underlying raw 
reader class also helps. However, I did not think it would be wise to bypass 
the locking in the “bz2” module, I didn’t bypass BZ2File.readline() in the 
patch. Timing results and a test script I used to investigate different options 
below:

 lzma gzip  bz2
 ===    
Unpatched3.2 s2.513 s   5.180 s
Custom __iter__()1.31 s   1.317 s   2.433 s
__iter__() and closed0.53 s*  0.543 s*  1.650 s
closed change only  4.047 s*
External BufferedReader  0.64 s   0.597 s   1.750 s
Direct from BytesIO  0.33 s   0.370 s   1.280 s
Command-line tool0.063 s  0.053 s   0.993 s

* Option implemented in decomp-optim.patch

---

import lzma, io
filename = pacman.log.xz  # 256206 lines; 389 kB - 13 MB

# Basic case
reader = lzma.LZMAFile(filename)  # 3.2 s

# Add __iter__() optimization
def lzma_iter(self):
self._check_can_read()
return iter(self._buffer)
lzma.LZMAFile.__iter__ = lzma_iter  # 1.31 s

# Add “closed” optimization
def decompressor_closed(self):
return self._decompressor is None
import _compression
_compression.DecompressReader.closed = property(decompressor_closed)  # 0.53 s

#~ # External BufferedReader baseline
#~ reader = io.BufferedReader(lzma.LZMAFile(filename))  # 0.64 s

#~ # Direct from BytesIO baseline
#~ with open(filename, rb) as file:
#~ data = file.read()
#~ reader = io.BytesIO(lzma.decompress(data))  # 0.33 s

for line in reader:
pass

--
keywords: +patch
versions: +Python 3.5, Python 3.6 -Python 3.4
Added file: http://bugs.python.org/file39586/decomp-optim.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2015-01-11 Thread Martin Panter

Martin Panter added the comment:

I haven’t done any tests, but my LZMAFile patch to Issue 15955 uses 
BufferedReader, so it might satisfy this issue

--
nosy: +vadmium

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2013-09-20 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

See issue19051. Even preliminary Python implementation noticeable speed up the 
reading of short lines.

$ ./python -m timeit -s import lzma, io f=lzma.LZMAFile('words.xz', 'r') 
for line in f: pass

Unpatched: 1.44 sec per loop
Patched: 1.06 sec per loop

With C implementation it should be as fast as with BufferedReader.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2013-09-20 Thread Antoine Pitrou

Antoine Pitrou added the comment:

 With C implementation it should be as fast as with BufferedReader.

So why not simply use BufferedReader?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2013-09-20 Thread Antoine Pitrou

Antoine Pitrou added the comment:

  So why not simply use BufferedReader?
 
 Because we want good performance LZMAFile and compatibility with older
 versions.

You're reading me wrong. I'm simply suggesting that users interested in
readline() performance wrap LZMAFile in a BufferedReader. The
documentation can mention it.

  And I guess that it will be even faster than wrapping in
 BufferedReader (due to the avoiding of double buffering).

Let's wait for the numbers, then. The performance increase would have to
be quite large to justify such code duplication.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2013-09-20 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 So why not simply use BufferedReader?

Because we want good performance LZMAFile and compatibility with older 
versions. And I guess that it will be even faster than wrapping in 
BufferedReader (due to the avoiding of double buffering).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18003] lzma module very slow with line-oriented reading.

2013-05-24 Thread Éric Araujo

Éric Araujo added the comment:

A higher-level interface to abstract differences between gzip, xz and others is 
actually provided in the tarfile module.  (zipfile is left out and its file 
objects have different methods, but that’s another issue.  shutil provides even 
higher-level functions to work on top of tarfile or zipfile.)

--
nosy: +eric.araujo
title: New lzma crazy slow with line-oriented reading. - lzma module very slow 
with line-oriented reading.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com