[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-20 Thread Michael Fox
Michael Fox added the comment: I was thinking about this line: end = self._buffer.find(b\n, self._buffer_offset) + 1 Might be a bug? For example, is there a unicode where one of several bytes is '\n'? In this case it splits the line in the middle of a character, right? On Sun, May 19, 2013 at

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-20 Thread Nadeem Vawda
Nadeem Vawda added the comment: No, that is the intended behavior for binary streams - they operate at the level of individual byes. If you want to treat your input file as Unicode-encoded text, you should open it in text mode. This will return a TextIOWrapper which handles the decoding and line

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-20 Thread Michael Fox
Michael Fox added the comment: You're right. In fact, what doesn't make sense is to be doing line-oriented reads on a binary file. Why was I doing that? I do have another quibble though. The open() function is like this: open(file, mode='r', buffering=-1, encoding=None, errors=None,

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-20 Thread Michael Fox
Michael Fox added the comment: I thought of an even more hazardous case: if compression == 'gz': import gzip open = gzip.open elif compression == 'xz': import lzma open = lzma.open else: pass On Mon, May 20, 2013 at 9:41 AM, Michael Fox rep...@bugs.python.org wrote:

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-20 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Wrapping a raw LZMAFile in a BufferedReader is a simple solution. But I think about extending BufferedReader so that LZMAFile and BufferedReader could use a common buffer. Perhaps add a new method to BufferedIOBase which will be called when a buffer is

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-20 Thread Michael Fox
Michael Fox added the comment: I thought about it some more and the only bug here is mine, failing to explicitly set mode='rt'. Maybe back when someone invented text and binary modes they should have been clear which was to be the default for all things. Maybe when someone made the base class,

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-19 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: I'm against implementing LZMAFile in a pure C. It was a great win that LZMAFile had implemented in a pure Python. However may be we could reuse the existing accelerated implementation of io.BufferedReader. -- ___

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-19 Thread Antoine Pitrou
Antoine Pitrou added the comment: I second Serhiy here. Wrapping the LZMAFile in a BufferedReader is the simple solution to the performance problem: ./python -m timeit -s import lzma, io f=lzma.LZMAFile('words.xz', 'r') for line in f: pass 10 loops, best of 3: 148 msec per loop $ ./python

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-19 Thread Michael Fox
Michael Fox added the comment: io.BufferedReader works well for me. Thanks for the good suggestion. Now python 3.3 and 3.4 have similar performance to each other and they are only 2x slower than pyliblzma. From my perspective default wrapping with io.BufferedReader is a great idea. I can't

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-19 Thread Nadeem Vawda
Nadeem Vawda added the comment: I agree that making lzma.open() wrap its return value in a BufferedReader (or BufferedWriter, as appropriate) is the way to go. I'm currently travelling and don't have my SSH key with me - Serhiy, can you make the change? I'll put together a documentation patch

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-19 Thread Nadeem Vawda
Nadeem Vawda added the comment: I agree that making lzma.open() wrap its return value in a BufferedReader (or BufferedWriter, as appropriate) is the way to go. On second thoughts, there's no need to change the behavior for mode='wb'. We can just return a BufferedReader for mode='rb', and

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-19 Thread Arfrever Frehtes Taifersar Arahesis
Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com: -- nosy: +Arfrever ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-18 Thread Nadeem Vawda
Nadeem Vawda added the comment: Have you tried running the benchmark against the default (3.4) branch? There was some significant optimization work done in issue 16034, but the changes were not backported to 3.3. -- ___ Python tracker

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-18 Thread Michael Fox
Michael Fox added the comment: 3.4 is much better but still 4x slower than 2.7 m@air:~/q/topaz/parse_datalog$ time python2.7 lzmaperf.py 102368 real0m0.053s user0m0.052s sys 0m0.000s m@air:~/q/topaz/parse_datalog$ time ~/tmp/cpython-23836f17e4a2/bin/python3.4 lzmaperf.py 102368

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-18 Thread Michael Fox
Michael Fox added the comment: I looked into it a little and it looks like pyliblzma is a pure C extension whereas new lzma library wraps liblzma but the rest is python. In particular this happens for every line: if size 0: end = self._buffer.find(b\n, self._buffer_offset)

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-18 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Try `f = io.BufferedReader(f)`. -- nosy: +serhiy.storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-18 Thread Raymond Hettinger
Raymond Hettinger added the comment: So, unless someone thinks that a pure C extension is the right technical direction, lzma in 3.4 is probably as fast as it's ever going to be. I would support the inclusion of a C extension. Reasonable performance is a prerequisite for broader adoption.

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-18 Thread Raymond Hettinger
Raymond Hettinger added the comment: Serhiy, would you like to take this one? -- assignee: - serhiy.storchaka stage: - needs patch versions: +Python 3.4 -Python 3.3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-17 Thread Michael Fox
New submission from Michael Fox: import lzma count = 0 f = lzma.LZMAFile('bigfile.xz' ,'r') for line in f: count += 1 print(count) Comparing python2 with pyliblzma to python3.3.1 with the new lzma: m@air:~/q/topaz/parse_datalog$ time python lzmaperf.py 102368 real0m0.062s user

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-17 Thread STINNER Victor
Changes by STINNER Victor victor.stin...@gmail.com: -- nosy: +haypo ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18003 ___ ___ Python-bugs-list