Michael Fox added the comment:
I was thinking about this line:
end = self._buffer.find(b\n, self._buffer_offset) + 1
Might be a bug? For example, is there a unicode where one of several
bytes is '\n'? In this case it splits the line in the middle of a
character, right?
On Sun, May 19, 2013 at
Nadeem Vawda added the comment:
No, that is the intended behavior for binary streams - they operate at
the level of individual byes. If you want to treat your input file as
Unicode-encoded text, you should open it in text mode. This will return a
TextIOWrapper which handles the decoding and line
Michael Fox added the comment:
You're right. In fact, what doesn't make sense is to be doing
line-oriented reads on a binary file. Why was I doing that?
I do have another quibble though. The open() function is like this:
open(file, mode='r', buffering=-1, encoding=None,
errors=None,
Michael Fox added the comment:
I thought of an even more hazardous case:
if compression == 'gz':
import gzip
open = gzip.open
elif compression == 'xz':
import lzma
open = lzma.open
else:
pass
On Mon, May 20, 2013 at 9:41 AM, Michael Fox rep...@bugs.python.org wrote:
Serhiy Storchaka added the comment:
Wrapping a raw LZMAFile in a BufferedReader is a simple solution. But I think
about extending BufferedReader so that LZMAFile and BufferedReader could use a
common buffer. Perhaps add a new method to BufferedIOBase which will be called
when a buffer is
Michael Fox added the comment:
I thought about it some more and the only bug here is mine, failing to
explicitly set mode='rt'.
Maybe back when someone invented text and binary modes they should
have been clear which was to be the default for all things. Maybe when
someone made the base class,
Serhiy Storchaka added the comment:
I'm against implementing LZMAFile in a pure C. It was a great win that LZMAFile
had implemented in a pure Python. However may be we could reuse the existing
accelerated implementation of io.BufferedReader.
--
___
Antoine Pitrou added the comment:
I second Serhiy here. Wrapping the LZMAFile in a BufferedReader is the simple
solution to the performance problem:
./python -m timeit -s import lzma, io f=lzma.LZMAFile('words.xz', 'r')
for line in f: pass
10 loops, best of 3: 148 msec per loop
$ ./python
Michael Fox added the comment:
io.BufferedReader works well for me. Thanks for the good suggestion.
Now python 3.3 and 3.4 have similar performance to each other and they
are only 2x slower than pyliblzma.
From my perspective default wrapping with io.BufferedReader is a great
idea. I can't
Nadeem Vawda added the comment:
I agree that making lzma.open() wrap its return value in a BufferedReader
(or BufferedWriter, as appropriate) is the way to go. I'm currently
travelling and don't have my SSH key with me - Serhiy, can you make the
change?
I'll put together a documentation patch
Nadeem Vawda added the comment:
I agree that making lzma.open() wrap its return value in a BufferedReader
(or BufferedWriter, as appropriate) is the way to go.
On second thoughts, there's no need to change the behavior for mode='wb'.
We can just return a BufferedReader for mode='rb', and
Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com:
--
nosy: +Arfrever
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
Nadeem Vawda added the comment:
Have you tried running the benchmark against the default (3.4) branch?
There was some significant optimization work done in issue 16034, but
the changes were not backported to 3.3.
--
___
Python tracker
Michael Fox added the comment:
3.4 is much better but still 4x slower than 2.7
m@air:~/q/topaz/parse_datalog$ time python2.7 lzmaperf.py
102368
real0m0.053s
user0m0.052s
sys 0m0.000s
m@air:~/q/topaz/parse_datalog$ time
~/tmp/cpython-23836f17e4a2/bin/python3.4 lzmaperf.py
102368
Michael Fox added the comment:
I looked into it a little and it looks like pyliblzma is a pure C
extension whereas new lzma library wraps liblzma but the rest is
python. In particular this happens for every line:
if size 0:
end = self._buffer.find(b\n, self._buffer_offset)
Serhiy Storchaka added the comment:
Try `f = io.BufferedReader(f)`.
--
nosy: +serhiy.storchaka
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
Raymond Hettinger added the comment:
So, unless someone thinks that a pure C extension is the
right technical direction, lzma in 3.4 is probably as fast
as it's ever going to be.
I would support the inclusion of a C extension. Reasonable performance is a
prerequisite for broader adoption.
Raymond Hettinger added the comment:
Serhiy, would you like to take this one?
--
assignee: - serhiy.storchaka
stage: - needs patch
versions: +Python 3.4 -Python 3.3
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
New submission from Michael Fox:
import lzma
count = 0
f = lzma.LZMAFile('bigfile.xz' ,'r')
for line in f:
count += 1
print(count)
Comparing python2 with pyliblzma to python3.3.1 with the new lzma:
m@air:~/q/topaz/parse_datalog$ time python lzmaperf.py
102368
real0m0.062s
user
Changes by STINNER Victor victor.stin...@gmail.com:
--
nosy: +haypo
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18003
___
___
Python-bugs-list
20 matches
Mail list logo