[issue15955] gzip, bz2, lzma: add option to limit output size

2015-02-26 Thread Martin Panter

Martin Panter added the comment:

Yes I think that is a good idea:

* Moved gzip-bomb.patch to Issue 23528
* Opened Issue 23529 with a new version of the LZMAFile patch, and plan to post 
a BZ2File patch there as well when it is ready

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-02-26 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 10dab1b596d4 by Antoine Pitrou in branch 'default':
Issue #15955: Add an option to limit the output size in bz2.decompress().
https://hg.python.org/cpython/rev/10dab1b596d4

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-02-25 Thread Nikolaus Rath

Nikolaus Rath added the comment:

Martin, I'll try to review your GzipFile patch. But maybe it would make sense 
to open a separate issue for this?

I think the LZMAFile patch has not yet been reviewed or committed, and we 
probably want a patch for BZ2File too. The review page is already pretty 
cluttered right now, so I think it would make sense to use this issue for the 
low-level compressor/decompressor API and handle the *File API separately.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-02-23 Thread Martin Panter

Martin Panter added the comment:

Rev3 still seems to have the same weird indentation as rev2. Are you using some 
sort of diff --ignore-all-space option by accident?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-02-22 Thread Nikolaus Rath

Nikolaus Rath added the comment:

Updated patch attached.

--
Added file: http://bugs.python.org/file38206/issue15955_bz2_rev2.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-02-21 Thread Martin Panter

Martin Panter added the comment:

The new bzip parameter still needs documenting in the library reference.

However I reviewed the doc string, C code, and tests, and I can’t find 
anything wrong.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-02-20 Thread Nikolaus Rath

Nikolaus Rath added the comment:

Attached is a patch for the bz2 module.

--
Added file: http://bugs.python.org/file38194/issue15955_bz2.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-02-19 Thread Nikolaus Rath

Nikolaus Rath added the comment:

I've started to work on the bzip module. I'll attach I work-in-progress patch 
if I get stuck or run out of time.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-01-31 Thread Nikolaus Rath

Nikolaus Rath added the comment:

Yes, I still plan to do it, but I haven't started yet.

That said, I certainly won't be offended if someone else implements the feature 
either. Please just let me know when you start working on this (I'll do the 
same), so there's no duplication of efforts.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-01-31 Thread Nikolaus Rath

Nikolaus Rath added the comment:

On a more practical note: 

I believe Nadeem at one point said that the bz2 module is not exactly an 
example for good stdlib coding, while the lzma module's implementation is quite 
clean. Therefore

Therefore, one idea I had for the bz2 module was to port the lzma module to 
use the bzip2 library (instead of adding a max_size parameter to the bz2 
module). Just something to consider if you find time to work on this before me.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-01-30 Thread Martin Panter

Martin Panter added the comment:

Nikolaus, do you still plan on doing the bzip module? If not, I could have a go 
when I get a chance. I’m also keen for the GzipFile decompression to be fixed, 
if anyone wants to review my gzip-bomb.patch.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-01-17 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 00e552a23bcc by Antoine Pitrou in branch 'default':
Issue #15955: Add an option to limit output size when decompressing LZMA data.
https://hg.python.org/cpython/rev/00e552a23bcc

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-01-17 Thread Antoine Pitrou

Antoine Pitrou added the comment:

I've committed the LZMADecompressor patch. Now let's tackle the rest.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-01-16 Thread Antoine Pitrou

Antoine Pitrou added the comment:

I've posted a review of the latest lzma decompressor patch. I think it'll be 
able to go in once the comments are addressed.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-01-16 Thread Martin Panter

Martin Panter added the comment:

Adding issue15955_lzma_r5.diff. Main changes from r4:

* Consistent Py_ssize_t type for data_size
* max_size → max_length to match Python parameter name
* Arranged for EOF handling to occur before, and instead of, saving the input 
buffer
* Removed my LZMAFile test

Nikolaus, I hope I am not getting in your way by updating this patch. I though 
I should take responsibility for removing the test I accidentally added in r4.

--
Added file: http://bugs.python.org/file37734/issue15955_lzma_r5.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-01-13 Thread Nikolaus Rath

Nikolaus Rath added the comment:

Martin Panter rep...@bugs.python.org writes:
 We still need a patch for max_length in BZ2Decompressor, and to use it
 in BZ2File.

I'm still quite interested to do this. The only reason I haven't done it
yet is that I'm waiting for the LZMA patch to be reviewed and committed
first (no need to make any mistakes twice).

Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-01-11 Thread Martin Panter

Martin Panter added the comment:

We still need a patch for max_length in BZ2Decompressor, and to use it in 
BZ2File. Also, I think my GzipFile patch should be considered as a bug fix 
(rather than enhancement), e.g. the fix for Issue 16043 assumes 
GzipFile.read(size) is limited.

I’m adding v4 of Nikolaus’s low-level LZMA patch. I merged v3 with the recent 
default branch and fixed the conflicts, since I couldn’t tell what revision it 
was originally based on. I also made some fixes based on my review comments.

--
Added file: http://bugs.python.org/file37677/issue15955_lzma_r4.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-01-11 Thread Antoine Pitrou

Changes by Antoine Pitrou pit...@free.fr:


--
stage: needs patch - patch review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-01-09 Thread Martin Panter

Martin Panter added the comment:

Here is a patch for the higher-level LZMAFile implementation to use Nikolaus’s 
“max_length” parameter. It depends on Nikolaus’s patch also being applied.

I split out a _RawReader class that does the actual decompress() calls, and 
then wrapped that in a BufferedReader. This avoids needing any special code to 
implement buffering, readline(), etc. The only significant changes in the API 
that I can see are:

* LZMAFile now inherits the useless specification of BufferedReader.peek(), 
losing the guarantee of returning at least a single byte. I questioned the 
BufferedReader specification at https://bugs.python.org/issue5811#msg233750.
* read() now accepts size=None, because BufferedReader does. I had to change a 
test case for this.

--
Added file: http://bugs.python.org/file37658/LZMAFile.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2015-01-08 Thread Martin Panter

Martin Panter added the comment:

It turns out that GzipFile.read(size) etc is also susceptible to 
decompression bombing. Here is a patch to test and fix that, making use of the 
existing “max_length” parameter in the “zlib” module.

--
Added file: http://bugs.python.org/file37644/gzip-bomb.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-12-19 Thread Nikolaus Rath

Nikolaus Rath added the comment:

*ping*. It's been another 8 months. It would be nice if someone could review 
the patch.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-09-10 Thread Martin Panter

Martin Panter added the comment:

If people are worried about the best low-level decompressor API, maybe leave 
that as a future enhancement, and just rely on using the existing file reader 
APIs. I would expect them to have a sensible decompressed buffer size limit, 
however “bzip2” and LZMA look susceptible to zip bombing:

 GzipFile(fileobj=gzip_bomb).read(1)
b'\x00'
 BZ2File(bzip_bomb).read(1)
Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/lib/python3.4/bz2.py, line 293, in read
return self._read_block(size)
  File /usr/lib/python3.4/bz2.py, line 254, in _read_block
while n  0 and self._fill_buffer():
  File /usr/lib/python3.4/bz2.py, line 218, in _fill_buffer
self._buffer = self._decompressor.decompress(rawblock)
MemoryError
 z = LZMAFile(lzma_bomb)
 z.read(1)
b'\x00'  # Slight delay before returning
 len(z._buffer)
55675075  # Decompressed much more data than I asked for

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-08-11 Thread Nikolaus Rath

Nikolaus Rath added the comment:

*ping*

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-06-16 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@gmail.com:


--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-06-15 Thread Nadeem Vawda

Nadeem Vawda added the comment:

Sorry, I just haven't had any free time lately, and may still not be able
to give this the attention it deserves for another couple of weeks.

Serhiy, would you be interested in reviewing Nikolaus' patch?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-06-14 Thread Nikolaus Rath

Nikolaus Rath added the comment:

Nadeem, did you get a chance to look at this?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-04-12 Thread Nikolaus Rath

Changes by Nikolaus Rath nikol...@rath.org:


Added file: http://bugs.python.org/file34792/issue15955_r3.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-04-11 Thread Nikolaus Rath

Nikolaus Rath added the comment:

I've attached the second iteration of the patch. I've factored out a new 
function decompress_buf, and added two new (C only) instance variables 
input_buffer and input_buffer_size.

I've tested the patch by enabling Py_USING_MEMORY_DEBUGGER in 
Objects/obmalloc.c and compiling --with-pydebug --without-pymalloc. Output is:

$ valgrind --tool=memcheck --suppressions=Misc/valgrind-python.supp ./python -m 
test -R : -v test_lzma
==18635== Memcheck, a memory error detector
==18635== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==18635== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==18635== Command: ./python -m test -R : -v test_lzma
==18635== 
== CPython 3.5.0a0 (default:a8f3ca72f703+, Apr 11 2014, 21:48:07) [GCC 4.8.2]
[]
Ran 103 tests in 43.655s

OK
beginning 9 repetitions
123456789
[...]
OK
.
1 test OK.
==18635== 
==18635== HEAP SUMMARY:
==18635== in use at exit: 2,328,705 bytes in 13,625 blocks
==18635==   total heap usage: 2,489,623 allocs, 2,475,998 frees, 91,305,463,593 
bytes allocated
==18635== 
==18635== LEAK SUMMARY:
==18635==definitely lost: 0 bytes in 0 blocks
==18635==indirectly lost: 0 bytes in 0 blocks
==18635==  possibly lost: 963,533 bytes in 1,334 blocks
==18635==still reachable: 1,365,172 bytes in 12,291 blocks
==18635== suppressed: 0 bytes in 0 blocks
==18635== Rerun with --leak-check=full to see details of leaked memory
==18635== 
==18635== For counts of detected and suppressed errors, rerun with: -v
==18635== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)


When running the tests only once (no -R option), the number of possibly lost 
bytes is 544,068 in 1,131 blocks. When running without the patch, the number is 
545,740 bytes in 1,134 blocks (for one iteration). I guess this means that 
Python is using interior pointers, so these blocks are not truly lost?

--
Added file: http://bugs.python.org/file34789/issue15955_r2.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-04-06 Thread Nadeem Vawda

Nadeem Vawda added the comment:

I've posted a review at http://bugs.python.org/review/15955/. (For some reason, 
it looks like Rietveld didn't send out email notifications. But maybe it never 
sends a notification to the sender? Hmm.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-04-06 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Yes, Rietveld never sends a notification to the sender. I've received a 
notification.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-03-30 Thread Nadeem Vawda

Nadeem Vawda added the comment:

Thanks for the patch, Nikolaus. I'm afraid I haven't had a chance to look
over it yet; this past week has been a bit crazy for me. I'll definitely
get back to you with a review in the next week, though.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-03-24 Thread Nikolaus Rath

Nikolaus Rath added the comment:

Thanks Nadeem. I'll get going.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-03-24 Thread Nikolaus Rath

Nikolaus Rath added the comment:

I have attached a patch adding the max_length parameter to 
LZMADecompressor.decompress(). 

I have chosen to store the pointer to any unconsumed input in the lzma stream 
object itself. The new convention is: if lzs.next_in != NULL, then there is 
valid data that needs to be prepended.

Since I expect that max_length will often be specified as a keyword argument, I 
had to change the argument clinic definition to allow all arguments to be 
specified as keyword arguments. I hope that's ok.

Testcases seem to run fine. No reference leaks with -R : either.

Comments?

--
keywords: +patch
Added file: http://bugs.python.org/file34609/issue15955.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-02-02 Thread Nadeem Vawda

Nadeem Vawda added the comment:

After some consideration, I've come to agree with Serhiy that it would be better
to keep a private internal buffer, rather than having the user manage unconsumed
input data. I'm also in favor of having a flag to indicate whether the
decompressor needs more input to produce more decompressed data. (I'd prefer to
call it 'needs_input' or similar, though - 'data_ready' feels too vague to me.)

In msg176883 and msg177228, Serhiy raises the possibility that the compressor
might be unable to produce decompressed output from a given piece of (non-empty)
input, but will still leave the input unconsumed. I do not think that this can
actually happen (based on the libraries' documentation), but this API will work
even if that situation can occur.

So, to summarize, the API will look like this:

class LZMADecompressor:

...

def decompress(self, data, max_length=-1):
Decompresses *data*, returning uncompressed data as bytes.

If *max_length* is nonnegative, returns at most *max_length* bytes
of decompressed data. If this limit is reached and further output
can be produced, *self.needs_input* will be set to False. In this
case, the next call to *decompress()* should provide *data* as b''
to obtain more of the output.

If all of the input data was decompressed and returned (either
because this was less than *max_length* bytes, or because
*max_length* was negative), *self.needs_input* will be set to True.

...

Data not consumed due to the use of 'max_length' should be saved in an internal
buffer (that is not exposed to Python code at all), which is then prepended to
any data provided in the next call to decompress() before providing the data to
the underlying compression library. The cases where either the internal buffer
or the new data are empty should be optimized to avoid unnecessary allocations
or copies, since these will be the most common cases.

Note that this API does not need a Python-level 'unconsumed_tail' attribute -
its role is served by the internal buffer (which is private to the C module
implementation). This is not to be confused with the already-existing
'unused_data' attribute that stores data found after the end of the compressed
stream. 'unused_data' should continue to work as before, regardless of whether
decompress() is called with a max_length argument or not.

As a starting point I would suggest writing a patch for LZMADecompressor first,
since its implementation is a bit simpler than BZ2Decompressor. Once this patch
and an analogous one for BZ2Decompressor have been committed, we can then
convert GzipFile, BZ2File and LZMAFile to use this feature.

If you have any questions while you're working on this issue, feel free to send
them my way.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-01-25 Thread Nikolaus Rath

Nikolaus Rath added the comment:

Let me be more precise: 

My suggestion is not to remove `unconsumed_tail` entirely, but I think its 
value needs to be defined only when the end of the compressed stream has been 
reached.

In other words, you could still do:

while not decomp.eof
  # ...
if decomp.unconsumed_tail:
raise RuntimeError('unexpected data after end of compressed stream')

but as long as decomp.eof is True, decomp.unconsumed_tail could (as far as I 
can tell) be None, no matter if there is uncompressed data in the internal 
buffer or not.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-01-24 Thread Nikolaus Rath

Nikolaus Rath added the comment:

Is there any reason why unconsumed_tail needs to be exposted?

I would suggest to instead introduce a boolean attribute data_ready than 
indicates that more decompressed data can be provided without additional 
compressed input.

Example:

# decomp = decompressor object
# infh = compressed input stream
# outfh = decompressed output stream
while not decomp.eof:
if decomp.data_ready:
buf = decomp.decompress(max_size=BUFSIZE)
# or maybe:
#buf = decomp.decompress(b'', max_size=BUFSIZE)
else:
buf = infh.read(BUFSIZE)
if not buf:
raise RuntimeError('Unexpected end of compressed stream')
buf = decomp.decompress(buf, max_size=BUFSIZE)

assert len(buf)  0
outfh.write(buf)

This is short, easily readable (in my opinion) and also avoids the problem 
where the decompressor blocks because it needs more data even though there 
still is an unconsumed tail.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-01-22 Thread Nadeem Vawda

Nadeem Vawda added the comment:

No, I'm afraid I haven't had a chance to do any work on this issue since my last
message.

I would be happy to review a patch for this, but before you start writing one,
we should settle on how the API will look. I'll review the existing discussion
in detail over the weekend and come up with something that avoids the potential
problems raised by Serhiy.

--
versions: +Python 3.5 -Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2014-01-19 Thread Nikolaus Rath

Nikolaus Rath added the comment:

Nadeem, did you have a chance to look at this again, or do you have any partial 
patch already?

If not, I'd like to try working on a patch.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2013-06-17 Thread Martin Panter

Changes by Martin Panter vadmium...@gmail.com:


--
nosy: +vadmium

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2013-04-21 Thread Nikolaus Rath

Nikolaus Rath added the comment:

The lack of output size limiting has security implications as well.

Without being able to limit the size of the uncompressed data returned per 
call, it is not possible to decompress untrusted lzma or bz2 data without 
becoming susceptible to a DoS attack, as the attacker can force allocation of 
gigantic buffers by sending just a tiny amount of compressed data.

--
nosy: +Nikratio

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2013-01-19 Thread Nadeem Vawda

Nadeem Vawda added the comment:

 What if unconsumed_tail is not empty but less than needed to decompress at
 least one byte? We need read more data until unconsumed_tail grow enought to
 be decompressed.

This is possible in zlib, but not in bz2. According to the manual [1], it is
perfectly OK to supply one byte at a time.

For xz, I'm not sure whether this problem could occur. I had assumed that it
could not, but I may be mistaken ;-). Unfortunately liblzma has no proper
manual, so I'll have to dig into the implementation to find out, and I haven't
had the time to do this yet.


[As an aside, it would be nice if the documentation for the zlib module
 mentioned this problem. We can't assume that users of the Python module are
 familiar with the C API for zlib...]


[1] http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2012-12-09 Thread Nadeem Vawda

Nadeem Vawda added the comment:

 # Using zlib's interface
 while not d.eof:
 compressed = d.unconsumed_tail or f.read(8192)
 if not compressed:
 raise ValueError('End-of-stream marker not found')
 output = d.decompress(compressed, 8192)
 # process output

 This is not usable with bzip2. Bzip2 uses large block size and 
 unconsumed_tail 
 can be non empty but decompress() will return b''. With zlib you possible can 
 see the same effect on some input when read by one byte.

I don't see how this is a problem. If (for some strange reason) the
application-specific processing code can't handle empty blocks properly, you can
just stick if not output: continue before it.


 Actually it should be:

 # Using zlib's interface
 while not d.eof:
 output = d.decompress(d.unconsumed_tail, 8192)
 while not output and not d.eof:
 compressed = f.read(8192)
 if not compressed:
 raise ValueError('End-of-stream marker not found')
 output = d.decompress(d.unconsumed_tail + compressed, 8192)
 # process output

 Note that you should use d.unconsumed_tail + compressed as input, and 
 therefore
 do an unnecessary copy of the data.

Why is this necessary? If unconsumed_tail is b'', then there's no need to
prepend it (and the concatenation would be a no-op anyway). If unconsumed_tail
does contain data, then we don't need to read additional compressed data from
the file until we've finished decompressing the data we already have.


 Without explicit unconsumed_tail you can write input data in the internal
 mutable buffer, it will be more effective for large buffer (handreds of KB)
 and small input chunks (several KB).

Are you proposing that the decompressor object maintain its own buffer, and
copy the input data into it before passing it to the decompression library?
Doesn't that just duplicate work that the library is already doing for us?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2012-12-09 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 you can just stick if not output: continue before it.

And then hang. Because d.unconsumed_tail is not empty and no new data will be 
read.

 Why is this necessary? If unconsumed_tail is b'', then there's no need to
 prepend it (and the concatenation would be a no-op anyway). If
 unconsumed_tail does contain data, then we don't need to read additional
 compressed data from the file until we've finished decompressing the data
 we already have.

What if unconsumed_tail is not empty but less than needed to decompress at 
least one byte? We need read more data until unconsumed_tail grow enought to 
be decompressed.

 Are you proposing that the decompressor object maintain its own buffer, and
 copy the input data into it before passing it to the decompression library?
 Doesn't that just duplicate work that the library is already doing for us?

unconsumed_tail is such buffer and when we call decompressor with new chunk of 
data we should allocate buffer of size (len(unconsumed_tail)+len(compressed)) 
and copy len(unconsumed_tail) bytes from unconsumed_tail and len(compressed) 
from gotten data. But when you use internal buffer, you should only copy new 
data.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2012-12-04 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 # Using zlib's interface
 while not d.eof:
 compressed = d.unconsumed_tail or f.read(8192)
 if not compressed:
 raise ValueError('End-of-stream marker not found')
 output = d.decompress(compressed, 8192)
 # process output

This is not usable with bzip2. Bzip2 uses large block size and unconsumed_tail 
can be non empty but decompress() will return b''. With zlib you possible can 
see the same effect on some input when read by one byte.

 A related, but orthogonal proposal: We might want to make unconsumed_tail
 a memoryview (provided the input data is know to be immutable), to avoid
 creating an unnecessary copy of the data.

It looks interesting. However the data should be copied anyway if the input 
data is not a bytes object.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2012-12-04 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Actually it should be:

# Using zlib's interface
while not d.eof:
output = d.decompress(d.unconsumed_tail, 8192)
while not output and not d.eof:
compressed = f.read(8192)
if not compressed:
raise ValueError('End-of-stream marker not found')
output = d.decompress(d.unconsumed_tail + compressed, 8192)
# process output

Note that you should use d.unconsumed_tail + compressed as input, and therefore 
do an unnecessary copy of the data.

Without explicit unconsumed_tail you can write input data in the internal 
mutable buffer, it will be more effective for large buffer (handreds of KB) and 
small input chunks (several KB).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2012-12-02 Thread Nadeem Vawda

Nadeem Vawda added the comment:

I've tried reimplementing LZMAFile in terms of the decompress_into()
method, and it has ended up not being any faster than the existing
implementation. (It is _slightly_ faster for readinto() with a large
buffer size, but all other cases it was either of equal performance or
significantly slower.)

In addition, decompress_into() is more complicated to work with than I
had expected, so I withdraw my objection to the approach based on
max_length/unconsumed_tail.


 unconsumed_tail should be private hidden attribute, which automatically 
 prepends any consumed data.

I don't think this is a good idea. In order to have predictable memory
usage, the caller will need to ensure that the current input is fully
decompressed before passing in the next block of compressed data. This
can be done more simply with the interface used by zlib. Compare:

while not d.eof:
output = d.decompress(b'', 8192)
if not output:
compressed = f.read(8192)
if not compressed:
raise ValueError('End-of-stream marker not found')
output = d.decompress(compressed, 8192)
# process output

with:

# Using zlib's interface
while not d.eof:
compressed = d.unconsumed_tail or f.read(8192)
if not compressed:
raise ValueError('End-of-stream marker not found')
output = d.decompress(compressed, 8192)
# process output


A related, but orthogonal proposal: We might want to make unconsumed_tail
a memoryview (provided the input data is know to be immutable), to avoid
creating an unnecessary copy of the data.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2012-12-02 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com:


--
nosy: +Arfrever

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2012-11-06 Thread Nadeem Vawda

Nadeem Vawda added the comment:

I suspect that it will be slower than the decompress_into() approach, but
as you say, we need to do benchmarks to see for sure.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2012-11-05 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

unconsumed_tail should be private hidden attribute, which automatically 
prepends any consumed data.  This should not be very complicated.  But 
benchmarks needed to show what kind of approach is more efficient.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2012-11-04 Thread Nadeem Vawda

Nadeem Vawda added the comment:

I agree that being able to limit output size is useful and desirable, but
I'm not keen on copying the max_length/unconsumed_tail approach used by
zlib's decompressor class. It feels awkward to use, and it complicates
the implementation of the existing decompress() method, which is already
unwieldy enough.

As an alternative, I propose a thin wrapper around the underlying C API:

def decompress_into(self, src, dst, src_start=0, dst_start=0): ...

This would store decompressed data in a caller-provided bytearray, and
return a pair of integers indicating the end points of the consumed and
produced data in the respective buffers.

The implementation should be extremely simple - it does not need to do
any memory allocation or reference management.

I think it could also be useful for optimizing the implementation of
BZ2File and LZMAFile. I plan to write a prototype and run some benchmarks
some time in the next few weeks.

(Aside: if implemented for zlib, this could also be a nicer (I think)
 solution for the problem raised in issue 5804.)

--
stage:  - needs patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2012-09-25 Thread Antoine Pitrou

Antoine Pitrou added the comment:

You don't need to know the final size. You just need to be able to pass an 
output size limit. This would be an easy feature to add to functions like 
zlib.decompress(), since they already handle sizing of the output buffer.

--
nosy: +pitrou
title: gzip, bz2, lzma: add method to get decompressed size - gzip, bz2, lzma: 
add option to limit output size

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue15955] gzip, bz2, lzma: add option to limit output size

2012-09-25 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Agree with Antoine. This would be a desirable feature.

--
nosy: +storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue15955
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com