[issue44134] lzma: stream padding in xz files

2021-05-16 Thread rogdham


rogdham  added the comment:

It must be decided what to do in the following cases, which are not valid per 
the XZ file specification, but supported by the lzma module (and tested):
 1. different format concatenated together (e.g. a .xz and a .lzma); this 
somehow includes tailing null bytes (12 null bytes is a valid .lzma)
 2. trailing junk (i.e. non-null bytes after the stream)

The answer may be different depending on the format arg (e.g. FORMAT_AUTO vs 
FORMAT_XZ).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44134] lzma: stream padding in xz files

2021-05-15 Thread Ma Lin


Change by Ma Lin :


--
nosy: +malin

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44134] lzma: stream padding in xz files

2021-05-14 Thread rogdham


Change by rogdham :


Added file: https://bugs.python.org/file50045/example2.xz

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44134] lzma: stream padding in xz files

2021-05-14 Thread rogdham

New submission from rogdham :

Hello,

The lzma module does not works well with XZ stream padding. Depending on the 
case, it may work; or it may stops the stream prematurely without error; or an 
error may be raised; or no error may be raised when it must.


In the XZ file format, stream padding is a number of null bytes (multiple of 4) 
that can be between and after streams.

>From the specification (section 2.2):

> Only the decoders that support decoding of concatenated Streams MUST support 
> Stream Padding.

Since the lzma module supports decoding of concatenated streams, it must 
support stream padding as well.



 Examples to reproduce the issue:

1. example1.xz:
- made of one stream followed by 4 null bytes:
$ (echo 'Hi!' | xz; head -c 4 /dev/zero) > example1.xz
- will raise an exception in both modes (FORMAT_AUTO and FORMAT_XZ)

>>> with lzma.open('/example1.xz', format=lzma.FORMAT_AUTO) as f:
... f.read()
...
Traceback (most recent call last):
  File "", line 2, in 
  File "/usr/lib/python3.9/lzma.py", line 200, in read
return self._buffer.read(size)
  File "/usr/lib/python3.9/_compression.py", line 99, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
>>> with lzma.open('/example1.xz', format=lzma.FORMAT_XZ) as f:
... f.read()
...
Traceback (most recent call last):
  File "", line 2, in 
  File "/usr/lib/python3.9/lzma.py", line 200, in read
return self._buffer.read(size)
  File "/usr/lib/python3.9/_compression.py", line 99, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached


2. example2.xz:
- made of two streams with 18 null bytes of stream padding between them
$ (echo 'Hi!' | xz; head -c 18 /dev/zero; echo 'Second stream' | xz) > 
example2.xz
- second stream will be ignored with FORMAT_XZ
- the two streams will be decoded with FORMAT_AUTO, where it should raise 
an error (18 null bytes is not multiple of 4, so the stream padding is invalid 
according to the XZ specification and the decoder “MUST indicate an error”)

>>> with lzma.open('/tmp/example2.xz', format=lzma.FORMAT_AUTO) as f:
... f.read()
...
b'Hi!\nSecond stream\n'
>>> with lzma.open('/tmp/example2.xz', format=lzma.FORMAT_XZ) as f:
... f.read()
...
b'Hi!\n'



 Analysis

This issue comes from the relation between _lzma and _compression. In _lzma, 
the C library is called without the LZMA_CONCATENATED flag, which means that 
multiple streams and stream padding must be supported in Python.

In _compression, when a LZMADecompressor is done (.eof is True), an other one 
is created to decompress from that point. If the new one fails to decompress 
the remaining data, the LZMAError is ignored and we assume we reached the end.

So the behavior seen above can be explained as follows:
 - In FORMAT_AUTO, it seems that .eof is False while we haven't read 18 bytes
 - In FORMAT_AUTO, 18 null bytes will be decompressed as b'' with .eof being 
True afterwards
 - In FORMAT_XZ, it seems that .eof is False while we haven't read 12 bytes
 - In FORMAT_XZ, no stream padding is valid, so as soon as we have more than 12 
bytes an LZMAError is raised



 Possible solution

A possible solution would be to add a finish method on the decompressor 
interface, and support it appropriately in _compression when we reached EOF on 
the input. Then, in LZMADecompressor implementation, use the LZMA_CONCATENATED 
flag, and implement the finish method to call lzma_code with LZMA_FINISH as 
action.

I think this would be preferred than trying to solve the issue in Python, 
because if the format is FORMAT_AUTO we don't know if the format is XZ (and we 
should support stream padding) or not.

--
components: Library (Lib)
files: example1.xz
messages: 393681
nosy: nadeem.vawda, rogdham
priority: normal
severity: normal
status: open
title: lzma: stream padding in xz files
type: behavior
versions: Python 3.10, Python 3.11, Python 3.6, Python 3.7, Python 3.8, Python 
3.9
Added file: https://bugs.python.org/file50044/example1.xz

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com