[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-11 Thread Hiroshi Miura


Hiroshi Miura  added the comment:

Here is a BCJ only CFFI test project.
https://github.com/miurahr/bcj-cffi

It imports two bcj_x86 C sources, one is from liblzma (src/xz_bcj_x86.c) taht 
is bind with python's lzma module, and the other is from xz-embbed project for 
linux kernel.(src/xz_simple_bcj.c)

We can observe that

1. it has an interface which overwrite buffer
2. it returns good resulted buffer (digest assertion) in both case
3. it returns 4 bytes less size than expected.

for 3, it is because return value  of BCJ is defined such as

```
size -= 4;
for (i = 0; i < size; ++i) {...}
return i;
```
and  variable i sometimes increment 4 bytes when target sequence is found and 
processed.

It may be natural that a size value returned from BCJ filter is often 4 bytes 
smaller than actual.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-07 Thread Hiroshi Miura


Hiroshi Miura  added the comment:

Thank you for information about similar problem.

This problem is observed and reported on 7-zip library project, 
https://github.com/miurahr/py7zr/issues/178.
py7zr heavily depend on lzma FORMAT_RAW interface.

Fortunately  7-zip container format has size database, then library can know 
output is enough or not.

In reported case, the library/caller become a state that all input data has 
send into decompressor,  but decompressor does not output anything.

I'd like to wait upstream reaction.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-07 Thread Hiroshi Miura


Change by Hiroshi Miura :


Added file: 
https://bugs.python.org/file49301/0001-lzma-support-LZMA1-with-FORMAT_RAW.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-06 Thread Ma Lin


Ma Lin  added the comment:

There was a similar issue (issue21872).

When decompressing a lzma.FORMAT_ALONE format data, and it doesn't have the end 
marker (but has the correct "Uncompressed Size" in the .lzma header), sometimes 
the last one to dozens bytes can't be output.

issue21872 fixed the problem in `_lzmamodule.c`. But if liblzma strictly 
follows zlib's API (IMO it should), there should be no this problem.


I debugged your code with attached file `lzmabcj.bin`, when it output 12796 
bytes, the output buffer still has 353 bytes space. So it seems to be a problem 
of liblzma.

IMHO, we first wait the reply from liblzma maintainer, if Lasse Collin thinks 
this is a bug, let us wait for the upstream fix. And I will report the 
issue21872 to see if he can fix the problem in upstream as well.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-06 Thread Hiroshi Miura


Hiroshi Miura  added the comment:

I think FORMAT_RAW is only tested with LZMA2 in Lib/test/test_lzma.py Since no 
test is for LZMA1, then the document express FORMAT_RAW is for LZMA2.

I'd like to add tests against LZMA1 and change expression on the document.

--
keywords: +patch
Added file: 
https://bugs.python.org/file49300/0001-lzma-support-LZMA1-with-FORMAT_RAW.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-06 Thread Hiroshi Miura


Hiroshi Miura  added the comment:

>Compression filters:
>FILTER_LZMA1 (for use with FORMAT_ALONE)
>FILTER_LZMA2 (for use with FORMAT_XZ and FORMAT_RAW)

I look into past discussion  BPO-6715 when lzma module proposed.
https://bugs.python.org/issue6715

There is an only comment about FORMAT_ALONE and LZMA1 here 
https://bugs.python.org/issue6715#msg92174

> .lzma is actually not a format. It is just the raw output of the LZMA1
> coder. XZ instead is a container format for the LZMA2 coder, which
probably means LZMA+some metadata.

It said FORMAT_ALONE decode .lzma archive which use LZMA1 as coder and 
FORMAT_XZ decode .xz archive which use LZMA2 as coder.
There are no discussion about FORMAT_RAW.

This indicate an opposite relation between two things.
FORMAT_ALONE should use with LZMA1.
FORMAT_XZ should use with LZMA2. 

FORMAT_RAW actually no limitation against LZMA1/2.

Here is another discussion about lzma_raw_encoder and LZMA1.
A xz/liblzma maintainer Lasse suggest lzma_raw_encoder is usable for LZMA1.
https://sourceforge.net/p/lzmautils/discussion/708858/thread/cd04b6ace0/#6050


I think we need fix the document.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-06 Thread Ma Lin


Ma Lin  added the comment:

The docs[1] said:

Compression filters:
FILTER_LZMA1 (for use with FORMAT_ALONE)
FILTER_LZMA2 (for use with FORMAT_XZ and FORMAT_RAW)

But your code uses a combination of `FILTER_LZMA1` and `FORMAT_RAW`, is this ok?

[1] https://docs.python.org/3/library/lzma.html#specifying-custom-filter-chains

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-05 Thread Ma Lin


Change by Ma Lin :


--
components: +Library (Lib) -Extension Modules
nosy: +malin

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-04 Thread Hiroshi Miura


New submission from Hiroshi Miura :

When decompressing a particular archive, result become truncated a last word. 
A test data attached is uncompressed size is 12800 bytes, and compressed using 
LZMA1+BCJ algorithm into 11327 bytes.
The data is a payload of a 7zip archive.

Here is a pytest code to reproduce it.


:: code-block::

def test_lzma_raw_decompressor_lzmabcj():
filters = []
filters.append({'id': lzma.FILTER_X86})
filters.append(lzma._decode_filter_properties(lzma.FILTER_LZMA1, 
b']\x00\x00\x01\x00'))
decompressor = lzma.LZMADecompressor(format=lzma.FORMAT_RAW, 
filters=filters)
with testdata_path.joinpath('lzmabcj.bin').open('rb') as infile:
out = decompressor.decompress(infile.read(11327))
assert len(out) == 12800


test become failure that len(out) become 12796 bytes, which lacks last 4 bytes, 
which should be b'\x00\x00\x00\x00'
When specifying  a filters  as a single LZMA1 decompression,  I got an expected 
length of data, 12800 bytes.(*1)

When creating a test data with LZMA2+BCJ and examines it, I got an expected 
data.
When specifying a filters as a single LZMA2 decompression against LZMA2+BCJ 
payload, a result is perfectly as same as (*1) data.
It indicate us that a pipeline of LZMA1/LZMA2 --> BCJ is in doubt. 


After investigation and understanding that _lzmamodule.c is a thin wrapper of 
liblzma, I found the problem can be reproduced in liblzma.
I've reported it to upstream xz-devel ML with a test code 
https://www.mail-archive.com/xz-devel@tukaani.org/msg00370.html

--
components: Extension Modules
files: lzmabcj.bin
messages: 373008
nosy: miurahr
priority: normal
severity: normal
status: open
title: LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is 
paticular LZMA+BCJ  data
versions: Python 3.6, Python 3.7, Python 3.8, Python 3.9
Added file: https://bugs.python.org/file49296/lzmabcj.bin

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com