Re: [xz-devel] BUG: liblzma: LZMA+BCJ raw decode: output truncated last word

Lasse Collin Sun, 12 Jul 2020 10:55:48 -0700

Hello!

On 2020-07-08 ｍｉｕｒａ＠ｌｉｎｕｘ wrote:
> when setting filter as only LZMA1, it returns expected SIZE of
> output. 
> 
> Because BCJ should not change size, BCJ may have a bug, or LZMA1 -
> BCJ pipeline may be a problem.


liblzma cannot be used to decode data from .7z files except in certain
cases. This isn't a bug, it's a missing feature.

The raw encoder and decoder APIs only support streams that contain an
end of payload marker (EOPM) alias end of stream (EOS) marker. .7z
files use LZMA1 without such an end marker. Instead, the end is handled
by the decoder knowing the exact uncompressed size of the data.

The API of liblzma supports LZMA1 without end marker via
lzma_alone_decoder(). That API can be abused to properly decode raw
LZMA1 with known uncompressed size by feeding the decoder a fake 13-byte
header. Everything else in the public API requires some end marker.

Decoding LZMA1 without BCJ or other extra filters from .7z with
lzma_raw_decoder() kind of works but you will notice that it will never
return LZMA_STREAM_END, only LZMA_OK. This is because it will never see
an end marker. A minor downside is that it won't then do a small
integrity check at the end either (one variable in the range decoder
state will be 0 at the end of any valid LZMA1 stream);
lzma_alone_decoder() does this check even when end marker is missing.

If you use lzma_raw_decoder() for end-markerless LZMA1, make sure that
you never give it more output space than the real uncompressed size. In
rare cases this could result in extra output or an error since the
decoder would try to decode more output using the input it has gotten
so far. Overall I think the hack with lzma_alone_decoder() is a better
way with the current API.

BCJ filters process the input data in chunks of a few bytes long, thus
they need to hold a few bytes of look-ahead buffer. With some filters
like ARM the look-ahead is aligned and if the uncompressed size is a
multiple of this alignment, lzma_raw_decoder() will give you all the
data even when the LZMA1 layer doesn't have an end marker. The x86
filter has one-byte alignment but needs to see five bytes from the
future before producing output. When LZMA1 layer doesn't return
LZMA_STREAM_END, the x86 filter doesn't know that the end was reached
and cannot flush the last bytes out.

Also note that .7z files tend to use BCJ2 for x86 code. liblzma doesn't
support the x86 BCJ2 filter at all because it isn't streamable (it
could be modified to be but then it's not compatible). So even if the
liblzma was improved to handle the lack of end marker in a better way,
you still couldn't decompress all .7z files. For .7z, LZMA SDK is the
way to go.

Using liblzma to decode .7z works in these cases:

  - LZMA1 using a fake 13-byte header with lzma_alone_decoder():

        1 byte   LZMA properties byte that encodes lc/lp/pb
        4 bytes  dictionary size as little endian uint32_t
        8 bytes  uncompressed size as little endian uint64_t;
                 UINT64_MAX means unknown and then (and only then)
                 EOPM must be present

  - LZMA2, possibly together with a BCJ or Delta filter, with
    lzma_raw_decoder() since LZMA2 always includes the end marker.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] BUG: liblzma: LZMA+BCJ raw decode: output truncated last word

Reply via email to