Hello! On 2020-07-08 miura@linux wrote: > when setting filter as only LZMA1, it returns expected SIZE of > output. > > Because BCJ should not change size, BCJ may have a bug, or LZMA1 - > BCJ pipeline may be a problem.
liblzma cannot be used to decode data from .7z files except in certain cases. This isn't a bug, it's a missing feature. The raw encoder and decoder APIs only support streams that contain an end of payload marker (EOPM) alias end of stream (EOS) marker. .7z files use LZMA1 without such an end marker. Instead, the end is handled by the decoder knowing the exact uncompressed size of the data. The API of liblzma supports LZMA1 without end marker via lzma_alone_decoder(). That API can be abused to properly decode raw LZMA1 with known uncompressed size by feeding the decoder a fake 13-byte header. Everything else in the public API requires some end marker. Decoding LZMA1 without BCJ or other extra filters from .7z with lzma_raw_decoder() kind of works but you will notice that it will never return LZMA_STREAM_END, only LZMA_OK. This is because it will never see an end marker. A minor downside is that it won't then do a small integrity check at the end either (one variable in the range decoder state will be 0 at the end of any valid LZMA1 stream); lzma_alone_decoder() does this check even when end marker is missing. If you use lzma_raw_decoder() for end-markerless LZMA1, make sure that you never give it more output space than the real uncompressed size. In rare cases this could result in extra output or an error since the decoder would try to decode more output using the input it has gotten so far. Overall I think the hack with lzma_alone_decoder() is a better way with the current API. BCJ filters process the input data in chunks of a few bytes long, thus they need to hold a few bytes of look-ahead buffer. With some filters like ARM the look-ahead is aligned and if the uncompressed size is a multiple of this alignment, lzma_raw_decoder() will give you all the data even when the LZMA1 layer doesn't have an end marker. The x86 filter has one-byte alignment but needs to see five bytes from the future before producing output. When LZMA1 layer doesn't return LZMA_STREAM_END, the x86 filter doesn't know that the end was reached and cannot flush the last bytes out. Also note that .7z files tend to use BCJ2 for x86 code. liblzma doesn't support the x86 BCJ2 filter at all because it isn't streamable (it could be modified to be but then it's not compatible). So even if the liblzma was improved to handle the lack of end marker in a better way, you still couldn't decompress all .7z files. For .7z, LZMA SDK is the way to go. Using liblzma to decode .7z works in these cases: - LZMA1 using a fake 13-byte header with lzma_alone_decoder(): 1 byte LZMA properties byte that encodes lc/lp/pb 4 bytes dictionary size as little endian uint32_t 8 bytes uncompressed size as little endian uint64_t; UINT64_MAX means unknown and then (and only then) EOPM must be present - LZMA2, possibly together with a BCJ or Delta filter, with lzma_raw_decoder() since LZMA2 always includes the end marker. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode