On 2018-10-26 Bhargava Shastry wrote:
> On 10/25/18 11:34 PM, Lasse Collin wrote:
> >> +    // Decode BUFSIZ==8192 bytes of inbuf at a time  
> > 
> > The value of BUFSIZ depends on libc and isn't always 8192.  
> 
> Agreed. What would you like it to be? Afaiu, we are doing not only
> header decoding but also contents of the compressed block itself,
> right?

Yes, not only headers but the whole file.

The buffer size doesn't matter too much. For mem-to-mem operations, 4096
bytes is plenty (there's no syscall overhead for disk I/O). If it is
important to clear output buffer with memset(), 1024 bytes could be
enough to reduce the overhead from the memset() calls. Going much below
512 starts to slow things down due to the increased number of calls to
lzma_code().

> As a reference, fuzzers usually generate between 1 byte and a few
> hundred KB of compressed data.

OK, this is good to know.

> Test harnesses for other compression libs such as lzo [1] use an
> output buffer size equal to default block size (= 256 KB).

This is not comparable. With that LZO API you need to decompress the
whole block with a single call and thus the caller needs to know the
block size.

By the way, a few hopefully helpful comments about the linked [1]:

  - If lzo_init() fails, 0 is returned and thus the error is silently
    ignored. There is a printf if __DEBUG__ is defined but that's it.
    I'm not familiar with the fuzzing framework so I don't know if it
    matters, but without such knowledge, silently ignoring errors looks
    very suspicious to me.

  - out[bufSize] is a 256 KiB buffer that is allocated on stack. On
    Linux/glibc the default stack size is 8 MiB, so maybe it's OK in
    this particular situation, but in general 256 KiB buffer is a bit
    big to be allocated on stack (it could be made static since it
    doesn't need to be thread safe).

> > I guess the reason to split the input into BUFSIZ chunks is there to
> > emulate reading from a file, but to me that doesn't seem very useful
> > from point of view of fuzzing. Splitting the input into smaller
> > chunks is good in sense that it tests that liblzma behaves
> > correctly when liblzma doesn't get the whole file at once. However,
> > the BUFSIZ is far too big to do this for header decoding.  
> 
> It is my understanding that the harness is doing more than just header
> decoding, right? I notice (in the generated html coverage report) that
> calls to "block_decode" are also executed while fuzzing.

Yes. Testing with small buffer sizes might help in finding corner cases
outside header decoding too (the decoder must be able to stop and
continue processing input at any point in the file). Doing it one byte
at a time hurts fuzzing performance though. Depending on the input, it's
like 3-20 times slower than with normal buffer sizes (1 KiB or more).
Perhaps a compromise (a few bytes at a time) should be considered too.

> > It should *never* happen if the code calling the liblzma functions
> > is correct. It is better to use if() instead of assert() here to
> > catch LZMA_PROG_ERROR. It's not good to rely on that assertion
> > checks haven't been disabled with #define NDEBUG.  
> 
> The reason I chose assert() was because fuzzers pick up on signals
> such as SIGABRT, SIGSEGV etc. Also, libFuzzer interface reserves
> non-zero return codes, the LLVMFuzzerTestOneInput() API is required
> to always return 0.

OK.

> Would replacing "assert" with an "abort" make sense? This way, we
> don't need to use a debug build (which may slow down fuzzing).

Yes, abort() combined with a fprintf(stderr, ...) should be fine.

> >> +    // Init decoder.
> >> +    if (!init_decoder(&strm)) {
> >> +        // Decoder initialization failed. There's no point
> >> +        // retrying, so bail out.
> >> +        return 0;
> >> +    }  
> > 
> > Returning 0 indicates success, I guess. If initialization fails, the
> > actual fuzzing is never done. While it's unlikely that the
> > initialization fails, it's still essential to report the error to
> > the caller instead of silently skipping the whole fuzzing step.  
> 
> Would a call to "abort" make sense here?

Yes, here too.

> >   - Is it enough to pass the whole input to liblzma at once? If not,
> >     how small chunks to use? One byte could give most thorough
> > fuzzing but is slower too.  
> 
> One suggestion is to keep the current output buffer size I guess.

Yes, it is one suggestion. It's just that I'm thinking that small
buffers could improve the fuzzing coverage (more places where the
decoder has to be able to stop & continue the processing of input data).

I apologize that instead of trying to explain in detail what I think
could be tried, I thought it needed less energy from me to write some
code myself.

The first version is effectively very similar to the version you most
recently sent to the list, but I wrote it from scratch because I didn't
know the license of the version you sent and the code is very short
anyway.

The second version passes input/output to/from liblzma in tiny chunks.
Instead of one-byte chunks, I chose 13 and 29 bytes for input and
output, respectively. This gives much better speed than one-byte
chunks. They are odd (well, also primes) because some things are
aligned to 2^n bytes and this way different states get tested better, I
*hope*.

These version have a proper error check for the initialization
function call, and both that and the LZMA_PROG_ERROR situation also
print an error message before calling abort().

These versions don't clear the output buffer because I think it's not
necessary. liblzma shouldn't ever read anything from the output buffer.

Do you see any difference in test coverage between these versions? Is
the speed of the second version good enough? Note that I have now pushed
support for FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION to xz.git.

---------

///////////////////////////////////////////////////////////////////////////////
//
/// \file       fuzz.c
/// \brief      Fuzz test program for liblzma
//
//  Author:     Lasse Collin
//
//  This file has been put into the public domain.
//  You can do whatever you want with this file.
//
///////////////////////////////////////////////////////////////////////////////

#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include "lzma.h"


// Output buffer for decompressed data. This is write only; nothing cares
// about the actual data written here.
static uint8_t outbuf[4096];


extern int
LLVMFuzzerTestOneInput(const uint8_t *inbuf, size_t inbuf_size)
{
        // Some header values can make liblzma allocate a lot of RAM
        // (up to about 4 GiB with liblzma 5.2.x). We set a limit here to
        // prevent extreme allocations when fuzzing.
        const uint64_t memlimit = 300 << 20; // 300 MiB

        // Making strm static here and omitting the lzma_end() call at
        // the end of this function would cause subsequent calls to this
        // function to reuse the existing decoder state (lzma_stream_decoder()
        // would re-initialize it). It could be good to fuzz that that code
        // path too, but on the other hand it makes the current fuzzing round
        // depend on the previous rounds which isn't a good thing.
        /*static*/ lzma_stream strm = LZMA_STREAM_INIT;

        // Initialize a .xz decoder using the above memory usage limit.
        // Enable support for concatenated .xz files which is used when
        // decompressing regular .xz files (instead of data embedded inside
        // some other file format). Integrity checks on the uncompressed
        // data are ignored to make fuzzing more effective (incorrect check
        // values won't prevent the decoder from processing more input).
        //
        // The flag LZMA_IGNORE_CHECK doesn't disable verification of header
        // CRC32 values. Those checks are disabled when liblzma is built
        // with the #define FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION.
        lzma_ret ret = lzma_stream_decoder(&strm, memlimit,
                        LZMA_CONCATENATED | LZMA_IGNORE_CHECK);
        if (ret != LZMA_OK) {
                // This should never happen unless the system has
                // no free memory or address space to allow the small
                // allocations that the initialization requires.
                fprintf(stderr, "lzma_stream_decoder() failed (%d)\n", ret);
                abort();
        }

        // Give the whole input buffer at once to liblzma.
        strm.next_in = inbuf;
        strm.avail_in = inbuf_size;
        strm.next_out = outbuf;
        strm.avail_out = sizeof(outbuf);

        while ((ret = lzma_code(&strm, LZMA_FINISH)) == LZMA_OK) {
                if (strm.avail_out == 0) {
                        // outbuf became full. We don't care about the
                        // uncompressed data there, so we simply reuse
                        // the outbuf and overwrite the old data.
                        strm.next_out = outbuf;
                        strm.avail_out = sizeof(outbuf);
                }
        }

        // LZMA_PROG_ERROR should never happen as long as the code calling
        // the liblzma functions is correct. Thus LZMA_PROG_ERROR is a sign
        // of a bug in either this function or in liblzma.
        if (ret == LZMA_PROG_ERROR) {
                fprintf(stderr, "lzma_code() returned LZMA_PROG_ERROR\n");
                abort();
        }

        // Free the allocated memory.
        //
        // NOTE: If strm were static, this should be commented out to allow
        // reusing the decoder memory on the next fuzzing round.
        lzma_end(&strm);

        return 0;
}

---------

///////////////////////////////////////////////////////////////////////////////
//
/// \file       fuzz.c
/// \brief      Fuzz test program for liblzma
//
//  Author:     Lasse Collin
//
//  This file has been put into the public domain.
//  You can do whatever you want with this file.
//
///////////////////////////////////////////////////////////////////////////////

#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include "lzma.h"


// Chunk sizes (in bytes) to be used for passing input and output data.
//
// Passing the whole input file to liblzma at once and using an output
// buffer of 1-4 KiB would be the fastest, but using tiny odd-sized
// buffers exercises the corner cases where liblzma has to be able to
// stop and continue the decoding when running out of input data or
// the output buffer becomes full.
//
// One-byte chunks would be quite slow. As a compromise, bigger values are
// used to get better speed (only 50-150 % slower than the fast version).
#define IN_CHUNK_SIZE 13
#define OUT_CHUNK_SIZE 29


// Output buffer for decompressed data. This is write only; nothing
// cares about the actual data written here.
static uint8_t outbuf[OUT_CHUNK_SIZE];


extern int
LLVMFuzzerTestOneInput(const uint8_t *inbuf, size_t inbuf_size)
{
        // Some header values can make liblzma allocate a lot of RAM
        // (up to about 4 GiB with liblzma 5.2.x). We set a limit here to
        // prevent extreme allocations when fuzzing.
        const uint64_t memlimit = 300 << 20; // 300 MiB

        // Making strm static here and omitting the lzma_end() call at
        // the end of this function would cause subsequent calls to this
        // function to reuse the existing decoder state (lzma_stream_decoder()
        // would re-initialize it). It could be good to fuzz that that code
        // path too, but on the other hand it makes the current fuzzing round
        // depend on the previous rounds which isn't a good thing.
        /*static*/ lzma_stream strm = LZMA_STREAM_INIT;

        // Initialize a .xz decoder using the above memory usage limit.
        // Enable support for concatenated .xz files which is used when
        // decompressing regular .xz files (instead of data embedded inside
        // some other file format). Integrity checks on the uncompressed
        // data are ignored to make fuzzing more effective (incorrect check
        // values won't prevent the decoder from processing more input).
        //
        // The flag LZMA_IGNORE_CHECK doesn't disable verification of header
        // CRC32 values. Those checks are disabled when liblzma is built
        // with the #define FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION.
        lzma_ret ret = lzma_stream_decoder(&strm, memlimit,
                        LZMA_CONCATENATED | LZMA_IGNORE_CHECK);
        if (ret != LZMA_OK) {
                // This should never happen unless the system has
                // no free memory or address space to allow the small
                // allocations that the initialization requires.
                fprintf(stderr, "lzma_stream_decoder() failed (%d)\n", ret);
                abort();
        }

        strm.next_in = inbuf;
        strm.avail_in = 0;
        strm.next_out = outbuf;
        strm.avail_out = sizeof(outbuf);

        // Use LZMA_RUN until the last input byte is available to lzma_code().
        lzma_action action = LZMA_RUN;

        do {
                if (strm.avail_in == 0) {
                        // Add at most CHUNK_SIZE bytes of more input.
                        // We don't need to set strm.next_in as that
                        // already points to the correct byte.
                        if (inbuf_size > 0) {
                                strm.avail_in = inbuf_size < IN_CHUNK_SIZE
                                                ? inbuf_size
                                                : IN_CHUNK_SIZE;
                                inbuf_size -= strm.avail_in;
                        }

                        // Use LZMA_FINISH when the last input byte is
                        // available to lzma_code().
                        if (inbuf_size == 0)
                                action = LZMA_FINISH;
                }

                if (strm.avail_out == 0) {
                        // outbuf became full. We don't care about the
                        // uncompressed data there, so we simply reuse
                        // the outbuf and overwrite the old data.
                        strm.next_out = outbuf;
                        strm.avail_out = sizeof(outbuf);
                }

                ret = lzma_code(&strm, action);
        } while (ret == LZMA_OK);

        // LZMA_PROG_ERROR should never happen as long as the code calling
        // the liblzma functions is correct. Thus LZMA_PROG_ERROR is a sign
        // of a bug in either this function or in liblzma.
        if (ret == LZMA_PROG_ERROR) {
                fprintf(stderr, "lzma_code() returned LZMA_PROG_ERROR\n");
                abort();
        }

        // Free the allocated memory.
        //
        // NOTE: If strm were static, this should be commented out to allow
        // reusing the decoder memory on the next fuzzing round.
        lzma_end(&strm);

        return 0;
}

---------

Comments are welcome. Thanks!

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Reply via email to