Stephen Jung (Stripe) created HADOOP-17096:
----------------------------------------------

             Summary: ZStandardCompressor throws java.lang.InternalError: Error 
(generic)
                 Key: HADOOP-17096
                 URL: https://issues.apache.org/jira/browse/HADOOP-17096
             Project: Hadoop Common
          Issue Type: Bug
          Components: io
    Affects Versions: 3.2.1
         Environment: Our repro is on ubuntu xenial LTS, with hadoop 3.2.1 
linking to libzstd 1.3.1. The bug is difficult to reproduce in an end-to-end 
environment (eg running an actual hadoop job with zstd compression) because 
it's very sensitive to the exact input and output characteristics. I reproduced 
the bug by turning one of the existing unit tests into a crude fuzzer, but I'm 
not sure upstream will accept that patch, so I've attached it separately on 
this ticket.

Note that the existing unit test for testCompressingWithOneByteOutputBuffer 
fails to reproduce this bug. This is because it's using the license file as 
input, and this file is too small. libzstd has internal buffering (in our 
environment it seems to be 128 kilobytes), and the license file is only 10 
kilobytes. Thus libzstd is able to consume all the input and compress it in a 
single call, then return pieces of its internal buffer one byte at a time. 
Since all the input is consumed in a single call, uncompressedDirectBufOff and 
uncompressedDirectBufLen are both set to zero and thus the bug does not 
reproduce.
            Reporter: Stephen Jung (Stripe)


A bug in index handling causes ZStandardCompressor.c to pass a malformed 
ZSTD_inBuffer to libzstd. libzstd then returns an "Error (generic)" that gets 
thrown. The crux of the issue is two variables, uncompressedDirectBufLen and 
uncompressedDirectBufOff. The hadoop code counts uncompressedDirectBufOff from 
the start of uncompressedDirectBuf, then uncompressedDirectBufLen is counted 
from uncompressedDirectBufOff. However, libzstd considers pos and size to both 
be counted from the start of the buffer. As a result, this line 
https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L228
 causes a malformed buffer to be passed to libzstd, where pos>size. Here's a 
longer description of the bug in case this abstract explanation is unclear:

----

Suppose we initialize uncompressedDirectBuf (via setInputFromSavedData) with 
five bytes of input. This results in uncompressedDirectBufOff=0 and 
uncompressedDirectBufLen=5 
(https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.java#L140-L146).

Then we call compress(), which initializes a ZSTD_inBuffer 
(https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L195-L196).
 The definition of those libzstd structs is here 
https://github.com/facebook/zstd/blob/v1.3.1/lib/zstd.h#L251-L261 - note that 
we set size=uncompressedDirectBufLen and pos=uncompressedDirectBufOff. The 
ZSTD_inBuffer gets passed to libzstd, compression happens, etc. When libzstd 
returns from the compression function, it updates the ZSTD_inBuffer struct to 
indicate how many bytes were consumed 
(https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3919-L3920).
 Note that pos is advanced, but size is unchanged.

Now, libzstd does not guarantee that the entire input will be compressed in a 
single call of the compression function. (Some of the compression libraries 
used by hadoop, such as snappy, _do_ provide this guarantee, but libzstd is not 
one of them.) So the hadoop native code updates uncompressedDirectBufOff and 
uncompressedDirectBufLen using the updated ZSTD_inBuffer: 
https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L227-L228

Now, returning to our example, we started with 5 bytes of uncompressed input. 
Suppose libzstd compressed 4 of those bytes, leaving one unread. This would 
result in a ZSTD_inBuffer struct with size=5 (unchanged) and pos=4 (four bytes 
were consumed). The hadoop native code would then set 
uncompressedDirectBufOff=4, but it would also set uncompressedDirectBufLen=1 
(five minus four equals one).

Since some of the input was not consumed, we will eventually call compress() 
again. Then we instantiate another ZSTD_inBuffer struct, this time with size=1 
and pos=4. This is a bug - libzstd expects size and pos to both be counted from 
the start of the buffer, therefore pos>size is unsound. So it returns an error 
https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3932 
which gets escalated as a java.lang.InternalError.

I will be submitting a pull request on github with a fix for this bug. The key 
is that the hadoop code should handle offsets the same way libzstd does, ie 
uncompressedDirectBufLen should be counted from the start of 
uncompressedDirectBuf, not from uncompressedDirectBufOff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to