[ https://issues.apache.org/jira/browse/HADOOP-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17191286#comment-17191286 ]
Michael South edited comment on HADOOP-15171 at 9/6/20, 3:18 PM: ----------------------------------------------------------------- Issue should be closed, unfounded. The Hive Orc driver creates a decompression object and repeatedly calling it to deflate Orc blocks. Its treating each block as an entirely separate chunk (stream), completely decompressing each with one call to ...{{_inflateBytesDirect()}}. However, it wasn't calling {{inflateReset()}} or {{inflateEnd()}} / {{inflateInit()}} between the streams, which naturally left things in a confused state. It appears to be fixed in trunk Hive. Also, returning 0 for {{Z_BUF_ERROR}} or {{Z_NEED_DICT}} is correct, and should not throw an error. The Java decompression object is agnostic as to whether the application is working in stream or all-at-once mode. The only determination of which mode is active is whether the application (Hive Orc driver in this case) is passing the entire input in one chunk and is allocating sufficient space for all of the output. Therefore, the application must check for a zero return. If no-progress (zero return) is an impossible situation then it can throw an exception; otherwise it needs to look at one or more of ...{{_finished()}}, ...{{_getRemaining()}}, and/or ...{{_needDict()}} to figure out what's needed to make further progress. (It would be nice if JNI exposed the {{avail_out}} field, but if it's not an input or dictionary issue it must be a full output buffer.) There *is* a very minor bug in ...{{inflateBytesDirect()}}. It's calling {{inflate()}} with {{Z_PARTIAL_FLUSH}}, which only applies to {{deflate()}}. It should be {{Z_NO_FLUSH}}. However, in the current zlib code (1.2.11) the {{flush}} parameter only affects the return code, and it only checks whether or not it is {{Z_FINISH}}. Edit: The Zlib docs (overall, very excellent) kind of assume you realize that the internals, allocated in ...{{init()}}, are only valid for one stream run. The docs *do* say that ...{{end()}} deallocates these internal buffers; therefore to reuse the base compressor / decompressor you need to call ...{{init()}} again to re-allocate them (otherwise NPE). The docs also state that ...{{reset()}} is equivalent to calling ...{{end()}} followed by ...{{init()}}, only by resetting the internals and not deallocating and reallocating them. was (Author: michael south): Issue should be closed, unfounded. The Hive Orc driver creates a decompression object and repeatedly calling it to deflate Orc blocks. Its treating each block as an entirely separate chunk (stream), completely decompressing each with one call to ...{{_inflateBytesDirect()}}. However, it wasn't calling {{inflateReset()}} or {{inflateEnd()}} / {{inflateInit()}} between the streams, which naturally left things in a confused state. It appears to be fixed in trunk Hive. Also, returning 0 for {{Z_BUF_ERROR}} or {{Z_NEED_DICT}} is correct, and should not throw an error. The Java decompression object is agnostic as to whether the application is working in stream or all-at-once mode. The only determination of which mode is active is whether the application (Hive Orc driver in this case) is passing the entire input in one chunk and is allocating sufficient space for all of the output. Therefore, the application must check for a zero return. If no-progress (zero return) is an impossible situation then it can throw an exception; otherwise it needs to look at one or more of ...{{_finished()}}, ...{{_getRemaining()}}, and/or ...{{_needDict()}} to figure out what's needed to make further progress. (It would be nice if JNI exposed the {{avail_out}} field, but if it's not an input or dictionary issue it must be a full output buffer.) There *is* a very minor bug in ...{{inflateBytesDirect()}}. It's calling {{inflate()}} with {{Z_PARTIAL_FLUSH}}, which only applies to {{deflate()}}. It should be {{Z_NO_FLUSH}}. However, in the current zlib code (1.2.11) the {{flush}} parameter only affects the return code, and it only checks whether or not it is {{Z_FINISH}}. > native ZLIB decompressor produces 0 bytes on the 2nd call; also incorrrectly > handles some zlib errors > ----------------------------------------------------------------------------------------------------- > > Key: HADOOP-15171 > URL: https://issues.apache.org/jira/browse/HADOOP-15171 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 3.1.0 > Reporter: Sergey Shelukhin > Assignee: Lokesh Jain > Priority: Blocker > > While reading some ORC file via direct buffers, Hive gets a 0-sized buffer > for a particular compressed segment of the file. We narrowed it down to > Hadoop native ZLIB codec; when the data is copied to heap-based buffer and > the JDK Inflater is used, it produces correct output. Input is only 127 bytes > so I can paste it here. > All the other (many) blocks of the file are decompressed without problems by > the same code. > {noformat} > 2018-01-13T02:47:40,815 TRACE [IO-Elevator-Thread-0 > (1515637158315_0079_1_00_000000_0)] encoded.EncodedReaderImpl: Decompressing > 127 bytes to dest buffer pos 524288, limit 786432 > 2018-01-13T02:47:40,816 WARN [IO-Elevator-Thread-0 > (1515637158315_0079_1_00_000000_0)] encoded.EncodedReaderImpl: The codec has > produced 0 bytes for 127 bytes at pos 0, data hash 1719565039: [e3 92 e1 62 > 66 60 60 10 12 e5 98 e0 27 c4 c7 f1 e8 12 8f 40 c3 7b 5e 89 09 7f 6e 74 73 04 > 30 70 c9 72 b1 30 14 4d 60 82 49 37 bd e7 15 58 d0 cd 2f 31 a1 a1 e3 35 4c fa > 15 a3 02 4c 7a 51 37 bf c0 81 e5 02 12 13 5a b6 9f e2 04 ea 96 e3 62 65 b8 c3 > b4 01 ae fd d0 72 01 81 07 87 05 25 26 74 3c 5b c9 05 35 fd 0a b3 03 50 7b 83 > 11 c8 f2 c3 82 02 0f 96 0b 49 34 7c fa ff 9f 2d 80 01 00 > 2018-01-13T02:47:40,816 WARN [IO-Elevator-Thread-0 > (1515637158315_0079_1_00_000000_0)] encoded.EncodedReaderImpl: Fell back to > JDK decompressor with memcopy; got 155 bytes > {noformat} > Hadoop version is based on 3.1 snapshot. > The size of libhadoop.so is 824403 bytes, and libgplcompression is 78273 > FWIW. Not sure how to extract versions from those. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org