[ https://issues.apache.org/jira/browse/IMPALA-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Armstrong resolved IMPALA-4177. ----------------------------------- Resolution: Fixed Fix Version/s: Impala 2.11.0 IMPALA-4177,IMPALA-6039: batched bit reading and rle decoding Switch the decoders to using more batch-oriented interfaces. As an intermediate step this doesn't make the interfaces of LevelDecoder or DictDecoder batch-oriented, only the lower-level utility classes. The next step would be to change those interfaces to be batch-oriented and make according optimisations in parquet. This could deliver much larger perf improvements than the current patch. The high-level changes are. * BitReader -> BatchedBitReader, which is built to unpack runs of 32 bit-packed values efficiently. * RleDecoder -> RleBatchDecoder, which exposes the repeated and literal runs to the caller and uses BatchedBitReader to unpack literal runs efficiently. * Dict decoding uses RleBatchDecoder to decode repeated runs efficiently and uses the BitPacking utilities to unpack and encode in a single step. Also removes an older benchmark that isn't too interesting (since the batch-oriented approach to encoding and decoding is so much faster than the value-by-value approach). Testing: * Ran core tests. * Updated unit tests to exercise new code. * Added test coverage for the deprecated bit-packed level encoding to that it still works (there was no coverage previously). Perf: Single-node benchmarks showed a few % performance gain. 16 node cluster benchmarks only showed a gain for TPC-H nested. Change-Id: I35de0cf80c86f501c4a39270afc8fb8111552ac6 Reviewed-on: http://gerrit.cloudera.org:8080/8267 Reviewed-by: Tim Armstrong <tarmstr...@cloudera.com> Tested-by: Impala Public Jenkins --- > Add batch dictionary/RLE decoding in Parquet > -------------------------------------------- > > Key: IMPALA-4177 > URL: https://issues.apache.org/jira/browse/IMPALA-4177 > Project: IMPALA > Issue Type: Improvement > Components: Backend > Affects Versions: Impala 2.8.0 > Reporter: Tim Armstrong > Assignee: Tim Armstrong > Priority: Minor > Labels: perf > Fix For: Impala 2.11.0 > > > parquet-cpp implemented this optimisation here: > https://github.com/apache/parquet-cpp/pull/140/commits/3f10378c5fc56c346ce77bf9e9faf011ead9c5e6 > The basic idea is to add a batched interface to DictDecoder and RleDecoder, > and support passing in a dictionary to RleDecoder. It should then be possible > to significantly optimise the decoding. > We should add a microbenchmark for DictDecoder. and updated the benchmark for > RleDecoder so we can understand the perf. -- This message was sent by Atlassian JIRA (v6.4.14#64029)