Timothy Miller created PARQUET-2135:
---------------------------------------

             Summary: Performance optimizations: Merged all 
LittleEndianDataInputStream functionality into ByteBufferInputStream
                 Key: PARQUET-2135
                 URL: https://issues.apache.org/jira/browse/PARQUET-2135
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-mr
    Affects Versions: 1.12.2
            Reporter: Timothy Miller


This PR is all performance optimization. In benchmarking with Trino, we find 
query performance to improve from 5% to 15%, depending on the query, and that 
includes all the I/O time from S3.

The main modification is to merge all of LittleEndianDataInputStream 
functionality into ByteBufferInputStream, which yields the following benefits:
 * Elimination of extra layers of abstraction and method call overhead
 * Enable the use of intrinsics for readInt, readLong, etc.
 * Availability of faster access methods like readFully and skipFully, without 
the need for helper functions
 * Reduces some object creation in the performance critical path

This also includes and enables performance optimizations to:
 * ByteBitPackingValuesReader
 * PlainValuesReader
 * RunLengthBitPackingHybridDecoder

Context:
I've been working on improving Parquet reading performance in Trino, mostly by 
profiling while running performance benchmarks and TPCDS queries. This PR is a 
subset of the changes I made that have more than doubled the performance of a 
lot of TPCDS queries (wall clock time, including the S3 access time). If you 
are kind enough to accept these changes, I have more I would like to contribute.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to