Timothy Miller created PARQUET-2135: ---------------------------------------
Summary: Performance optimizations: Merged all LittleEndianDataInputStream functionality into ByteBufferInputStream Key: PARQUET-2135 URL: https://issues.apache.org/jira/browse/PARQUET-2135 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.12.2 Reporter: Timothy Miller This PR is all performance optimization. In benchmarking with Trino, we find query performance to improve from 5% to 15%, depending on the query, and that includes all the I/O time from S3. The main modification is to merge all of LittleEndianDataInputStream functionality into ByteBufferInputStream, which yields the following benefits: * Elimination of extra layers of abstraction and method call overhead * Enable the use of intrinsics for readInt, readLong, etc. * Availability of faster access methods like readFully and skipFully, without the need for helper functions * Reduces some object creation in the performance critical path This also includes and enables performance optimizations to: * ByteBitPackingValuesReader * PlainValuesReader * RunLengthBitPackingHybridDecoder Context: I've been working on improving Parquet reading performance in Trino, mostly by profiling while running performance benchmarks and TPCDS queries. This PR is a subset of the changes I made that have more than doubled the performance of a lot of TPCDS queries (wall clock time, including the S3 access time). If you are kind enough to accept these changes, I have more I would like to contribute. -- This message was sent by Atlassian Jira (v8.20.1#820001)