[ https://issues.apache.org/jira/browse/ORC-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Norbert Luksa reassigned ORC-614: --------------------------------- Assignee: Norbert Luksa > Implement efficient seek() in decompression streams > --------------------------------------------------- > > Key: ORC-614 > URL: https://issues.apache.org/jira/browse/ORC-614 > Project: ORC > Issue Type: Improvement > Components: C++ > Reporter: Csaba Ringhofer > Assignee: Norbert Luksa > Priority: Major > > The current implementation of > ZlibDecompressionStream/BlockDecompressionStream::seek resets the state of > the decompressor and the underlying file reader and throws away their > buffers. The buffers can still have usable data in the following cases; > 1. If the new row group's start position is in the same compressed chunk we > were reading, then we just jumped to another position within the same > uncompressed buffer, so both the original compressed buffer and the > decompressed buffer can be reused. This is a very common scenario with the > default ORC configs of unaligned 256KB>=chunks and 10K row groups, e.g. chunk > can contain 3 full row groups of 8 byte int without any encoding. > 2. If the new row group's start position is in another compressed chunk, but > it starts in the current compressed buffer (as we have read ahead during > file reading), then the compressed buffer can be kept and only the > uncompressed buffer needs to be dropped. This is the usual case in Apache > Impala, as 8 MB block size is used which leads to reading the whole stream to > the buffer for typical columns. > The lack of these optimizations lead to regression during the testing of > https://github.com/apache/orc/pull/476, which uses seek() when a row group is > skipped due to predicate push down, as all seeks caused the whole stream to > be read again. -- This message was sent by Atlassian Jira (v8.3.4#803005)