No, it only records the start offset. So it doesn't matter how many compressed chunks in the row group.
On Tue, Feb 7, 2023 at 2:01 PM Xinyu Z <xzen...@gmail.com> wrote: > I see. So for example, if one row group of a compressed stream spans > two compression chunks, the positions in that RowIndex are [(byte > offset of chunk1, decompressed size, # of values), (byte offset of > chunk2, decompressed size, # of values)]. Is that correct? > > On Tue, Feb 7, 2023 at 1:04 PM Gang Wu <ust...@gmail.com> wrote: > > > > Not exactly. It starts with the byte offset of the compression chunk and > appends offset to values in the chunk based on the encoding type. > > > > I copied the description from the specs as below: > > > > To record positions, each stream needs a sequence of numbers. For > uncompressed streams, the position is the byte offset of the RLE run’s > start location followed by the number of values that need to be consumed > from the run. In compressed streams, the first number is the start of the > compression chunk in the stream, followed by the number of decompressed > bytes that need to be consumed, and finally the number of values consumed > in the RLE. > > > > For columns with multiple streams, the sequences of positions in each > stream are concatenated. That was an unfortunate decision on my part that > we should fix at some point, because it makes code that uses the indexes > error-prone. > > > > Because dictionaries are accessed randomly, there is not a position to > record for the dictionary and the entire dictionary must be read even if > only part of a stripe is being read. > > > > More details can be found here: > https://orc.apache.org/specification/ORCv1/ > > > > Best, > > Gang > > > > On Tue, Feb 7, 2023 at 11:57 AM Xinyu Z <xzen...@gmail.com> wrote: > >> > >> Hi Gang, > >> > >> Thanks for your reply. > >> A follow up question on Row Index, what is the exact meaning of > >> 'position' in RowIndexEntry? Is it the byte offset of the starting > >> position of the first compression chunk of that row group? > >> > >> On Thu, Feb 2, 2023 at 4:40 PM Gang Wu <ust...@gmail.com> wrote: > >> > > >> > Hi Xinyu, > >> > > >> > Sorry I am not sure about that. > >> > > >> > You may be interested in the implementation of Apache Impala. > >> > > >> > Best, > >> > Gang > >> > > >> > On Thu, Feb 2, 2023 at 4:05 PM Xinyu Z <xzen...@gmail.com> wrote: > >> >> > >> >> Hi Gang, do you know any upstream system that uses ORC C++ and does > >> >> vectorized predicate evaluation on the resulting ColumnVectorBatch > >> >> produced by C++ reader with PPD? > >> >> > >> >> On Thu, Jan 19, 2023 at 5:46 PM Xinyu Z <xzen...@gmail.com> wrote: > >> >> > > >> >> > Hi Gang, > >> >> > > >> >> > Thanks for your reply! It helps. > >> >> > > >> >> > Xinyu > >> >> > > >> >> > On Wed, Jan 18, 2023 at 10:42 AM Gang Wu <ust...@gmail.com> wrote: > >> >> > > > >> >> > > Hi Xinyu, > >> >> > > > >> >> > > The C++ library does not provide lazy materialization. The java > library supports row level filtering, please check it if interested: > https://issues.apache.org/jira/browse/ORC-577 > >> >> > > > >> >> > > With regards to the IO magnification introduced by PPD, I think > we have discussed this earlier and there is a pending work item: > https://issues.apache.org/jira/browse/ORC-1264 > >> >> > > > >> >> > > Best, > >> >> > > Gang > >> >> > > > >> >> > > On Mon, Jan 16, 2023 at 5:41 PM Xinyu Z <xzen...@gmail.com> > wrote: > >> >> > >> > >> >> > >> Hi, > >> >> > >> > >> >> > >> I know that in ORC with SearchArguments and row index, we can > skip > >> >> > >> reading and decoding row groups that are out of the range of > >> >> > >> predicate. But does ORC have late materialization functionality? > >> >> > >> Basically after decoding and evaluating the predicate > column(s), we > >> >> > >> can only read and decode the row groups of projection columns > where > >> >> > >> the matching rows reside. This can further reduce IO and > decoding > >> >> > >> overhead. It seems the C++ version does not have this. I am > asking > >> >> > >> because parquet-rs recently add this: > >> >> > >> > https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/ > >> >> > >> > >> >> > >> Another question is about row index. Since each row group is > logically > >> >> > >> 10000 rows and may not align with CompressionChunk boundaries, > does > >> >> > >> this cause issue for predicate pushdown? E.g, even we can skip > one row > >> >> > >> group, we may still need to do IO on the boundary > CompressionChunks. > >> >> > >> > >> >> > >> Thanks a lot, > >> >> > >> Xinyu >