No, it only records the start offset. So it doesn't matter how many
compressed chunks in the row group.

On Tue, Feb 7, 2023 at 2:01 PM Xinyu Z <xzen...@gmail.com> wrote:

> I see. So for example, if one row group of a compressed stream spans
> two compression chunks, the positions in that RowIndex are [(byte
> offset of chunk1, decompressed size, # of values), (byte offset of
> chunk2, decompressed size, # of values)]. Is that correct?
>
> On Tue, Feb 7, 2023 at 1:04 PM Gang Wu <ust...@gmail.com> wrote:
> >
> > Not exactly. It starts with the byte offset of the compression chunk and
> appends offset to values in the chunk based on the encoding type.
> >
> > I copied the description from the specs as below:
> >
> > To record positions, each stream needs a sequence of numbers. For
> uncompressed streams, the position is the byte offset of the RLE run’s
> start location followed by the number of values that need to be consumed
> from the run. In compressed streams, the first number is the start of the
> compression chunk in the stream, followed by the number of decompressed
> bytes that need to be consumed, and finally the number of values consumed
> in the RLE.
> >
> > For columns with multiple streams, the sequences of positions in each
> stream are concatenated. That was an unfortunate decision on my part that
> we should fix at some point, because it makes code that uses the indexes
> error-prone.
> >
> > Because dictionaries are accessed randomly, there is not a position to
> record for the dictionary and the entire dictionary must be read even if
> only part of a stripe is being read.
> >
> > More details can be found here:
> https://orc.apache.org/specification/ORCv1/
> >
> > Best,
> > Gang
> >
> > On Tue, Feb 7, 2023 at 11:57 AM Xinyu Z <xzen...@gmail.com> wrote:
> >>
> >> Hi Gang,
> >>
> >> Thanks for your reply.
> >> A follow up question on Row Index, what is the exact meaning of
> >> 'position' in RowIndexEntry? Is it the byte offset of the starting
> >> position of the first compression chunk of that row group?
> >>
> >> On Thu, Feb 2, 2023 at 4:40 PM Gang Wu <ust...@gmail.com> wrote:
> >> >
> >> > Hi Xinyu,
> >> >
> >> > Sorry I am not sure about that.
> >> >
> >> > You may be interested in the implementation of Apache Impala.
> >> >
> >> > Best,
> >> > Gang
> >> >
> >> > On Thu, Feb 2, 2023 at 4:05 PM Xinyu Z <xzen...@gmail.com> wrote:
> >> >>
> >> >> Hi Gang, do you know any upstream system that uses ORC C++ and does
> >> >> vectorized predicate evaluation on the resulting ColumnVectorBatch
> >> >> produced by C++ reader with PPD?
> >> >>
> >> >> On Thu, Jan 19, 2023 at 5:46 PM Xinyu Z <xzen...@gmail.com> wrote:
> >> >> >
> >> >> > Hi Gang,
> >> >> >
> >> >> > Thanks for your reply! It helps.
> >> >> >
> >> >> > Xinyu
> >> >> >
> >> >> > On Wed, Jan 18, 2023 at 10:42 AM Gang Wu <ust...@gmail.com> wrote:
> >> >> > >
> >> >> > > Hi Xinyu,
> >> >> > >
> >> >> > > The C++ library does not provide lazy materialization. The java
> library supports row level filtering, please check it if interested:
> https://issues.apache.org/jira/browse/ORC-577
> >> >> > >
> >> >> > > With regards to the IO magnification introduced by PPD, I think
> we have discussed this earlier and there is a pending work item:
> https://issues.apache.org/jira/browse/ORC-1264
> >> >> > >
> >> >> > > Best,
> >> >> > > Gang
> >> >> > >
> >> >> > > On Mon, Jan 16, 2023 at 5:41 PM Xinyu Z <xzen...@gmail.com>
> wrote:
> >> >> > >>
> >> >> > >> Hi,
> >> >> > >>
> >> >> > >> I know that in ORC with SearchArguments and row index, we can
> skip
> >> >> > >> reading and decoding row groups that are out of the range of
> >> >> > >> predicate. But does ORC have late materialization functionality?
> >> >> > >> Basically after decoding and evaluating the predicate
> column(s), we
> >> >> > >> can only read and decode the row groups of projection columns
> where
> >> >> > >> the matching rows reside. This can further reduce IO and
> decoding
> >> >> > >> overhead. It seems the C++ version does not have this. I am
> asking
> >> >> > >> because parquet-rs recently add this:
> >> >> > >>
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> >> >> > >>
> >> >> > >> Another question is about row index. Since each row group is
> logically
> >> >> > >> 10000 rows and may not align with CompressionChunk boundaries,
> does
> >> >> > >> this cause issue for predicate pushdown? E.g, even we can skip
> one row
> >> >> > >> group, we may still need to do IO on the boundary
> CompressionChunks.
> >> >> > >>
> >> >> > >> Thanks a lot,
> >> >> > >> Xinyu
>

Reply via email to