Yes, you are correct.

BTW, I have created a JIRA to follow up with this case:
https://issues.apache.org/jira/projects/ORC/issues/ORC-1264

Best,
Gang

On Tue, Sep 6, 2022 at 12:01 AM Xinyu Z <xzen...@gmail.com> wrote:

> Hi Gang,
>
> Thanks for your clear explanation. Basically too small RowIndexStride
> will not benefit because 1. extra protobuf deserialization overhead.
> 2. Not benefit I/O because row group is smaller than compression
> block.
>
> A follow up question, if a compression block overlaps with two row
> groups and both row groups survive the PPD, the compression block will
> be read and decompressed only once right?
>
> On Mon, Sep 5, 2022 at 11:31 PM Gang Wu <gan...@apache.org> wrote:
> >
> > Hi Xinyu,
> >
> > When the row group stride is set to 100, we end up with many row groups
> and each contributes a protobuf object in the stripe index. That's why you
> see the most expensive function is loadStripeIndex().
> >
> > I need to say that smaller row groups may not help reduce the I/Os since
> the compression blocks by design are not aligned to the row group boundary.
> For example, if we have one compression block containing 5 row groups and
> only the 3rd row group survives the PPD, we still need the I/O of the
> entire compressed block and decompress the two row groups before the 3rd
> one.
> >
> > Hope my answer helps.
> >
> > Best,
> > Gang
> >
> > On Mon, Sep 5, 2022 at 4:15 PM Xinyu Z <xzen...@gmail.com> wrote:
> >>
> >> Hi community,
> >>
> >> I am using ORC C++ with filter pushdown (using similar approaches in
> >> TestPredicatePushdown.cc). By varying rowIndexStride, I found that for
> >> a low selectivity query, which means smaller rowIndexStride should
> >> eliminate more IO, the scan time even goes up. This typically happens
> >> when rowIndexStride is below 1000.
> >>
> >> A simple perf profiling shows that for an extreme case where I set
> >> rowIndexStride=100, the time cost is from loadStripeIndex(). I was
> >> wondering why? Is this because of the cost of protobuf parsing of a
> >> lot of indexes?
> >>
> >> Thanks a lot,
> >> Xinyu
>

Reply via email to