Yes, you are correct. BTW, I have created a JIRA to follow up with this case: https://issues.apache.org/jira/projects/ORC/issues/ORC-1264
Best, Gang On Tue, Sep 6, 2022 at 12:01 AM Xinyu Z <xzen...@gmail.com> wrote: > Hi Gang, > > Thanks for your clear explanation. Basically too small RowIndexStride > will not benefit because 1. extra protobuf deserialization overhead. > 2. Not benefit I/O because row group is smaller than compression > block. > > A follow up question, if a compression block overlaps with two row > groups and both row groups survive the PPD, the compression block will > be read and decompressed only once right? > > On Mon, Sep 5, 2022 at 11:31 PM Gang Wu <gan...@apache.org> wrote: > > > > Hi Xinyu, > > > > When the row group stride is set to 100, we end up with many row groups > and each contributes a protobuf object in the stripe index. That's why you > see the most expensive function is loadStripeIndex(). > > > > I need to say that smaller row groups may not help reduce the I/Os since > the compression blocks by design are not aligned to the row group boundary. > For example, if we have one compression block containing 5 row groups and > only the 3rd row group survives the PPD, we still need the I/O of the > entire compressed block and decompress the two row groups before the 3rd > one. > > > > Hope my answer helps. > > > > Best, > > Gang > > > > On Mon, Sep 5, 2022 at 4:15 PM Xinyu Z <xzen...@gmail.com> wrote: > >> > >> Hi community, > >> > >> I am using ORC C++ with filter pushdown (using similar approaches in > >> TestPredicatePushdown.cc). By varying rowIndexStride, I found that for > >> a low selectivity query, which means smaller rowIndexStride should > >> eliminate more IO, the scan time even goes up. This typically happens > >> when rowIndexStride is below 1000. > >> > >> A simple perf profiling shows that for an extreme case where I set > >> rowIndexStride=100, the time cost is from loadStripeIndex(). I was > >> wondering why? Is this because of the cost of protobuf parsing of a > >> lot of indexes? > >> > >> Thanks a lot, > >> Xinyu >