The page indices should solve a large part of this problem, but I can definitely come up with examples where the page indices aren't sufficient to avoid most materialisation if we have a predicate on an unsorted column.
E.g. if you have a predicate on a state column with 50 distinct values (I'm being US-centric). select * from sales where state = 'MI' Suppose there is some amount of locality to the data and on average you get 2 states per data page. You're probably only going to be able to filter out ~50% of pages using min-max filters since 'MI' will lie in-between many pairs of states. Whereas if you scanned the 'state' column and materialized the other columns lazily, you could filter out a large majority of the data before materialising the other columns. On Tue, Mar 20, 2018 at 9:20 AM, Alexander Behm <[email protected]> wrote: > I think we do eventually want to support it. For highly selective queries > the existing dictionary and min/max filtering can already be very > effective. In addition, we plan to add indexes for finer-grained page > pruning. See https://issues.apache.org/jira/browse/IMPALA-5842 > > After all those improvements, it's not clear what the additional benefit > of later materialization is going to be in practice. > > Do you have a case in mind that specifically requires late materialization > to work well? > > On Tue, Mar 20, 2018 at 12:47 AM, Antoni Ivanov <[email protected]> > wrote: > >> Hi, >> >> >> >> You can ignore my question, Found the relevant JIRA - >> https://issues.apache.org/jira/browse/IMPALA-2017 So I guess the answer >> is not yet. >> >> >> >> Regards, >> >> Antoni >> >> >> >> *From:* Antoni Ivanov >> *Sent:* Tuesday, March 20, 2018 9:45 AM >> *To:* '[email protected]' <[email protected]> >> *Subject:* Does Impala supports or plan to support Late Materialization >> >> >> >> I don’t mean partition pruning but as described in >> >> https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-re >> dshift-introduces-late-materialization-for-faster-query-processing/ >> >> >> >> It basically pre-fetches first the filter columns and then after applying >> the filter it fetches only the data from the rest of columns only if filter >> applies. >> >> >> >> Thanks >> > >
