[ 
https://issues.apache.org/jira/browse/IMPALA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-3841:
---------------------------------
    Labels: complextype nested_types parquet performance  (was: nested_types 
parquet performance)

> Avoid materializing nested collections if top-level predicates already 
> disqualify the row.
> ------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-3841
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3841
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.5.0, Impala 2.6.0
>            Reporter: Alexander Behm
>            Priority: Minor
>              Labels: complextype, nested_types, parquet, performance
>
> Today, we fully materialize a row before evaluating the top-level conjuncts 
> when scanning Parquet. This includes materializing nested collections. We 
> should avoid materializing nested collections if top-level conjuncts already 
> discard the row. Our recent move to column-wise materialization makes this 
> improvement feasible (IMPALA-2736).
> To illustrate the problem, consider this query:
> {code}
> select * from customer c, c.orders o where c.id = 10
> {code}
> Even though we have a very selective predicate on the top-level customer, our 
> scanner will still fully materialize all orders of all customers. The 
> non-matches will be filtered, but we still pay the cost of materializing the 
> orders.
> The proposed improvement is to avoid materializing the orders of 
> non-qualifying customers.
> The improvement will several things:
> * Analyze and separate the top-level conjuncts into those that can be 
> evaluated before materializing the nested collections and those that require 
> nested collections to be materialized. In particular, we need to be careful 
> with our auto-generated !empty() predicates on nested collections.
> * Add a new SkipValues() or similar interface to the Parquet column readers 
> to advances the scanner without actually materializing values. If possible, 
> we should skip entire blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to