[jira] [Updated] (IMPALA-10602) Intersection of multiple min/max filters when applying to common equi-join columns

Qifan Chen (Jira) Tue, 23 Mar 2021 12:20:04 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Qifan Chen updated IMPALA-10602:
--------------------------------
    Description: 
Currently, Impala generates two min/max filters from the two joins, for a test 
query as follows.

{quote}select straight_join  count(*)
from store_sales ss, date_dim d1, date_dim d2
where 
ss.ss_sold_time_sk = d1.d_date_sk and
ss.ss_sold_time_sk = d2.d_date_sk;{quote}

{quote}|   00:SCAN HDFS [tpcds_parquet.store_sales ss, RANDOM]                  
                                                                                
                  |
|      HDFS partitions=1824/1824 files=1824 size=200.94MB                       
                                                                                
           |
|      runtime filters: RF001[min_max] -> ss.ss_sold_time_sk, RF003[min_max] -> 
ss.ss_sold_time_sk, RF000[bloom] -> ss.ss_sold_time_sk, RF002[bloom] -> 
ss.ss_sold_time_sk |
|      stored statistics:                                                       
                                                                                
           |
|        table: rows=2.88M size=200.94MB                                        
                                                                                
           |
|        partitions: 1824/1824 rows=2.88M                                       
                                                                                
           |
|        columns: all                                                           
                                                                                
           |
|      extrapolated-rows=disabled max-scan-range-rows=130.09K                   
                                                                                
           |
|      file formats: [PARQUET]                                                  
                                                                                
           |
|      mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1       
                                                                                
           |
|      tuple-ids=0 row-size=4B cardinality=2.88M{quote}

Since the two filters are applied to the same equi-join column 
ss.ss_sold_time_sk, it will be more efficient to intersect RF001 and RF003 into 
a single filter and apply the result one instead.  Assume the range of RF001 is 
[10, 50] and that for RF003 is [20, 60]  then the combined filter will be [20, 
50].  


  was:
Currently, Impala actually generates two min/max filters from the two joins, 
for a test query as follows.

{quote}select straight_join  count(*)
from store_sales ss, date_dim d1, date_dim d2
where 
ss.ss_sold_time_sk = d1.d_date_sk and
ss.ss_sold_time_sk = d2.d_date_sk;{quote}

{quote}|   00:SCAN HDFS [tpcds_parquet.store_sales ss, RANDOM]                  
                                                                                
                  |
|      HDFS partitions=1824/1824 files=1824 size=200.94MB                       
                                                                                
           |
|      runtime filters: RF001[min_max] -> ss.ss_sold_time_sk, RF003[min_max] -> 
ss.ss_sold_time_sk, RF000[bloom] -> ss.ss_sold_time_sk, RF002[bloom] -> 
ss.ss_sold_time_sk |
|      stored statistics:                                                       
                                                                                
           |
|        table: rows=2.88M size=200.94MB                                        
                                                                                
           |
|        partitions: 1824/1824 rows=2.88M                                       
                                                                                
           |
|        columns: all                                                           
                                                                                
           |
|      extrapolated-rows=disabled max-scan-range-rows=130.09K                   
                                                                                
           |
|      file formats: [PARQUET]                                                  
                                                                                
           |
|      mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1       
                                                                                
           |
|      tuple-ids=0 row-size=4B cardinality=2.88M{quote}

Since the two filters are applied to the same equi-join column 
ss.ss_sold_time_sk, it will be more efficient to intersect RF001 and RF003 into 
a single filter and apply the result one instead.  Assume the range of RF001 is 
[10, 50] and that for RF003 is [20, 60]  then the combined filter will be [20, 
50].  


        Summary: Intersection of multiple min/max filters when applying to 
common equi-join columns  (was: Intersection of multiple min/max filters when 
applying onto common equi-join column)

> Intersection of multiple min/max filters when applying to common equi-join 
> columns
> ----------------------------------------------------------------------------------
>
>                 Key: IMPALA-10602
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10602
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Qifan Chen
>            Priority: Major
>
> Currently, Impala generates two min/max filters from the two joins, for a 
> test query as follows.
> {quote}select straight_join  count(*)
> from store_sales ss, date_dim d1, date_dim d2
> where 
> ss.ss_sold_time_sk = d1.d_date_sk and
> ss.ss_sold_time_sk = d2.d_date_sk;{quote}
> {quote}|   00:SCAN HDFS [tpcds_parquet.store_sales ss, RANDOM]                
>                                                                               
>                       |
> |      HDFS partitions=1824/1824 files=1824 size=200.94MB                     
>                                                                               
>                |
> |      runtime filters: RF001[min_max] -> ss.ss_sold_time_sk, RF003[min_max] 
> -> ss.ss_sold_time_sk, RF000[bloom] -> ss.ss_sold_time_sk, RF002[bloom] -> 
> ss.ss_sold_time_sk |
> |      stored statistics:                                                     
>                                                                               
>                |
> |        table: rows=2.88M size=200.94MB                                      
>                                                                               
>                |
> |        partitions: 1824/1824 rows=2.88M                                     
>                                                                               
>                |
> |        columns: all                                                         
>                                                                               
>                |
> |      extrapolated-rows=disabled max-scan-range-rows=130.09K                 
>                                                                               
>                |
> |      file formats: [PARQUET]                                                
>                                                                               
>                |
> |      mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1     
>                                                                               
>                |
> |      tuple-ids=0 row-size=4B cardinality=2.88M{quote}
> Since the two filters are applied to the same equi-join column 
> ss.ss_sold_time_sk, it will be more efficient to intersect RF001 and RF003 
> into a single filter and apply the result one instead.  Assume the range of 
> RF001 is [10, 50] and that for RF003 is [20, 60]  then the combined filter 
> will be [20, 50].  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Updated] (IMPALA-10602) Intersection of multiple min/max filters when applying to common equi-join columns

Reply via email to