[ https://issues.apache.org/jira/browse/IMPALA-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Qifan Chen updated IMPALA-10602: -------------------------------- Description: Currently, Impala generates two min/max filters from the two joins, for a test query as follows. {quote}select straight_join count(*) from store_sales ss, date_dim d1, date_dim d2 where ss.ss_sold_time_sk = d1.d_date_sk and ss.ss_sold_time_sk = d2.d_date_sk;{quote} {quote}| 00:SCAN HDFS [tpcds_parquet.store_sales ss, RANDOM] | | HDFS partitions=1824/1824 files=1824 size=200.94MB | | runtime filters: RF001[min_max] -> ss.ss_sold_time_sk, RF003[min_max] -> ss.ss_sold_time_sk, RF000[bloom] -> ss.ss_sold_time_sk, RF002[bloom] -> ss.ss_sold_time_sk | | stored statistics: | | table: rows=2.88M size=200.94MB | | partitions: 1824/1824 rows=2.88M | | columns: all | | extrapolated-rows=disabled max-scan-range-rows=130.09K | | file formats: [PARQUET] | | mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1 | | tuple-ids=0 row-size=4B cardinality=2.88M{quote} Since the two filters are applied to the same equi-join column ss.ss_sold_time_sk, it will be more efficient to intersect RF001 and RF003 into a single filter and apply the result one instead. Assume the range of RF001 is [10, 50] and that for RF003 is [20, 60] then the combined filter will be [20, 50]. was: Currently, Impala actually generates two min/max filters from the two joins, for a test query as follows. {quote}select straight_join count(*) from store_sales ss, date_dim d1, date_dim d2 where ss.ss_sold_time_sk = d1.d_date_sk and ss.ss_sold_time_sk = d2.d_date_sk;{quote} {quote}| 00:SCAN HDFS [tpcds_parquet.store_sales ss, RANDOM] | | HDFS partitions=1824/1824 files=1824 size=200.94MB | | runtime filters: RF001[min_max] -> ss.ss_sold_time_sk, RF003[min_max] -> ss.ss_sold_time_sk, RF000[bloom] -> ss.ss_sold_time_sk, RF002[bloom] -> ss.ss_sold_time_sk | | stored statistics: | | table: rows=2.88M size=200.94MB | | partitions: 1824/1824 rows=2.88M | | columns: all | | extrapolated-rows=disabled max-scan-range-rows=130.09K | | file formats: [PARQUET] | | mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1 | | tuple-ids=0 row-size=4B cardinality=2.88M{quote} Since the two filters are applied to the same equi-join column ss.ss_sold_time_sk, it will be more efficient to intersect RF001 and RF003 into a single filter and apply the result one instead. Assume the range of RF001 is [10, 50] and that for RF003 is [20, 60] then the combined filter will be [20, 50]. Summary: Intersection of multiple min/max filters when applying to common equi-join columns (was: Intersection of multiple min/max filters when applying onto common equi-join column) > Intersection of multiple min/max filters when applying to common equi-join > columns > ---------------------------------------------------------------------------------- > > Key: IMPALA-10602 > URL: https://issues.apache.org/jira/browse/IMPALA-10602 > Project: IMPALA > Issue Type: Improvement > Reporter: Qifan Chen > Priority: Major > > Currently, Impala generates two min/max filters from the two joins, for a > test query as follows. > {quote}select straight_join count(*) > from store_sales ss, date_dim d1, date_dim d2 > where > ss.ss_sold_time_sk = d1.d_date_sk and > ss.ss_sold_time_sk = d2.d_date_sk;{quote} > {quote}| 00:SCAN HDFS [tpcds_parquet.store_sales ss, RANDOM] > > | > | HDFS partitions=1824/1824 files=1824 size=200.94MB > > | > | runtime filters: RF001[min_max] -> ss.ss_sold_time_sk, RF003[min_max] > -> ss.ss_sold_time_sk, RF000[bloom] -> ss.ss_sold_time_sk, RF002[bloom] -> > ss.ss_sold_time_sk | > | stored statistics: > > | > | table: rows=2.88M size=200.94MB > > | > | partitions: 1824/1824 rows=2.88M > > | > | columns: all > > | > | extrapolated-rows=disabled max-scan-range-rows=130.09K > > | > | file formats: [PARQUET] > > | > | mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1 > > | > | tuple-ids=0 row-size=4B cardinality=2.88M{quote} > Since the two filters are applied to the same equi-join column > ss.ss_sold_time_sk, it will be more efficient to intersect RF001 and RF003 > into a single filter and apply the result one instead. Assume the range of > RF001 is [10, 50] and that for RF003 is [20, 60] then the combined filter > will be [20, 50]. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org