Song Jun created SPARK-27227:
--------------------------------

             Summary: Dynamic Partition Prune in Spark
                 Key: SPARK-27227
                 URL: https://issues.apache.org/jira/browse/SPARK-27227
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Song Jun


When we equi-join one big table with a smaller table, we can collect some 
statistics from the smaller table side, and use it to the scan of big table to 
do partition prune or data filter before execute the join.
This can significantly improve SQL perfermance.

For a simple example:
select * from A, B where A.a = B.b
A is big table ,B is small table.

There are two scenarios:
1. A.a is a partition column of table A
   we can collect  all the values  of B.b, and send it to table A to do 
   partition prune on A.a.
2. A.a is not a partition column of table A
  we can collect real-time some statistics(such as min/max/bloomfilter) of B.b 
by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to table 
A to do filter on A.a.
  Addititionaly, if a more complex query select * from A join (select * from B 
where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) from 
X)

Above two scenarios, we can filter out lots of data by partition prune or data 
filter, thus we can imporve perfermance.

10TB TPC-DS  gain about 35%  improvement in our test.

I will submit a SPIP later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to