Hi Vibhath, 

Thanks a lot for the email. 

Unfortunately, Impala does not compute the partition values from a general 
function and partitioning predicates, as demonstrated in your last two queries. 
 

One way to get the plan with partition pruning is through rewriting.  For the 
3rd test query, the query rewritten is as follows.

Query: explain SELECT * FROM TEST_TABLE
WHERE 
(TEST_PART_COL = abs(2000) or TEST_PART_COL = abs(2001)) 
AND (TEST_COL = 2000 OR TEST_COL = 2001)
+------------------------------------------------------------------------------------+
| Explain String                                                                
     |
+------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=8.00KB Threads=3                    
     |
| Per-Host Resource Estimates: Memory=16MB                                      
     |
| WARNING: The following tables are missing relevant table and/or column 
statistics. |
| default.test_table                                                            
     |
|                                                                               
     |
| PLAN-ROOT SINK                                                                
     |
| |                                                                             
     |
| 01:EXCHANGE [UNPARTITIONED]                                                   
     |
| |                                                                             
     |
| 00:SCAN HDFS [default.test_table]                                             
     |
|    partition predicates: TEST_PART_COL IN (2000, 2001)                        
     |
|    HDFS partitions=2/5 files=2 size=824B                                      
     |
|    predicates: TEST_COL IN (2000, 2001)                                       
     |
|    row-size=16B cardinality=26                                                
     |
+------------------------------------------------------------------------------------+


To handle the above rewriting automatically in Impala, an enhancement is 
needed.  Some work has been in the past (such as IMPALA-10064) to propagate 
constants for range predicates. 

Thanks —Qifan




> On Oct 26, 2020, at 3:06 PM, Joe McDonnell <joemcdonn...@cloudera.com> wrote:
> 
> 
> 
> ---------- Forwarded message ---------
> From: Vibhath Ileperuma <vibhatharunapr...@gmail.com 
> <mailto:vibhatharunapr...@gmail.com>>
> Date: Sun, Oct 25, 2020 at 3:48 AM
> Subject: Impala Fails to prune partitions if the partition column value is 
> generated by a function using the values of another column.
> To: <user@impala.apache.org <mailto:user@impala.apache.org>>
> 
> 
> Hi Team,
> 
> I, Vibhath Ileperuma, am an undergraduate, currently doing the internship. I 
> was assigned to evaluate the partition pruning functionality in impala and I 
> noticed that partition pruning does not happen in the following situation. 
> 
> Problem:
> 
> Impala Fails to prune partitions if the partition column value is generated 
> by a function using the values of another column.
> 
> Consider the following example.
> 
> Let’s say we create a table and insert a value as follows
> 
> CREATE TABLE TEST_TABLE (
> 
> TEST_COL BIGINT
> 
> )
> 
> PARTITIONED BY (
> 
> TEST_PART_COL BIGINT
> 
> )
> 
> STORED AS PARQUET
> 
> LOCATION 
> 'hdfs://localhost:11619/x01/vibhath/data/hadoop/dataNode/current/test_table';
> 
>  
> INSERT INTO TEST_TABLE (TEST_COL)
> 
> PARTITION (TEST_PART_COL)
> 
> (1999, 1999), (2000, 2000), (2001, 2001), (2010, 2010), (2018, 2018);
> 
>  
> If we execute a query, generating partition column value directly from other 
> column as follows, pruning happens
> 
>            
> SELECT * FROM TEST_TABLE
> 
> WHERE TEST_PART_COL = TEST_COL
> 
> AND (TEST_COL = 2000 OR TEST_COL = 2001)
> 
> 00:SCAN HDFS [default.test_table, RANDOM]
> 
>    partition predicates: default.test_table.test_part_col IN (CAST(2000 AS 
> BIGINT), CAST(2001 AS BIGINT))
> 
>    HDFS partitions=2/5 files=2 size=794B
> 
>    predicates: TEST_PART_COL = TEST_COL, TEST_COL IN (CAST(2000 AS BIGINT), 
> CAST(2001 AS BIGINT))
> 
>  
> SELECT * FROM TEST_TABLE
> 
> WHERE TEST_PART_COL = TEST_COL
> 
> AND TEST_COL > 1999
> 
> AND TEST_COL <= 2001
> 
> 00:SCAN HDFS [default.test_table, RANDOM]
>    partition predicates: default.test_table.test_part_col > CAST(1999 AS 
> BIGINT), default.test_table.test_part_col <= CAST(2001 AS BIGINT)
>    HDFS partitions=2/5 files=2 size=794B
>    predicates: TEST_COL <= CAST(2001 AS BIGINT), TEST_COL > CAST(1999 AS 
> BIGINT), TEST_PART_COL = TEST_COL
>  
>  
>  
> But If we execute a query, generating partition column value from a function 
> as follows, pruning doesn’t happen
> 
> SELECT * FROM TEST_TABLE
> 
> WHERE TEST_PART_COL = ABS(TEST_COL)
> 
> AND (TEST_COL = 2000 OR TEST_COL = 2001)
> 
> 00:SCAN HDFS [default.test_table, RANDOM]
>    HDFS partitions=5/5 files=5 size=1.94KB
>    predicates: TEST_COL IN (CAST(2000 AS BIGINT), CAST(2001 AS BIGINT)), 
> TEST_PART_COL = abs(TEST_COL)
>  
> SELECT * FROM TEST_TABLE
> 
> WHERE TEST_PART_COL = ABS(TEST_COL)
> 
> AND TEST_COL > 1999
> 
> AND TEST_COL <= 2001
> 
> 00:SCAN HDFS [default.test_table, RANDOM]
>    HDFS partitions=5/5 files=5 size=1.94KB
>    predicates: TEST_COL <= CAST(2001 AS BIGINT), TEST_COL > CAST(1999 AS 
> BIGINT), TEST_PART_COL = abs(TEST_COL)
>  
>  
> I would be grateful if you could check on this and let me know how to 
> overcome this issue.
> 
> 
> Thank You.
> Best Regards,
> Vibhath.

Reply via email to