Partition Pruning in Apache Drill

sreeparna bhabani Mon, 04 May 2020 09:01:20 -0700

Hi Team,

Kindly check the below query regarding the partition pruning. We are using
the partition pruning for our current project in Apache Drill and have some
questions. Please find the below details of the scenario-


File Type-
Parquet generated from Python

Folder structure in hdfs-
/<root_folder>/<dir0>/<dir1>/<dir2>

Query used to select data under <dir2>-
To take advantage of partition pruning
select column1, column2, ... from dfs.`tmp`.`<root_folder>` where dir0 =
<dir0> and dir1 = <dir1> and dir2 = <dir2> and <filter> = ..;

Observation-
Although the execution is fast, the time taken for planning is quite high.
I didn't see VALUES operator in the physical plan of the query, rather
there was SCAN operator.
How can we ensure that the selected data is partition pruned here ?
As an alternative, I modified the query to bring down the planning time of
it and included the sub-directories in the root directory. The modified
query is-
select column1, column2, ... from
dfs.`tmp`.`<root_folder>/<dir0>/<dir1>/<dir2>`  where <filter> = ..;

Can you please tell me why the planning time is so high for the first
query? How can we take advantage of partition pruning from it ? Or should
we include sub-directories in the root directory ?

Thanks in advance.

*Sreeparna Bhabani*

Partition Pruning in Apache Drill

Reply via email to