The information is stored in the footer of the parquet files. Drill reads the metadata information stored in the parquet footer to discover the columns with a single value and treats them as partitioning columns.

Thanks
Mehant

On 10/21/15 11:52 AM, Chris Mathews wrote:
Thank Mehant; yes we did look at doing this, but the advantages of using the 
new PARTITION BY feature is that the partitioned columns are automatically 
detected during any subsequent queries.  This is a major advantage as our 
customers are using the Tableau BI tool, and knowing details such as the exact 
partition levels and directories is not an option.

By the way, having created a table using PARTITION BY and CTAS ,how does a 
query know how to action the pruning ?  Where is this information stored for 
the query to access the tables/files efficiently ?

Cheers — Chris
On 21 Oct 2015, at 19:37, Mehant Baid <[email protected]> wrote:

In addition to the auto partitioning done by CTAS, Drill also supports 
directory based pruning. You could load data into different(nested) directories 
underneath the top level table location and use the 'where' clause to get the 
pruning performance benefits. Following is a typical example

Table location: /home/user/table_name
Within this you could create nested directory structure of the form
/home/user/table_name/2010/jan
/home/user/table_name/2010/feb
...
/home/user/table_name/2010/dec

/home/user/table_name/2011/jan
...
/home/user/table_name/2011/dec

Given this directory structure you could have a query that looks like

select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and dir1 = jan;

This would prune out scanning the parquet files under the other directories.

Thanks
Mehant
On 10/21/15 11:26 AM, Chris Mathews wrote:
We have an existing ETL framework processing machine generated data, which we 
are updating to write Parquet files out directly to HDFS using 
AvroParquetWriter for access by Drill.

Some questions:

How do we take advantage of Drill’s partition pruning capabilities with 
PARTITION BY if we are not using CTAS to load the Parquet files ?

It seems there is no way of taking advantage of these features if the Parquet 
files are created externally to CTAS - am I correct ?

If I am, then is there any way using a Drill API of programatically loading our 
data into Parquet files and utilise Drill's parallelisation techniques using 
CTAS, or do we have to write the data out to a file and then load that file 
again as input to a CTAS command ?

Another potential issue is that we are constantly writing Parquet files out to 
HDFS directories so the data in these files eventually appears as additional 
data in a Drill query - so how can we do this with CTAS ? Does CTAS append to 
an existing directory structure or does it insist on a new table name each time 
it is executed ?

What I am getting at here is that there seem to be performance enhancement 
features available to Drill when the Parquet files are created using an 
existing file as input to a CTAS that are not possible otherwise.  With the 
volumes of data we are talking about it is not really an option to write the 
files out, form them to then be read back in again for conversion using CTAS; 
which is why we write the Parquet files out directly to HDFS and append them to 
existing directories.

Am I missing something obvious here - quite possibly yes ?

Thanks for any help.

Cheers — Chris


Reply via email to