Re: Externally created Parquet files and partition pruning

Chris Mathews Wed, 21 Oct 2015 12:19:11 -0700

We create a JSON format schema for the Parquet file using the Avro 
specification and use this schema when loading data.


Is there anything special we have to do to flag a column as a partitioning 
column ?
Sorry I don’t understand your answer. What do you mean by ‘discover the columns 
with a single value’ ?

Cheers — Chris

> On 21 Oct 2015, at 20:02, Mehant Baid <[email protected]> wrote:
> 
> The information is stored in the footer of the parquet files. Drill reads the 
> metadata information stored in the parquet footer to discover the columns 
> with a single value and treats them as partitioning columns.
> 
> Thanks
> Mehant
> 
> On 10/21/15 11:52 AM, Chris Mathews wrote:
>> Thank Mehant; yes we did look at doing this, but the advantages of using the 
>> new PARTITION BY feature is that the partitioned columns are automatically 
>> detected during any subsequent queries.  This is a major advantage as our 
>> customers are using the Tableau BI tool, and knowing details such as the 
>> exact partition levels and directories is not an option.
>> 
>> By the way, having created a table using PARTITION BY and CTAS ,how does a 
>> query know how to action the pruning ?  Where is this information stored for 
>> the query to access the tables/files efficiently ?
>> 
>> Cheers — Chris
>>  
>>> On 21 Oct 2015, at 19:37, Mehant Baid <[email protected]> wrote:
>>> 
>>> In addition to the auto partitioning done by CTAS, Drill also supports 
>>> directory based pruning. You could load data into different(nested) 
>>> directories underneath the top level table location and use the 'where' 
>>> clause to get the pruning performance benefits. Following is a typical 
>>> example
>>> 
>>> Table location: /home/user/table_name
>>> Within this you could create nested directory structure of the form
>>> /home/user/table_name/2010/jan
>>> /home/user/table_name/2010/feb
>>> ...
>>> /home/user/table_name/2010/dec
>>> 
>>> /home/user/table_name/2011/jan
>>> ...
>>> /home/user/table_name/2011/dec
>>> 
>>> Given this directory structure you could have a query that looks like
>>> 
>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and dir1 = 
>>> jan;
>>> 
>>> This would prune out scanning the parquet files under the other directories.
>>> 
>>> Thanks
>>> Mehant
>>> On 10/21/15 11:26 AM, Chris Mathews wrote:
>>>> We have an existing ETL framework processing machine generated data, which 
>>>> we are updating to write Parquet files out directly to HDFS using 
>>>> AvroParquetWriter for access by Drill.
>>>> 
>>>> Some questions:
>>>> 
>>>> How do we take advantage of Drill’s partition pruning capabilities with 
>>>> PARTITION BY if we are not using CTAS to load the Parquet files ?
>>>> 
>>>> It seems there is no way of taking advantage of these features if the 
>>>> Parquet files are created externally to CTAS - am I correct ?
>>>> 
>>>> If I am, then is there any way using a Drill API of programatically 
>>>> loading our data into Parquet files and utilise Drill's parallelisation 
>>>> techniques using CTAS, or do we have to write the data out to a file and 
>>>> then load that file again as input to a CTAS command ?
>>>> 
>>>> Another potential issue is that we are constantly writing Parquet files 
>>>> out to HDFS directories so the data in these files eventually appears as 
>>>> additional data in a Drill query - so how can we do this with CTAS ? Does 
>>>> CTAS append to an existing directory structure or does it insist on a new 
>>>> table name each time it is executed ?
>>>> 
>>>> What I am getting at here is that there seem to be performance enhancement 
>>>> features available to Drill when the Parquet files are created using an 
>>>> existing file as input to a CTAS that are not possible otherwise.  With 
>>>> the volumes of data we are talking about it is not really an option to 
>>>> write the files out, form them to then be read back in again for 
>>>> conversion using CTAS; which is why we write the Parquet files out 
>>>> directly to HDFS and append them to existing directories.
>>>> 
>>>> Am I missing something obvious here - quite possibly yes ?
>>>> 
>>>> Thanks for any help.
>>>> 
>>>> Cheers — Chris
>>>> 
>

Re: Externally created Parquet files and partition pruning

Reply via email to