Re: Externally created Parquet files and partition pruning

Chris Mathews Wed, 21 Oct 2015 13:46:31 -0700

Thanks guys this is very helpful. 
I now need to go away and do some more research into this.


Cheers -- Chris
____________________
Sent from my iPhone

> On 21 Oct 2015, at 21:32, Jinfeng Ni <[email protected]> wrote:
> 
> For each column in the parquet files, Drill will check column metadata
> and see if min == max across all parquet files. If yes, that indicates
> this column has a unique value for all the files, and Drill will use
> that column as partitioning columns.
> 
> The partitioning column could be a column specified in "PARTITION BY"
> clause of CTAS; in such case, CTAS will ensure each created parquet
> file will have unique value for the column.  If the parquet files are
> not created by CTAS PARTITION BY, but some column happens to be unique
> across all the parquet files, then it will be treated as partitioning
> column as well.
> 
> You may get more detail in the code [1].
> 
> [1] 
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L324
> 
> 
>> On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews <[email protected]> wrote:
>> We create a JSON format schema for the Parquet file using the Avro 
>> specification and use this schema when loading data.
>> 
>> Is there anything special we have to do to flag a column as a partitioning 
>> column ?
>> Sorry I don’t understand your answer. What do you mean by ‘discover the 
>> columns with a single value’ ?
>> 
>> Cheers — Chris
>> 
>>> On 21 Oct 2015, at 20:02, Mehant Baid <[email protected]> wrote:
>>> 
>>> The information is stored in the footer of the parquet files. Drill reads 
>>> the metadata information stored in the parquet footer to discover the 
>>> columns with a single value and treats them as partitioning columns.
>>> 
>>> Thanks
>>> Mehant
>>> 
>>>> On 10/21/15 11:52 AM, Chris Mathews wrote:
>>>> Thank Mehant; yes we did look at doing this, but the advantages of using 
>>>> the new PARTITION BY feature is that the partitioned columns are 
>>>> automatically detected during any subsequent queries.  This is a major 
>>>> advantage as our customers are using the Tableau BI tool, and knowing 
>>>> details such as the exact partition levels and directories is not an 
>>>> option.
>>>> 
>>>> By the way, having created a table using PARTITION BY and CTAS ,how does a 
>>>> query know how to action the pruning ?  Where is this information stored 
>>>> for the query to access the tables/files efficiently ?
>>>> 
>>>> Cheers — Chris
>>>> 
>>>>> On 21 Oct 2015, at 19:37, Mehant Baid <[email protected]> wrote:
>>>>> 
>>>>> In addition to the auto partitioning done by CTAS, Drill also supports 
>>>>> directory based pruning. You could load data into different(nested) 
>>>>> directories underneath the top level table location and use the 'where' 
>>>>> clause to get the pruning performance benefits. Following is a typical 
>>>>> example
>>>>> 
>>>>> Table location: /home/user/table_name
>>>>> Within this you could create nested directory structure of the form
>>>>> /home/user/table_name/2010/jan
>>>>> /home/user/table_name/2010/feb
>>>>> ...
>>>>> /home/user/table_name/2010/dec
>>>>> 
>>>>> /home/user/table_name/2011/jan
>>>>> ...
>>>>> /home/user/table_name/2011/dec
>>>>> 
>>>>> Given this directory structure you could have a query that looks like
>>>>> 
>>>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and dir1 = 
>>>>> jan;
>>>>> 
>>>>> This would prune out scanning the parquet files under the other 
>>>>> directories.
>>>>> 
>>>>> Thanks
>>>>> Mehant
>>>>>> On 10/21/15 11:26 AM, Chris Mathews wrote:
>>>>>> We have an existing ETL framework processing machine generated data, 
>>>>>> which we are updating to write Parquet files out directly to HDFS using 
>>>>>> AvroParquetWriter for access by Drill.
>>>>>> 
>>>>>> Some questions:
>>>>>> 
>>>>>> How do we take advantage of Drill’s partition pruning capabilities with 
>>>>>> PARTITION BY if we are not using CTAS to load the Parquet files ?
>>>>>> 
>>>>>> It seems there is no way of taking advantage of these features if the 
>>>>>> Parquet files are created externally to CTAS - am I correct ?
>>>>>> 
>>>>>> If I am, then is there any way using a Drill API of programatically 
>>>>>> loading our data into Parquet files and utilise Drill's parallelisation 
>>>>>> techniques using CTAS, or do we have to write the data out to a file and 
>>>>>> then load that file again as input to a CTAS command ?
>>>>>> 
>>>>>> Another potential issue is that we are constantly writing Parquet files 
>>>>>> out to HDFS directories so the data in these files eventually appears as 
>>>>>> additional data in a Drill query - so how can we do this with CTAS ? 
>>>>>> Does CTAS append to an existing directory structure or does it insist on 
>>>>>> a new table name each time it is executed ?
>>>>>> 
>>>>>> What I am getting at here is that there seem to be performance 
>>>>>> enhancement features available to Drill when the Parquet files are 
>>>>>> created using an existing file as input to a CTAS that are not possible 
>>>>>> otherwise.  With the volumes of data we are talking about it is not 
>>>>>> really an option to write the files out, form them to then be read back 
>>>>>> in again for conversion using CTAS; which is why we write the Parquet 
>>>>>> files out directly to HDFS and append them to existing directories.
>>>>>> 
>>>>>> Am I missing something obvious here - quite possibly yes ?
>>>>>> 
>>>>>> Thanks for any help.
>>>>>> 
>>>>>> Cheers — Chris
>>

Re: Externally created Parquet files and partition pruning

Reply via email to