Thanks guys this is very helpful. I now need to go away and do some more research into this.
Cheers -- Chris ____________________ Sent from my iPhone > On 21 Oct 2015, at 21:32, Jinfeng Ni <[email protected]> wrote: > > For each column in the parquet files, Drill will check column metadata > and see if min == max across all parquet files. If yes, that indicates > this column has a unique value for all the files, and Drill will use > that column as partitioning columns. > > The partitioning column could be a column specified in "PARTITION BY" > clause of CTAS; in such case, CTAS will ensure each created parquet > file will have unique value for the column. If the parquet files are > not created by CTAS PARTITION BY, but some column happens to be unique > across all the parquet files, then it will be treated as partitioning > column as well. > > You may get more detail in the code [1]. > > [1] > https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L324 > > >> On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews <[email protected]> wrote: >> We create a JSON format schema for the Parquet file using the Avro >> specification and use this schema when loading data. >> >> Is there anything special we have to do to flag a column as a partitioning >> column ? >> Sorry I don’t understand your answer. What do you mean by ‘discover the >> columns with a single value’ ? >> >> Cheers — Chris >> >>> On 21 Oct 2015, at 20:02, Mehant Baid <[email protected]> wrote: >>> >>> The information is stored in the footer of the parquet files. Drill reads >>> the metadata information stored in the parquet footer to discover the >>> columns with a single value and treats them as partitioning columns. >>> >>> Thanks >>> Mehant >>> >>>> On 10/21/15 11:52 AM, Chris Mathews wrote: >>>> Thank Mehant; yes we did look at doing this, but the advantages of using >>>> the new PARTITION BY feature is that the partitioned columns are >>>> automatically detected during any subsequent queries. This is a major >>>> advantage as our customers are using the Tableau BI tool, and knowing >>>> details such as the exact partition levels and directories is not an >>>> option. >>>> >>>> By the way, having created a table using PARTITION BY and CTAS ,how does a >>>> query know how to action the pruning ? Where is this information stored >>>> for the query to access the tables/files efficiently ? >>>> >>>> Cheers — Chris >>>> >>>>> On 21 Oct 2015, at 19:37, Mehant Baid <[email protected]> wrote: >>>>> >>>>> In addition to the auto partitioning done by CTAS, Drill also supports >>>>> directory based pruning. You could load data into different(nested) >>>>> directories underneath the top level table location and use the 'where' >>>>> clause to get the pruning performance benefits. Following is a typical >>>>> example >>>>> >>>>> Table location: /home/user/table_name >>>>> Within this you could create nested directory structure of the form >>>>> /home/user/table_name/2010/jan >>>>> /home/user/table_name/2010/feb >>>>> ... >>>>> /home/user/table_name/2010/dec >>>>> >>>>> /home/user/table_name/2011/jan >>>>> ... >>>>> /home/user/table_name/2011/dec >>>>> >>>>> Given this directory structure you could have a query that looks like >>>>> >>>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and dir1 = >>>>> jan; >>>>> >>>>> This would prune out scanning the parquet files under the other >>>>> directories. >>>>> >>>>> Thanks >>>>> Mehant >>>>>> On 10/21/15 11:26 AM, Chris Mathews wrote: >>>>>> We have an existing ETL framework processing machine generated data, >>>>>> which we are updating to write Parquet files out directly to HDFS using >>>>>> AvroParquetWriter for access by Drill. >>>>>> >>>>>> Some questions: >>>>>> >>>>>> How do we take advantage of Drill’s partition pruning capabilities with >>>>>> PARTITION BY if we are not using CTAS to load the Parquet files ? >>>>>> >>>>>> It seems there is no way of taking advantage of these features if the >>>>>> Parquet files are created externally to CTAS - am I correct ? >>>>>> >>>>>> If I am, then is there any way using a Drill API of programatically >>>>>> loading our data into Parquet files and utilise Drill's parallelisation >>>>>> techniques using CTAS, or do we have to write the data out to a file and >>>>>> then load that file again as input to a CTAS command ? >>>>>> >>>>>> Another potential issue is that we are constantly writing Parquet files >>>>>> out to HDFS directories so the data in these files eventually appears as >>>>>> additional data in a Drill query - so how can we do this with CTAS ? >>>>>> Does CTAS append to an existing directory structure or does it insist on >>>>>> a new table name each time it is executed ? >>>>>> >>>>>> What I am getting at here is that there seem to be performance >>>>>> enhancement features available to Drill when the Parquet files are >>>>>> created using an existing file as input to a CTAS that are not possible >>>>>> otherwise. With the volumes of data we are talking about it is not >>>>>> really an option to write the files out, form them to then be read back >>>>>> in again for conversion using CTAS; which is why we write the Parquet >>>>>> files out directly to HDFS and append them to existing directories. >>>>>> >>>>>> Am I missing something obvious here - quite possibly yes ? >>>>>> >>>>>> Thanks for any help. >>>>>> >>>>>> Cheers — Chris >>
