@Jacques: >>> Jinfeng hit the nail on the head. If you have Parquet files with single >>> value columns (and have Parquet footer metadata stats), Drill will >>> automatically leverage the partitioning with zero additional setup required.
So as it stands at the moment, does this mean Drill has to read the footer of all Parquet files in a directory to detect if a column has a single value across all files in that directory ? Or is this where the metadata comes into play ? — Chris > On 28 Oct 2015, at 00:30, Jacques Nadeau <[email protected]> wrote: > > Sounds like a future optimization opportunity once someone has the hybrid > issue and need. > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Sun, Oct 25, 2015 at 6:00 PM, Jinfeng Ni <[email protected]> wrote: > >> @Jacques, >> >> Steven probably could confirm whether my understanding of the code is >> correct or not. From the code, it seems we enforce the checking that >> only a column with unique value across all the files would be >> considered for pruning. >> >> I just tried two simple cases with TPC-H sample data. It seems to be >> the case; the table that meets this unique value requirement shows >> partition pruning works, while the table that has one file that >> violates this requirement does not show so. >> >> I think the code imposes such requirement due to the current way how >> partition pruning works : we build a set of value vectors for each >> input files before partition pruning does the interpreter evaluation. >> Those value vectors assume that each input file should have single >> value; otherwise it would not fit into this model. >> >> I agree with you that we actually should do the partitioning if any of >> the files are excludable. That sounds like we should change the >> partition pruning rule such that the filter evaluation is moved into >> ParquetGroupScan itself. >> >> >> >> >> On Sun, Oct 25, 2015 at 4:32 PM, Jacques Nadeau <[email protected]> >> wrote: >>> Jinfeng hit the nail on the head. If you have Parquet files with single >>> value columns (and have Parquet footer metadata stats), Drill will >>> automatically leverage the partitioning with zero additional setup >>> required. >>> >>> Jinfeng, based on what you said, it sounds as if we don't apply >>> partitioning unless all the files read are partitioned (single value >>> columns). We should actually do the partitioning if any of the files are >>> excludable, right? >>> >>> -- >>> Jacques Nadeau >>> CTO and Co-Founder, Dremio >>> >>> On Wed, Oct 21, 2015 at 1:45 PM, Chris Mathews <[email protected]> wrote: >>> >>>> Thanks guys this is very helpful. >>>> I now need to go away and do some more research into this. >>>> >>>> Cheers -- Chris >>>> ____________________ >>>> Sent from my iPhone >>>> >>>>> On 21 Oct 2015, at 21:32, Jinfeng Ni <[email protected]> wrote: >>>>> >>>>> For each column in the parquet files, Drill will check column metadata >>>>> and see if min == max across all parquet files. If yes, that indicates >>>>> this column has a unique value for all the files, and Drill will use >>>>> that column as partitioning columns. >>>>> >>>>> The partitioning column could be a column specified in "PARTITION BY" >>>>> clause of CTAS; in such case, CTAS will ensure each created parquet >>>>> file will have unique value for the column. If the parquet files are >>>>> not created by CTAS PARTITION BY, but some column happens to be unique >>>>> across all the parquet files, then it will be treated as partitioning >>>>> column as well. >>>>> >>>>> You may get more detail in the code [1]. >>>>> >>>>> [1] >>>> >> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L324 >>>>> >>>>> >>>>>> On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews <[email protected]> >>>> wrote: >>>>>> We create a JSON format schema for the Parquet file using the Avro >>>> specification and use this schema when loading data. >>>>>> >>>>>> Is there anything special we have to do to flag a column as a >>>> partitioning column ? >>>>>> Sorry I don’t understand your answer. What do you mean by ‘discover >> the >>>> columns with a single value’ ? >>>>>> >>>>>> Cheers — Chris >>>>>> >>>>>>> On 21 Oct 2015, at 20:02, Mehant Baid <[email protected]> >> wrote: >>>>>>> >>>>>>> The information is stored in the footer of the parquet files. Drill >>>> reads the metadata information stored in the parquet footer to discover >> the >>>> columns with a single value and treats them as partitioning columns. >>>>>>> >>>>>>> Thanks >>>>>>> Mehant >>>>>>> >>>>>>>> On 10/21/15 11:52 AM, Chris Mathews wrote: >>>>>>>> Thank Mehant; yes we did look at doing this, but the advantages of >>>> using the new PARTITION BY feature is that the partitioned columns are >>>> automatically detected during any subsequent queries. This is a major >>>> advantage as our customers are using the Tableau BI tool, and knowing >>>> details such as the exact partition levels and directories is not an >> option. >>>>>>>> >>>>>>>> By the way, having created a table using PARTITION BY and CTAS ,how >>>> does a query know how to action the pruning ? Where is this information >>>> stored for the query to access the tables/files efficiently ? >>>>>>>> >>>>>>>> Cheers — Chris >>>>>>>> >>>>>>>>> On 21 Oct 2015, at 19:37, Mehant Baid <[email protected]> >> wrote: >>>>>>>>> >>>>>>>>> In addition to the auto partitioning done by CTAS, Drill also >>>> supports directory based pruning. You could load data into >>>> different(nested) directories underneath the top level table location >> and >>>> use the 'where' clause to get the pruning performance benefits. >> Following >>>> is a typical example >>>>>>>>> >>>>>>>>> Table location: /home/user/table_name >>>>>>>>> Within this you could create nested directory structure of the >> form >>>>>>>>> /home/user/table_name/2010/jan >>>>>>>>> /home/user/table_name/2010/feb >>>>>>>>> ... >>>>>>>>> /home/user/table_name/2010/dec >>>>>>>>> >>>>>>>>> /home/user/table_name/2011/jan >>>>>>>>> ... >>>>>>>>> /home/user/table_name/2011/dec >>>>>>>>> >>>>>>>>> Given this directory structure you could have a query that looks >> like >>>>>>>>> >>>>>>>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and >>>> dir1 = jan; >>>>>>>>> >>>>>>>>> This would prune out scanning the parquet files under the other >>>> directories. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Mehant >>>>>>>>>> On 10/21/15 11:26 AM, Chris Mathews wrote: >>>>>>>>>> We have an existing ETL framework processing machine generated >>>> data, which we are updating to write Parquet files out directly to HDFS >>>> using AvroParquetWriter for access by Drill. >>>>>>>>>> >>>>>>>>>> Some questions: >>>>>>>>>> >>>>>>>>>> How do we take advantage of Drill’s partition pruning >> capabilities >>>> with PARTITION BY if we are not using CTAS to load the Parquet files ? >>>>>>>>>> >>>>>>>>>> It seems there is no way of taking advantage of these features if >>>> the Parquet files are created externally to CTAS - am I correct ? >>>>>>>>>> >>>>>>>>>> If I am, then is there any way using a Drill API of >> programatically >>>> loading our data into Parquet files and utilise Drill's parallelisation >>>> techniques using CTAS, or do we have to write the data out to a file and >>>> then load that file again as input to a CTAS command ? >>>>>>>>>> >>>>>>>>>> Another potential issue is that we are constantly writing Parquet >>>> files out to HDFS directories so the data in these files eventually >> appears >>>> as additional data in a Drill query - so how can we do this with CTAS ? >>>> Does CTAS append to an existing directory structure or does it insist >> on a >>>> new table name each time it is executed ? >>>>>>>>>> >>>>>>>>>> What I am getting at here is that there seem to be performance >>>> enhancement features available to Drill when the Parquet files are >> created >>>> using an existing file as input to a CTAS that are not possible >> otherwise. >>>> With the volumes of data we are talking about it is not really an >> option to >>>> write the files out, form them to then be read back in again for >> conversion >>>> using CTAS; which is why we write the Parquet files out directly to HDFS >>>> and append them to existing directories. >>>>>>>>>> >>>>>>>>>> Am I missing something obvious here - quite possibly yes ? >>>>>>>>>> >>>>>>>>>> Thanks for any help. >>>>>>>>>> >>>>>>>>>> Cheers — Chris >>>>>> >>>> >>
