Sounds like a future optimization opportunity once someone has the hybrid issue and need.
-- Jacques Nadeau CTO and Co-Founder, Dremio On Sun, Oct 25, 2015 at 6:00 PM, Jinfeng Ni <[email protected]> wrote: > @Jacques, > > Steven probably could confirm whether my understanding of the code is > correct or not. From the code, it seems we enforce the checking that > only a column with unique value across all the files would be > considered for pruning. > > I just tried two simple cases with TPC-H sample data. It seems to be > the case; the table that meets this unique value requirement shows > partition pruning works, while the table that has one file that > violates this requirement does not show so. > > I think the code imposes such requirement due to the current way how > partition pruning works : we build a set of value vectors for each > input files before partition pruning does the interpreter evaluation. > Those value vectors assume that each input file should have single > value; otherwise it would not fit into this model. > > I agree with you that we actually should do the partitioning if any of > the files are excludable. That sounds like we should change the > partition pruning rule such that the filter evaluation is moved into > ParquetGroupScan itself. > > > > > On Sun, Oct 25, 2015 at 4:32 PM, Jacques Nadeau <[email protected]> > wrote: > > Jinfeng hit the nail on the head. If you have Parquet files with single > > value columns (and have Parquet footer metadata stats), Drill will > > automatically leverage the partitioning with zero additional setup > > required. > > > > Jinfeng, based on what you said, it sounds as if we don't apply > > partitioning unless all the files read are partitioned (single value > > columns). We should actually do the partitioning if any of the files are > > excludable, right? > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Wed, Oct 21, 2015 at 1:45 PM, Chris Mathews <[email protected]> wrote: > > > >> Thanks guys this is very helpful. > >> I now need to go away and do some more research into this. > >> > >> Cheers -- Chris > >> ____________________ > >> Sent from my iPhone > >> > >> > On 21 Oct 2015, at 21:32, Jinfeng Ni <[email protected]> wrote: > >> > > >> > For each column in the parquet files, Drill will check column metadata > >> > and see if min == max across all parquet files. If yes, that indicates > >> > this column has a unique value for all the files, and Drill will use > >> > that column as partitioning columns. > >> > > >> > The partitioning column could be a column specified in "PARTITION BY" > >> > clause of CTAS; in such case, CTAS will ensure each created parquet > >> > file will have unique value for the column. If the parquet files are > >> > not created by CTAS PARTITION BY, but some column happens to be unique > >> > across all the parquet files, then it will be treated as partitioning > >> > column as well. > >> > > >> > You may get more detail in the code [1]. > >> > > >> > [1] > >> > https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L324 > >> > > >> > > >> >> On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews <[email protected]> > >> wrote: > >> >> We create a JSON format schema for the Parquet file using the Avro > >> specification and use this schema when loading data. > >> >> > >> >> Is there anything special we have to do to flag a column as a > >> partitioning column ? > >> >> Sorry I don’t understand your answer. What do you mean by ‘discover > the > >> columns with a single value’ ? > >> >> > >> >> Cheers — Chris > >> >> > >> >>> On 21 Oct 2015, at 20:02, Mehant Baid <[email protected]> > wrote: > >> >>> > >> >>> The information is stored in the footer of the parquet files. Drill > >> reads the metadata information stored in the parquet footer to discover > the > >> columns with a single value and treats them as partitioning columns. > >> >>> > >> >>> Thanks > >> >>> Mehant > >> >>> > >> >>>> On 10/21/15 11:52 AM, Chris Mathews wrote: > >> >>>> Thank Mehant; yes we did look at doing this, but the advantages of > >> using the new PARTITION BY feature is that the partitioned columns are > >> automatically detected during any subsequent queries. This is a major > >> advantage as our customers are using the Tableau BI tool, and knowing > >> details such as the exact partition levels and directories is not an > option. > >> >>>> > >> >>>> By the way, having created a table using PARTITION BY and CTAS ,how > >> does a query know how to action the pruning ? Where is this information > >> stored for the query to access the tables/files efficiently ? > >> >>>> > >> >>>> Cheers — Chris > >> >>>> > >> >>>>> On 21 Oct 2015, at 19:37, Mehant Baid <[email protected]> > wrote: > >> >>>>> > >> >>>>> In addition to the auto partitioning done by CTAS, Drill also > >> supports directory based pruning. You could load data into > >> different(nested) directories underneath the top level table location > and > >> use the 'where' clause to get the pruning performance benefits. > Following > >> is a typical example > >> >>>>> > >> >>>>> Table location: /home/user/table_name > >> >>>>> Within this you could create nested directory structure of the > form > >> >>>>> /home/user/table_name/2010/jan > >> >>>>> /home/user/table_name/2010/feb > >> >>>>> ... > >> >>>>> /home/user/table_name/2010/dec > >> >>>>> > >> >>>>> /home/user/table_name/2011/jan > >> >>>>> ... > >> >>>>> /home/user/table_name/2011/dec > >> >>>>> > >> >>>>> Given this directory structure you could have a query that looks > like > >> >>>>> > >> >>>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and > >> dir1 = jan; > >> >>>>> > >> >>>>> This would prune out scanning the parquet files under the other > >> directories. > >> >>>>> > >> >>>>> Thanks > >> >>>>> Mehant > >> >>>>>> On 10/21/15 11:26 AM, Chris Mathews wrote: > >> >>>>>> We have an existing ETL framework processing machine generated > >> data, which we are updating to write Parquet files out directly to HDFS > >> using AvroParquetWriter for access by Drill. > >> >>>>>> > >> >>>>>> Some questions: > >> >>>>>> > >> >>>>>> How do we take advantage of Drill’s partition pruning > capabilities > >> with PARTITION BY if we are not using CTAS to load the Parquet files ? > >> >>>>>> > >> >>>>>> It seems there is no way of taking advantage of these features if > >> the Parquet files are created externally to CTAS - am I correct ? > >> >>>>>> > >> >>>>>> If I am, then is there any way using a Drill API of > programatically > >> loading our data into Parquet files and utilise Drill's parallelisation > >> techniques using CTAS, or do we have to write the data out to a file and > >> then load that file again as input to a CTAS command ? > >> >>>>>> > >> >>>>>> Another potential issue is that we are constantly writing Parquet > >> files out to HDFS directories so the data in these files eventually > appears > >> as additional data in a Drill query - so how can we do this with CTAS ? > >> Does CTAS append to an existing directory structure or does it insist > on a > >> new table name each time it is executed ? > >> >>>>>> > >> >>>>>> What I am getting at here is that there seem to be performance > >> enhancement features available to Drill when the Parquet files are > created > >> using an existing file as input to a CTAS that are not possible > otherwise. > >> With the volumes of data we are talking about it is not really an > option to > >> write the files out, form them to then be read back in again for > conversion > >> using CTAS; which is why we write the Parquet files out directly to HDFS > >> and append them to existing directories. > >> >>>>>> > >> >>>>>> Am I missing something obvious here - quite possibly yes ? > >> >>>>>> > >> >>>>>> Thanks for any help. > >> >>>>>> > >> >>>>>> Cheers — Chris > >> >> > >> >
