Re: Externally created Parquet files and partition pruning

Jacques Nadeau Tue, 27 Oct 2015 17:31:41 -0700

Sounds like a future optimization opportunity once someone has the hybrid
issue and need.


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Sun, Oct 25, 2015 at 6:00 PM, Jinfeng Ni <[email protected]> wrote:

> @Jacques,
>
> Steven probably could confirm whether my understanding of the code is
> correct or not. From the code, it seems we enforce the checking that
> only a column with unique value across all the files would be
> considered for pruning.
>
> I just tried two simple cases with TPC-H sample data. It seems to be
> the case; the table that meets this unique value requirement shows
> partition pruning works, while the table that has one file that
> violates this requirement does not show so.
>
> I think the code imposes such requirement due to the current way how
> partition pruning works : we build a set of value vectors for each
> input files before partition pruning does the interpreter evaluation.
> Those value vectors assume that each input file should have single
> value; otherwise it would not fit into this model.
>
> I agree with you that we actually should do the partitioning if any of
> the files are excludable. That sounds like we should change the
> partition pruning rule such that the filter evaluation is moved into
> ParquetGroupScan itself.
>
>
>
>
> On Sun, Oct 25, 2015 at 4:32 PM, Jacques Nadeau <[email protected]>
> wrote:
> > Jinfeng hit the nail on the head. If you have Parquet files with single
> > value columns (and have Parquet footer metadata stats), Drill will
> > automatically leverage the partitioning with zero additional setup
> > required.
> >
> > Jinfeng, based on what you said, it sounds as if we don't apply
> > partitioning unless all the files read are partitioned (single value
> > columns). We should actually do the partitioning if any of the files are
> > excludable, right?
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Wed, Oct 21, 2015 at 1:45 PM, Chris Mathews <[email protected]> wrote:
> >
> >> Thanks guys this is very helpful.
> >> I now need to go away and do some more research into this.
> >>
> >> Cheers -- Chris
> >> ____________________
> >> Sent from my iPhone
> >>
> >> > On 21 Oct 2015, at 21:32, Jinfeng Ni <[email protected]> wrote:
> >> >
> >> > For each column in the parquet files, Drill will check column metadata
> >> > and see if min == max across all parquet files. If yes, that indicates
> >> > this column has a unique value for all the files, and Drill will use
> >> > that column as partitioning columns.
> >> >
> >> > The partitioning column could be a column specified in "PARTITION BY"
> >> > clause of CTAS; in such case, CTAS will ensure each created parquet
> >> > file will have unique value for the column.  If the parquet files are
> >> > not created by CTAS PARTITION BY, but some column happens to be unique
> >> > across all the parquet files, then it will be treated as partitioning
> >> > column as well.
> >> >
> >> > You may get more detail in the code [1].
> >> >
> >> > [1]
> >>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L324
> >> >
> >> >
> >> >> On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews <[email protected]>
> >> wrote:
> >> >> We create a JSON format schema for the Parquet file using the Avro
> >> specification and use this schema when loading data.
> >> >>
> >> >> Is there anything special we have to do to flag a column as a
> >> partitioning column ?
> >> >> Sorry I don’t understand your answer. What do you mean by ‘discover
> the
> >> columns with a single value’ ?
> >> >>
> >> >> Cheers — Chris
> >> >>
> >> >>> On 21 Oct 2015, at 20:02, Mehant Baid <[email protected]>
> wrote:
> >> >>>
> >> >>> The information is stored in the footer of the parquet files. Drill
> >> reads the metadata information stored in the parquet footer to discover
> the
> >> columns with a single value and treats them as partitioning columns.
> >> >>>
> >> >>> Thanks
> >> >>> Mehant
> >> >>>
> >> >>>> On 10/21/15 11:52 AM, Chris Mathews wrote:
> >> >>>> Thank Mehant; yes we did look at doing this, but the advantages of
> >> using the new PARTITION BY feature is that the partitioned columns are
> >> automatically detected during any subsequent queries.  This is a major
> >> advantage as our customers are using the Tableau BI tool, and knowing
> >> details such as the exact partition levels and directories is not an
> option.
> >> >>>>
> >> >>>> By the way, having created a table using PARTITION BY and CTAS ,how
> >> does a query know how to action the pruning ?  Where is this information
> >> stored for the query to access the tables/files efficiently ?
> >> >>>>
> >> >>>> Cheers — Chris
> >> >>>>
> >> >>>>> On 21 Oct 2015, at 19:37, Mehant Baid <[email protected]>
> wrote:
> >> >>>>>
> >> >>>>> In addition to the auto partitioning done by CTAS, Drill also
> >> supports directory based pruning. You could load data into
> >> different(nested) directories underneath the top level table location
> and
> >> use the 'where' clause to get the pruning performance benefits.
> Following
> >> is a typical example
> >> >>>>>
> >> >>>>> Table location: /home/user/table_name
> >> >>>>> Within this you could create nested directory structure of the
> form
> >> >>>>> /home/user/table_name/2010/jan
> >> >>>>> /home/user/table_name/2010/feb
> >> >>>>> ...
> >> >>>>> /home/user/table_name/2010/dec
> >> >>>>>
> >> >>>>> /home/user/table_name/2011/jan
> >> >>>>> ...
> >> >>>>> /home/user/table_name/2011/dec
> >> >>>>>
> >> >>>>> Given this directory structure you could have a query that looks
> like
> >> >>>>>
> >> >>>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and
> >> dir1 = jan;
> >> >>>>>
> >> >>>>> This would prune out scanning the parquet files under the other
> >> directories.
> >> >>>>>
> >> >>>>> Thanks
> >> >>>>> Mehant
> >> >>>>>> On 10/21/15 11:26 AM, Chris Mathews wrote:
> >> >>>>>> We have an existing ETL framework processing machine generated
> >> data, which we are updating to write Parquet files out directly to HDFS
> >> using AvroParquetWriter for access by Drill.
> >> >>>>>>
> >> >>>>>> Some questions:
> >> >>>>>>
> >> >>>>>> How do we take advantage of Drill’s partition pruning
> capabilities
> >> with PARTITION BY if we are not using CTAS to load the Parquet files ?
> >> >>>>>>
> >> >>>>>> It seems there is no way of taking advantage of these features if
> >> the Parquet files are created externally to CTAS - am I correct ?
> >> >>>>>>
> >> >>>>>> If I am, then is there any way using a Drill API of
> programatically
> >> loading our data into Parquet files and utilise Drill's parallelisation
> >> techniques using CTAS, or do we have to write the data out to a file and
> >> then load that file again as input to a CTAS command ?
> >> >>>>>>
> >> >>>>>> Another potential issue is that we are constantly writing Parquet
> >> files out to HDFS directories so the data in these files eventually
> appears
> >> as additional data in a Drill query - so how can we do this with CTAS ?
> >> Does CTAS append to an existing directory structure or does it insist
> on a
> >> new table name each time it is executed ?
> >> >>>>>>
> >> >>>>>> What I am getting at here is that there seem to be performance
> >> enhancement features available to Drill when the Parquet files are
> created
> >> using an existing file as input to a CTAS that are not possible
> otherwise.
> >> With the volumes of data we are talking about it is not really an
> option to
> >> write the files out, form them to then be read back in again for
> conversion
> >> using CTAS; which is why we write the Parquet files out directly to HDFS
> >> and append them to existing directories.
> >> >>>>>>
> >> >>>>>> Am I missing something obvious here - quite possibly yes ?
> >> >>>>>>
> >> >>>>>> Thanks for any help.
> >> >>>>>>
> >> >>>>>> Cheers — Chris
> >> >>
> >>
>

Re: Externally created Parquet files and partition pruning

Reply via email to