To me there's practically very little difference between partitioning and
bucketing (partitioning defines split criteria explicitly whereas bucketing
somewhat implicitly) . Hive however recognises the latter as a separate
feature and handles the two in quite different way.

There's already a feature request proposition to unify and bring the
optimisations across (so it would address the "bucket pruning" issue I
believe you're having):

https://issues.apache.org/jira/browse/HIVE-9523

Probably best if you vote for it so it gets some traction…

Regards
~Maciek

On Fri, Mar 13, 2015 at 12:22 PM, cobby <ququr...@yahoo.com> wrote:

> hi, thanks for the detailed response.
> i will experiment with your suggested orc bloom filter solution.
>
> it seems to me the obvious, most straight forward solution is to add
> support for hash partitioning. so i can do something like:
>
> create table T()
> partitioned by (x into num_partitions,..).
>
> upon insert hash(x) determines which partition to put the record in. upon
> select, the query processor can now hash on x and scan only that partition
> (this optimization will probably work only on = and other discrete
> filtering but thats true for partitioning in general).
> it seems all of this can be done early in the query plan phase and have no
> effect on underling infra.
>
> regards,cobby.
>
>
>
> > On 12 במרץ 2015, at 23:05, Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
> >
> > Hi,
> >
> > No and it¹s a shame because we¹re stuck on some compatibility details
> with
> > this.
> >
> > The primary issue is the fact that the InputFormat is very generic and
> > offers no way to communicate StorageDescriptor or bucketing.
> >
> > The split generation for something SequenceFileInputFormat lives inside
> > MapReduce, where it has no idea about bucketing.
> >
> > So InputFormat.getSplits(conf) returns something relatively arbitrary,
> > which contains a mixture of files when CombineInputFormat is turned on.
> >
> > I have implemented this twice so far for ORC (for custom Tez jobs, with
> > huge wins) by using an MRv2 PathFilter over the regular OrcNewInputFormat
> > implementation, by turning off combine input and using Tez grouping
> > instead.
> >
> > But that has proved to be very fragile for a trunk feature, since with
> > schema evolution of partitioned tables older partitions may be bucketed
> > with a different count from a newer partition - so the StorageDescriptor
> > for each partition has to be fetched across before we can generate a
> valid
> > PathFilter.
> >
> > The SARGs are probably a better way to do this eventually as they can
> > implement IN_BUCKET(1,2) to indicate 1 of 2 instead of the ³00000_1²
> > PathFilter which is fragile.
> >
> >
> > Right now, the most fool-proof solution we¹ve hit upon was to apply the
> > ORC bloom filter to the bucket columns, which is far safer as it does not
> > care about the DDL - but does a membership check on the actual metadata &
> > prunes deeper at the stripe-level if it is sorted as well.
> >
> > That is somewhat neat since this doesn¹t need any new options for
> querying
> > - it automatically(*) kicks in for your query pattern.
> >
> > Cheers,
> > Gopal
> > (*) - conditions apply - there¹s a threshold for file-size for these
> > filters to be evaluated during planning (to prevent HS2 from burning
> CPU).
> >
> >
> > From:  Daniel Haviv <daniel.ha...@veracity-group.com>
> > Reply-To:  "user@hive.apache.org" <user@hive.apache.org>
> > Date:  Thursday, March 12, 2015 at 2:36 AM
> > To:  "user@hive.apache.org" <user@hive.apache.org>
> > Subject:  Bucket pruning
> >
> >
> > Hi,
> > We created a bucketed table and when we select in the following way:
> > select *
> > from testtble
> > where bucket_col ='X';
> >
> > We observe that there all of the table is being read and not just the
> > specific bucket.
> >
> > Does Hive support such a feature ?
> >
> >
> > Thanks,
> > Daniel
> >
> >
>

Reply via email to