Hi,

No and it¹s a shame because we¹re stuck on some compatibility details with
this.

The primary issue is the fact that the InputFormat is very generic and
offers no way to communicate StorageDescriptor or bucketing.

The split generation for something SequenceFileInputFormat lives inside
MapReduce, where it has no idea about bucketing.

So InputFormat.getSplits(conf) returns something relatively arbitrary,
which contains a mixture of files when CombineInputFormat is turned on.

I have implemented this twice so far for ORC (for custom Tez jobs, with
huge wins) by using an MRv2 PathFilter over the regular OrcNewInputFormat
implementation, by turning off combine input and using Tez grouping
instead.

But that has proved to be very fragile for a trunk feature, since with
schema evolution of partitioned tables older partitions may be bucketed
with a different count from a newer partition - so the StorageDescriptor
for each partition has to be fetched across before we can generate a valid
PathFilter.

The SARGs are probably a better way to do this eventually as they can
implement IN_BUCKET(1,2) to indicate 1 of 2 instead of the ³00000_1²
PathFilter which is fragile.


Right now, the most fool-proof solution we¹ve hit upon was to apply the
ORC bloom filter to the bucket columns, which is far safer as it does not
care about the DDL - but does a membership check on the actual metadata &
prunes deeper at the stripe-level if it is sorted as well.

That is somewhat neat since this doesn¹t need any new options for querying
- it automatically(*) kicks in for your query pattern.

Cheers,
Gopal
(*) - conditions apply - there¹s a threshold for file-size for these
filters to be evaluated during planning (to prevent HS2 from burning CPU).


From:  Daniel Haviv <daniel.ha...@veracity-group.com>
Reply-To:  "user@hive.apache.org" <user@hive.apache.org>
Date:  Thursday, March 12, 2015 at 2:36 AM
To:  "user@hive.apache.org" <user@hive.apache.org>
Subject:  Bucket pruning


Hi,
We created a bucketed table and when we select in the following way:
select * 
from testtble
where bucket_col ='X';

We observe that there all of the table is being read and not just the
specific bucket.

Does Hive support such a feature ?


Thanks,
Daniel


Reply via email to