Re: clustered bucket and tablesample

no jihun Sat, 14 May 2016 05:39:21 -0700

ah, as i mentioned
both field type of action_id and classifier is STRING. and I can not change
the type.


CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC

I use two fields for hash then bucketing because each one field is not so
well distributed.

my concern is not about the strong hash source but about How can I
tablesample to the a bucket by field value what provided by 'where clause'

when I clustered by string fields which one is right for tablesample?
1. provide fields
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)

2. provide values
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  'aaa', 'bbb')
2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <[email protected]>님이 작성:

> Is action_id can be created as a numeric column:
>
> CREATE TABLE X ( action_id bigint,  ..)
>
> Bucketing or hash partitioning best works on numeric columns with high
> cardinality (say a primary key).
>
> From my old notes:
>
> Bucketing in Hive refers to hash partitioning where a hashing function is
> applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
> algorithm to prevent data from clustering within specific partitions.
> Hashing is very effective if the column selected for bucketing has very
> high selectivity like an ID column where selectivity (select
> count(distinct(column))/count(column) ) = 1.  In this case, the created
> partitions/ files will be as evenly sized as possible. In a nutshell
> bucketing is a method to get data evenly distributed over many
> partitions/files.  One should define the number of buckets by a power of
> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
> will help concurrency in Hive. It may even allow a partition wise join i.e.
> a join between two tables that are bucketed on the same column with the
> same number of buckets (anyone has tried this?)
>
>
>
> One more things. When one defines the number of buckets at table creation
> level in Hive, the number of partitions/files will be fixed. In contrast,
> with partitioning you do not have this limitation.
>
> can you do
>
> show create table X
>
> and send the output. please.
>
>
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 May 2016 at 12:23, no jihun <[email protected]> wrote:
>
>> Hello.
>>
>> I want to ask the correct bucketing and tablesample way.
>>
>> There is a table X which I created by
>>
>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>> STORED AS ORC
>>
>> Then I inserted 500M of rows into X by
>>
>> set hive.enforce.bucketing=true;
>> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>>
>> Then I want to count or search some rows with condition. roughly,
>>
>> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>>
>> But I'd better to USE tablesample as I clustered X (action_id,
>> classifier). So, the better query will be
>>
>> SELECT COUNT(*) FROM X
>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>> WHERE action_id='aaa' AND classifier='bbb'
>>
>> Is there any wrong above? But I can't not find any performance gain
>> between these two query.
>>
>> query1 and RESULT( with no tablesample.)
>>
>> SELECT COUNT(*)) from X
>> WHERE action_id='aaa' and classifier='bbb'
>>
>>
>> --------------------------------------------------------------------------------
>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>> KILLED
>>
>> --------------------------------------------------------------------------------
>> Map 1 ..........   SUCCEEDED    256        256        0        0
>> 0       0
>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>> 0       0
>>
>> --------------------------------------------------------------------------------
>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.35
>> s
>>
>> --------------------------------------------------------------------------------
>> It scans full data.
>>
>> query 2 and RESULT
>>
>> SELECT COUNT(*)) from X
>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>> WHERE action_id='aaa' and classifier='bbb'
>>
>>
>> --------------------------------------------------------------------------------
>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>> KILLED
>>
>> --------------------------------------------------------------------------------
>> Map 1 ..........   SUCCEEDED    256        256        0        0
>> 0       0
>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>> 0       0
>>
>> --------------------------------------------------------------------------------
>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>> 15.82     s
>>
>> --------------------------------------------------------------------------------
>> It ALSO scans full data.
>>
>> query 2 RESULT WHAT I EXPECTED.
>>
>> Result what I expected is something like...
>> (use 1 map and relatively faster than without tabmesample)
>>
>> --------------------------------------------------------------------------------
>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>> KILLED
>>
>> --------------------------------------------------------------------------------
>> Map 1 ..........   SUCCEEDED      1          1        0        0
>> 0       0
>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>> 0       0
>>
>> --------------------------------------------------------------------------------
>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>> 3.xx     s
>>
>> --------------------------------------------------------------------------------
>>
>> Values of action_id and classifier are well distributed and there is no
>> skewed data.
>>
>> So I want to ask you what will be a correct query that prune and target
>> specific bucket by multiple column?
>>
>
>

Re: clustered bucket and tablesample

Reply via email to