clustered bucket and tablesample

no jihun Sat, 14 May 2016 04:24:07 -0700

Hello.

I want to ask the correct bucketing and tablesample way.


There is a table X which I created by

CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC

Then I inserted 500M of rows into X by

set hive.enforce.bucketing=true;
INSERT OVERWRITE INTO X SELECT * FROM X_RAW

Then I want to count or search some rows with condition. roughly,

SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'

But I'd better to USE tablesample as I clustered X (action_id, classifier).
So, the better query will be

SELECT COUNT(*) FROM X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
WHERE action_id='aaa' AND classifier='bbb'

Is there any wrong above? But I can't not find any performance gain between
these two query.

query1 and RESULT( with no tablesample.)

SELECT COUNT(*)) from X
WHERE action_id='aaa' and classifier='bbb'

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED    256        256        0        0
0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0
0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.35
s
--------------------------------------------------------------------------------
It scans full data.

query 2 and RESULT

SELECT COUNT(*)) from X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
WHERE action_id='aaa' and classifier='bbb'

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED    256        256        0        0
0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0
0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
15.82     s
--------------------------------------------------------------------------------
It ALSO scans full data.

query 2 RESULT WHAT I EXPECTED.

Result what I expected is something like...
(use 1 map and relatively faster than without tabmesample)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0
0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0
0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
3.xx     s
--------------------------------------------------------------------------------

Values of action_id and classifier are well distributed and there is no
skewed data.

So I want to ask you what will be a correct query that prune and target
specific bucket by multiple column?

clustered bucket and tablesample

Reply via email to