Hello.
I want to ask the correct bucketing and tablesample way.
There is a table X which I created by
CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC
Then I inserted 500M of rows into X by
set hive.enforce.bucketing=true;
INSERT OVERWRITE INTO X SELECT * FROM X_RAW
Then I want to count or search some rows with condition. roughly,
SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
But I'd better to USE tablesample as I clustered X (action_id, classifier).
So, the better query will be
SELECT COUNT(*) FROM X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
WHERE action_id='aaa' AND classifier='bbb'
Is there any wrong above? But I can't not find any performance gain between
these two query.
query1 and RESULT( with no tablesample.)
SELECT COUNT(*)) from X
WHERE action_id='aaa' and classifier='bbb'
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED
KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 256 256 0 0
0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0
0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.35
s
--------------------------------------------------------------------------------
It scans full data.
query 2 and RESULT
SELECT COUNT(*)) from X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
WHERE action_id='aaa' and classifier='bbb'
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED
KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 256 256 0 0
0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0
0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME:
15.82 s
--------------------------------------------------------------------------------
It ALSO scans full data.
query 2 RESULT WHAT I EXPECTED.
Result what I expected is something like...
(use 1 map and relatively faster than without tabmesample)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED
KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0
0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0
0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME:
3.xx s
--------------------------------------------------------------------------------
Values of action_id and classifier are well distributed and there is no
skewed data.
So I want to ask you what will be a correct query that prune and target
specific bucket by multiple column?