This is your code
SELECT COUNT(*) FROM X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
WHERE action_id='aaa' AND classifier='bbb'
Well I have a table dummy with 1 billion rows imported from Oracle as ORC
format
hive> show create table dummy;
OK
CREATE TABLE `dummy`(
`id` int,
`clustered` int,
`scattered` int,
`randomised` int,
`random_string` varchar(50),
`small_vc` varchar(10),
`padding` varchar(10))
*CLUSTERED BY ( id)INTO 256 BUCKETS*ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/dummy'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
'numFiles'='1',
'numRows'='100000000',
'orc.bloom.filter.columns'='ID',
'orc.bloom.filter.fpp'='0.05',
'orc.compress'='SNAPPY',
'orc.create.index'='true',
'orc.row.index.stride'='10000',
'orc.stripe.size'='16777216',
'rawDataSize'='0',
'totalSize'='5662644579',
'transient_lastDdlTime'='1463245925')
If I turn on the plan for the following two cases. First a simple case
hive>
*EXPLAIN SELECT COUNT(1) FROM dummy > where id = 20;*OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (GROUP, 1)
DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:7
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: dummy
Statistics: Num rows: 100000000 Data size: 5662644736
Basic stats: COMPLETE Column stats: NONE
Filter Operator
* predicate: (id = 20) (type: boolean) *
Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE
Column stats: NONE
Select Operator
Statistics: Num rows: 50000000 Data size: 2831322368
Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(1)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Reducer 2
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.064 seconds, Fetched: 51 row(s)
Now we try with tablesample but assigning predicate values inside the
bracket as below because you are looking in bucket 1 for those values and
you want optimizer to know that.
*hive> EXPLAIN SELECT COUNT(1) FROM dummy > TABLESAMPLE (BUCKET 1 OUT
OF 256 ON ID = 10)* > ;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (GROUP, 1)
DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:10
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: dummy
Statistics: Num rows: 100000000 Data size: 5662644736
Basic stats: COMPLETE Column stats: NONE
Filter Operator
* predicate: (((hash((id = 10)) & 2147483647) % 256) = 0)
(type: boolean)* Statistics: Num rows: 50000000 Data
size: 2831322368 Basic stats: COMPLETE Column stats: NONE
Select Operator
Statistics: Num rows: 50000000 Data size: 2831322368
Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(1)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Reducer 2
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Otherwise I don't see much happening
hive> EXPLAIN SELECT COUNT(1) FROM dummy
> TABLESAMPLE (BUCKET 1 OUT OF 256 ON ID)
> WHERE ID = 10;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (GROUP, 1)
DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:13
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: dummy
Statistics: Num rows: 100000000 Data size: 5662644736
Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (false and (id = 10)) (type: boolean)
Statistics: Num rows: 25000000 Data size: 1415661184
Basic stats: COMPLETE Column stats: NONE
Select Operator
Statistics: Num rows: 25000000 Data size: 1415661184
Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(1)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Reducer 2
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
In general in my experience bucketing in ORC is the only area where ORC
tables come handy.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
On 14 May 2016 at 13:38, no jihun <[email protected]> wrote:
> ah, as i mentioned
> both field type of action_id and classifier is STRING. and I can not
> change the type.
>
> CREATE TABLE `X`(`action_id` string,`classifier` string)
> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
> STORED AS ORC
>
> I use two fields for hash then bucketing because each one field is not so
> well distributed.
>
> my concern is not about the strong hash source but about How can I
> tablesample to the a bucket by field value what provided by 'where clause'
>
> when I clustered by string fields which one is right for tablesample?
> 1. provide fields
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
>
> 2. provide values
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON 'aaa', 'bbb')
> 2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <[email protected]>님이 작성:
>
> Is action_id can be created as a numeric column:
>>
>> CREATE TABLE X ( action_id bigint, ..)
>>
>> Bucketing or hash partitioning best works on numeric columns with high
>> cardinality (say a primary key).
>>
>> From my old notes:
>>
>> Bucketing in Hive refers to hash partitioning where a hashing function is
>> applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
>> algorithm to prevent data from clustering within specific partitions.
>> Hashing is very effective if the column selected for bucketing has very
>> high selectivity like an ID column where selectivity (select
>> count(distinct(column))/count(column) ) = 1. In this case, the created
>> partitions/ files will be as evenly sized as possible. In a nutshell
>> bucketing is a method to get data evenly distributed over many
>> partitions/files. One should define the number of buckets by a power of
>> two -- 2^n, like 2, 4, 8, 16 etc to achieve best results. Again bucketing
>> will help concurrency in Hive. It may even allow a partition wise join i.e.
>> a join between two tables that are bucketed on the same column with the
>> same number of buckets (anyone has tried this?)
>>
>>
>>
>> One more things. When one defines the number of buckets at table creation
>> level in Hive, the number of partitions/files will be fixed. In contrast,
>> with partitioning you do not have this limitation.
>>
>> can you do
>>
>> show create table X
>>
>> and send the output. please.
>>
>>
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 May 2016 at 12:23, no jihun <[email protected]> wrote:
>>
>>> Hello.
>>>
>>> I want to ask the correct bucketing and tablesample way.
>>>
>>> There is a table X which I created by
>>>
>>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>>> STORED AS ORC
>>>
>>> Then I inserted 500M of rows into X by
>>>
>>> set hive.enforce.bucketing=true;
>>> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>>>
>>> Then I want to count or search some rows with condition. roughly,
>>>
>>> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>>>
>>> But I'd better to USE tablesample as I clustered X (action_id,
>>> classifier). So, the better query will be
>>>
>>> SELECT COUNT(*) FROM X
>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
>>> WHERE action_id='aaa' AND classifier='bbb'
>>>
>>> Is there any wrong above? But I can't not find any performance gain
>>> between these two query.
>>>
>>> query1 and RESULT( with no tablesample.)
>>>
>>> SELECT COUNT(*)) from X
>>> WHERE action_id='aaa' and classifier='bbb'
>>>
>>> ------------------------------------------------------------
>>> --------------------
>>> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING
>>> FAILED KILLED
>>> ------------------------------------------------------------
>>> --------------------
>>> Map 1 .......... SUCCEEDED 256 256 0 0
>>> 0 0
>>> Reducer 2 ...... SUCCEEDED 1 1 0 0
>>> 0 0
>>> ------------------------------------------------------------
>>> --------------------
>>> VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME:
>>> 15.35 s
>>> ------------------------------------------------------------
>>> --------------------
>>> It scans full data.
>>>
>>> query 2 and RESULT
>>>
>>> SELECT COUNT(*)) from X
>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
>>> WHERE action_id='aaa' and classifier='bbb'
>>>
>>> ------------------------------------------------------------
>>> --------------------
>>> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING
>>> FAILED KILLED
>>> ------------------------------------------------------------
>>> --------------------
>>> Map 1 .......... SUCCEEDED 256 256 0 0
>>> 0 0
>>> Reducer 2 ...... SUCCEEDED 1 1 0 0
>>> 0 0
>>> ------------------------------------------------------------
>>> --------------------
>>> VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME:
>>> 15.82 s
>>> ------------------------------------------------------------
>>> --------------------
>>> It ALSO scans full data.
>>>
>>> query 2 RESULT WHAT I EXPECTED.
>>>
>>> Result what I expected is something like...
>>> (use 1 map and relatively faster than without tabmesample)
>>> ------------------------------------------------------------
>>> --------------------
>>> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING
>>> FAILED KILLED
>>> ------------------------------------------------------------
>>> --------------------
>>> Map 1 .......... SUCCEEDED 1 1 0 0
>>> 0 0
>>> Reducer 2 ...... SUCCEEDED 1 1 0 0
>>> 0 0
>>> ------------------------------------------------------------
>>> --------------------
>>> VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME:
>>> 3.xx s
>>> ------------------------------------------------------------
>>> --------------------
>>>
>>> Values of action_id and classifier are well distributed and there is no
>>> skewed data.
>>>
>>> So I want to ask you what will be a correct query that prune and target
>>> specific bucket by multiple column?
>>>
>>
>>