ah, as i mentioned both field type of action_id and classifier is STRING. and I can not change the type.
CREATE TABLE `X`(`action_id` string,`classifier` string) CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS STORED AS ORC I use two fields for hash then bucketing because each one field is not so well distributed. my concern is not about the strong hash source but about How can I tablesample to the a bucket by field value what provided by 'where clause' when I clustered by string fields which one is right for tablesample? 1. provide fields TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) 2. provide values TABLESAMPLE(BUCKET 1 OUT OF 256 ON 'aaa', 'bbb') 2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <[email protected]>님이 작성: > Is action_id can be created as a numeric column: > > CREATE TABLE X ( action_id bigint, ..) > > Bucketing or hash partitioning best works on numeric columns with high > cardinality (say a primary key). > > From my old notes: > > Bucketing in Hive refers to hash partitioning where a hashing function is > applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing > algorithm to prevent data from clustering within specific partitions. > Hashing is very effective if the column selected for bucketing has very > high selectivity like an ID column where selectivity (select > count(distinct(column))/count(column) ) = 1. In this case, the created > partitions/ files will be as evenly sized as possible. In a nutshell > bucketing is a method to get data evenly distributed over many > partitions/files. One should define the number of buckets by a power of > two -- 2^n, like 2, 4, 8, 16 etc to achieve best results. Again bucketing > will help concurrency in Hive. It may even allow a partition wise join i.e. > a join between two tables that are bucketed on the same column with the > same number of buckets (anyone has tried this?) > > > > One more things. When one defines the number of buckets at table creation > level in Hive, the number of partitions/files will be fixed. In contrast, > with partitioning you do not have this limitation. > > can you do > > show create table X > > and send the output. please. > > > > Thanks > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 14 May 2016 at 12:23, no jihun <[email protected]> wrote: > >> Hello. >> >> I want to ask the correct bucketing and tablesample way. >> >> There is a table X which I created by >> >> CREATE TABLE `X`(`action_id` string,`classifier` string) >> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS >> STORED AS ORC >> >> Then I inserted 500M of rows into X by >> >> set hive.enforce.bucketing=true; >> INSERT OVERWRITE INTO X SELECT * FROM X_RAW >> >> Then I want to count or search some rows with condition. roughly, >> >> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb' >> >> But I'd better to USE tablesample as I clustered X (action_id, >> classifier). So, the better query will be >> >> SELECT COUNT(*) FROM X >> TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) >> WHERE action_id='aaa' AND classifier='bbb' >> >> Is there any wrong above? But I can't not find any performance gain >> between these two query. >> >> query1 and RESULT( with no tablesample.) >> >> SELECT COUNT(*)) from X >> WHERE action_id='aaa' and classifier='bbb' >> >> >> -------------------------------------------------------------------------------- >> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED >> KILLED >> >> -------------------------------------------------------------------------------- >> Map 1 .......... SUCCEEDED 256 256 0 0 >> 0 0 >> Reducer 2 ...... SUCCEEDED 1 1 0 0 >> 0 0 >> >> -------------------------------------------------------------------------------- >> VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.35 >> s >> >> -------------------------------------------------------------------------------- >> It scans full data. >> >> query 2 and RESULT >> >> SELECT COUNT(*)) from X >> TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) >> WHERE action_id='aaa' and classifier='bbb' >> >> >> -------------------------------------------------------------------------------- >> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED >> KILLED >> >> -------------------------------------------------------------------------------- >> Map 1 .......... SUCCEEDED 256 256 0 0 >> 0 0 >> Reducer 2 ...... SUCCEEDED 1 1 0 0 >> 0 0 >> >> -------------------------------------------------------------------------------- >> VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: >> 15.82 s >> >> -------------------------------------------------------------------------------- >> It ALSO scans full data. >> >> query 2 RESULT WHAT I EXPECTED. >> >> Result what I expected is something like... >> (use 1 map and relatively faster than without tabmesample) >> >> -------------------------------------------------------------------------------- >> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED >> KILLED >> >> -------------------------------------------------------------------------------- >> Map 1 .......... SUCCEEDED 1 1 0 0 >> 0 0 >> Reducer 2 ...... SUCCEEDED 1 1 0 0 >> 0 0 >> >> -------------------------------------------------------------------------------- >> VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: >> 3.xx s >> >> -------------------------------------------------------------------------------- >> >> Values of action_id and classifier are well distributed and there is no >> skewed data. >> >> So I want to ask you what will be a correct query that prune and target >> specific bucket by multiple column? >> > >
