Hi Saurabh.
Bucketing in Hive refers to hash partitioning where a hashing function is applied. Likewise an RDBMS, Hive will apply a linear hashing algorithm to prevent data from clustering within specific partitions. Hashing is very effective if the column selected for bucketing has very high selectivity like an ID column where selectivity (select count(distinct(column))/count(column) ) = 1. In this case, the created partitions/ files will be as evenly sized as possible. In a nutshell bucketing is a method to get data evenly distributed over many partitions/files. One should define the number of buckets by a power of two -- 2^n, like 2, 4, 8, 16 etc to achieve best results. Again bucketing will help concurrency in Hive. It may even allow a partition wise join i.e. a join between two tables that are bucketed on the same column with the same number of buckets (anyone has tried this?). You need to work out the selectivity of column you are using for bucketing using the above formulae or something similar then decide on the number of buckets say clustered by (object_id) into 256 buckets. HTH Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908. pdf Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. -----Original Message----- From: Db-Blog [mailto:mpp.databa...@gmail.com] Sent: 06 September 2015 21:22 To: user@hive.apache.org Subject: Re: Bucketing- Identify Number of Buckets Details of Hive Version: I am using Hive -14.0 with Tez as execution engine. Thanks, Saurabh Sent from my iPhone, please avoid typos. > On 07-Sep-2015, at 1:51 am, Db-Blog < <mailto:mpp.databa...@gmail.com> mpp.databa...@gmail.com> wrote: > > Hi, > > I need to join two big tables in hive. The join key is the grain of both these tables, hence clustering and sorting on the same will provide significant performance optimisation while joining. > > However, i am not sure how to calculate the exact number of buckets while creating these tables. Can someone please share any pointers on the same? > > Planning to keep these Clustered and Sorted tables as parquet/orc- for columnar storage and better compression. > > Thanks, > Saurabh > > Sent from my iPhone, please avoid typos.