RE: Bucketing in Hive

Mich Talebzadeh Tue, 26 Jan 2016 14:29:49 -0800

Thanks for the link Maciek.


I read  and quote:

 

“Logically and functionally bucketing and partitioning are quite similar - both 
provide mechanism to segregate and separate the table's data based on its 
content. Thanks to that significant further optimisations like [partition] 
PRUNING or [bucket] MAP JOIN are possible.

The difference seems to be imposed by design where the PARTITIONing is 
open/explicit while BUCKETing is discrete/implicit.

 

Partitioning seems to be very common if not a standard feature in all current 
RDBMS while BUCKETING seems to be HIVE specific only.

In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT 
PARTITIONING".

 

Just to clarify every RDBMS that I know of; Oracle, Sybase among others provide 
Range, Hash and List partitioning plus local indexes on those partitions. One 
advantage is concurrent scanning of multiple partitions of very large tables.  
Hive on Tez or Spark (where both use

DAG, as opposed to MR which is essentially a serial scan) will benefit from 
bucketing. Otherwise much like RDBMS the use case for hash partitioning AKA 
bucketing in practice is limited.

 

 

HTH,

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Maciek [mailto:mac...@sonra.io] 
Sent: 26 January 2016 22:01
To: user <user@hive.apache.org>
Subject: Re: Bucketing in Hive

 

These two serve the same purpose and logically are very much alike.

The difference is that partitioning may be explicit (partitioning, in pretty 
much all solid RDMBSs, Hive too) or implicit (hashing/bucketing, just Hive?).

In Hive, for some reason, they come with different, mutually exclusive set of 
optimisations.

 

Pruning is a good example - available for Partitionined but not for Bucketed 
tables.

You can track full list here: https://issues.apache.org/jira/browse/HIVE-9523




Thank you,

Kind Regards

~Maciek

 

On 26 January 2016 at 21:44, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Hi,

 

There are number of questions brought up about Hive Bucketing. As I see -  it 
is another name for hash partitioning (assuming that Hive partitioning is 
effectively range partitioning). I borrow these terms (range and hash 
partitioning) from industry standard as they are commonly used among RDBMS .

 

Excuse my ignorance, I am at loss to know why hash partitioning is called 
bucketing in Hive? Someone may throw light on what are the main differences if 
any.

 

As I see it in RDBMS Partitioning has these benefits:

 

1.    Availability -- each partition can reside on a different segment/device. 
Hence a problem with a device will take out a slice of the table's data instead 
of the whole thing. 

2.     Manageability -- partitioning provides a mechanism for splitting whole 
table jobs into clear batches. Partition exchange can make it easier to bulk 
load data. Getting rid of fragmentation , moving older partitions to lower tier 
storage, updating stats etc 

3.    Performance -- Partition elimination 

 

Hash partitioning is where a hashing function is applied. RDBMS will apply a 
linear hashing algorithm f(x) like mod (x) to prevent data from clustering 
within specific partitions. Hashing is very effective if the column selected 
for partitioning has very high selectivity like an ID column, where selectivity 
(select count(distinct(column))/count(column) ) = 1.  In this case, the created 
partitions will be as evenly sized as possible. In a nutshell hash partitioning 
is a method to get data evenly distributed over many files. One should define 
the number of hash partitions by a power of two -- 2^n,  like 2, 4, 8, 16 etc. 
to achieve best results. I am pretty sure this definition applies to Hive 
bucketing although hashing is far simpler.

 

As for performance, physical co-location of records can speed up some queries- 
those which are searching records by a defined range of keys. However, any 
queries which do not match the grain of the query will not  perform faster (and 
may even perform slower) than a non-hash-partitioned (reads bucketing) table. 

 

IMO, Hash partitioning is unlikely to provide performance benefits, precisely 
because it shuffles the keys across the whole table. It will provide the 
availability and manageability benefits of partitioning. Unlike standard range 
partitioning, the number of buckets is fixed so it does not fluctuate with 
data. It may even allow a partition wise join i.e. a join between two tables 
that are hash partitioned (bucketed) on the same column with the same number of 
partitions (buckets), thus helping certain queries.

 

 

HTH

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

RE: Bucketing in Hive

Reply via email to