Hi,

 

I will try to have a go at your points but I am sure there are many experts 
around.

 

As you may know already in RDBMS partitioning (dividing a very large table into 
sub-tables conceptually) is deployed to address three areast. 

 

1.     Availability -- each partition can reside on a different 
tablespace/device. Hence a problem with a tablespace/device will take out a 
slice of the table's data instead of the whole thing. This does not really 
ap[ply to Hive with 3 block replication as standard

2.     Manageability -- partitioning provides a mechanism for splitting whole 
table jobs into clear batches. Partition exchange can make it easier to bulk 
load data. Defragging, moving older partitions to lower tier storage, updating 
stats etc Most of these benefits apply to Hive as well. Please check the docs. 

3.     Performance -- partition elimination 

 

In simplest form (excluding composite partitioning), Hive partitioning will be 
similar to “range partitioning” in RDBMS. One can partition a table (say 
partitioned_table as shown below which is batch loaded from 
non_partitioned_table) -- by country, year, month etc. Each partition will be 
stored in Hive under sub-directory table/year/month like below

 

/user/hive/warehouse/scratchpad.db/partitioned_table/country=Italy/year=2014/month=Feb

 

Hive does not have the concept of indexes local or global as yet. So without 
partitioning a simple query in Hive will have to read the entire table even if 
it is filtering a smaller result set (WHERE CLAUSE). This becomes a bottleneck 
for running multiple MapReduce jobs over a large table. So partitioning will 
help localise the query by hitting the relevant sub-directory or 
sub-directories only. There is another important aspect with Hive as well. The 
locking granularity will be determined by the lowest slice in the filing system 
(sub-directory). So entering data into the above partition/file, will take an 
exclusive lock on that partition/file but crucially the rest of partitions will 
be available (assuming concurrency in Hive is enabled). 

 

+----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+

|  lockid  |  database   |         table          |             partition       
       | lock_state  |  lock_type   | transaction_id  | last_heartbeat  |  
acquired_at   |  user   | hostname  |

+----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+

| Lock ID  | Database    | Table                  | Partition                   
       | State       | Type         | Transaction ID  | Last Hearbeat   | 
Acquired At    | User    | Hostname  |

| 1711     | scratchpad  | non_partitioned_table  | NULL                        
       | ACQUIRED    | SHARED_READ  | NULL            | 1428862154670   | 
1428862151904  | hduser  | rhes564   |

| 1711     | scratchpad  | partitioned_table      | 
country=Italy/year=2014/month=Feb  | ACQUIRED    | EXCLUSIVE    | NULL          
  | 1428862154670   | 1428862151905  | hduser  | rhes564   |

+----------+-------------+------------------------+------------------------------------+-------------+--------------+-----------------+-----------------+----------------+---------+-----------+--+

 

Now your point 2, bucketing in Hive refers to hash partitioning where a hashing 
function is applied. Likewise an RDBMS, Hive will apply a linear hashing 
algorithm to prevent data from clustering within specific partitions. Hashing 
is very effective if the column selected for bucketing has very high 
selectivity like an ID column where selectivity (select 
count(distinct(column))/count(column) ) = 1.  In this case, the created 
partitions/ files will be as evenly sized as possible. In a nutshell bucketing 
is a method to get data evenly distributed over many partitions/files.  One 
should define the number of buckets by a power of two -- 2^n,  like 2, 4, 8, 16 
etc to achieve best results. Again bucketing will help concurrency in Hive. It 
may even allow a partition wise join i.e. a join between two tables that are 
bucketed on the same column with the same number of buckets (anyone has tried 
this?)

 

One more things. When one defines the number of buckets at table creation level 
in Hive, the number of partitions/files will be fixed. In contrast, with 
partitioning you do not have this limitation. 

 

HTH

 

Mich

 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries 
or their employees, unless expressly so stated. It is the responsibility of the 
recipient to ensure that this email is virus free, therefore neither Peridale 
Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Ashok Kumar [mailto:ashok34...@yahoo.com] 
Sent: 10 April 2015 17:46
To: user@hive.apache.org
Subject: partition and bucket

 


Greeting all,

Glad to join the user group. I am from DBA background Oracle/Sybase/MSSQL.

I would like to understand partition and bucketing in Hive and the difference 
between.

Shall be grateful if someone explains where shall I use partition or bucket for 
best practices.

thanks

 

Reply via email to