Re: Storage requirements for intermediate (map-side-output) data during Hive joins

Bejoy Ks Mon, 07 May 2012 00:48:34 -0700

Hi Safdar
     Map side join uses memory on the hive client to form hash tables. They 
don't come into key value juggling part as there is no reduce phase involved 
for such jobs.


Regards
Bejoy KS


________________________________
 From: Ali Safdar Kureishy <safdar.kurei...@gmail.com>
To: user@hive.apache.org 
Sent: Monday, May 7, 2012 1:08 PM
Subject: Re: Storage requirements for intermediate (map-side-output) data 
during Hive joins
 

Please ignore my question below. I made a mistake with my calculation. The 
map-side joins do not perform a cross-product of the data. They just emit the 
data using the join-key as the row key.


Thanks,
Safdar




On Mon, May 7, 2012 at 12:31 AM, Ali Safdar Kureishy 
<safdar.kurei...@gmail.com> wrote:

Hi,
>
>
>I'm setting up a Hadoop cluster and would like to understand how much disk 
>space I should expect to need with joins.
>
>
>Let's assume that I have 2 tables, each of about 500 GB. Since the tables are 
>large, these will all be reduce-side joins. As far as I know about such joins, 
>the data generated is a cross product of the size of the two tables. Am I 
>wrong?
>
>
>In other words, for a reduce-side join in Hive involving 2 such tables, would 
>I need to accommodate for 500 GB * 500 GB = 250000 GB of 
>intermediate (map-side output) data before the reducer(s) kick-in in my 
>cluster? Or am I missing something? That seems rediculously high, so I hope 
>I'm mistaken.
>
>
>But if the above IS accurate, what are the ways to reduce this consumption for 
>the same kind of join in Hive?
>
>
>Thanks,
>Safdar

Re: Storage requirements for intermediate (map-side-output) data during Hive joins

Reply via email to