Hi Ali

      The 500*500 Gigs of data is actually processed by multiple tasks across 
multiple nodes. In default settings a task will process 64Mb of data per task. 
So you don't need 250000 GB temp space in a node at all . A few gigs 
of free space is more than enough for any MR task .

Regards

Bejoy KS


________________________________
 From: Ali Safdar Kureishy <safdar.kurei...@gmail.com>
To: user@hive.apache.org 
Sent: Monday, May 7, 2012 1:01 PM
Subject: Storage requirements for intermediate (map-side-output) data during 
Hive joins
 

Hi,

I'm setting up a Hadoop cluster and would like to understand how much disk 
space I should expect to need with joins.

Let's assume that I have 2 tables, each of about 500 GB. Since the tables are 
large, these will all be reduce-side joins. As far as I know about such joins, 
the data generated is a cross product of the size of the two tables. Am I wrong?

In other words, for a reduce-side join in Hive involving 2 such tables, would I 
need to accommodate for 500 GB * 500 GB = 250000 GB of intermediate (map-side 
output) data before the reducer(s) kick-in in my cluster? Or am I missing 
something? That seems rediculously high, so I hope I'm mistaken.

But if the above IS accurate, what are the ways to reduce this consumption for 
the same kind of join in Hive?

Thanks,
Safdar

Reply via email to