Re: Storage requirements for intermediate (map-side-output) data during Hive joins

Bejoy Ks Mon, 07 May 2012 00:45:19 -0700

Hi Ali

      The 500*500 Gigs of data is actually processed by multiple tasks across 
multiple nodes. In default settings a task will process 64Mb of data per task. 
So you don't need 250000 GB temp space in a node at all . A few gigs 
of free space is more than enough for any MR task .

Regards

Bejoy KS

________________________________
 From: Ali Safdar Kureishy <safdar.kurei...@gmail.com>
To: user@hive.apache.org 
Sent: Monday, May 7, 2012 1:01 PM
Subject: Storage requirements for intermediate (map-side-output) data during 
Hive joins

Hi,

I'm setting up a Hadoop cluster and would like to understand how much disk 
space I should expect to need with joins.

Let's assume that I have 2 tables, each of about 500 GB. Since the tables are 
large, these will all be reduce-side joins. As far as I know about such joins, 
the data generated is a cross product of the size of the two tables. Am I wrong?

In other words, for a reduce-side join in Hive involving 2 such tables, would I 
need to accommodate for 500 GB * 500 GB = 250000 GB of intermediate (map-side 
output) data before the reducer(s) kick-in in my cluster? Or am I missing 
something? That seems rediculously high, so I hope I'm mistaken.

But if the above IS accurate, what are the ways to reduce this consumption for 
the same kind of join in Hive?

Thanks,
Safdar

Re: Storage requirements for intermediate (map-side-output) data during Hive joins

Reply via email to