Hi Ali The 500*500 Gigs of data is actually processed by multiple tasks across multiple nodes. In default settings a task will process 64Mb of data per task. So you don't need 250000 GB temp space in a node at all . A few gigs of free space is more than enough for any MR task .
Regards Bejoy KS ________________________________ From: Ali Safdar Kureishy <safdar.kurei...@gmail.com> To: user@hive.apache.org Sent: Monday, May 7, 2012 1:01 PM Subject: Storage requirements for intermediate (map-side-output) data during Hive joins Hi, I'm setting up a Hadoop cluster and would like to understand how much disk space I should expect to need with joins. Let's assume that I have 2 tables, each of about 500 GB. Since the tables are large, these will all be reduce-side joins. As far as I know about such joins, the data generated is a cross product of the size of the two tables. Am I wrong? In other words, for a reduce-side join in Hive involving 2 such tables, would I need to accommodate for 500 GB * 500 GB = 250000 GB of intermediate (map-side output) data before the reducer(s) kick-in in my cluster? Or am I missing something? That seems rediculously high, so I hope I'm mistaken. But if the above IS accurate, what are the ways to reduce this consumption for the same kind of join in Hive? Thanks, Safdar