Hi Bejoy, Thanks....I see...I was asking because I wanted to know how much total storage space I would need on the cluster for the given data in the tables.
Are you saying that for 2 tables of 500 Gb each (spread across the cluster), there would be a need for intermediate storage of 250000 GB? Or are you saying that it is the sum total of all data *processing* that happens, but is not actually stored? I'm guessing you were referring to the latter, because the former seems unscalable. Regards, Safdar On Mon, May 7, 2012 at 10:44 AM, Bejoy Ks <bejoy...@yahoo.com> wrote: > Hi Ali > > The 500*500 Gigs of data is actually processed by multiple tasks > across multiple nodes. In default settings a task will process 64Mb of data > per task. So you don't need *250000 *GB temp space in a node at all . A > few gigs of free space is more than enough for any MR task . > > Regards > Bejoy KS > > ------------------------------ > *From:* Ali Safdar Kureishy <safdar.kurei...@gmail.com> > *To:* user@hive.apache.org > *Sent:* Monday, May 7, 2012 1:01 PM > *Subject:* Storage requirements for intermediate (map-side-output) data > during Hive joins > > Hi, > > I'm setting up a Hadoop cluster and would like to understand how much disk > space I should expect to need with joins. > > Let's assume that I have 2 tables, each of about 500 GB. Since the tables > are large, these will all be reduce-side joins. As far as I know about such > joins, the data generated is a cross product of the size of the two tables. > Am I wrong? > > In other words, for a reduce-side join in Hive involving 2 such tables, > would I need to accommodate for 500 GB * 500 GB = *250000 *GB of * > intermediate* (map-side output) data before the reducer(s) kick-in in my > cluster? Or am I missing something? That seems rediculously high, so I hope > I'm mistaken. > > But if the above IS accurate, what are the ways to reduce this consumption > for the same kind of join in Hive? > > Thanks, > Safdar > > >