Hi Ali Sorry my short response may have got you confused. Let us assume you are doing a LeftOuterJoin on two tables 'A' and 'B' on a column 'id' (table are large so that only reduce side joins are only possible) then from my understanding this is how it should happen (explanation based on a simple join) - Two sets of mappers would be there one set to process table A, one set to process table B. The mapper semits it output with A.id and B.id as the keys and the required columns from the two tables as values. - On the reducer the values for common keys from both the mappers come together. if it is like A LEFT OUTER JOIN B, and if there are n matching values in the values List for table B, then there will be n rows in the output. (matching based on equality conditions provided in ON clause)
Now if table A has 100 columns and table B has 150 columns, but in the final result you just need 25 columns. In that case the data is filtered in map itself and the map output has less data volume (map output gets spilled to lfs ) . So the amount of data that goes into local disk is based mostly on your join query than the input data. Now the intermediate output is a well optimized and serialized form for hadoop - Writables ,so it occupies lesser memory compared to actual input data in hdfs. The map output tmp dir in a node has only the data processed by that node, so if you take the whole data volume only part of the same will be on every node. That is one of the advantages of distributed processing. To add on, you can still optimize the size of intermediate output by enabling map output compression. Regards Bejoy KS ________________________________ From: Ali Safdar Kureishy <safdar.kurei...@gmail.com> To: Bejoy Ks <bejoy...@yahoo.com> Cc: "user@hive.apache.org" <user@hive.apache.org> Sent: Tuesday, May 8, 2012 12:28 PM Subject: Re: Storage requirements for intermediate (map-side-output) data during Hive joins Hi Bejoy, Thanks....I see...I was asking because I wanted to know how much total storage space I would need on the cluster for the given data in the tables. Are you saying that for 2 tables of 500 Gb each (spread across the cluster), there would be a need for intermediate storage of 250000 GB? Or are you saying that it is the sum total of all data processing that happens, but is not actually stored? I'm guessing you were referring to the latter, because the former seems unscalable. Regards, Safdar On Mon, May 7, 2012 at 10:44 AM, Bejoy Ks <bejoy...@yahoo.com> wrote: Hi Ali > > > The 500*500 Gigs of data is actually processed by multiple tasks across >multiple nodes. In default settings a task will process 64Mb of data per task. >So you don't need 250000 GB temp space in a node at all . A few gigs >of free space is more than enough for any MR task . > > >Regards > >Bejoy KS > > > >________________________________ > From: Ali Safdar Kureishy <safdar.kurei...@gmail.com> >To: user@hive.apache.org >Sent: Monday, May 7, 2012 1:01 PM >Subject: Storage requirements for intermediate (map-side-output) data during >Hive joins > > > >Hi, > > >I'm setting up a Hadoop cluster and would like to understand how much disk >space I should expect to need with joins. > > >Let's assume that I have 2 tables, each of about 500 GB. Since the tables are >large, these will all be reduce-side joins. As far as I know about such joins, >the data generated is a cross product of the size of the two tables. Am I >wrong? > > >In other words, for a reduce-side join in Hive involving 2 such tables, would >I need to accommodate for 500 GB * 500 GB = 250000 GB of >intermediate (map-side output) data before the reducer(s) kick-in in my >cluster? Or am I missing something? That seems rediculously high, so I hope >I'm mistaken. > > >But if the above IS accurate, what are the ways to reduce this consumption for >the same kind of join in Hive? > > >Thanks, >Safdar > >