Hi Safdar Map side join uses memory on the hive client to form hash tables. They don't come into key value juggling part as there is no reduce phase involved for such jobs.
Regards Bejoy KS ________________________________ From: Ali Safdar Kureishy <safdar.kurei...@gmail.com> To: user@hive.apache.org Sent: Monday, May 7, 2012 1:08 PM Subject: Re: Storage requirements for intermediate (map-side-output) data during Hive joins Please ignore my question below. I made a mistake with my calculation. The map-side joins do not perform a cross-product of the data. They just emit the data using the join-key as the row key. Thanks, Safdar On Mon, May 7, 2012 at 12:31 AM, Ali Safdar Kureishy <safdar.kurei...@gmail.com> wrote: Hi, > > >I'm setting up a Hadoop cluster and would like to understand how much disk >space I should expect to need with joins. > > >Let's assume that I have 2 tables, each of about 500 GB. Since the tables are >large, these will all be reduce-side joins. As far as I know about such joins, >the data generated is a cross product of the size of the two tables. Am I >wrong? > > >In other words, for a reduce-side join in Hive involving 2 such tables, would >I need to accommodate for 500 GB * 500 GB = 250000 GB of >intermediate (map-side output) data before the reducer(s) kick-in in my >cluster? Or am I missing something? That seems rediculously high, so I hope >I'm mistaken. > > >But if the above IS accurate, what are the ways to reduce this consumption for >the same kind of join in Hive? > > >Thanks, >Safdar