Hi Ali

     Sorry my short response may have got you confused. Let us assume you are 
doing a LeftOuterJoin on two tables 'A' and 'B' on a column 'id' (table are 
large so that only reduce side joins are only possible) then from my 
understanding this is how it should happen (explanation based on a simple join)
- Two sets of mappers would be there one set to process table A, one set to 
process table B. The mapper semits it output with A.id and B.id as the keys and 
the required columns from the two tables as values.
- On the reducer the values for common keys from both the mappers come 
together. if it is like A LEFT OUTER JOIN B, and if there are n matching values 
in the values List for table B, then there will be n rows in the output. 
(matching based on equality conditions provided in ON clause)

Now if table A has 100 columns and table B has 150 columns, but in the final 
result you just need 25 columns. In that case the data is filtered in map 
itself and the map output has less data volume (map output gets spilled to lfs 
) . So the amount of data that goes into local disk is based mostly on your 
join query than the input data. Now the intermediate output is a well optimized 
and serialized form for hadoop - Writables ,so it occupies 
lesser memory compared to actual input data in hdfs. The map output tmp dir in 
a node has only the data processed by that node, so if you take the whole data 
volume only part of the same will be on every node. That is one of the 
advantages of distributed processing.

To add on, you can still optimize the size of intermediate output by enabling 
map output compression.

Regards
Bejoy KS


________________________________
 From: Ali Safdar Kureishy <safdar.kurei...@gmail.com>
To: Bejoy Ks <bejoy...@yahoo.com> 
Cc: "user@hive.apache.org" <user@hive.apache.org> 
Sent: Tuesday, May 8, 2012 12:28 PM
Subject: Re: Storage requirements for intermediate (map-side-output) data 
during Hive joins
 

Hi Bejoy,

Thanks....I see...I was asking because I wanted to know how much total storage 
space I would need on the cluster for the given data in the tables.

Are you saying that for 2 tables of 500 Gb each (spread across the cluster), 
there would be a need for intermediate storage of 250000 GB? Or are you saying 
that it is the sum total of all data processing that happens, but is not 
actually stored? I'm guessing you were referring to the latter, because the 
former seems unscalable.

Regards,
Safdar



On Mon, May 7, 2012 at 10:44 AM, Bejoy Ks <bejoy...@yahoo.com> wrote:

Hi Ali
>
>
>      The 500*500 Gigs of data is actually processed by multiple tasks across 
>multiple nodes. In default settings a task will process 64Mb of data per task. 
>So you don't need 250000 GB temp space in a node at all . A few gigs 
>of free space is more than enough for any MR task .
>
>
>Regards
>
>Bejoy KS
>
>
>
>________________________________
> From: Ali Safdar Kureishy <safdar.kurei...@gmail.com>
>To: user@hive.apache.org 
>Sent: Monday, May 7, 2012 1:01 PM
>Subject: Storage requirements for intermediate (map-side-output) data during 
>Hive joins
> 
>
>
>Hi,
>
>
>I'm setting up a Hadoop cluster and would like to understand how much disk 
>space I should expect to need with joins.
>
>
>Let's assume that I have 2 tables, each of about 500 GB. Since the tables are 
>large, these will all be reduce-side joins. As far as I know about such joins, 
>the data generated is a cross product of the size of the two tables. Am I 
>wrong?
>
>
>In other words, for a reduce-side join in Hive involving 2 such tables, would 
>I need to accommodate for 500 GB * 500 GB = 250000 GB of 
>intermediate (map-side output) data before the reducer(s) kick-in in my 
>cluster? Or am I missing something? That seems rediculously high, so I hope 
>I'm mistaken.
>
>
>But if the above IS accurate, what are the ways to reduce this consumption for 
>the same kind of join in Hive?
>
>
>Thanks,
>Safdar
>
>

Reply via email to