Re: Storage requirements for intermediate (map-side-output) data during Hive joins

Mark Grover Tue, 08 May 2012 11:00:33 -0700

Hi Safdar and Bejoy,
I decided to give my 2 cents to this dialogue:-)

All points that Bejoy made are valid. However, I don't see why multiplication 
of data sizes is involved.
You have two tables A and B, 500 GB each. They occupy 1000GB in HDFS (that's 
not entirely true, since your dfs replication factor is most likely not 1, but 
let's assume it is). A given record from table A would only go to one reducer 
depending on its id value. Same thing for table B. Records from A and B with 
the same id value would go to the same reducer. Therefore, in the worst case, 
you would see intermediate file of size 500GB + 500 GB = 1000 GB on your 
reducer. Like Bejoy said, depending on the columns you choose and the map 
output compression, this size may vary.


Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 

----- Original Message -----
From: "Bejoy Ks" <bejoy...@yahoo.com>
To: "safdar kureishy" <safdar.kurei...@gmail.com>
Cc: user@hive.apache.org
Sent: Tuesday, May 8, 2012 5:10:54 AM
Subject: Re: Storage requirements for intermediate (map-side-output) data 
during Hive joins



Hi Ali 


Sorry my short response may have got you confused. Let us assume you are doing 
a LeftOuterJoin on two tables 'A' and 'B' on a column 'id' (table are large so 
that only reduce side joins are only possible) then from my understanding this 
is how it should happen (explanation based on a simple join) 
- Two sets of mappers would be there one set to process table A, one set to 
process table B. The mapper semits it output with A.id and B.id as the keys and 
the required columns from the two tables as values. 
- On the reducer the values for common keys from both the mappers come 
together. if it is like A LEFT OUTER JOIN B, and if there are n matching values 
in the values List for table B, then there will be n rows in the output. 
(matching based on equality conditions provided in ON clause) 


Now if table A has 100 columns and table B has 150 columns, but in the final 
result you just need 25 columns. In that case the data is filtered in map 
itself and the map output has less data volume (map output gets spilled to lfs 
) . So the amount of data that goes into local disk is based mostly on your 
join query than the input data. Now the intermediate output is a well optimized 
and serialized form for hadoop - Writables ,so it occupies lesser memory 
compared to actual input data in hdfs. The map output tmp dir in a node has 
only the data processed by that node, so if you take the whole data volume only 
part of the same will be on every node. That is one of the advantages of 
distributed processing. 


To add on, you can still optimize the size of intermediate output by enabling 
map output compression. 


Regards 
Bejoy KS 





From: Ali Safdar Kureishy <safdar.kurei...@gmail.com> 
To: Bejoy Ks <bejoy...@yahoo.com> 
Cc: "user@hive.apache.org" <user@hive.apache.org> 
Sent: Tuesday, May 8, 2012 12:28 PM 
Subject: Re: Storage requirements for intermediate (map-side-output) data 
during Hive joins 



Hi Bejoy, 

Thanks....I see...I was asking because I wanted to know how much total storage 
space I would need on the cluster for the given data in the tables. 

Are you saying that for 2 tables of 500 Gb each (spread across the cluster), 
there would be a need for intermediate storage of 250000 GB? Or are you saying 
that it is the sum total of all data processing that happens, but is not 
actually stored? I'm guessing you were referring to the latter, because the 
former seems unscalable. 

Regards, 
Safdar 



On Mon, May 7, 2012 at 10:44 AM, Bejoy Ks < bejoy...@yahoo.com > wrote: 





Hi Ali 


The 500*500 Gigs of data is actually processed by multiple tasks across 
multiple nodes. In default settings a task will process 64Mb of data per task. 
So you don't need 250000 GB temp space in a node at all . A few gigs of free 
space is more than enough for any MR task . 


Regards 

Bejoy KS 





From: Ali Safdar Kureishy < safdar.kurei...@gmail.com > 
To: user@hive.apache.org 
Sent: Monday, May 7, 2012 1:01 PM 
Subject: Storage requirements for intermediate (map-side-output) data during 
Hive joins 





Hi, 


I'm setting up a Hadoop cluster and would like to understand how much disk 
space I should expect to need with joins. 


Let's assume that I have 2 tables, each of about 500 GB. Since the tables are 
large, these will all be reduce-side joins. As far as I know about such joins, 
the data generated is a cross product of the size of the two tables. Am I 
wrong? 


In other words, for a reduce-side join in Hive involving 2 such tables, would I 
need to accommodate for 500 GB * 500 GB = 250000 GB of intermediate (map-side 
output) data before the reducer(s) kick-in in my cluster? Or am I missing 
something? That seems rediculously high, so I hope I'm mistaken. 


But if the above IS accurate, what are the ways to reduce this consumption for 
the same kind of join in Hive? 


Thanks, 
Safdar

Re: Storage requirements for intermediate (map-side-output) data during Hive joins

Reply via email to