Hi Safdar and Bejoy, I decided to give my 2 cents to this dialogue:-) All points that Bejoy made are valid. However, I don't see why multiplication of data sizes is involved. You have two tables A and B, 500 GB each. They occupy 1000GB in HDFS (that's not entirely true, since your dfs replication factor is most likely not 1, but let's assume it is). A given record from table A would only go to one reducer depending on its id value. Same thing for table B. Records from A and B with the same id value would go to the same reducer. Therefore, in the worst case, you would see intermediate file of size 500GB + 500 GB = 1000 GB on your reducer. Like Bejoy said, depending on the columns you choose and the map output compression, this size may vary.
Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com ----- Original Message ----- From: "Bejoy Ks" <bejoy...@yahoo.com> To: "safdar kureishy" <safdar.kurei...@gmail.com> Cc: user@hive.apache.org Sent: Tuesday, May 8, 2012 5:10:54 AM Subject: Re: Storage requirements for intermediate (map-side-output) data during Hive joins Hi Ali Sorry my short response may have got you confused. Let us assume you are doing a LeftOuterJoin on two tables 'A' and 'B' on a column 'id' (table are large so that only reduce side joins are only possible) then from my understanding this is how it should happen (explanation based on a simple join) - Two sets of mappers would be there one set to process table A, one set to process table B. The mapper semits it output with A.id and B.id as the keys and the required columns from the two tables as values. - On the reducer the values for common keys from both the mappers come together. if it is like A LEFT OUTER JOIN B, and if there are n matching values in the values List for table B, then there will be n rows in the output. (matching based on equality conditions provided in ON clause) Now if table A has 100 columns and table B has 150 columns, but in the final result you just need 25 columns. In that case the data is filtered in map itself and the map output has less data volume (map output gets spilled to lfs ) . So the amount of data that goes into local disk is based mostly on your join query than the input data. Now the intermediate output is a well optimized and serialized form for hadoop - Writables ,so it occupies lesser memory compared to actual input data in hdfs. The map output tmp dir in a node has only the data processed by that node, so if you take the whole data volume only part of the same will be on every node. That is one of the advantages of distributed processing. To add on, you can still optimize the size of intermediate output by enabling map output compression. Regards Bejoy KS From: Ali Safdar Kureishy <safdar.kurei...@gmail.com> To: Bejoy Ks <bejoy...@yahoo.com> Cc: "user@hive.apache.org" <user@hive.apache.org> Sent: Tuesday, May 8, 2012 12:28 PM Subject: Re: Storage requirements for intermediate (map-side-output) data during Hive joins Hi Bejoy, Thanks....I see...I was asking because I wanted to know how much total storage space I would need on the cluster for the given data in the tables. Are you saying that for 2 tables of 500 Gb each (spread across the cluster), there would be a need for intermediate storage of 250000 GB? Or are you saying that it is the sum total of all data processing that happens, but is not actually stored? I'm guessing you were referring to the latter, because the former seems unscalable. Regards, Safdar On Mon, May 7, 2012 at 10:44 AM, Bejoy Ks < bejoy...@yahoo.com > wrote: Hi Ali The 500*500 Gigs of data is actually processed by multiple tasks across multiple nodes. In default settings a task will process 64Mb of data per task. So you don't need 250000 GB temp space in a node at all . A few gigs of free space is more than enough for any MR task . Regards Bejoy KS From: Ali Safdar Kureishy < safdar.kurei...@gmail.com > To: user@hive.apache.org Sent: Monday, May 7, 2012 1:01 PM Subject: Storage requirements for intermediate (map-side-output) data during Hive joins Hi, I'm setting up a Hadoop cluster and would like to understand how much disk space I should expect to need with joins. Let's assume that I have 2 tables, each of about 500 GB. Since the tables are large, these will all be reduce-side joins. As far as I know about such joins, the data generated is a cross product of the size of the two tables. Am I wrong? In other words, for a reduce-side join in Hive involving 2 such tables, would I need to accommodate for 500 GB * 500 GB = 250000 GB of intermediate (map-side output) data before the reducer(s) kick-in in my cluster? Or am I missing something? That seems rediculously high, so I hope I'm mistaken. But if the above IS accurate, what are the ways to reduce this consumption for the same kind of join in Hive? Thanks, Safdar