[Pig Wiki] Update of "PigSkewedJoinSpec" by yinghe

Apache Wiki Mon, 14 Dec 2009 16:52:45 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The "PigSkewedJoinSpec" page has been changed by yinghe.
http://wiki.apache.org/pig/PigSkewedJoinSpec?action=diff&rev1=15&rev2=16

--------------------------------------------------

         Number of Tuples from First Table (tupleCount) = (sampleCount / 
totalSampleCount) * (inputFileSize / avgDiskUsage)
         Number of Reducers = (int) Math.round(Math.ceil((double) tupleCount / 
tupleMCount));
  }}}
+ 
+ For example, if we assume
+  * total number of samples = 200 
+  * total number of samples with key k1 = 30 
+  * size of input file = 1G.
+  * totalMemory = 150M
+  * avgMemUsage for tuples of k1 = 150 bytes
+  * avgDiskUsage for tuples of k1 = 100 bytes
+ 
+ then,
+  * estimated total number of k1 that can fit in memory = 150M/150 = 1M
+  * estimated total number of tuples from input file = 1G/100 = 10M tuples 
+  * estimated number of tuples for k1 from input file = (30/200) * 10M = 1.5M
+  * estimated total number of reducers for k1 = Math.ceil (1.5M/1M) = 2
+ 
+ This calculation is done on every key of samples. If a key requires more than 
1 reducer, it is regarded as a skewed key, and pre-allocated with multiple 
reducers. The reducers are allocated to skewed keys in round robin fashion. 
+ 
  This UDF generates an output which will be used by the following join job. 
The format of the output file is a map. It has two keys:
  
   * totalreducers: the number of total reducers for second job

[Pig Wiki] Update of "PigSkewedJoinSpec" by yinghe

Reply via email to