Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The "PigSkewedJoinSpec" page has been changed by yinghe. http://wiki.apache.org/pig/PigSkewedJoinSpec?action=diff&rev1=15&rev2=16 -------------------------------------------------- Number of Tuples from First Table (tupleCount) = (sampleCount / totalSampleCount) * (inputFileSize / avgDiskUsage) Number of Reducers = (int) Math.round(Math.ceil((double) tupleCount / tupleMCount)); }}} + + For example, if we assume + * total number of samples = 200 + * total number of samples with key k1 = 30 + * size of input file = 1G. + * totalMemory = 150M + * avgMemUsage for tuples of k1 = 150 bytes + * avgDiskUsage for tuples of k1 = 100 bytes + + then, + * estimated total number of k1 that can fit in memory = 150M/150 = 1M + * estimated total number of tuples from input file = 1G/100 = 10M tuples + * estimated number of tuples for k1 from input file = (30/200) * 10M = 1.5M + * estimated total number of reducers for k1 = Math.ceil (1.5M/1M) = 2 + + This calculation is done on every key of samples. If a key requires more than 1 reducer, it is regarded as a skewed key, and pre-allocated with multiple reducers. The reducers are allocated to skewed keys in round robin fashion. + This UDF generates an output which will be used by the following join job. The format of the output file is a map. It has two keys: * totalreducers: the number of total reducers for second job