Hi, Have you considered using an in-mapper combining pattern? i.e Inside your Mapper object you can create a Map object holding the intermediate key-values whose state is preserved across multiple calls of map method. The values are emitted periodically only when certain threshold reached(threshold = ratio between block size and memory consumed). You can make use of a counter to check the number of key-value pairs has been processed. You can substantially avoid the problem: "reducer to be the bottleneck when there are large volume of intermediate output" as you have already a lesser number of intermediate keys in-memory which are flushed on a specific bucket size.
Thanks Sambit Tripathy On Thu, Sep 20, 2012 at 6:42 PM, Jason Yang <[email protected]>wrote: > Hi, all > > I have a question that whether all the intermediate output with the same > key go to the same reducer or not? > > If it is, in case of only two keys are generated from mapper, but there > are 3 reducer running in this job, what would happen? > > If not, how could I do some processing over the all data, like counting? I > think some would suggest to set the number of reducer to 1, but I thought > this would make the reducer to be the bottleneck when there are large > volume of intermediate output, isn't it? > > -- > YANG, Lin > >
