When using the Hadoop 2.3.0 distribution of Hadoop Pipes from C++, I found that if a Combiner is specified, the Combiner close() method is called before all of the Combiner reduce() methods have been called. This call pattern differs from the normal Reducer call pattern (init()...reduce()*...close()).
Shouldn't the Combiner call sequence be the same as the Reducer call sequence? After reviewing HadoopPipes.cc, the change in the call pattern appears to be caused by the Combiner instance being wrapped by the CombineRunner writer. And, how TaskContextImpl::closeAll() closes the writer after the reducer. This means the Combiner close is called before CombineRunner::splitAll(), via CombineRunner::close(), has had a chance to call reduce() on all of its collected key/values. I believe the fix would be to delegate the Combiner ownership to the CombineRunner instance. The CombineRunner could ensure the Combiner has combined all data by calling the Combiner close() method from the CombineRunner::close() method after the splitAll(). And, to complete the Combiner cleanup, the CombineRunner destructor would need to delete the combiner instance. Should this be submitted as a bug? Thanks, Joe
