When using the Hadoop 2.3.0 distribution of Hadoop Pipes from C++, I found that 
if a Combiner is specified, the Combiner close() method is called before all of 
the Combiner reduce() methods have been called.  This call pattern differs from 
the normal Reducer call pattern (init()...reduce()*...close()).

Shouldn't the Combiner call sequence be the same as the Reducer call sequence?

After reviewing HadoopPipes.cc, the change in the call pattern appears to be 
caused by the Combiner instance being wrapped by the CombineRunner writer.  
And, how TaskContextImpl::closeAll() closes the writer after the reducer.  This 
means the Combiner close is called before CombineRunner::splitAll(), via 
CombineRunner::close(), has had a chance to call reduce() on all of its 
collected key/values.

I believe the fix would be to delegate the Combiner ownership to the 
CombineRunner instance.  The CombineRunner could ensure the Combiner has 
combined all data by calling the Combiner close() method from the 
CombineRunner::close() method after the splitAll().  And, to complete the 
Combiner cleanup, the CombineRunner destructor would need to delete the 
combiner instance.

Should this be submitted as a bug?

Thanks,
Joe

Reply via email to