Filtering by value in Reducer

Peter Ruch Mon, 11 May 2015 10:09:38 -0700

Hi,

I am currently playing around with Hadoop and have some problems when
trying to filter in the Reducer.


I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial with
some additional functionality
and added the possibility to filter by the specific value of each key -
e.g. only output the key-value pairs where [[ value > threshold ]].

Filtering Code in Reducer
#####################################

for (IntWritable val : values) {
     sum += val.get();
}
if ( sum > threshold ) {
     result.set(sum);
     context.write(key, result);
}

#####################################

For threshold smaller any value the above code works as expected and the
output contains all key-value pairs.
If I increase the threshold to 1 some pairs are missing in the output
although the respective value would be larger than the threshold.

I tried to work out the error myself, but I could not get it to work as
intended. I use the exact Tutorial setup with Oracle JDK 8
on a CentOS 7 machine.

As far as I understand the respective Iterable<...>  in the Reducer already
contains all the observed values for a specific key.
Why is it possible that I am missing some of these key-value pairs then? It
only fails in very few cases. The input file is pretty large - 250 MB -
so I also tried to increase the memory for the mapping and reduction steps
but it did not help ( tried a lot of different stuff without success )

Maybe someone already experienced similar problems / is more experienced
than I am.


Thank you,

Peter

Filtering by value in Reducer

Reply via email to