Hi,
I am currently playing around with Hadoop and have some problems when
trying to filter in the Reducer.
I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial with
some additional functionality
and added the possibility to filter by the specific value of each key -
e.g. only output the key-value pairs where [[ value > threshold ]].
Filtering Code in Reducer
#####################################
for (IntWritable val : values) {
sum += val.get();
}
if ( sum > threshold ) {
result.set(sum);
context.write(key, result);
}
#####################################
For threshold smaller any value the above code works as expected and the
output contains all key-value pairs.
If I increase the threshold to 1 some pairs are missing in the output
although the respective value would be larger than the threshold.
I tried to work out the error myself, but I could not get it to work as
intended. I use the exact Tutorial setup with Oracle JDK 8
on a CentOS 7 machine.
As far as I understand the respective Iterable<...> in the Reducer already
contains all the observed values for a specific key.
Why is it possible that I am missing some of these key-value pairs then? It
only fails in very few cases. The input file is pretty large - 250 MB -
so I also tried to increase the memory for the mapping and reduction steps
but it did not help ( tried a lot of different stuff without success )
Maybe someone already experienced similar problems / is more experienced
than I am.
Thank you,
Peter