When we execute context.write(null,null),we will close the current writer(which opened a storefile) and on next write request we will create new writer for other storefile. If a row key has puts of size more than the threshold, then they will be written to multiple store files. So same rowkey data will be distributed to multiple storefiles. In outer while loop we will continue the reduce from the point at which we have flushed or rolled. We will not omit any data.
________________________________________ From: Amit Sela [[email protected]] Sent: Wednesday, November 06, 2013 3:54 PM To: [email protected] Subject: PutSortReducer memory threshold Looking at the code of PutSortReducer I see that if my key has puts with size bigger than memory, the iteration stops and all puts up to the threshold point will be written to context. If iterator has more puts, context.write(null,null) is executed. Does this tell the bulk load tool to re-execute the reduce from that point in some way (if so, how ?) or the rest of the data is just omitted ? Thanks, Amit.
