Hi, Just a question about the implementation of Map/Reduce. I've been thinking about the output of the map stage. Logically all of the records emitted by the mapper have to be partitioned and sorted before they go into the reducers. (We can ignore the partitioning for the moment and so I'm just interested in the sorting.)
Now it seems to me that the obvious way to do this would be to have some sort of sorted structure (balanced binary tree for example) so that the (K, V) pairs emitted would be held in sorted order. But when I read White's TDG (3ed p209) it's pretty explicit that the data emitted is just held in a circular buffer in memory and that the sort occurs in-memory as part of the dump to disc. Examining the code (version 1.0.4) MapTask.java proves this to be the case. So the first question is why is it done this way? As far as I can see buffering the objects and doing one sort at the end is going to be computational complexity of order NlgN but maintaining a sorted in-memory structure is going to be of the same order so, at least asymptotically, there isn't going to be much difference between the two approaches. I can see that by serializing the objects into the circular buffer immediately you can minimize the number of key and value objects that need to be instantiated. (Particularly if the mapper re-uses its objects and custom comparators are used then barely any objects need to be constructed or destroyed.) In this way I guess that the load on the heap can be kept minimal. Is this the reason that it is done this way? Is there some other reason, perhaps a property of Java (I'm no expert) which makes one way preferable to the other? It's just that if the emitted data were held sorted then the opportunity to run the "combiner" much earlier seems to be much easier. For example if were running wordCount (or anything that needs the SumReducer) then we could just increment the counts held in-memory and we'd never have to emit duplicates. (In fact I'm tempted to hold a HashSet of references to the (K,V) pairs held serialized in the buffer and then incrementing the counts in memory so that I don't need to emit duplicates. But this seems perverse and far too implementation dependent.) Any thoughts? Z
