Thanks for all the help, guys. > When you do a shuffle form N map partitions to M reduce partitions, there are > N * M output blocks created and each one is tracked.
Okay, that makes sense. I have a few questions then... If there are N*M output blocks, does that mean that each machine will (generally) be responsible for (N*M)/(number of machines) blocks, and so the BlockManager data structures would have appropriately less data with more machines? (Turns out 7,000*7,000/5=9.8 million which is in the ballpark of the estimated 5 million or so entries that were in BlockManager heap dump.) > Since you only have a few machines, you don't "need" the > extra partitions to add more parallelism to the reduce. True, I am perhaps overly cautious about favoring more partitions... It seems like previously, when Spark was shuffling a partition in the ShuffleMapTask, it buffered all the data in memory. So, even if you only had 5 machines, it was important to have lots of tiny slices of data, rather than a few big ones, to avoid OOMEs. ...but, I had forgotten this, but I believe that's no longer the case? And Spark/ShuffleMapTask can now fully stream partitions? ...if so, that seems like a big deal and that one of my first patches to have default partitioner prefer max partitions no longer makes as much sense, and spark.default.parallelism just became a whole lot more useful. As in, it should always be set. (We're not, currently.) I'll give this another go later tonight/tomorrow with less partitions and see what happens. Thanks again! - Stephen
