Hi Patrick, > Just wondering, how many reducers are you using in this shuffle? By > 7,000 partitions, I'm assuming you mean the map side of the shuffle. > What about the reduce side?
7,000 on that side as well. We're loading about a month's worth of data in one RDD, with ~7,000 partitions, and cogrouping it with another RDD with 50 partitions, and the resulting RDD also has 7,000 partitions. (As, since we don't have spark.default.parallelism set, the defaultPartitioner logic chooses the max of [50, 7,0000] to be the next partition size.) I believe that is what you're asking by number of reducers? The number of partitions in the post-cogroup ShuffledRDD? Also, AFAICT, I don't believe we get to the reduce/ShuffledRDD side of this cogroup--after 2,000-3,000 ShuffleMapTasks on the map side is when it bogs down. Thanks, Stephen
