| The difference between a
shuffle dependency and a transformation that can cause a shuffle is
probably worth pointing out. The mentioned transformations (groupByKey, join, etc) *might* generate a shuffle dependency on input RDDs, but they won't necessarily. For example, if you join() two RDDs that already use the same partitioner (e.g. a default HashPartitioner with the default parallelism), then no shuffle needs to be performed (and nothing should hit disk). Any records that need to be considered together will already be in the same partitions of the input RDDs (e.g. all records with key X are guaranteed to be in partition hash(X) of both input RDDs, so no shuffling is needed). Sometimes this is *really* worth exploiting, and even if it only applies to one of the input RDDs. For example, if you're joining 2 RDDs and one is much larger than the other and already partitioned, you can explicitly use the partitioner from the larger RDD so that only the smaller RDD gets shuffled. This also means you probably want to pay attention to transformations that remove partitioners. For example, prefer mapValues() to map(). mapValues() has to maintain the same key, so the output is guaranteed to still be partitioned. map() can change the keys, so partitioning is lost even if you keep the same key. -Ewen
|
- How does shuffle work in spark ? suman bharadwaj
- Re: How does shuffle work in spark ? Archit Thakur
- Re: How does shuffle work in spark ? suman bharadwaj
- Re: How does shuffle work in spark ? Patrick Wendell
- Re: How does shuffle work in spark ... Ewen Cheslack-Postava
- Re: How does shuffle work in s... suman bharadwaj
- Re: How does shuffle work ... Ewen Cheslack-Postava
- Re: How does shuffle work in s... Mark Hamstra


