Re: How does shuffle work in spark ?

suman bharadwaj Thu, 16 Jan 2014 13:24:20 -0800

Thanks Patrick and Ewen,

Great answers.


So a shuffle dependency that can cause a shuffle will store the data in
memory + disk. More often in memory.
Is my understanding correct ?

Regards,
SB


On Fri, Jan 17, 2014 at 2:38 AM, Ewen Cheslack-Postava <[email protected]>wrote:

> The difference between a shuffle dependency and a transformation that can
> cause a shuffle is probably worth pointing out.
>
> The mentioned transformations (groupByKey, join, etc) *might* generate a
> shuffle dependency on input RDDs, but they won't necessarily. For example,
> if you join() two RDDs that already use the same partitioner (e.g. a
> default HashPartitioner with the default parallelism), then no shuffle
> needs to be performed (and nothing should hit disk). Any records that need
> to be considered together will already be in the same partitions of the
> input RDDs (e.g. all records with key X are guaranteed to be in partition
> hash(X) of both input RDDs, so no shuffling is needed).
>
> Sometimes this is *really* worth exploiting, and even if it only applies
> to one of the input RDDs. For example, if you're joining 2 RDDs and one is
> much larger than the other and already partitioned, you can explicitly use
> the partitioner from the larger RDD so that only the smaller RDD gets
> shuffled.
>
> This also means you probably want to pay attention to transformations that
> remove partitioners. For example, prefer mapValues() to map(). mapValues()
> has to maintain the same key, so the output is guaranteed to still be
> partitioned. map() can change the keys, so partitioning is lost even if you
> keep the same key.
>
> -Ewen
>
>   Patrick Wendell <[email protected]>
>  January 16, 2014 12:16 PM
> The intermediate shuffle output gets written to disk, but it often
> hits the OS-buffer cache since it's not explicitly fsync'ed, so in
> many cases it stays entirely in memory. The behavior of the shuffle is
> agnostic to whether the base RDD is in cache or in disk.
>
> For on-disk RDD's or inputs, the shuffle path still has some key
> differences with Hadoop's implementation, including that it doesn't
> sort on the map side before shuffling.
>
> - Patrick
>   suman bharadwaj <[email protected]>
>  January 16, 2014 6:24 AM
> Hi,
>
> Is this behavior the same when the data is in memory ?
> If the data is stored to disk, then how is it different than Hadoop map
> reduce ?
>
> Regards,
> SB
>
>
>
>   Archit Thakur <[email protected]>
>  January 16, 2014 3:41 AM
> For any shuffle operation, groupByKey, etc. it does write map output to
> disk before performing the reduce task on the data.
>
>
>
>   suman bharadwaj <[email protected]>
>  January 16, 2014 2:33 AM
> Hi,
>
> I'm new to spark. And wanted to understand more on how shuffle works in
> spark
>
> In Hadoop map reduce, while performing a reduce operation, the
> intermediate data from map gets written to disk. How does the same happen
> in Spark ?
>
> Does spark write the intermediate data to disk ?
>
> Thanks in advance.
>
> Regards,
> SB
>
>

<<postbox-contact.jpg>>

<<compose-unknown-contact.jpg>>

Re: How does shuffle work in spark ?

Reply via email to