Thanks Patrick and Ewen, Great answers.
So a shuffle dependency that can cause a shuffle will store the data in memory + disk. More often in memory. Is my understanding correct ? Regards, SB On Fri, Jan 17, 2014 at 2:38 AM, Ewen Cheslack-Postava <[email protected]>wrote: > The difference between a shuffle dependency and a transformation that can > cause a shuffle is probably worth pointing out. > > The mentioned transformations (groupByKey, join, etc) *might* generate a > shuffle dependency on input RDDs, but they won't necessarily. For example, > if you join() two RDDs that already use the same partitioner (e.g. a > default HashPartitioner with the default parallelism), then no shuffle > needs to be performed (and nothing should hit disk). Any records that need > to be considered together will already be in the same partitions of the > input RDDs (e.g. all records with key X are guaranteed to be in partition > hash(X) of both input RDDs, so no shuffling is needed). > > Sometimes this is *really* worth exploiting, and even if it only applies > to one of the input RDDs. For example, if you're joining 2 RDDs and one is > much larger than the other and already partitioned, you can explicitly use > the partitioner from the larger RDD so that only the smaller RDD gets > shuffled. > > This also means you probably want to pay attention to transformations that > remove partitioners. For example, prefer mapValues() to map(). mapValues() > has to maintain the same key, so the output is guaranteed to still be > partitioned. map() can change the keys, so partitioning is lost even if you > keep the same key. > > -Ewen > > Patrick Wendell <[email protected]> > January 16, 2014 12:16 PM > The intermediate shuffle output gets written to disk, but it often > hits the OS-buffer cache since it's not explicitly fsync'ed, so in > many cases it stays entirely in memory. The behavior of the shuffle is > agnostic to whether the base RDD is in cache or in disk. > > For on-disk RDD's or inputs, the shuffle path still has some key > differences with Hadoop's implementation, including that it doesn't > sort on the map side before shuffling. > > - Patrick > suman bharadwaj <[email protected]> > January 16, 2014 6:24 AM > Hi, > > Is this behavior the same when the data is in memory ? > If the data is stored to disk, then how is it different than Hadoop map > reduce ? > > Regards, > SB > > > > Archit Thakur <[email protected]> > January 16, 2014 3:41 AM > For any shuffle operation, groupByKey, etc. it does write map output to > disk before performing the reduce task on the data. > > > > suman bharadwaj <[email protected]> > January 16, 2014 2:33 AM > Hi, > > I'm new to spark. And wanted to understand more on how shuffle works in > spark > > In Hadoop map reduce, while performing a reduce operation, the > intermediate data from map gets written to disk. How does the same happen > in Spark ? > > Does spark write the intermediate data to disk ? > > Thanks in advance. > > Regards, > SB > >
<<postbox-contact.jpg>>
<<compose-unknown-contact.jpg>>
