Hey Corey, Yes, when shuffles are smaller than available memory to the OS, most often the outputs never get stored to disk. I believe this holds same for the YARN shuffle service, because the write path is actually the same, i.e. we don't fsync the writes and force them to disk. I would guess in such shuffles the bottleneck is serializing the data rather than raw IO, so I'm not sure explicitly buffering the data in the JVM process would yield a large improvement.
Writing shuffle to an explicitly pinned memory filesystem is also possible (per Davies suggestion), but it's brittle because the job will fail if shuffle output exceeds memory. - Patrick On Wed, Jun 10, 2015 at 9:50 PM, Davies Liu <dav...@databricks.com> wrote: > If you have enough memory, you can put the temporary work directory in > tempfs (in memory file system). > > On Wed, Jun 10, 2015 at 8:43 PM, Corey Nolet <cjno...@gmail.com> wrote: >> Ok so it is the case that small shuffles can be done without hitting any >> disk. Is this the same case for the aux shuffle service in yarn? Can that be >> done without hitting disk? >> >> On Wed, Jun 10, 2015 at 9:17 PM, Patrick Wendell <pwend...@gmail.com> wrote: >>> >>> In many cases the shuffle will actually hit the OS buffer cache and >>> not ever touch spinning disk if it is a size that is less than memory >>> on the machine. >>> >>> - Patrick >>> >>> On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet <cjno...@gmail.com> wrote: >>> > So with this... to help my understanding of Spark under the hood- >>> > >>> > Is this statement correct "When data needs to pass between multiple >>> > JVMs, a >>> > shuffle will always hit disk"? >>> > >>> > On Wed, Jun 10, 2015 at 10:11 AM, Josh Rosen <rosenvi...@gmail.com> >>> > wrote: >>> >> >>> >> There's a discussion of this at >>> >> https://github.com/apache/spark/pull/5403 >>> >> >>> >> >>> >> >>> >> On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet <cjno...@gmail.com> wrote: >>> >>> >>> >>> Is it possible to configure Spark to do all of its shuffling FULLY in >>> >>> memory (given that I have enough memory to store all the data)? >>> >>> >>> >>> >>> >>> >>> >> >>> > >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org