Re: Fully in-memory shuffles

Patrick Wendell Thu, 11 Jun 2015 16:34:12 -0700

Hey Corey,

Yes, when shuffles are smaller than available memory to the OS, most
often the outputs never get stored to disk. I believe this holds same
for the YARN shuffle service, because the write path is actually the
same, i.e. we don't fsync the writes and force them to disk. I would
guess in such shuffles the bottleneck is serializing the data rather
than raw IO, so I'm not sure explicitly buffering the data in the JVM
process would yield a large improvement.


Writing shuffle to an explicitly pinned memory filesystem is also
possible (per Davies suggestion), but it's brittle because the job
will fail if shuffle output exceeds memory.

- Patrick

On Wed, Jun 10, 2015 at 9:50 PM, Davies Liu <dav...@databricks.com> wrote:
> If you have enough memory, you can put the temporary work directory in
> tempfs (in memory file system).
>
> On Wed, Jun 10, 2015 at 8:43 PM, Corey Nolet <cjno...@gmail.com> wrote:
>> Ok so it is the case that small shuffles can be done without hitting any
>> disk. Is this the same case for the aux shuffle service in yarn? Can that be
>> done without hitting disk?
>>
>> On Wed, Jun 10, 2015 at 9:17 PM, Patrick Wendell <pwend...@gmail.com> wrote:
>>>
>>> In many cases the shuffle will actually hit the OS buffer cache and
>>> not ever touch spinning disk if it is a size that is less than memory
>>> on the machine.
>>>
>>> - Patrick
>>>
>>> On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet <cjno...@gmail.com> wrote:
>>> > So with this... to help my understanding of Spark under the hood-
>>> >
>>> > Is this statement correct "When data needs to pass between multiple
>>> > JVMs, a
>>> > shuffle will always hit disk"?
>>> >
>>> > On Wed, Jun 10, 2015 at 10:11 AM, Josh Rosen <rosenvi...@gmail.com>
>>> > wrote:
>>> >>
>>> >> There's a discussion of this at
>>> >> https://github.com/apache/spark/pull/5403
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet <cjno...@gmail.com> wrote:
>>> >>>
>>> >>> Is it possible to configure Spark to do all of its shuffling FULLY in
>>> >>> memory (given that I have enough memory to store all the data)?
>>> >>>
>>> >>>
>>> >>>
>>> >>
>>> >
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Fully in-memory shuffles

Reply via email to