Thanks, Andrew. That helps.

For 1, it sounds like the data for the RDD is held in memory and then only
written to disk after the entire RDD has been realized in memory. Is that
correct?

-Suren



On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash <and...@andrewash.com> wrote:

> For 1, persist can be used to save an RDD to disk using the various
> persistence levels.  When a persistency level is set on an RDD, when that
> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
> re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
> use the cached value.
>
>
> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
>
> 2. The other places disk is used most commonly is shuffles.  If you have
> data across the cluster that comes from a source, then you might not have
> to hold it all in memory at once.  But if you do a shuffle, which scatters
> the data across the cluster in a certain way, then you have to have the
> memory/disk available for that RDD all at once.  In that case, shuffles
> will sometimes need to spill over to disk for large RDDs, which can be
> controlled with the spark.shuffle.spill setting.
>
> Does that help clarify?
>
>
> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
>
>> It might help if I clarify my questions. :-)
>>
>> 1. Is persist() applied during the transformation right before the
>> persist() call in the graph? Or is is applied after the transform's
>> processing is complete? In the case of things like GroupBy, is the Seq
>> backed by disk as it is being created? We're trying to get a sense of how
>> the processing is handled behind the scenes with respect to disk.
>>
>> 2. When else is disk used internally?
>>
>> Any pointers are appreciated.
>>
>> -Suren
>>
>>
>>
>>
>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
>> suren.hira...@velos.io> wrote:
>>
>>> Hi,
>>>
>>> Any thoughts on this? Thanks.
>>>
>>> -Suren
>>>
>>>
>>>
>>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>>> suren.hira...@velos.io> wrote:
>>>
>>>> Hi,
>>>>
>>>> I know if we call persist with the right options, we can have Spark
>>>> persist an RDD's data on disk.
>>>>
>>>> I am wondering what happens in intermediate operations that could
>>>> conceivably create large collections/Sequences, like GroupBy and shuffling.
>>>>
>>>> Basically, one part of the question is when is disk used internally?
>>>>
>>>> And is calling persist() on the RDD returned by such transformations
>>>> what let's it know to use disk in those situations? Trying to understand if
>>>> persist() is applied during the transformation or after it.
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>> Velos
>>>> Accelerating Machine Learning
>>>>
>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>> NEW YORK, NY 10001
>>>> O: (917) 525-2466 ext. 105
>>>> F: 646.349.4063
>>>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>>>> W: www.velos.io
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>>> W: www.velos.io
>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>> W: www.velos.io
>>
>>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
W: www.velos.io

Reply via email to