Re: Understanding Spark's caching

2015-04-28 Thread ayan guha
Hi I replied you in SO. If option A had a action call then it should
suffice too.
On 28 Apr 2015 05:30, Eran Medan eran.me...@gmail.com wrote:

 Hi Everyone!

 I'm trying to understand how Spark's cache work.

 Here is my naive understanding, please let me know if I'm missing
 something:

 val rdd1 = sc.textFile(some data)
 rdd.cache() //marks rdd as cached
 val rdd2 = rdd1.filter(...)
 val rdd3 = rdd1.map(...)
 rdd2.saveAsTextFile(...)
 rdd3.saveAsTextFile(...)

 In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when
 rdd2 is saved I assume) and then from cache (assuming there is enough RAM)
 when rdd3 is saved)

 Now here is my question. Let's say I want to cache rdd2 and rdd3 as they
 will both be used later on, but I don't need rdd1 after creating them.

 Basically there is duplication, isn't it? Since once rdd2 and rdd3 are
 calculated, I don't need rdd1 anymore, I should probably unpersist it,
 right? the question is when?

 *Will this work? (Option A)*

 val rdd1 = sc.textFile(some data)
 rdd.cache() //marks rdd as cached
 val rdd2 = rdd1.filter(...)
 val rdd3 = rdd1.map(...)
 rdd2.cache()
 rdd3.cache()
 rdd1.unpersist()

 Does spark add the unpersist call to the DAG? or is it done immediately?
 if it's done immediately, then basically rdd1 will be non cached when I
 read from rdd2 and rdd3, right?

 *Should I do it this way instead (Option B)?*

 val rdd1 = sc.textFile(some data)
 rdd.cache() //marks rdd as cached
 val rdd2 = rdd1.filter(...)
 val rdd3 = rdd1.map(...)

 rdd2.cache()
 rdd3.cache()

 rdd2.saveAsTextFile(...)
 rdd3.saveAsTextFile(...)

 rdd1.unpersist()

 *So the question is this:* Is Option A good enough? e.g. will rdd1 be
 still accessing the file only once? Or do I need to go with Option B?

 (see also
 http://stackoverflow.com/questions/29903675/understanding-sparks-caching)

 Thanks in advance



Re: Understanding Spark's caching

2015-04-28 Thread Akhil Das
Option B would be fine, as in the SO itself the answer says, Since RDD
transformations merely build DAG descriptions without execution, in Option
A by the time you call unpersist, you still only have job descriptions and
not a running execution.

Also note, In Option A, you are not specifying any action anywhere.

Thanks
Best Regards

On Tue, Apr 28, 2015 at 12:58 AM, Eran Medan eran.me...@gmail.com wrote:

 Hi Everyone!

 I'm trying to understand how Spark's cache work.

 Here is my naive understanding, please let me know if I'm missing
 something:

 val rdd1 = sc.textFile(some data)
 rdd.cache() //marks rdd as cached
 val rdd2 = rdd1.filter(...)
 val rdd3 = rdd1.map(...)
 rdd2.saveAsTextFile(...)
 rdd3.saveAsTextFile(...)

 In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when
 rdd2 is saved I assume) and then from cache (assuming there is enough RAM)
 when rdd3 is saved)

 Now here is my question. Let's say I want to cache rdd2 and rdd3 as they
 will both be used later on, but I don't need rdd1 after creating them.

 Basically there is duplication, isn't it? Since once rdd2 and rdd3 are
 calculated, I don't need rdd1 anymore, I should probably unpersist it,
 right? the question is when?

 *Will this work? (Option A)*

 val rdd1 = sc.textFile(some data)
 rdd.cache() //marks rdd as cached
 val rdd2 = rdd1.filter(...)
 val rdd3 = rdd1.map(...)
 rdd2.cache()
 rdd3.cache()
 rdd1.unpersist()

 Does spark add the unpersist call to the DAG? or is it done immediately?
 if it's done immediately, then basically rdd1 will be non cached when I
 read from rdd2 and rdd3, right?

 *Should I do it this way instead (Option B)?*

 val rdd1 = sc.textFile(some data)
 rdd.cache() //marks rdd as cached
 val rdd2 = rdd1.filter(...)
 val rdd3 = rdd1.map(...)

 rdd2.cache()
 rdd3.cache()

 rdd2.saveAsTextFile(...)
 rdd3.saveAsTextFile(...)

 rdd1.unpersist()

 *So the question is this:* Is Option A good enough? e.g. will rdd1 be
 still accessing the file only once? Or do I need to go with Option B?

 (see also
 http://stackoverflow.com/questions/29903675/understanding-sparks-caching)

 Thanks in advance



Understanding Spark's caching

2015-04-27 Thread Eran Medan
Hi Everyone!

I'm trying to understand how Spark's cache work.

Here is my naive understanding, please let me know if I'm missing something:

val rdd1 = sc.textFile(some data)
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.saveAsTextFile(...)
rdd3.saveAsTextFile(...)

In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when
rdd2 is saved I assume) and then from cache (assuming there is enough RAM)
when rdd3 is saved)

Now here is my question. Let's say I want to cache rdd2 and rdd3 as they
will both be used later on, but I don't need rdd1 after creating them.

Basically there is duplication, isn't it? Since once rdd2 and rdd3 are
calculated, I don't need rdd1 anymore, I should probably unpersist it,
right? the question is when?

*Will this work? (Option A)*

val rdd1 = sc.textFile(some data)
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd1.unpersist()

Does spark add the unpersist call to the DAG? or is it done immediately? if
it's done immediately, then basically rdd1 will be non cached when I read
from rdd2 and rdd3, right?

*Should I do it this way instead (Option B)?*

val rdd1 = sc.textFile(some data)
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)

rdd2.cache()
rdd3.cache()

rdd2.saveAsTextFile(...)
rdd3.saveAsTextFile(...)

rdd1.unpersist()

*So the question is this:* Is Option A good enough? e.g. will rdd1 be still
accessing the file only once? Or do I need to go with Option B?

(see also
http://stackoverflow.com/questions/29903675/understanding-sparks-caching)

Thanks in advance