Re: Understanding Spark's caching
Hi I replied you in SO. If option A had a action call then it should suffice too. On 28 Apr 2015 05:30, Eran Medan eran.me...@gmail.com wrote: Hi Everyone! I'm trying to understand how Spark's cache work. Here is my naive understanding, please let me know if I'm missing something: val rdd1 = sc.textFile(some data) rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.saveAsTextFile(...) rdd3.saveAsTextFile(...) In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when rdd2 is saved I assume) and then from cache (assuming there is enough RAM) when rdd3 is saved) Now here is my question. Let's say I want to cache rdd2 and rdd3 as they will both be used later on, but I don't need rdd1 after creating them. Basically there is duplication, isn't it? Since once rdd2 and rdd3 are calculated, I don't need rdd1 anymore, I should probably unpersist it, right? the question is when? *Will this work? (Option A)* val rdd1 = sc.textFile(some data) rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.cache() rdd3.cache() rdd1.unpersist() Does spark add the unpersist call to the DAG? or is it done immediately? if it's done immediately, then basically rdd1 will be non cached when I read from rdd2 and rdd3, right? *Should I do it this way instead (Option B)?* val rdd1 = sc.textFile(some data) rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.cache() rdd3.cache() rdd2.saveAsTextFile(...) rdd3.saveAsTextFile(...) rdd1.unpersist() *So the question is this:* Is Option A good enough? e.g. will rdd1 be still accessing the file only once? Or do I need to go with Option B? (see also http://stackoverflow.com/questions/29903675/understanding-sparks-caching) Thanks in advance
Re: Understanding Spark's caching
Option B would be fine, as in the SO itself the answer says, Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution. Also note, In Option A, you are not specifying any action anywhere. Thanks Best Regards On Tue, Apr 28, 2015 at 12:58 AM, Eran Medan eran.me...@gmail.com wrote: Hi Everyone! I'm trying to understand how Spark's cache work. Here is my naive understanding, please let me know if I'm missing something: val rdd1 = sc.textFile(some data) rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.saveAsTextFile(...) rdd3.saveAsTextFile(...) In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when rdd2 is saved I assume) and then from cache (assuming there is enough RAM) when rdd3 is saved) Now here is my question. Let's say I want to cache rdd2 and rdd3 as they will both be used later on, but I don't need rdd1 after creating them. Basically there is duplication, isn't it? Since once rdd2 and rdd3 are calculated, I don't need rdd1 anymore, I should probably unpersist it, right? the question is when? *Will this work? (Option A)* val rdd1 = sc.textFile(some data) rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.cache() rdd3.cache() rdd1.unpersist() Does spark add the unpersist call to the DAG? or is it done immediately? if it's done immediately, then basically rdd1 will be non cached when I read from rdd2 and rdd3, right? *Should I do it this way instead (Option B)?* val rdd1 = sc.textFile(some data) rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.cache() rdd3.cache() rdd2.saveAsTextFile(...) rdd3.saveAsTextFile(...) rdd1.unpersist() *So the question is this:* Is Option A good enough? e.g. will rdd1 be still accessing the file only once? Or do I need to go with Option B? (see also http://stackoverflow.com/questions/29903675/understanding-sparks-caching) Thanks in advance
Understanding Spark's caching
Hi Everyone! I'm trying to understand how Spark's cache work. Here is my naive understanding, please let me know if I'm missing something: val rdd1 = sc.textFile(some data) rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.saveAsTextFile(...) rdd3.saveAsTextFile(...) In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when rdd2 is saved I assume) and then from cache (assuming there is enough RAM) when rdd3 is saved) Now here is my question. Let's say I want to cache rdd2 and rdd3 as they will both be used later on, but I don't need rdd1 after creating them. Basically there is duplication, isn't it? Since once rdd2 and rdd3 are calculated, I don't need rdd1 anymore, I should probably unpersist it, right? the question is when? *Will this work? (Option A)* val rdd1 = sc.textFile(some data) rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.cache() rdd3.cache() rdd1.unpersist() Does spark add the unpersist call to the DAG? or is it done immediately? if it's done immediately, then basically rdd1 will be non cached when I read from rdd2 and rdd3, right? *Should I do it this way instead (Option B)?* val rdd1 = sc.textFile(some data) rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.cache() rdd3.cache() rdd2.saveAsTextFile(...) rdd3.saveAsTextFile(...) rdd1.unpersist() *So the question is this:* Is Option A good enough? e.g. will rdd1 be still accessing the file only once? Or do I need to go with Option B? (see also http://stackoverflow.com/questions/29903675/understanding-sparks-caching) Thanks in advance