Re: Spark map performance question

2013-12-10 Thread Yadid Ayzenberg
Thanks Mark, that cleared things up for me. I applied the cache() before the count() and now its behaving as expected. I really appreciate the fast response. Yadid On 12/10/13 12:20 PM, Mark Hamstra wrote: You're not marking rdd1 as cached (actually, to-be-cached-after-next-evaluation) unti

Re: Spark map performance question

2013-12-10 Thread Mark Hamstra
You're not marking rdd1 as cached (actually, to-be-cached-after-next-evaluation) until after rdd1.count; so when you hit rdd2.count, rdd1 is not yet cached (no action has been performed on it since it was marked as cached) and has to be completely re-evaluated. On the other hand, by the time you h

Spark map performance question

2013-12-10 Thread Yadid Ayzenberg
Hi All, I'm trying to understand the performance results I'm getting for the following: rdd = sc.newAPIHadoopRDD( ... ) rdd1 = rdd.keyBy( func1() ) rdd1.count() rdd1.cache() rdd2= rdd1.map(func2()) rdd2.count() rdd3 = rdd2.map(func2()) rdd3.count() I would expect the 2 maps to be more or le