Thanks Mark, that cleared things up for me.
I applied the cache() before the count() and now its behaving as expected.
I really appreciate the fast response.
Yadid
On 12/10/13 12:20 PM, Mark Hamstra wrote:
You're not marking rdd1 as cached (actually,
to-be-cached-after-next-evaluation) unti
You're not marking rdd1 as cached (actually,
to-be-cached-after-next-evaluation) until after rdd1.count; so when you hit
rdd2.count, rdd1 is not yet cached (no action has been performed on it
since it was marked as cached) and has to be completely re-evaluated. On
the other hand, by the time you h
Hi All,
I'm trying to understand the performance results I'm getting for the
following:
rdd = sc.newAPIHadoopRDD( ... )
rdd1 = rdd.keyBy( func1() )
rdd1.count()
rdd1.cache()
rdd2= rdd1.map(func2())
rdd2.count()
rdd3 = rdd2.map(func2())
rdd3.count()
I would expect the 2 maps to be more or le