Thanks Mark, that cleared things up for me.
I applied the cache() before the count() and now its behaving as expected.
I really appreciate the fast response.
Yadid
On 12/10/13 12:20 PM, Mark Hamstra wrote:
You're not marking rdd1 as cached (actually,
to-be-cached-after-next-evaluation) until after rdd1.count; so when
you hit rdd2.count, rdd1 is not yet cached (no action has been
performed on it since it was marked as cached) and has to be
completely re-evaluated. On the other hand, by the time you hit
rdd3.count, there has been an action forcing evaluation of rdd1 after
it was marked as cached, so rdd1 does not need to be re-evaluated
within the rdd3.count job.
On Tue, Dec 10, 2013 at 9:11 AM, Yadid Ayzenberg <ya...@media.mit.edu
<mailto:ya...@media.mit.edu>> wrote:
Hi All,
I'm trying to understand the performance results I'm getting for
the following:
rdd = sc.newAPIHadoopRDD( ... )
rdd1 = rdd.keyBy( func1() )
rdd1.count()
rdd1.cache()
rdd2= rdd1.map(func2())
rdd2.count()
rdd3 = rdd2.map(func2())
rdd3.count()
I would expect the 2 maps to be more or less equivalent in terms
of runtime, however im seeing a more than 2x improvement.
My assumption is that after the first count() operation all of the
data is already loaded in the RAM and therefor subsequent
transformations would have equivalent performance. I guess my
assumption may be flawed.
Yadid