You're not marking rdd1 as cached (actually,
to-be-cached-after-next-evaluation) until after rdd1.count; so when you hit
rdd2.count, rdd1 is not yet cached (no action has been performed on it
since it was marked as cached) and has to be completely re-evaluated.  On
the other hand, by the time you hit rdd3.count, there has been an action
forcing evaluation of rdd1 after it was marked as cached, so rdd1 does not
need to be re-evaluated within the rdd3.count job.


On Tue, Dec 10, 2013 at 9:11 AM, Yadid Ayzenberg <[email protected]>wrote:

>
> Hi All,
>
> I'm trying to understand the performance results I'm getting for the
> following:
>
> rdd = sc.newAPIHadoopRDD( ... )
> rdd1 = rdd.keyBy( func1() )
> rdd1.count()
> rdd1.cache()
>
> rdd2= rdd1.map(func2())
> rdd2.count()
> rdd3 = rdd2.map(func2())
> rdd3.count()
>
> I would expect the 2 maps to be more or less equivalent in terms of
> runtime, however im seeing a more than 2x improvement.
> My assumption is that after the first count() operation all of the data is
> already loaded in the RAM and therefor subsequent transformations would
> have equivalent performance. I guess my assumption may be flawed.
>
> Yadid
>
>

Reply via email to