Thanks Mark, that cleared things up for me.
I applied the cache() before the count() and now its behaving as expected.

I really appreciate the fast response.

Yadid




On 12/10/13 12:20 PM, Mark Hamstra wrote:
You're not marking rdd1 as cached (actually, to-be-cached-after-next-evaluation) until after rdd1.count; so when you hit rdd2.count, rdd1 is not yet cached (no action has been performed on it since it was marked as cached) and has to be completely re-evaluated. On the other hand, by the time you hit rdd3.count, there has been an action forcing evaluation of rdd1 after it was marked as cached, so rdd1 does not need to be re-evaluated within the rdd3.count job.


On Tue, Dec 10, 2013 at 9:11 AM, Yadid Ayzenberg <ya...@media.mit.edu <mailto:ya...@media.mit.edu>> wrote:


    Hi All,

    I'm trying to understand the performance results I'm getting for
    the following:

    rdd = sc.newAPIHadoopRDD( ... )
    rdd1 = rdd.keyBy( func1() )
    rdd1.count()
    rdd1.cache()

    rdd2= rdd1.map(func2())
    rdd2.count()
    rdd3 = rdd2.map(func2())
    rdd3.count()

    I would expect the 2 maps to be more or less equivalent in terms
    of runtime, however im seeing a more than 2x improvement.
    My assumption is that after the first count() operation all of the
    data is already loaded in the RAM and therefor subsequent
    transformations would have equivalent performance. I guess my
    assumption may be flawed.

    Yadid



Reply via email to