Re: Spark map performance question

Yadid Ayzenberg Tue, 10 Dec 2013 09:28:37 -0800

Thanks Mark, that cleared things up for me.
I applied the cache() before the count() and now its behaving as expected.


I really appreciate the fast response.

Yadid




On 12/10/13 12:20 PM, Mark Hamstra wrote:

You're not marking rdd1 as cached (actually,to-be-cached-after-next-evaluation) until after rdd1.count; so whenyou hit rdd2.count, rdd1 is not yet cached (no action has beenperformed on it since it was marked as cached) and has to becompletely re-evaluated. On the other hand, by the time you hitrdd3.count, there has been an action forcing evaluation of rdd1 afterit was marked as cached, so rdd1 does not need to be re-evaluatedwithin the rdd3.count job.
On Tue, Dec 10, 2013 at 9:11 AM, Yadid Ayzenberg <ya...@media.mit.edu<mailto:ya...@media.mit.edu>> wrote:
    Hi All,

    I'm trying to understand the performance results I'm getting for
    the following:

    rdd = sc.newAPIHadoopRDD( ... )
    rdd1 = rdd.keyBy( func1() )
    rdd1.count()
    rdd1.cache()

    rdd2= rdd1.map(func2())
    rdd2.count()
    rdd3 = rdd2.map(func2())
    rdd3.count()

    I would expect the 2 maps to be more or less equivalent in terms
    of runtime, however im seeing a more than 2x improvement.
    My assumption is that after the first count() operation all of the
    data is already loaded in the RAM and therefor subsequent
    transformations would have equivalent performance. I guess my
    assumption may be flawed.

    Yadid

Re: Spark map performance question

Reply via email to