Spark map performance question

Yadid Ayzenberg Tue, 10 Dec 2013 09:12:13 -0800


Hi All,

I'm trying to understand the performance results I'm getting for thefollowing:


rdd = sc.newAPIHadoopRDD( ... )
rdd1 = rdd.keyBy( func1() )
rdd1.count()
rdd1.cache()

rdd2= rdd1.map(func2())
rdd2.count()
rdd3 = rdd2.map(func2())
rdd3.count()

I would expect the 2 maps to be more or less equivalent in terms ofruntime, however im seeing a more than 2x improvement.My assumption is that after the first count() operation all of the datais already loaded in the RAM and therefor subsequent transformationswould have equivalent performance. I guess my assumption may be flawed.


Yadid

Spark map performance question

Reply via email to