Hi All,

I'm trying to understand the performance results I'm getting for the following:

rdd = sc.newAPIHadoopRDD( ... )
rdd1 = rdd.keyBy( func1() )
rdd1.count()
rdd1.cache()

rdd2= rdd1.map(func2())
rdd2.count()
rdd3 = rdd2.map(func2())
rdd3.count()

I would expect the 2 maps to be more or less equivalent in terms of runtime, however im seeing a more than 2x improvement. My assumption is that after the first count() operation all of the data is already loaded in the RAM and therefor subsequent transformations would have equivalent performance. I guess my assumption may be flawed.

Yadid

Reply via email to