Hi All,
I'm trying to understand the performance results I'm getting for the
following:
rdd = sc.newAPIHadoopRDD( ... )
rdd1 = rdd.keyBy( func1() )
rdd1.count()
rdd1.cache()
rdd2= rdd1.map(func2())
rdd2.count()
rdd3 = rdd2.map(func2())
rdd3.count()
I would expect the 2 maps to be more or less equivalent in terms of
runtime, however im seeing a more than 2x improvement.
My assumption is that after the first count() operation all of the data
is already loaded in the RAM and therefor subsequent transformations
would have equivalent performance. I guess my assumption may be flawed.
Yadid