Hi all,

I am trying to understand Spark lazy evaluation works, and I need some
help. I have noticed that creating an RDD once and using it many times
won't trigger recomputation of it every time it gets used. Whereas creating
a new RDD for every time a new operation is performed will trigger
recomputation of the whole RDD again.
I would have thought that both approaches behave similarly (i.e. not
caching) due to Spark's lazy evaluation strategy, but I guess Spark is
keeps track of the RDD used and of the partial results computed so far so
it doesn't do unnecessary extra work. Could anybody point me to where Spark
decides what to cache or how I can disable this behaviour?
Thanks in advance!


Renato M.

Approach 1 --> this doesn't trigger recomputation of the RDD in every
iteration
=========
JavaRDD aggrRel = Utils.readJavaRDD(...).groupBy(groupFunction).
map(mapFunction);
for (int i = 0; i < NUM_RUNS; i++) {
   // doing some computation like aggrRel.count()
   . . .
}

Approach 2 --> this triggers recomputation of the RDD in every iteration
=========
for (int i = 0; i < NUM_RUNS; i++) {
   JavaRDD aggrRel =
Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction);
   // doing some computation like aggrRel.count()
   . . .
}

Reply via email to