Good question :)

Although RDD DAG is lazy evaluated, it’s not exactly the same as Scala lazy
val. For Scala lazy val, evaluated value is automatically cached, while
evaluated RDD elements are not cached unless you call .cache() explicitly,
because materializing an RDD can often be expensive. Take local file
reading as an analogy:

val v0 = sc.textFile("input.log").cache()

is similar to a lazy val

lazy val u0 = Source.fromFile("input.log").mkString


val v1 = sc.textFile("input.log")

is similar to a function

def u0 = Source.fromFile("input.log").mkString

Think it this way: if you want to “reuse” the evaluated elements, you have
to cache those elements somewhere. Without caching, you have to re-evaluate
the RDD, and the semantics of an uncached RDD simply downgrades to a
function rather than a lazy val.

On Wed, Apr 23, 2014 at 4:00 PM, Mayur Rustagi <>wrote:

> Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb
> question :)
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> @mayur_rustagi <>
> On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian <>wrote:
>> Without caching, an RDD will be evaluated multiple times if referenced
>> multiple times by other RDDs. A silly example:
>> val text = sc.textFile("input.log")val r1 = text.filter(_ startsWith 
>> "ERROR")val r2 = split " ")val r3 = (r1 ++ r2).collect()
>> Here the input file will be scanned twice unless you call .cache() on
>> text. So if your computation involves nondeterminism (e.g. random
>> number), you may get different results.
>> On Tue, Apr 22, 2014 at 11:30 AM, randylu <> wrote:
>>> it's ok when i call doc_topic_dist.cache() firstly.
>>> --
>>> View this message in context:
>>> Sent from the Apache Spark User List mailing list archive at

Reply via email to