Hi all,

I have a newbie question about StorageLevel of spark. I came up with these 
sentences in spark documents:

If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), 
leave them that way. This is the most CPU-efficient option, allowing operations 
on the RDDs to run as fast as possible.

And

Spark automatically monitors cache usage on each node and drops out old data 
partitions in a least-recently-used (LRU) fashion. If you would like to 
manually remove an RDD instead of waiting for it to fall out of the cache, use 
the RDD.unpersist() method.
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence 

But I found the default storageLevel is NONE in source code, and if I never 
call 'persist(somelevel)', that value will always be NONE. The 'iterator' 
method goes to

final def iterator(split: Partition, context: TaskContext): Iterator[T] = { 
    if (storageLevel != StorageLevel.NONE) { 
        SparkEnv.get.cacheManager.getOrCompute(this, split, context, 
storageLevel) 
    } else { 
        computeOrReadCheckpoint(split, context) 
    } 
}     
Is that to say, the rdds are cached in memory (or somewhere else) if and only 
if the 'persist' or 'cache' method is called explicitly,
otherwise they will be re-computed every time even in an iterative situation?
It made me confused becase I had a first impression that spark is super-fast 
because it prefers to store intermediate results in memory automatically.

Forgive me if I asked a stupid question.

Regards,
Kang Liu

Reply via email to