Hi all, I have a newbie question about StorageLevel of spark. I came up with these sentences in spark documents:
If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible. And Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method. http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence But I found the default storageLevel is NONE in source code, and if I never call 'persist(somelevel)', that value will always be NONE. The 'iterator' method goes to final def iterator(split: Partition, context: TaskContext): Iterator[T] = { if (storageLevel != StorageLevel.NONE) { SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel) } else { computeOrReadCheckpoint(split, context) } } Is that to say, the rdds are cached in memory (or somewhere else) if and only if the 'persist' or 'cache' method is called explicitly, otherwise they will be re-computed every time even in an iterative situation? It made me confused becase I had a first impression that spark is super-fast because it prefers to store intermediate results in memory automatically. Forgive me if I asked a stupid question. Regards, Kang Liu