Yes, That is correct. When you call cache on an RDD, internally it calls persist(StorageLevel.MEMORY_ONLY) which further calls
persist(StorageLevel.MEMORY_ONLY, allowOverride=false) , if the RDD is not marked for localCheckpointing Below is what is finally triggered : /** * Mark this RDD for persisting using the specified level. * * @param newLevel the target storage level * @param allowOverride whether to override any existing level with the new one */ private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = { // TODO: Handle changes of StorageLevel if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) { throw new UnsupportedOperationException( "Cannot change storage level of an RDD after it was already assigned a level") } // If this is the first time this RDD is marked for persisting, register it // with the SparkContext for cleanups and accounting. Do this only once. if (storageLevel == StorageLevel.NONE) { sc.cleaner.foreach(_.registerRDDForCleanup(this)) sc.persistRDD(this) } storageLevel = newLevel this } As is clear from the code, persistRDD is called only when storageLevel for the RDD was never set (So it will be called only once for multiple calls for the same RDD). Also, persistRDD only sets an entry in persistentRdds map, which is keyed by RDD id : /** * Register an RDD to be persisted in memory and/or disk storage */ private[spark] def persistRDD(rdd: RDD[_]) { persistentRdds(rdd.id) = rdd } Hope this helps. Best Yash On Thu, Mar 24, 2016 at 1:58 PM, charles li <charles.up...@gmail.com> wrote: > > happened to see this problem on stackoverflow: > http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812 > > > I think it's very interesting, and I think the answer posted by Aaron > sounds promising, but I'm not sure, and I don't find the details on the > cache principle in Spark, so just post here and to ask everyone that the > internal principle on implementing cache. > > great thanks. > > > -----aaron's answer to that question [Is that right?]----- > > nothing happens, it will just cache the RDD for once. The reason, I think, > is that every RDD has an id internally, spark will use the id to mark > whether a RDD have been cached or not. so cache one RDD for multiple times > will do nothing. > ----------- > > > > -- > *--------------------------------------* > a spark lover, a quant, a developer and a good man. > > http://github.com/litaotao > -- When events unfold with calm and ease When the winds that blow are merely breeze Learn from nature, from birds and bees Live your life in love, and let joy not cease.