hi, yash, that's really help me, great thanks On Thu, Mar 24, 2016 at 7:07 PM, yash datta <sau...@gmail.com> wrote:
> Yes, That is correct. > > When you call cache on an RDD, internally it calls > persist(StorageLevel.MEMORY_ONLY) which further calls > > persist(StorageLevel.MEMORY_ONLY, allowOverride=false) , if the RDD is not > marked for localCheckpointing > > Below is what is finally triggered : > > /** > * Mark this RDD for persisting using the specified level. > * > * @param newLevel the target storage level > * @param allowOverride whether to override any existing level with the new > one > */ > private def persist(newLevel: StorageLevel, allowOverride: Boolean): > this.type = { > // TODO: Handle changes of StorageLevel > if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && > !allowOverride) { > throw new UnsupportedOperationException( > "Cannot change storage level of an RDD after it was already assigned a > level") > } > // If this is the first time this RDD is marked for persisting, register it > // with the SparkContext for cleanups and accounting. Do this only once. > if (storageLevel == StorageLevel.NONE) { > sc.cleaner.foreach(_.registerRDDForCleanup(this)) > sc.persistRDD(this) > } > storageLevel = newLevel > this > } > > As is clear from the code, persistRDD is called only when storageLevel for > the RDD was never set (So it will be called only once for multiple calls > for the same RDD). > Also, persistRDD only sets an entry in persistentRdds map, which is keyed > by RDD id : > > /** > * Register an RDD to be persisted in memory and/or disk storage > */ > private[spark] def persistRDD(rdd: RDD[_]) { > persistentRdds(rdd.id) = rdd > } > > Hope this helps. > > Best > Yash > > On Thu, Mar 24, 2016 at 1:58 PM, charles li <charles.up...@gmail.com> > wrote: > >> >> happened to see this problem on stackoverflow: >> http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812 >> >> >> I think it's very interesting, and I think the answer posted by Aaron >> sounds promising, but I'm not sure, and I don't find the details on the >> cache principle in Spark, so just post here and to ask everyone that the >> internal principle on implementing cache. >> >> great thanks. >> >> >> -----aaron's answer to that question [Is that right?]----- >> >> nothing happens, it will just cache the RDD for once. The reason, I >> think, is that every RDD has an id internally, spark will use the id to >> mark whether a RDD have been cached or not. so cache one RDD for multiple >> times will do nothing. >> ----------- >> >> >> >> -- >> *--------------------------------------* >> a spark lover, a quant, a developer and a good man. >> >> http://github.com/litaotao >> > > > > -- > When events unfold with calm and ease > When the winds that blow are merely breeze > Learn from nature, from birds and bees > Live your life in love, and let joy not cease. > -- *--------------------------------------* a spark lover, a quant, a developer and a good man. http://github.com/litaotao