Re: what happened if cache a RDD for multiple time?

yash datta Thu, 24 Mar 2016 04:07:50 -0700

Yes, That is correct.

When you call cache on an RDD, internally it calls
persist(StorageLevel.MEMORY_ONLY) which further calls


persist(StorageLevel.MEMORY_ONLY, allowOverride=false) , if the RDD is not
marked for localCheckpointing

Below is what is finally triggered :

/**
 * Mark this RDD for persisting using the specified level.
 *
 * @param newLevel the target storage level
 * @param allowOverride whether to override any existing level with the new one
 */
private def persist(newLevel: StorageLevel, allowOverride: Boolean):
this.type = {
  // TODO: Handle changes of StorageLevel
  if (storageLevel != StorageLevel.NONE && newLevel != storageLevel &&
!allowOverride) {
    throw new UnsupportedOperationException(
      "Cannot change storage level of an RDD after it was already
assigned a level")
  }
  // If this is the first time this RDD is marked for persisting, register it
  // with the SparkContext for cleanups and accounting. Do this only once.
  if (storageLevel == StorageLevel.NONE) {
    sc.cleaner.foreach(_.registerRDDForCleanup(this))
    sc.persistRDD(this)
  }
  storageLevel = newLevel
  this
}

As is clear from the code, persistRDD is called only when storageLevel for
the RDD was never set (So it will be called only once for multiple calls
for the same RDD).
Also, persistRDD only sets an entry in persistentRdds map, which is keyed
by RDD id :

/**
 * Register an RDD to be persisted in memory and/or disk storage
 */
private[spark] def persistRDD(rdd: RDD[_]) {
  persistentRdds(rdd.id) = rdd
}

Hope this helps.

Best
Yash

On Thu, Mar 24, 2016 at 1:58 PM, charles li <charles.up...@gmail.com> wrote:

>
> happened to see this problem on stackoverflow:
> http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812
>
>
> I think it's very interesting, and I think the answer posted by Aaron
> sounds promising, but I'm not sure, and I don't find the details on the
> cache principle in Spark, so just post here and to ask everyone that the
> internal principle on implementing cache.
>
> great thanks.
>
>
> -----aaron's answer to that question [Is that right?]-----
>
> nothing happens, it will just cache the RDD for once. The reason, I think,
> is that every RDD has an id internally, spark will use the id to mark
> whether a RDD have been cached or not. so cache one RDD for multiple times
> will do nothing.
> -----------
>
>
>
> --
> *--------------------------------------*
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao
>



-- 
When events unfold with calm and ease
When the winds that blow are merely breeze
Learn from nature, from birds and bees
Live your life in love, and let joy not cease.

Re: what happened if cache a RDD for multiple time?

Reply via email to