hi, yash, that's really help me, great thanks
On Thu, Mar 24, 2016 at 7:07 PM, yash datta <[email protected]> wrote:
> Yes, That is correct.
>
> When you call cache on an RDD, internally it calls
> persist(StorageLevel.MEMORY_ONLY) which further calls
>
> persist(StorageLevel.MEMORY_ONLY, allowOverride=false) , if the RDD is not
> marked for localCheckpointing
>
> Below is what is finally triggered :
>
> /**
> * Mark this RDD for persisting using the specified level.
> *
> * @param newLevel the target storage level
> * @param allowOverride whether to override any existing level with the new
> one
> */
> private def persist(newLevel: StorageLevel, allowOverride: Boolean):
> this.type = {
> // TODO: Handle changes of StorageLevel
> if (storageLevel != StorageLevel.NONE && newLevel != storageLevel &&
> !allowOverride) {
> throw new UnsupportedOperationException(
> "Cannot change storage level of an RDD after it was already assigned a
> level")
> }
> // If this is the first time this RDD is marked for persisting, register it
> // with the SparkContext for cleanups and accounting. Do this only once.
> if (storageLevel == StorageLevel.NONE) {
> sc.cleaner.foreach(_.registerRDDForCleanup(this))
> sc.persistRDD(this)
> }
> storageLevel = newLevel
> this
> }
>
> As is clear from the code, persistRDD is called only when storageLevel for
> the RDD was never set (So it will be called only once for multiple calls
> for the same RDD).
> Also, persistRDD only sets an entry in persistentRdds map, which is keyed
> by RDD id :
>
> /**
> * Register an RDD to be persisted in memory and/or disk storage
> */
> private[spark] def persistRDD(rdd: RDD[_]) {
> persistentRdds(rdd.id) = rdd
> }
>
> Hope this helps.
>
> Best
> Yash
>
> On Thu, Mar 24, 2016 at 1:58 PM, charles li <[email protected]>
> wrote:
>
>>
>> happened to see this problem on stackoverflow:
>> http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812
>>
>>
>> I think it's very interesting, and I think the answer posted by Aaron
>> sounds promising, but I'm not sure, and I don't find the details on the
>> cache principle in Spark, so just post here and to ask everyone that the
>> internal principle on implementing cache.
>>
>> great thanks.
>>
>>
>> -----aaron's answer to that question [Is that right?]-----
>>
>> nothing happens, it will just cache the RDD for once. The reason, I
>> think, is that every RDD has an id internally, spark will use the id to
>> mark whether a RDD have been cached or not. so cache one RDD for multiple
>> times will do nothing.
>> -----------
>>
>>
>>
>> --
>> *--------------------------------------*
>> a spark lover, a quant, a developer and a good man.
>>
>> http://github.com/litaotao
>>
>
>
>
> --
> When events unfold with calm and ease
> When the winds that blow are merely breeze
> Learn from nature, from birds and bees
> Live your life in love, and let joy not cease.
>
--
*--------------------------------------*
a spark lover, a quant, a developer and a good man.
http://github.com/litaotao