You can remove cached rdd1 from the cache manager by calling
rdd1.unpersist(). But here comes some subtleties: RDD.cache() is *lazy*while
RDD.unpersist() is *eager*. When .cache() is called, it just tells Spark
runtime to cache the RDD *later* when corresponding job that uses this RDD
is submitted; when .unpersist() is called, the cached RDD is removed
immediately. So you may want to do something like this to avoid rdd1 taking
too much memory:

val rdd1 = sc.textFile(path).cache()val rdd2 =
rdd1.filter(...).cache()val rdd3 = rdd1.filter(...).cache()
// Trigger a job to materialize and cache rdd1, rdd2 & rdd3
(rdd2 ++ rdd3).count()
// Remove rdd1
rdd1.unpersist()
// Use rdd2 & rdd3 for later logics.

In this way, an additional job is required so that you have chance to evict
rdd1 as early as possible.


On Wed, Apr 16, 2014 at 2:43 PM, Arpit Tak <arpit.sparku...@gmail.com>wrote:

> Hi Cheng,
>
> Is it possibe to delete or replicate an rdd ??
>
>
> > rdd1 = textFile("hdfs...").cache()
> >
> > rdd2 = rdd1.filter(userDefinedFunc1).cache()
> > rdd3 = rdd1.filter(userDefinedFunc2).cache()
>
> I reframe above question , if rdd1 is around 50G and after filtering its
> come around say 4G.
> then to increase computing performance we just cached it .. but rdd2 and
> rdd3 are on disk ..
> so this will show somehow show good performance than performing filter on
> disk , then caching rdd2 and rdd3.
>
> or can we also remove a particular rdd from cache say rdd1(if cached)
> after filtered operation as its not required and we save memory usage.
>
> Regards,
> Arpit
>
>
> On Tue, Apr 15, 2014 at 7:14 AM, Cheng Lian <lian.cs....@gmail.com> wrote:
>
>> Hi Joe,
>>
>> You need to make sure which RDD is used most frequently. In your case,
>> rdd2 & rdd3 are filtered result of rdd1, so usually they are relatively
>> smaller than rdd1, and it would be more reasonable to cache rdd2 and/or
>> rdd3 if rdd1 is not referenced elsewhere.
>>
>> Say rdd1 takes 10G, rdd2 takes 1G after filtering, if you cache both of
>> them, you end up with 11G memory consumption, which might not be what you
>> want.
>>
>> Regards
>> Cheng
>>
>>
>> On Mon, Apr 14, 2014 at 8:32 PM, Joe L <selme...@yahoo.com> wrote:
>>
>>> Hi I am trying to cache 2Gbyte data and to implement the following
>>> procedure.
>>> In order to cache them I did as follows: Is it necessary to cache rdd2
>>> since
>>> rdd1 is already cached?
>>>
>>> rdd1 = textFile("hdfs...").cache()
>>>
>>> rdd2 = rdd1.filter(userDefinedFunc1).cache()
>>> rdd3 = rdd1.filter(userDefinedFunc2).cache()
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Proper-caching-method-tp4206.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>

Reply via email to