You can remove cached rdd1 from the cache manager by calling rdd1.unpersist(). But here comes some subtleties: RDD.cache() is *lazy*while RDD.unpersist() is *eager*. When .cache() is called, it just tells Spark runtime to cache the RDD *later* when corresponding job that uses this RDD is submitted; when .unpersist() is called, the cached RDD is removed immediately. So you may want to do something like this to avoid rdd1 taking too much memory:
val rdd1 = sc.textFile(path).cache()val rdd2 = rdd1.filter(...).cache()val rdd3 = rdd1.filter(...).cache() // Trigger a job to materialize and cache rdd1, rdd2 & rdd3 (rdd2 ++ rdd3).count() // Remove rdd1 rdd1.unpersist() // Use rdd2 & rdd3 for later logics. In this way, an additional job is required so that you have chance to evict rdd1 as early as possible. On Wed, Apr 16, 2014 at 2:43 PM, Arpit Tak <arpit.sparku...@gmail.com>wrote: > Hi Cheng, > > Is it possibe to delete or replicate an rdd ?? > > > > rdd1 = textFile("hdfs...").cache() > > > > rdd2 = rdd1.filter(userDefinedFunc1).cache() > > rdd3 = rdd1.filter(userDefinedFunc2).cache() > > I reframe above question , if rdd1 is around 50G and after filtering its > come around say 4G. > then to increase computing performance we just cached it .. but rdd2 and > rdd3 are on disk .. > so this will show somehow show good performance than performing filter on > disk , then caching rdd2 and rdd3. > > or can we also remove a particular rdd from cache say rdd1(if cached) > after filtered operation as its not required and we save memory usage. > > Regards, > Arpit > > > On Tue, Apr 15, 2014 at 7:14 AM, Cheng Lian <lian.cs....@gmail.com> wrote: > >> Hi Joe, >> >> You need to make sure which RDD is used most frequently. In your case, >> rdd2 & rdd3 are filtered result of rdd1, so usually they are relatively >> smaller than rdd1, and it would be more reasonable to cache rdd2 and/or >> rdd3 if rdd1 is not referenced elsewhere. >> >> Say rdd1 takes 10G, rdd2 takes 1G after filtering, if you cache both of >> them, you end up with 11G memory consumption, which might not be what you >> want. >> >> Regards >> Cheng >> >> >> On Mon, Apr 14, 2014 at 8:32 PM, Joe L <selme...@yahoo.com> wrote: >> >>> Hi I am trying to cache 2Gbyte data and to implement the following >>> procedure. >>> In order to cache them I did as follows: Is it necessary to cache rdd2 >>> since >>> rdd1 is already cached? >>> >>> rdd1 = textFile("hdfs...").cache() >>> >>> rdd2 = rdd1.filter(userDefinedFunc1).cache() >>> rdd3 = rdd1.filter(userDefinedFunc2).cache() >>> >>> >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Proper-caching-method-tp4206.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >> >> >