Which storage scheme are you using? I am guessing it is MEMORY_ONLY. In large datasets, MEMORY_AND_DISK or MEMORY_AND_DISK_SER work better.
You can call unpersist on an RDD to remove it from Cache though. On Thu, Mar 27, 2014 at 11:57 AM, Sai Prasanna <ansaiprasa...@gmail.com>wrote: > No i am running on 0.8.1. > Yes i am caching a lot, i am benchmarking a simple code in spark where in > 512mb, 1g and 2g text files are taken, some basic intermediate operations > are done while the intermediate result which will be used in subsequent > operations are cached. > > I thought that, we need not manually unpersist, if i need to cache > something and if cache is found full, automatically space will be created > by evacuating the earlier. Do i need to unpersist? > > Moreover, if i run several times, will the previously cached RDDs still > remain in the cache? If so can i flush them manually out before the next > run? [something like complete cache flush] > > > On Thu, Mar 27, 2014 at 11:16 PM, Andrew Or <and...@databricks.com> wrote: > >> Are you caching a lot of RDD's? If so, maybe you should unpersist() the >> ones that you're not using. Also, if you're on 0.9, make sure >> spark.shuffle.spill is enabled (which it is by default). This allows your >> application to spill in-memory content to disk if necessary. >> >> How much memory are you giving to your executors? The default, >> spark.executor.memory is 512m, which is quite low. Consider raising this. >> Checking the web UI is a good way to figure out your runtime memory usage. >> >> >> On Thu, Mar 27, 2014 at 9:22 AM, Ognen Duzlevski < >> og...@plainvanillagames.com> wrote: >> >>> Look at the tuning guide on Spark's webpage for strategies to cope with >>> this. >>> I have run into quite a few memory issues like these, some are resolved >>> by changing the StorageLevel strategy and employing things like Kryo, some >>> are solved by specifying the number of tasks to break down a given >>> operation into etc. >>> >>> Ognen >>> >>> >>> On 3/27/14, 10:21 AM, Sai Prasanna wrote: >>> >>> "java.lang.OutOfMemoryError: GC overhead limit exceeded" >>> >>> What is the problem. The same code, i run, one instance it runs in 8 >>> second, next time it takes really long time, say 300-500 seconds... >>> I see the logs a lot of GC overhead limit exceeded is seen. What should >>> be done ?? >>> >>> Please can someone throw some light on it ?? >>> >>> >>> >>> -- >>> *Sai Prasanna. AN* >>> *II M.Tech (CS), SSSIHL* >>> >>> >>> * Entire water in the ocean can never sink a ship, Unless it gets >>> inside. All the pressures of life can never hurt you, Unless you let them >>> in.* >>> >>> >>> >> > > > -- > *Sai Prasanna. AN* > *II M.Tech (CS), SSSIHL* > > > *Entire water in the ocean can never sink a ship, Unless it gets inside. > All the pressures of life can never hurt you, Unless you let them in.* >