Re: Why RDDs are being dropped by Executors?

2015-09-23 Thread Tathagata Das
There could multiple reasons for caching till 90% - 1. not enough aggregate space in cluster - increase cluster memory 2. ata is skewed among executor so one executor is try to cache too much while others are idle - Repartition the data using RDD.repartition to force even distribution. The Storage

Re: Why RDDs are being dropped by Executors?

2015-09-23 Thread Uthayan Suthakar
Thank you tathagata for your response. It make sense to use the MEMORY_AND_DISK. But sometime when I start the job it does not cache everyting at the start. It only caches 90%. The LRU scheme will only take affect after a while when the data is not in use but why it failing to cache the data at the

Re: Why RDDs are being dropped by Executors?

2015-09-22 Thread Tathagata Das
If the RDD is not constantly in use, then the LRU scheme in each executor can kick out some of the partitions from memory. If you want to avoid recomputing in such cases, you could persist with StorageLevel.MEMORY_AND_DISK, where the partitions will dropped to disk when kicked from memory. That wil

Why RDDs are being dropped by Executors?

2015-09-22 Thread Uthayan Suthakar
Hello All, We have a Spark Streaming job that reads data from DB (three tables) and cache them into memory ONLY at the start then it will happily carry out the incremental calculation with the new data. What we've noticed occasionally is that one of the RDDs caches only 90% of the data. Therefore,