There could multiple reasons for caching till 90% -
1. not enough aggregate space in cluster - increase cluster memory
2. ata is skewed among executor so one executor is try to cache too much
while others are idle - Repartition the data using RDD.repartition to force
even distribution.
The Storage
Thank you tathagata for your response. It make sense to use the
MEMORY_AND_DISK.
But sometime when I start the job it does not cache everyting at the start.
It only caches 90%. The LRU scheme will only take affect after a while when
the data is not in use but why it failing to cache the data at the
If the RDD is not constantly in use, then the LRU scheme in each executor
can kick out some of the partitions from memory.
If you want to avoid recomputing in such cases, you could persist with
StorageLevel.MEMORY_AND_DISK, where the partitions will dropped to disk
when kicked from memory. That wil
Hello All,
We have a Spark Streaming job that reads data from DB (three tables) and
cache them into memory ONLY at the start then it will happily carry out the
incremental calculation with the new data. What we've noticed occasionally
is that one of the RDDs caches only 90% of the data. Therefore,