I run Spark 1.6.1 on YARN (EMR-4.5.0) I call RDD.count on MEMORY_ONLY_SER cached RDD (spark.serializer is KryoSerializer)
after count task is done I noticed that Spark UI shows that RDD Fraction Cached is 6% only Size in Memory = 65.3 GB I looked at Executors stderr on Spark UI and saw lots of messages like 16/04/15 19:08:03 INFO storage.MemoryStore: Will not store rdd_16_4383 as it would require dropping another block from the same RDD 16/04/15 19:08:03 WARN storage.MemoryStore: Not enough space to cache rdd_16_4383 in memory! (computed 1462.4 MB so far) 16/04/15 19:08:03 INFO storage.MemoryStore: Memory use = 11.0 KB (blocks) + 33.8 GB (scratch space shared across 17 tasks(s)) = 33.8 GB. Storage limit = 33.8 GB. 16/04/15 19:08:06 INFO storage.MemoryStore: Will not store rdd_16_4306 as it would require dropping another block from the same RDD 16/04/15 19:08:06 WARN storage.MemoryStore: Not enough space to cache rdd_16_4306 in memory! (computed 1920.6 MB so far) 16/04/15 19:08:06 INFO storage.MemoryStore: Memory use = 11.0 KB (blocks) + 33.8 GB (scratch space shared across 17 tasks(s)) = 33.8 GB. Storage limit = 33.8 GB. But the cluster has memory to cache 3 RDDs like that spark.executor.instances - 100 spark.executor.memory - 48524M Storage Memory on each executor - 33.8 GB Executors Memory: 67.2 GB Used (3.3 TB Total) If my RDD takes 65.3 GB in memory storage when RDD Fraction Cached = 6% then total size in memory should be about 1.1 TB The cluster has 3.3 TB total storage memory and only 1 application is running now (the RDD is the first RDD to cache in my programm) Why Spark can not store entire RDD in memory? BTW. Previous Spark 1.5.2 stores 100% of the RDD Should I switch to legacy mode? spark.memory.useLegacyMode=true -