Hello all,

I'm using spark 1.6 and trying to cache a dataset which is 1.5 TB, I have
only ~800GB RAM  in total, so I am choosing the DISK_ONLY storage level.
Unfortunately, I'm getting out of the overhead memory limit:


Container killed by YARN for exceeding memory limits. 27.0 GB of 27 GB
physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.


I'm giving 6GB overhead memory and using 10 cores per executor. Apparently,
that's not enough. Without persisting the data and later computing the
dataset (twice in my case) the job works fine. Can anyone, please, explain
what is the overhead which consumes that much memory during persist to the
disk and how can I estimate what extra memory should I give to the
executors in order to make it not fail?

Thanks, Alex.

Reply via email to