Consider an example where I have a cluster with 5 nodes and each node has 64
cores with 244 GB memory. I decide to run 3 executors on each node and set
executor-cores to 21 and executor memory of 80GB, so that each executor can
execute 21 tasks in parallel.  Now consider that 315(63 * 5) partitions of
data, out of which 314 partitions are of size 3GB but one of them is
30GB(due to data skew). All of the executors that received the 3GB
partitions have 63GB(21 * 3 = since each executor can run 21 tasks in
parallel and each task takes 3GB of memory space) occupied. But the one
executor that received the 30GB partition will need 90GB(20 * 3 + 30)
memory. So will this executor first execute the 20 tasks of 3GB and then
load 30GB task or will it just try to load 21 tasks and find that for one
task it has to spill to disk? If I set executor-cores to just 15 then the
executor that receives the 30 GB partition will only need 14 * 3 + 30 = 72
gb and hence won't spill to disk. So in this case will reduced parallelism
lead to no shuffle spill?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to