Why don't you just repartion the dataset ? If partion are really that
unevenly sized you should probably do that first. That potentially also
saves a lot of trouble later on.

On Thu, Nov 7, 2019 at 5:14 PM V0lleyBallJunki3 <venkatda...@gmail.com>
wrote:

> Consider an example where I have a cluster with 5 nodes and each node has
> 64
> cores with 244 GB memory. I decide to run 3 executors on each node and set
> executor-cores to 21 and executor memory of 80GB, so that each executor can
> execute 21 tasks in parallel.  Now consider that 315(63 * 5) partitions of
> data, out of which 314 partitions are of size 3GB but one of them is
> 30GB(due to data skew). All of the executors that received the 3GB
> partitions have 63GB(21 * 3 = since each executor can run 21 tasks in
> parallel and each task takes 3GB of memory space) occupied. But the one
> executor that received the 30GB partition will need 90GB(20 * 3 + 30)
> memory. So will this executor first execute the 20 tasks of 3GB and then
> load 30GB task or will it just try to load 21 tasks and find that for one
> task it has to spill to disk? If I set executor-cores to just 15 then the
> executor that receives the 30 GB partition will only need 14 * 3 + 30 = 72
> gb and hence won't spill to disk. So in this case will reduced parallelism
> lead to no shuffle spill?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to