> What's the total number of Partitions that you have ?
18k

> What machines are you using ? Are you using an SSD ?
Using a family of r5.4xlarges nodes. Yes I'm using five GP3 Disks which
gives me about 625 MB/s of sustained throughput (which is what I see when
writing the shuffle data).

> can you please provide whats the size of the shuffle file that is getting
generated in each task .
I have to check that. But total sizes of tasks are around 150 ~ 180 MB

> what exact instance types do you use? Unless you use local instance
storage and have actually configured your Kubernetes and Spark to use
instance storage, your 30x30 exchange can run into EBS IOPS limits. You can
investigate that by going to an instance, then to volume, and see
monitoring charts.
That's a pretty good hint. I forgot to check the IOPs limit, was only
looking at the throughput.

> Another thought is that you're essentially giving 4GB per core. That
sounds pretty low, in my experience.
For this particular job, it seems fine. No long GCs and peak usage per task
was reported at 1.4, so plenty of room.

Thanks a lot for the responses. I'm betting this is related to EBS IOPS
limits, since most of our jobs use instances with local disks instead.

On Thu, Sep 29, 2022 at 7:44 PM Vladimir Prus <vladimir.p...@gmail.com>
wrote:

> Igor,
>
> what exact instance types do you use? Unless you use local instance
> storage and have actually configured your Kubernetes and Spark to use
> instance storage, your 30x30 exchange can run into EBS IOPS limits. You can
> investigate that by going to an instance, then to volume, and see
> monitoring charts.
>
> Another thought is that you're essentially giving 4GB per core. That
> sounds pretty low, in my experience.
>
>
>
> On Thu, Sep 29, 2022 at 9:13 PM Igor Calabria <igor.calab...@gmail.com>
> wrote:
>
>> Hi Everyone,
>>
>> I'm running spark 3.2 on kubernetes and have a job with a decently sized
>> shuffle of almost 4TB. The relevant cluster config is as follows:
>>
>> - 30 Executors. 16 physical cores, configured with 32 Cores for spark
>> - 128 GB RAM
>> -  shuffle.partitions is 18k which gives me tasks of around 150~180MB
>>
>> The job runs fine but I'm bothered by how underutilized the cluster gets
>> during the reduce phase. During the map(reading data from s3 and writing
>> the shuffle data) CPU usage, disk throughput and network usage is as
>> expected, but during the reduce phase it gets really low. It seems the main
>> bottleneck is reading shuffle data from other nodes, task statistics
>> reports values ranging from 25s to several minutes(the task sizes are
>> really close, they aren't skewed). I've tried increasing
>> "spark.reducer.maxSizeInFlight" and
>> "spark.shuffle.io.numConnectionsPerPeer" and it did improve performance by
>> a little, but not enough to saturate the cluster resources.
>>
>> Did I miss some more tuning parameters that could help?
>> One obvious thing would be to vertically increase the machines and use
>> less nodes to minimize traffic, but 30 nodes doesn't seem like much even
>> considering 30x30 connections.
>>
>> Thanks in advance!
>>
>>
>
> --
> Vladimir Prus
> http://vladimirprus.com
>

Reply via email to