How many partitions does the stream have? With 80 cores, you need at least
80 tasks to even take advantage of them, so if it's less than 80, at least
.repartition(80). A crude rule of thumb is to have 2-3x as many tasks as
cores, to help even out differences in task size by more finely
distributing the work. You might even go for more. I'd watch the task
length, and as long as the tasks aren't completing in a few seconds or
less, you probably don't have too many.

This is also a good reason to use autoscaling, so that when not busy you
can (for example) scale down to 1 executor, but under load, scale up to 10
or 20 machines if needed. That is also a good reason to repartition more,
so that it's possible to take advantage of more parallelism when needed.

On Tue, Jan 25, 2022 at 7:07 AM Aki Riisiö <aki.rii...@gmail.com> wrote:

> Hello.
>
> We have a very simple AWS Glue job running with Spark: get some events
> from Kafka stream, do minor transformations, and write to S3.
>
> Recently, there was a change in Kafka topic which suddenly increased our
> data size * 10 and at the same time we were testing with different
> repartition values during df.repartition(n).write ...
> At the time when Kafka started sending an increased volume of data, we
> didn't actually have the repartition value set in our write.
> Suddenly, our Glue job (or save at NativeMethodAccessorImpl.java:0) jumped
> from 2h to 10h. Here are some details of the save stage from SparkUI:
> - Only 5 executors, which can run 16 tasks parallel each
> - 10500 tasks (job is still running...) with medians for duration=2,6min
> and GC time= 2s
> - Input size per executor is 9GB and output is 4,5GB
> - executor memory is 20GB
>
> My question is now that we're trying to find a proper value for
> repartition, what would be the optimal value here? Our data volume was not
> expected to go this high, but there are times when it might be. As this job
> is running in AWS Glue, should I also consider setting the executor amount,
> cores, and memory manually? I think Glue is actually setting those based on
> the Glue job configuration. Yes, this is not probably your concern but
> still, worth a shot :)
>
> Thank you!
>
>

Reply via email to