Are you doing a join/groupBy such operation? In that case i would suspect
that the keys are not evenly distributed and that's why few of the tasks
are spending way too much time doing the actual processing. You might want
to look into custom partitioners
are there other processes on sk3? or more generally are you sharing
resources with somebody else, virtualization etc
does your transformation consumes other services?(e.g. reading from s3, so
it can happen that s3 latency plays the role...)
can it be that task per some key will take longer than
It really depends on the code. I would say that the easiest way is to
restart the problematic action, find the straggler task and analyze whats
happening with it with jstack / make a heap dump and analyze locally. For
example, there might be the case that your tasks are connecting to some
external