Re: Unbalanced distribution of keyed stream to downstream parallel operators

Qingsheng Ren Fri, 01 Apr 2022 20:55:13 -0700

Hi Isidoros,

Two assumptions in my mind:

1. Records are not evenly distributed across different keys, e.g. some 
accountId just has more events than others. If the record distribution is 
predicable, you can try to combine other fields or include more information 
into the key field to help balancing the distribution. 

2. Keys themselves are not distributed evenly. In short the subtask ID that a 
key belongs to is calculated by murmurHash(key.hashCode()) % maxParallelism, so 
if the distribution of keys is quite strange, it’s possible that most keys drop 
into the same subtask with the algorithm above. AFAIK there isn't such kind of 
metric for monitoring number of keys in a subtask, but I think you can simply 
investigate it with a map function after keyBy. 

Hope this would be helpful!

Qingsheng

> On Apr 1, 2022, at 17:37, Isidoros Ioannou <akis3...@gmail.com> wrote:
> 
> Hello,
> 
> we ran a flink application version 1.13.2 that consists of a kafka source 
> with one partition so far
> then we filter the data based on some conditions, mapped to POJOS and we 
> transform to a KeyedStream based on an accountId long property from the POJO. 
> The downstream operators are 10 CEP operators that run with parallelism of 14 
> and the maxParallelism is set to the (operatorParallelism * 
> operatorParallelism).
> As you see in the image attached the events are distributed unevenly so some 
> subtasks are busy and others are idle.
> Is there any way to distribute evenly the load to the subtasks? Thank you in 
> advance.
> <Capture.PNG>
>

Re: Unbalanced distribution of keyed stream to downstream parallel operators

Reply via email to