Hey team,

We are migrating our Flink codebase and a bunch of jobs from Flink-1.9 to
Flink-1.14. To ensure uniformity in performance we ran a bunch of jobs for
a week both in 1.9 and 1.14 simultaneously with the same resources and
configurations and monitored them.

Though most of the jobs are running fine, we have significant performance
degradation in some of the high throughput jobs during peak hours. As a
result, we can see high lag and data drops while processing messages from
Kafka in some of the jobs in 1.14 while in 1.9 they are working just fine.
Now we are debugging and trying to understand the potential reason for it.

One of the hypotheses that we can think of is the change in the sequence of
processing in the source-operator. To explain this, adding screenshots for
the problematic tasks below.
The first one is for 1.14 and the second is for 1.9. Upon inspection, it
can be seen the sequence of processing 1.14 is -

data_streams_0 -> Timestamps/Watermarks -> Filter -> Select.

While in 1.9 it was,

data_streams_0 -> Filter -> Timestamps/Watermarks -> Select.

In 1.14 we are using KafkaSource API while in the older version it was
FlinkKafkaConsumer API. Wanted to understand if it can cause potential
performance decline as all other configurations/resources for both of the
jobs are identical and if so then how to avoid it. Also, we can not see any
unusual behaviour for the CPU/Memory while monitoring the affected jobs.

Source Operator in 1.14 :
[image: image.png]
Source Operator in 1.9 :
[image: image.png]
Thanks in advance,
//arujit

Reply via email to