Hi,

Unfortunately the new KafkaSource was contributed without good benchmarks,
and so far you are the first one that noticed and reported this issue.
Without more direct comparison (as Martijn suggested), it's hard for us to
help right away. It would be a tremendous help for us if you could for
example provide us steps to reproduce this exact issue? Another thing that
you could do, is to attach some code profiler to both Flink 1.9 and 1.14
versions and compare the results of source task threads from both (Flink
task threads are named after the task name, so they are easy to
distinguish).

Also have you observed some degradation in metrics reported by Flink? Like
the records processing rate between those two versions?

Best,
Piotrek

śr., 16 lut 2022 o 13:24 Arujit Pradhan <arujit.prad...@gojek.com>
napisał(a):

> Hey Martijn,
>
> Thanks a lot for getting back to us. To give you a little bit more
> context, we do maintain an open-source project around flink dagger
> <http://github.com/odpf/dagger> which is a wrapper for proto processing.
> As part of the upgrade to the latest version, we did some refactoring and
> moved to KafkaSource since the older FlinkKafkaConsumer was getting
> deprecated.
>
> So we currently do not have any set up to test the hypothesis. Also just
> increasing the resources by a bit fixes it and it does happen with a small
> set of jobs during high traffic.
>
> We would love to get some input from the community as it might cause
> errors in some of the jobs in production.
>
> Thanks and regards,
> //arujit
>
> On Tue, Feb 15, 2022 at 8:48 PM Martijn Visser <mart...@ververica.com>
> wrote:
>
>> Hi Arujit,
>>
>> I'm also looping in some contributors from the connector and runtime
>> perspective in this thread. Did you also test the upgrade first by only
>> upgrading to Flink 1.14 and keeping the FlinkKafkaConsumer? That would
>> offer a better way to determine if a regression is caused by the upgrade of
>> Flink or because of the change in connector.
>>
>> Best regards,
>>
>> Martijn Visser
>> https://twitter.com/MartijnVisser82
>>
>>
>> On Tue, 15 Feb 2022 at 13:07, Arujit Pradhan <arujit.prad...@gojek.com>
>> wrote:
>>
>>> Hey team,
>>>
>>> We are migrating our Flink codebase and a bunch of jobs from Flink-1.9
>>> to Flink-1.14. To ensure uniformity in performance we ran a bunch of jobs
>>> for a week both in 1.9 and 1.14 simultaneously with the same resources and
>>> configurations and monitored them.
>>>
>>> Though most of the jobs are running fine, we have significant
>>> performance degradation in some of the high throughput jobs during peak
>>> hours. As a result, we can see high lag and data drops while processing
>>> messages from Kafka in some of the jobs in 1.14 while in 1.9 they are
>>> working just fine.
>>> Now we are debugging and trying to understand the potential reason for
>>> it.
>>>
>>> One of the hypotheses that we can think of is the change in the sequence
>>> of processing in the source-operator. To explain this, adding screenshots
>>> for the problematic tasks below.
>>> The first one is for 1.14 and the second is for 1.9. Upon inspection, it
>>> can be seen the sequence of processing 1.14 is -
>>>
>>> data_streams_0 -> Timestamps/Watermarks -> Filter -> Select.
>>>
>>> While in 1.9 it was,
>>>
>>> data_streams_0 -> Filter -> Timestamps/Watermarks -> Select.
>>>
>>> In 1.14 we are using KafkaSource API while in the older version it was
>>> FlinkKafkaConsumer API. Wanted to understand if it can cause potential
>>> performance decline as all other configurations/resources for both of the
>>> jobs are identical and if so then how to avoid it. Also, we can not see any
>>> unusual behaviour for the CPU/Memory while monitoring the affected jobs.
>>>
>>> Source Operator in 1.14 :
>>> [image: image.png]
>>> Source Operator in 1.9 :
>>> [image: image.png]
>>> Thanks in advance,
>>> //arujit
>>>
>>>
>>>
>>>
>>>
>>>
>>>

Reply via email to