Ah OK I could reproduce the problem.
Seems to be tied to the capacity of the async operator; if you half that
the start delay is doubled.
It looks like classic back-pressure delaying checkpoints, which kinda
makes sense,
if you ignore that unaligned checkpoints are enabled which are supposed
to prevent that from happening.
I think it'd be best to create a ticket; either something isn't behaving
as it should or the documentation is incomplete.
On 12/07/2022 20:43, Nathan Sharp wrote:
I have not found a solution yet, but some points:
- A co-worker has reproduced this issue on their own box using the recipe
given below
- I have tried using rocksdb state backend, which did not help
- I have tried adding additional TaskWorkers, which did not help
- I have checked the TaskWorker stats and nothing seems awry. No memory
consumption, for example. Nothing obvious in the stack traces
- If I change the code to be sequential instead of async, checkpoints work
fine
- The log file merely shows the checkpoint being triggered, then it being
completed 47 seconds later. No additional information is logged.
- See the attached image for the UI representation, which shows that the delay is under
the "Start Time" column.
Chesnay, how was your Flink cluster configured when it worked for you? Are
you able to reproduce it using my docker-compose file?
Thanks again!
Nathan
-----Original Message-----
From: Nathan Sharp
Sent: Monday, July 4, 2022 10:00 AM
To: 'Chesnay Schepler' <ches...@apache.org>; user@flink.apache.org
Subject: RE: Unaligned checkpoint waiting in 'start delay' with AsyncDataStream
Thank you for trying it out! Hopefully, there is just some setting that needs
to be changed.
I have an Ubuntu VM where I created a single node Docker swarm. Then I used the
following command to run Flink 1.15.0 using the docker-compose.yml file in the
repository:
docker stack up -c docker-compose.yml flink
Then I used Flink's web UI to upload the .jar file and run it with default
settings.
Nathan