Re: How to debug checkpoints failing to complete

2020-03-27 Thread Congxian Qiu
Hi >From my experience, you can first check the jobmanager.log, find out whether the checkpoint expired or was declined by some task, if expired, you can follow the adivce of seeksst given above(maybe enable debug log can help you here), if was declined, then you can go to the taskmanager.log to

Re: How to debug checkpoints failing to complete

2020-03-25 Thread David Anderson
seeksst has already covered many of the relevant points, but a few more thoughts: I would start by checking to see if the checkpoints are failing because they timeout, or for some other reason. Assuming they are timing out, then a good place to start is to look and see if this can be explained by

Re: How to debug checkpoints failing to complete

2020-03-23 Thread seeksst
Hi: according to my experience, there are several possible reasons for checkpoint fail. 1. if you use rocksdb as backend, insufficient disk will cause it. because file save on local disk, and you may see a exception. 2. Sink can’t be written. all parallelism can’t be

How to debug checkpoints failing to complete

2020-03-23 Thread Stephen Connolly
We have a topology and the checkpoints fail to complete a *lot* of the time. Typically it is just one subtask that fails. We have a parallelism of 2 on this topology at present and the other subtask will complete in 3ms though the end to end duration on the rare times when the checkpointing