Thanks for your replies.
To Peter:
The heartbeat.timeout has been increased to 3 minutes before, but the job
manager timeout will still occur. At present, the following logic is added
: When JM times out, onFatalError is called, which can ensure that the job
fails to exit quickly. Does the method
Hi Anyang,
Thanks for rising the question. I didn't test the PR in batch mode, the
observation helps me to have better implementation. From my understanding,
if rm to a job manager heartbeat timeout, the job manager connection will
be closed, so it will not be reconnected. Are you running batch
Hi Anyang,
as far as I can tell, FLINK-10868 has not been merged into Flink yet. Thus,
I cannot tell much about how well it works. The case you are describing
should be properly handled in a version which get's merged though. I guess
what needs to happen is that once the JM reconnects to the RM
Hi ZhenQiu && Rohrmann:
Currently I backport the FLINK-10868 to flink-1.5, most of my jobs (all
batch jobs) can be exited immediately after applying for the failed
container to the upper limit, but there are still some jobs cannot be
exited immediately. Through the log, it is observed that these