Re: [FLINK-10868] job cannot be exited immediately if job manager is timed out for some reason

2019-07-04 Thread Anyang Hu
Thanks for your replies. To Peter: The heartbeat.timeout has been increased to 3 minutes before, but the job manager timeout will still occur. At present, the following logic is added : When JM times out, onFatalError is called, which can ensure that the job fails to exit quickly. Does the method

Re: [FLINK-10868] job cannot be exited immediately if job manager is timed out for some reason

2019-07-01 Thread Peter Huang
Hi Anyang, Thanks for rising the question. I didn't test the PR in batch mode, the observation helps me to have better implementation. From my understanding, if rm to a job manager heartbeat timeout, the job manager connection will be closed, so it will not be reconnected. Are you running batch

Re: [FLINK-10868] job cannot be exited immediately if job manager is timed out for some reason

2019-07-01 Thread Till Rohrmann
Hi Anyang, as far as I can tell, FLINK-10868 has not been merged into Flink yet. Thus, I cannot tell much about how well it works. The case you are describing should be properly handled in a version which get's merged though. I guess what needs to happen is that once the JM reconnects to the RM

[FLINK-10868] job cannot be exited immediately if job manager is timed out for some reason

2019-06-26 Thread Anyang Hu
Hi ZhenQiu && Rohrmann: Currently I backport the FLINK-10868 to flink-1.5, most of my jobs (all batch jobs) can be exited immediately after applying for the failed container to the upper limit, but there are still some jobs cannot be exited immediately. Through the log, it is observed that these