I second Alex' observation - based on the logs it looks like the task
restart functionality worked as expected: It tried to restart the tasks
until it reached the limit of 4 attempts due to the missing TaskManager.
The job-cluster shut down with an error code. At this point, YARN should
pick it up and bring up a new JobManager based on the non-0 exit code of
the Flink cluster. It would be interesting to see the YARN logs to figure
out why the cluster failover didn't work.


On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß <
alexanderpre...@ververica.com> wrote:

> Hi,
> from a first glance it looks like the exception was thrown very rapidly so
> it exceeded the maxFailuresPerInterval and the FailureRestartStrategy
> decided not to restart. Why do you think this is different from the
> expected behavior?
> Best,
> Alex
> On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹 <ljq1120799...@outlook.com> wrote:
>> Hi, all
>> We encounter some problem with FailureRateRestartStrategy, which confuse
>> us and don't know how to solove it. Here's the situation:
>> Flink version: 1.10.1
>> Development env: on Yarn
>> FailureRateRestartStrategy: 
>> failuresIntervalMS=60000,backoffTimeMS=15000,maxFailuresPerInterval=4
>> One of our hadoop machine got stuck without response, which our job's
>> taskmanager running on. At this moment, the jobmanager receive a heartbeat
>> timeout exception, but after throwing 4 times exception in a very short
>> time(about 10ms each), it hit the FailureRateRestartStrategy and all job
>> quit, we got the message of 'org.apache.flink.runtime.JobException:
>> Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'.
>> As I know from document, the behavior expected was jobmanager should try
>> to restart the job which will bring up a new taskmanager on other machine,
>> but it did not.
>> We also do some test, start a new job and just kill the taskamanger, but
>> it can restart as expect.
>> So it confuse us most,  if anyone know what happen, that would be thanks.
>> JobManager log and TaskManager log append below
> --
> Alexander Preuß | Junior Engineer - Data Intensive Systems
> alexanderpre...@ververica.com
> <https://www.ververica.com/>
> Follow us @VervericaData
> --
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
> Stream Processing | Event Driven | Real Time
> --
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
> Jinwei (Kevin) Zhang

Reply via email to