Re: Flink failure rate restart not work as expect

Matthias Pohl Tue, 01 Mar 2022 01:57:41 -0800

Hi,
I second Alex' observation - based on the logs it looks like the task
restart functionality worked as expected: It tried to restart the tasks
until it reached the limit of 4 attempts due to the missing TaskManager.
The job-cluster shut down with an error code. At this point, YARN should
pick it up and bring up a new JobManager based on the non-0 exit code of
the Flink cluster. It would be interesting to see the YARN logs to figure
out why the cluster failover didn't work.


Best,
Matthias

On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß <
alexanderpre...@ververica.com> wrote:

> Hi,
> from a first glance it looks like the exception was thrown very rapidly so
> it exceeded the maxFailuresPerInterval and the FailureRestartStrategy
> decided not to restart. Why do you think this is different from the
> expected behavior?
>
> Best,
> Alex
>
> On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹 <ljq1120799...@outlook.com> wrote:
>
>> Hi, all
>> We encounter some problem with FailureRateRestartStrategy, which confuse
>> us and don't know how to solove it. Here's the situation:
>>
>> Flink version: 1.10.1
>> Development env: on Yarn
>>
>> FailureRateRestartStrategy: 
>> failuresIntervalMS=60000,backoffTimeMS=15000,maxFailuresPerInterval=4
>>
>> One of our hadoop machine got stuck without response, which our job's
>> taskmanager running on. At this moment, the jobmanager receive a heartbeat
>> timeout exception, but after throwing 4 times exception in a very short
>> time(about 10ms each), it hit the FailureRateRestartStrategy and all job
>> quit, we got the message of 'org.apache.flink.runtime.JobException:
>> Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'.
>> As I know from document, the behavior expected was jobmanager should try
>> to restart the job which will bring up a new taskmanager on other machine,
>> but it did not.
>> We also do some test, start a new job and just kill the taskamanger, but
>> it can restart as expect.
>>
>> So it confuse us most,  if anyone know what happen, that would be thanks.
>>
>> JobManager log and TaskManager log append below
>>
>
>
> --
>
> Alexander Preuß | Junior Engineer - Data Intensive Systems
>
> alexanderpre...@ververica.com
>
> <https://www.ververica.com/>
>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
>
> Ververica GmbH
>
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>
> Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
> Jinwei (Kevin) Zhang
>
>

Re: Flink failure rate restart not work as expect

Reply via email to