Re: Flink HA cluster on YARN is restarted more than yarn.application-attempts value

Kazunori Shinhira Sun, 02 Jun 2019 22:42:39 -0700

Hi, Shuyi

Thank you for your reply.

I read the blog post you suggested.

I understand that YARN don’t count container failure when container exited
with status one of PREEMPTED, ABORTED, DISK_FAILED, and
KILLED_BY_RESOURCEMANAGER,  and stopping JobManager with kill command will
cause one of that status.

Now, I have another question about Flink recovery process.

I tried to create NON HA Flink cluster and killed JobManager, then that
cluster was failed after I executed second time kill command.

The text like bellow was recorded as YARN Application’s Diagnostics message.

```

Application application_1545044154652_0154 failed 2 times due to AM
Container for appattempt_1545044154652_0154_000002 exited with exitCode: 137

Failing this attempt.Diagnostics: Container killed on request. Exit code is
137

..(omitted)

```

Furthermore following messages was recorded in ResourceManager’s log, so I
think that is expected behavior described in Flink document at
https://ci.apache.org/projects/flink/flink-docs-stable/ops/jobmanager_high_availability.html#maximum-application-master-attempts-yarn-sitexml
.

```

The specific max attempts: 10 for application: 154 is invalid, because it
is out of the range [1, 2]. Use the global max attempts instead.

```

※ This message was also logged when using HA cluster, I found this after
sending my first e-mail.

My question is that why attempts time is counted only when using NON HA
cluster ?

I use same command to kill JobManager but only when using NON HA cluster, it
looks like JobManager failure is counted towards to calculate max attempts.

Is there any other conditions to count attempts time or is there
possibility to happen another failure in recovery process when using NON HA
cluster?

I check JobManager's log but can't find difference between HA cluster and
NON HA cluster.

May be  this is a question  about Hadoop YARN implementation rather than
Apache Flink.

I’m sorry if this is inappropriate  for this mailing list.

Thanks,

2019年6月3日(月) 4:00 Shuyi Chen <suez1...@gmail.com>:

> AFAIR,  your manual kill won't count towards the max-attempt counter in
> hadoop's logic. Please see this post for more details:
> http://johnjianfang.blogspot.com/2015/04/the-number-of-maximum-attempts-of-yarn.html
> .
>
> On Sun, Jun 2, 2019 at 9:48 AM 新平和礼 <k.shinhira.1...@gmail.com> wrote:
>
>> Hi all,
>>
>>
>> I'm Flink newbie, and trying to understand Flink cluster’s recovery
>> feature using Flink 1.7.2 and YARN 2.8.
>>
>> To confirm HA cluster’s behavior, I created Flink YARN session cluster
>> and stopped JobManager repeatedly using kill command after job deployment.
>>
>> In that test, I set “yarn.application-attempts” to 5, but Flink cluster
>> was recovered more than 5 times.
>>
>>
>> Does anyone know what “yarn.application-attempts” mean, and when Flink
>> cluster’s attempts time will be incremented ?
>>
>>
>> I asked same question at stackoverflow, but I still don’t  get it.
>>
>>
>>
>> https://stackoverflow.com/questions/56225088/why-is-flink-ha-cluster-on-yarn-recovered-more-than-the-maximum-number-of-attemp
>>
>>
>>
>> Best,
>> --
>> Kazunori Shinhira
>> Mail : k.shinhira.1...@gmail.com
>>
>

-- 
Kazunori Shinhira
Mail : k.shinhira.1...@gmail.com

Re: Flink HA cluster on YARN is restarted more than yarn.application-attempts value

Reply via email to