Hi,

You should still be able to get the Flink logs via:

> yarn logs -applicationId application_1623861596410_0010

And it should give you more answers about what has happened.

About the Flink and YARN behaviour, have you seen the documentation? [1]
Especially this part:

> Failed containers (including the JobManager) are replaced by YARN. The
maximum number of JobManager container restarts is configured via
yarn.application-attempts (default 1). The YARN Application will fail once
all attempts are exhausted.

?

Best,
Piotrek

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/yarn/#flink-on-yarn-reference

pon., 28 cze 2021 o 02:26 Thomas Wang <w...@datability.io> napisaƂ(a):

> Just found some additional info. It looks like one of the EC2 instances
> got terminated at the time the crash happened and this job had 7 Task
> Managers running on that EC2 instance. Now I suspect it's possible
> that when Yarn tried to migrate the Task Managers, there were no idle
> containers as this job was using like 99% of the entire cluster. However in
> that case shouldn't Yarn wait for containers to become available? I'm not
> quite sure how Flink would behave in this case. Could someone provide some
> insights here? Thanks.
>
> Thomas
>
> On Sun, Jun 27, 2021 at 4:24 PM Thomas Wang <w...@datability.io> wrote:
>
>> Hi,
>>
>> I recently experienced a job crash due to the underlying Yarn application
>> failing for some reason. Here is the only error message I saw. It seems I
>> can no longer see any of the Flink job logs.
>>
>> Application application_1623861596410_0010 failed 1 times (global limit
>> =2; local limit is =1) due to ApplicationMaster for attempt
>> appattempt_1623861596410_0010_000001 timed out. Failing the application.
>>
>> I was running the Flink job using the Yarn session mode with the
>> following command.
>>
>> export HADOOP_CLASSPATH=`hadoop classpath` &&
>> /usr/lib/flink/bin/yarn-session.sh -jm 7g -tm 7g -s 4 --detached
>>
>> I didn't have HA setup, but I believe the underlying Yarn application
>> caused the crash because if, for some reason, the Flink job failed, the
>> Yarn application should still survive. Please correct me if this is not the
>> right assumption.
>>
>> My question is how I should find the root cause in this case and what's
>> the recommended way to avoid this going forward?
>>
>> Thanks.
>>
>> Thomas
>>
>

Reply via email to