Just found some additional info. It looks like one of the EC2 instances got
terminated at the time the crash happened and this job had 7 Task Managers
running on that EC2 instance. Now I suspect it's possible that when Yarn
tried to migrate the Task Managers, there were no idle containers as this
job was using like 99% of the entire cluster. However in that case
shouldn't Yarn wait for containers to become available? I'm not quite sure
how Flink would behave in this case. Could someone provide some insights
here? Thanks.

Thomas

On Sun, Jun 27, 2021 at 4:24 PM Thomas Wang <w...@datability.io> wrote:

> Hi,
>
> I recently experienced a job crash due to the underlying Yarn application
> failing for some reason. Here is the only error message I saw. It seems I
> can no longer see any of the Flink job logs.
>
> Application application_1623861596410_0010 failed 1 times (global limit
> =2; local limit is =1) due to ApplicationMaster for attempt
> appattempt_1623861596410_0010_000001 timed out. Failing the application.
>
> I was running the Flink job using the Yarn session mode with the following
> command.
>
> export HADOOP_CLASSPATH=`hadoop classpath` &&
> /usr/lib/flink/bin/yarn-session.sh -jm 7g -tm 7g -s 4 --detached
>
> I didn't have HA setup, but I believe the underlying Yarn application
> caused the crash because if, for some reason, the Flink job failed, the
> Yarn application should still survive. Please correct me if this is not the
> right assumption.
>
> My question is how I should find the root cause in this case and what's
> the recommended way to avoid this going forward?
>
> Thanks.
>
> Thomas
>

Reply via email to