Hi,

I recently experienced a job crash due to the underlying Yarn application
failing for some reason. Here is the only error message I saw. It seems I
can no longer see any of the Flink job logs.

Application application_1623861596410_0010 failed 1 times (global limit =2;
local limit is =1) due to ApplicationMaster for attempt
appattempt_1623861596410_0010_000001 timed out. Failing the application.

I was running the Flink job using the Yarn session mode with the following
command.

export HADOOP_CLASSPATH=`hadoop classpath` &&
/usr/lib/flink/bin/yarn-session.sh -jm 7g -tm 7g -s 4 --detached

I didn't have HA setup, but I believe the underlying Yarn application
caused the crash because if, for some reason, the Flink job failed, the
Yarn application should still survive. Please correct me if this is not the
right assumption.

My question is how I should find the root cause in this case and what's the
recommended way to avoid this going forward?

Thanks.

Thomas

Reply via email to