Hi, You should still be able to get the Flink logs via:
> yarn logs -applicationId application_1623861596410_0010 And it should give you more answers about what has happened. About the Flink and YARN behaviour, have you seen the documentation? [1] Especially this part: > Failed containers (including the JobManager) are replaced by YARN. The maximum number of JobManager container restarts is configured via yarn.application-attempts (default 1). The YARN Application will fail once all attempts are exhausted. ? Best, Piotrek [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/yarn/#flink-on-yarn-reference pon., 28 cze 2021 o 02:26 Thomas Wang <w...@datability.io> napisaĆ(a): > Just found some additional info. It looks like one of the EC2 instances > got terminated at the time the crash happened and this job had 7 Task > Managers running on that EC2 instance. Now I suspect it's possible > that when Yarn tried to migrate the Task Managers, there were no idle > containers as this job was using like 99% of the entire cluster. However in > that case shouldn't Yarn wait for containers to become available? I'm not > quite sure how Flink would behave in this case. Could someone provide some > insights here? Thanks. > > Thomas > > On Sun, Jun 27, 2021 at 4:24 PM Thomas Wang <w...@datability.io> wrote: > >> Hi, >> >> I recently experienced a job crash due to the underlying Yarn application >> failing for some reason. Here is the only error message I saw. It seems I >> can no longer see any of the Flink job logs. >> >> Application application_1623861596410_0010 failed 1 times (global limit >> =2; local limit is =1) due to ApplicationMaster for attempt >> appattempt_1623861596410_0010_000001 timed out. Failing the application. >> >> I was running the Flink job using the Yarn session mode with the >> following command. >> >> export HADOOP_CLASSPATH=`hadoop classpath` && >> /usr/lib/flink/bin/yarn-session.sh -jm 7g -tm 7g -s 4 --detached >> >> I didn't have HA setup, but I believe the underlying Yarn application >> caused the crash because if, for some reason, the Flink job failed, the >> Yarn application should still survive. Please correct me if this is not the >> right assumption. >> >> My question is how I should find the root cause in this case and what's >> the recommended way to avoid this going forward? >> >> Thanks. >> >> Thomas >> >