You are welcome :) Piotrek
śr., 30 cze 2021 o 08:34 Thomas Wang <w...@datability.io> napisał(a): > Thanks Piotr. This is helpful. > > Thomas > > On Mon, Jun 28, 2021 at 8:29 AM Piotr Nowojski <pnowoj...@apache.org> > wrote: > >> Hi, >> >> You should still be able to get the Flink logs via: >> >> > yarn logs -applicationId application_1623861596410_0010 >> >> And it should give you more answers about what has happened. >> >> About the Flink and YARN behaviour, have you seen the documentation? [1] >> Especially this part: >> >> > Failed containers (including the JobManager) are replaced by YARN. The >> maximum number of JobManager container restarts is configured via >> yarn.application-attempts (default 1). The YARN Application will fail once >> all attempts are exhausted. >> >> ? >> >> Best, >> Piotrek >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/yarn/#flink-on-yarn-reference >> >> pon., 28 cze 2021 o 02:26 Thomas Wang <w...@datability.io> napisał(a): >> >>> Just found some additional info. It looks like one of the EC2 instances >>> got terminated at the time the crash happened and this job had 7 Task >>> Managers running on that EC2 instance. Now I suspect it's possible >>> that when Yarn tried to migrate the Task Managers, there were no idle >>> containers as this job was using like 99% of the entire cluster. However in >>> that case shouldn't Yarn wait for containers to become available? I'm not >>> quite sure how Flink would behave in this case. Could someone provide some >>> insights here? Thanks. >>> >>> Thomas >>> >>> On Sun, Jun 27, 2021 at 4:24 PM Thomas Wang <w...@datability.io> wrote: >>> >>>> Hi, >>>> >>>> I recently experienced a job crash due to the underlying Yarn >>>> application failing for some reason. Here is the only error message I saw. >>>> It seems I can no longer see any of the Flink job logs. >>>> >>>> Application application_1623861596410_0010 failed 1 times (global limit >>>> =2; local limit is =1) due to ApplicationMaster for attempt >>>> appattempt_1623861596410_0010_000001 timed out. Failing the application. >>>> >>>> I was running the Flink job using the Yarn session mode with the >>>> following command. >>>> >>>> export HADOOP_CLASSPATH=`hadoop classpath` && >>>> /usr/lib/flink/bin/yarn-session.sh -jm 7g -tm 7g -s 4 --detached >>>> >>>> I didn't have HA setup, but I believe the underlying Yarn application >>>> caused the crash because if, for some reason, the Flink job failed, the >>>> Yarn application should still survive. Please correct me if this is not the >>>> right assumption. >>>> >>>> My question is how I should find the root cause in this case and what's >>>> the recommended way to avoid this going forward? >>>> >>>> Thanks. >>>> >>>> Thomas >>>> >>>