[jira] [Commented] (FLINK-8899) Submitting YARN job with FLIP-6 may lead to ApplicationAttemptNotFoundException
[ https://issues.apache.org/jira/browse/FLINK-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407618#comment-16407618 ] Nico Kruber commented on FLINK-8899: It should, however, also not be in "minor" priority as this may affect user experience - as do all the other mentioned exceptions (which should get JIRA tickets). Every exception in the log will potentially make the users (and us) investigate it and burn a lot of time. > Submitting YARN job with FLIP-6 may lead to > ApplicationAttemptNotFoundException > --- > > Key: FLINK-8899 > URL: https://issues.apache.org/jira/browse/FLINK-8899 > Project: Flink > Issue Type: Bug > Components: ResourceManager, YARN >Affects Versions: 1.5.0 >Reporter: Nico Kruber >Priority: Minor > Labels: flip-6 > > Occasionally, running a simple word count as this > {code} > ./bin/flink run -m yarn-cluster -yjm 768 -ytm 3072 -ys 2 -p 20 -c > org.apache.flink.streaming.examples.wordcount.WordCount > ./examples/streaming/WordCount.jar --input /usr/share/doc/rsync-3.0.6/COPYING > {code} > leads to an {{ApplicationAttemptNotFoundException}} in the logs: > {code} > 2018-03-08 16:18:08,507 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Streaming > WordCount (df707a3c9817ddf5936efe56d427e2bd) switched from state RUNNING to > FINISHED. > 2018-03-08 16:18:08,508 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping > checkpoint coordinator for job df707a3c9817ddf5936efe56d427e2bd > 2018-03-08 16:18:08,508 INFO > org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore - > Shutting down > 2018-03-08 16:18:08,536 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Job > df707a3c9817ddf5936efe56d427e2bd reached globally terminal state FINISHED. > 2018-03-08 16:18:08,611 INFO org.apache.flink.runtime.jobmaster.JobMaster > - Stopping the JobMaster for job Streaming > WordCount(df707a3c9817ddf5936efe56d427e2bd). > 2018-03-08 16:18:08,634 INFO org.apache.flink.runtime.jobmaster.JobMaster > - Close ResourceManager connection > dcfdc329d61aae0ace2de26292c8916b: JobManager is shutting down.. > 2018-03-08 16:18:08,634 INFO org.apache.flink.yarn.YarnResourceManager > - Disconnect job manager > 0...@akka.tcp://fl...@ip-172-31-2-0.eu-west-1.compute.internal:38555/user/jobmanager_0 > for job df707a3c9817ddf5936efe56d427e2bd from the resource manager. > 2018-03-08 16:18:08,664 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending > SlotPool. > 2018-03-08 16:18:08,664 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping > SlotPool. > 2018-03-08 16:18:08,664 INFO > org.apache.flink.runtime.jobmaster.JobManagerRunner - > JobManagerRunner already shutdown. > 2018-03-08 16:18:09,650 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager adc8090bdb3f7052943ff86bde7d2a7b at the SlotManager. > 2018-03-08 16:18:09,654 INFO org.apache.flink.yarn.YarnResourceManager > - Replacing old instance of worker for ResourceID > container_1519984124671_0090_01_05 > 2018-03-08 16:18:09,654 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - > Unregister TaskManager adc8090bdb3f7052943ff86bde7d2a7b from the SlotManager. > 2018-03-08 16:18:09,654 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager b975dbd16e0fd59c1168d978490a4b76 at the SlotManager. > 2018-03-08 16:18:09,654 INFO org.apache.flink.yarn.YarnResourceManager > - The target with resource ID > container_1519984124671_0090_01_05 is already been monitored. > 2018-03-08 16:18:09,992 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager 73c258a0dbad236501b8391971c330ba at the SlotManager. > 2018-03-08 16:18:10,000 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - RECEIVED > SIGNAL 15: SIGTERM. Shutting down as requested. > 2018-03-08 16:18:10,028 ERROR > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl - Exception > on heartbeat > org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: > Application attempt appattempt_1519984124671_0090_01 doesn't exist in > ApplicationMasterService cache. > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:403) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) >
[jira] [Commented] (FLINK-8899) Submitting YARN job with FLIP-6 may lead to ApplicationAttemptNotFoundException
[ https://issues.apache.org/jira/browse/FLINK-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406493#comment-16406493 ] Till Rohrmann commented on FLINK-8899: -- Agreed this is not a blocker for 1.5. > Submitting YARN job with FLIP-6 may lead to > ApplicationAttemptNotFoundException > --- > > Key: FLINK-8899 > URL: https://issues.apache.org/jira/browse/FLINK-8899 > Project: Flink > Issue Type: Bug > Components: ResourceManager, YARN >Affects Versions: 1.5.0 >Reporter: Nico Kruber >Priority: Minor > Labels: flip-6 > > Occasionally, running a simple word count as this > {code} > ./bin/flink run -m yarn-cluster -yjm 768 -ytm 3072 -ys 2 -p 20 -c > org.apache.flink.streaming.examples.wordcount.WordCount > ./examples/streaming/WordCount.jar --input /usr/share/doc/rsync-3.0.6/COPYING > {code} > leads to an {{ApplicationAttemptNotFoundException}} in the logs: > {code} > 2018-03-08 16:18:08,507 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Streaming > WordCount (df707a3c9817ddf5936efe56d427e2bd) switched from state RUNNING to > FINISHED. > 2018-03-08 16:18:08,508 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping > checkpoint coordinator for job df707a3c9817ddf5936efe56d427e2bd > 2018-03-08 16:18:08,508 INFO > org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore - > Shutting down > 2018-03-08 16:18:08,536 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Job > df707a3c9817ddf5936efe56d427e2bd reached globally terminal state FINISHED. > 2018-03-08 16:18:08,611 INFO org.apache.flink.runtime.jobmaster.JobMaster > - Stopping the JobMaster for job Streaming > WordCount(df707a3c9817ddf5936efe56d427e2bd). > 2018-03-08 16:18:08,634 INFO org.apache.flink.runtime.jobmaster.JobMaster > - Close ResourceManager connection > dcfdc329d61aae0ace2de26292c8916b: JobManager is shutting down.. > 2018-03-08 16:18:08,634 INFO org.apache.flink.yarn.YarnResourceManager > - Disconnect job manager > 0...@akka.tcp://fl...@ip-172-31-2-0.eu-west-1.compute.internal:38555/user/jobmanager_0 > for job df707a3c9817ddf5936efe56d427e2bd from the resource manager. > 2018-03-08 16:18:08,664 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending > SlotPool. > 2018-03-08 16:18:08,664 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping > SlotPool. > 2018-03-08 16:18:08,664 INFO > org.apache.flink.runtime.jobmaster.JobManagerRunner - > JobManagerRunner already shutdown. > 2018-03-08 16:18:09,650 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager adc8090bdb3f7052943ff86bde7d2a7b at the SlotManager. > 2018-03-08 16:18:09,654 INFO org.apache.flink.yarn.YarnResourceManager > - Replacing old instance of worker for ResourceID > container_1519984124671_0090_01_05 > 2018-03-08 16:18:09,654 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - > Unregister TaskManager adc8090bdb3f7052943ff86bde7d2a7b from the SlotManager. > 2018-03-08 16:18:09,654 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager b975dbd16e0fd59c1168d978490a4b76 at the SlotManager. > 2018-03-08 16:18:09,654 INFO org.apache.flink.yarn.YarnResourceManager > - The target with resource ID > container_1519984124671_0090_01_05 is already been monitored. > 2018-03-08 16:18:09,992 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager 73c258a0dbad236501b8391971c330ba at the SlotManager. > 2018-03-08 16:18:10,000 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - RECEIVED > SIGNAL 15: SIGTERM. Shutting down as requested. > 2018-03-08 16:18:10,028 ERROR > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl - Exception > on heartbeat > org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: > Application attempt appattempt_1519984124671_0090_01 doesn't exist in > ApplicationMasterService cache. > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:403) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at >
[jira] [Commented] (FLINK-8899) Submitting YARN job with FLIP-6 may lead to ApplicationAttemptNotFoundException
[ https://issues.apache.org/jira/browse/FLINK-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406458#comment-16406458 ] Piotr Nowojski commented on FLINK-8899: --- This exception is not being thrown on job submission, but during shutting down. One out of every ~30 runs on a test cluster ends up with this error on task termination on one of the TaskManagers. I don't think it's a critical/blocking issue. Especially that Flink often logs a lot of other exceptions on job termination. cc [~till.rohrmann] > Submitting YARN job with FLIP-6 may lead to > ApplicationAttemptNotFoundException > --- > > Key: FLINK-8899 > URL: https://issues.apache.org/jira/browse/FLINK-8899 > Project: Flink > Issue Type: Bug > Components: ResourceManager, YARN >Affects Versions: 1.5.0 >Reporter: Nico Kruber >Assignee: Piotr Nowojski >Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > Occasionally, running a simple word count as this > {code} > ./bin/flink run -m yarn-cluster -yjm 768 -ytm 3072 -ys 2 -p 20 -c > org.apache.flink.streaming.examples.wordcount.WordCount > ./examples/streaming/WordCount.jar --input /usr/share/doc/rsync-3.0.6/COPYING > {code} > leads to an {{ApplicationAttemptNotFoundException}} in the logs: > {code} > 2018-03-08 16:18:08,507 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Streaming > WordCount (df707a3c9817ddf5936efe56d427e2bd) switched from state RUNNING to > FINISHED. > 2018-03-08 16:18:08,508 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping > checkpoint coordinator for job df707a3c9817ddf5936efe56d427e2bd > 2018-03-08 16:18:08,508 INFO > org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore - > Shutting down > 2018-03-08 16:18:08,536 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Job > df707a3c9817ddf5936efe56d427e2bd reached globally terminal state FINISHED. > 2018-03-08 16:18:08,611 INFO org.apache.flink.runtime.jobmaster.JobMaster > - Stopping the JobMaster for job Streaming > WordCount(df707a3c9817ddf5936efe56d427e2bd). > 2018-03-08 16:18:08,634 INFO org.apache.flink.runtime.jobmaster.JobMaster > - Close ResourceManager connection > dcfdc329d61aae0ace2de26292c8916b: JobManager is shutting down.. > 2018-03-08 16:18:08,634 INFO org.apache.flink.yarn.YarnResourceManager > - Disconnect job manager > 0...@akka.tcp://fl...@ip-172-31-2-0.eu-west-1.compute.internal:38555/user/jobmanager_0 > for job df707a3c9817ddf5936efe56d427e2bd from the resource manager. > 2018-03-08 16:18:08,664 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending > SlotPool. > 2018-03-08 16:18:08,664 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping > SlotPool. > 2018-03-08 16:18:08,664 INFO > org.apache.flink.runtime.jobmaster.JobManagerRunner - > JobManagerRunner already shutdown. > 2018-03-08 16:18:09,650 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager adc8090bdb3f7052943ff86bde7d2a7b at the SlotManager. > 2018-03-08 16:18:09,654 INFO org.apache.flink.yarn.YarnResourceManager > - Replacing old instance of worker for ResourceID > container_1519984124671_0090_01_05 > 2018-03-08 16:18:09,654 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - > Unregister TaskManager adc8090bdb3f7052943ff86bde7d2a7b from the SlotManager. > 2018-03-08 16:18:09,654 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager b975dbd16e0fd59c1168d978490a4b76 at the SlotManager. > 2018-03-08 16:18:09,654 INFO org.apache.flink.yarn.YarnResourceManager > - The target with resource ID > container_1519984124671_0090_01_05 is already been monitored. > 2018-03-08 16:18:09,992 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager 73c258a0dbad236501b8391971c330ba at the SlotManager. > 2018-03-08 16:18:10,000 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - RECEIVED > SIGNAL 15: SIGTERM. Shutting down as requested. > 2018-03-08 16:18:10,028 ERROR > org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl - Exception > on heartbeat > org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: > Application attempt appattempt_1519984124671_0090_01 doesn't exist in > ApplicationMasterService cache. > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:403) >