Hi, we are seeing a weird issue where one TaskManager is lost and then never re-allocated and subsequently operators fail with NoResourceAvailableException and after 5 restarts (we have FixedDelay restarts of 5) the application goes down.
- We have explicitly set *yarn.reallocate-failed: *true and have not specified the yarn.maximum-failed-containers (and see “org.apache.flink.yarn.YarnApplicationMasterRunner - YARN application tolerates 145 failed TaskManager containers before giving up” in the logs). - After the initial startup where all 145 TaskManagers are requested I never see any logs saying “Requesting new TaskManager container” to reallocate failed container. 2018-08-29 23:13:56,655 WARN akka.remote.RemoteWatcher - Detected unreachable: [akka.tcp:// fl...@blahabc.sfdc.net:123] 2018-08-29 23:13:57,364 INFO org.apache.flink.yarn.YarnJobManager - Task manager akka.tcp:// fl...@blahabc.sfdc.net:123/user/taskmanager terminated. java.lang.Exception: TaskManager was lost/killed: container_e27_1535135887442_0906_01_000039 @ blahabc.sfdc.net (dataPort=124) at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:523) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212) at org.apache.flink.runtime.jobmanager.JobManager.org $apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1198) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1096) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.clusterframework.ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(ContaineredJobManager.scala:107) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:110) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511) at akka.actor.ActorCell.invoke(ActorCell.scala:494) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) java.lang.Exception: TaskManager was lost/killed: container_e27_1535135887442_0906_01_000039 @ blahabc.sfdc.net:42414 (dataPort=124) at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:523) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212) at org.apache.flink.runtime.jobmanager.JobManager.org $apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1198) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1096) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.clusterframework.ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(ContaineredJobManager.scala:107) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:110) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511) at akka.actor.ActorCell.invoke(ActorCell.scala:494) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2018-08-29 23:13:58,529 INFO org.apache.flink.runtime.instance.InstanceManager - Unregistered task manager blahabc.sfdc.net:/1.1.1.1. Number of registered task managers 144. Number of available slots 720. 2018-08-29 23:13:58,645 INFO org.apache.flink.yarn.YarnJobManager - Received message for non-existing checkpoint 1 2018-08-29 23:14:39,969 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper. 2018-08-29 23:14:39,975 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 0 checkpoints in ZooKeeper. 2018-08-29 23:14:39,975 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to fetch 0 checkpoints from storage. org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #1 *After 5 retries of our Sql query execution graph (we have configured 5 fixed delay restart), it outputs the below, * 2018-08-29 23:15:22,216 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job ab7e96a1659fbb4bfb3c6cd9bccb0335 2018-08-29 23:15:22,216 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Shutting down 2018-08-29 23:15:22,225 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Removing /prod/link/application_1535135887442_0906/checkpoints/ab7e96a1659fbb4bfb3c6cd9bccb0335 from ZooKeeper 2018-08-29 23:15:23,460 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Shutting down. 2018-08-29 23:15:23,460 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Removing /checkpoint-counter/ab7e96a1659fbb4bfb3c6cd9bccb0335 from ZooKeeper 2018-08-29 23:15:24,114 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Removed job graph ab7e96a1659fbb4bfb3c6cd9bccb0335 from ZooKeeper. 2018-08-29 23:15:26,126 INFO org.apache.flink.yarn.YarnJobManager - Job with ID ab7e96a1659fbb4bfb3c6cd9bccb0335 is in terminal state FAILED. Shutting down session 2018-08-29 23:15:26,128 INFO org.apache.flink.yarn.YarnJobManager - Stopping JobManager with final application status FAILED and diagnostics: The monitored job with ID ab7e96a1659fbb4bfb3c6cd9bccb0335 has failed to complete. 2018-08-29 23:15:26,129 INFO org.apache.flink.yarn.YarnFlinkResourceManager - Shutting down cluster with status FAILED : The monitored job with ID ab7e96a1659fbb4bfb3c6cd9bccb0335 has failed to complete. 2018-08-29 23:15:26,146 INFO org.apache.flink.yarn.YarnFlinkResourceManager - Unregistering application from the YARN Resource Manager 2018-08-29 23:15:27,997 INFO org.apache.flink.runtime.history.FsJobArchivist - Job ab7e96a1659fbb4bfb3c6cd9bccb0335 has been archived at hdfs:/savedSearches/prod/completed-jobs/ab7e96a1659fbb4bfb3c6cd9bccb0335. Cheers,