Hi,

Was trying to understand why it takes about 9 minutes between the last try to 
start a container and when it finally gets the sigterm to kill the 
YarnApplicationMasterRunner.

Client:



Calc Engine: 2017-08-28 12:39:23,596 INFO  
org.apache.flink.yarn.YarnClusterClient                       - Waiting until 
all TaskManagers have connected

Calc Engine: Waiting until all TaskManagers have connected

Calc Engine: 2017-08-28 12:39:23,600 INFO  
org.apache.flink.yarn.YarnClusterClient                       - Starting client 
actor system.

Calc Engine: 2017-08-28 12:39:24,077 INFO  akka.event.slf4j.Slf4jLogger         
                         - Slf4jLogger started

Calc Engine: 2017-08-28 12:39:24,366 INFO  Remoting                             
                         - Remoting started; listening on addresses 
:[akka.tcp://fl...@dlp-qa-176378-023.dc.gs.com:39353]

Calc Engine: 2017-08-28 12:39:24,609 INFO  
org.apache.flink.yarn.YarnClusterClient                       - TaskManager 
status (0/4)

Calc Engine: TaskManager status (0/4)

Calc Engine: 2017-08-28 12:39:29,864 INFO  
org.apache.flink.yarn.YarnClusterClient                       - TaskManager 
status (1/4)

Calc Engine: TaskManager status (1/4)

Calc Engine: 2017-08-28 12:39:30,389 INFO  
org.apache.flink.yarn.YarnClusterClient                       - TaskManager 
status (2/4)

Calc Engine: TaskManager status (2/4)

Calc Engine: 2017-08-28 12:41:04,920 INFO  
org.apache.flink.yarn.YarnClusterClient                       - TaskManager 
status (1/4)

Calc Engine: TaskManager status (1/4)

Calc Engine: 2017-08-28 12:41:13,775 INFO  
org.apache.flink.yarn.YarnClusterClient                       - TaskManager 
status (0/4)

Calc Engine: TaskManager status (0/4)

Calc Engine: 2017-08-28 12:50:43,133 WARN  
akka.remote.ReliableDeliverySupervisor                        - Association 
with remote system [akka.tcp://fl...@d191303-019.dc.gs.com:58084] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]



Logs:


Container id: container_e71_1503688027943_30786_01_000013

Exit code: 134

Stack trace: ExitCodeException exitCode=134:

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)

        at org.apache.hadoop.util.Shell.run(Shell.java:455)

        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:293)

        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)

        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)

        at java.util.concurrent.FutureTask.run(FutureTask.java:262)

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

        at java.lang.Thread.run(Thread.java:745)



Shell output: main : command provided 1

main : user is delp

main : requested yarn user is delp



Container exited with a non-zero exit code 134



17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Total number of failed 
containers so far: 5

17/08/28 12:39:51 ERROR yarn.YarnFlinkResourceManager: Stopping YARN session 
because the number of failed containers (5) exceeded the maximum failed 
containers (4). This number is controlled by the 
'yarn.maximum-failed-containers' configuration setting. By default its the 
number of requested containers.

17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Shutting down cluster 
with status FAILED : Stopping YARN session because the number of failed 
containers (5) exceeded the maximum failed containers (4). This number is 
controlled by the 'yarn.maximum-failed-containers' configuration setting. By 
default its the number of requested containers.

17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Unregistering application 
from the YARN Resource Manager

17/08/28 12:39:51 INFO impl.AMRMClientImpl: Waiting for application to be 
successfully unregistered.

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
d191303-010.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.AMRMClientAsyncImpl: Interrupted while waiting for 
queue

java.lang.InterruptedException

        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)

        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)

        at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)

        at 
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:274)

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
d191303-010.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
d191303-019.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
d191303-010.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
d191303-016.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
d191303-013.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
d191303-019.dc.gs.com:45454

17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
d191303-019.dc.gs.com:45454

17/08/28 12:39:52 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]

17/08/28 12:39:52 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]

17/08/28 12:40:01 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: 
d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:40:01 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: 
d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:40:11 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: 
d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:40:11 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: 
d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:40:21 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: 
d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:40:21 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: 
d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:40:31 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: 
d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:40:31 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: 
d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:40:41 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: 
d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:40:41 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: 
d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:40:51 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: 
d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:40:51 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: 
d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:41:01 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-010.dc.gs.com:48786] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-010.dc.gs.com:48786]] Caused by: [Connection refused: 
d191303-010.dc.gs.com/10.79.252.104:48786]

17/08/28 12:41:01 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: 
d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:41:04 WARN remote.RemoteWatcher: Detected unreachable: 
[akka.tcp://fl...@d191303-010.dc.gs.com:48786]

17/08/28 12:41:04 INFO yarn.YarnJobManager: Task manager 
akka.tcp://fl...@d191303-010.dc.gs.com:48786/user/taskmanager terminated.

17/08/28 12:41:04 INFO instance.InstanceManager: Unregistered task manager 
d191303-010.dc.gs.com/10.79.252.104. Number of registered task managers 1. 
Number of available slots 2.

17/08/28 12:41:11 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://fl...@d191303-016.dc.gs.com:58367] has failed, 
address is now gated for [5000] ms. Reason: [Association failed with 
[akka.tcp://fl...@d191303-016.dc.gs.com:58367]] Caused by: [Connection refused: 
d191303-016.dc.gs.com/10.79.162.181:58367]

17/08/28 12:41:13 WARN remote.RemoteWatcher: Detected unreachable: 
[akka.tcp://fl...@d191303-016.dc.gs.com:58367]

17/08/28 12:41:13 INFO yarn.YarnJobManager: Task manager 
akka.tcp://fl...@d191303-016.dc.gs.com:58367/user/taskmanager terminated.

17/08/28 12:41:13 INFO instance.InstanceManager: Unregistered task manager 
d191303-016.dc.gs.com/10.79.162.181. Number of registered task managers 0. 
Number of available slots 0.

17/08/28 12:50:42 INFO yarn.YarnApplicationMasterRunner: RECEIVED SIGNAL 15: 
SIGTERM. Shutting down as requested.

17/08/28 12:50:42 INFO webmonitor.WebRuntimeMonitor: Removing web dashboard 
root cache directory /tmp/flink-web-d1eebf19-098f-419e-859e-101cfd6c0749

17/08/28 12:50:42 INFO webmonitor.WebRuntimeMonitor: Removing web dashboard jar 
upload directory /tmp/flink-web-4d9bcf76-ddcb-4dbe-b91d-4a8d8da3d716

17/08/28 12:50:42 INFO blob.BlobServer: Stopped BLOB server at 0.0.0.0:35815




Regina Chan
Goldman Sachs - Enterprise Platforms, Data Architecture
30 Hudson Street, 37th floor | Jersey City, NY 07302 *  (212) 902-5697

Reply via email to