Appmaster failed to launch container in alternate nodemanager after it connection timeout in one NM.

Khireswar Kalita Sun, 11 Feb 2018 07:37:23 -0800

Dear friends,

Need some help to know root cause of this issue.


In a sqoop job failures, it hs been noticed that the app master wasn't able
to connect to a NM due to connection time out issues and it kept on
retrying the connection for close to 2 hrs, until killed manually.
The timeout was due a temporary network issue.

Here is overview of  what happend:

RM <-----> NM01(hdpn01)  Network ok
RM <-----> NM08(hdpn08)  Network ok
NM01 <---X---> NM08 Network failed

AppMaster container launched at NM01 node.

Here is brief log:

INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0
AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0
HostLocal:0 RackLocal:0
2018-02-03 21:12:51,734 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources()
for application_1517675224254_1052: ask=1 release= 0 newContainers=0
finishedContainers=0 resourcelimit=<memory:2776576, vCores:1> knownNMs=24
2018-02-03 21:12:52,751 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated
containers 1
2018-02-03 21:12:52,793 INFO [RMCommunicator Allocator]
org.apache.hadoop.yarn.util.RackResolver: Resolved hdpn08.ztpl.net to
/default-rack
2018-02-03 21:12:52,797 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned
container container_1517675224254_1052_02_000002 to
attempt_1517675224254_1052_m_000000_1000
2018-02-03 21:12:52,799 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After
Scheduling: PendingReds:0 ScheduledMaps:0 Sc

......................

2018-02-03 22:43:58,911 WARN [ContainerLauncher #0]
org.apache.hadoop.ipc.Client: Failed to connect to server:
hdpn08.ztpl.net/172.20.1.108:45454: retries get failed due to exceeded
maximum allowed retries number: 0
java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)


Why did AM keep retrying the connection to the NM on hdpn08 for 2 hrs. till
the time it was manually killed? If it wasn’t killed it would have
continued for much longer.

Why did AM not stop trying after x number of tries? Is there any max
attempt properties for application master?

Why did AM not spin out another map task to compensate for this problematic
task?



Thanks
Khireswar Kalita

Appmaster failed to launch container in alternate nodemanager after it connection timeout in one NM.

Reply via email to