Dear friends, Need some help to know root cause of this issue.
In a sqoop job failures, it hs been noticed that the app master wasn't able to connect to a NM due to connection time out issues and it kept on retrying the connection for close to 2 hrs, until killed manually. The timeout was due a temporary network issue. Here is overview of what happend: RM <-----> NM01(hdpn01) Network ok RM <-----> NM08(hdpn08) Network ok NM01 <---X---> NM08 Network failed AppMaster container launched at NM01 node. Here is brief log: INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0 2018-02-03 21:12:51,734 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1517675224254_1052: ask=1 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:2776576, vCores:1> knownNMs=24 2018-02-03 21:12:52,751 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated containers 1 2018-02-03 21:12:52,793 INFO [RMCommunicator Allocator] org.apache.hadoop.yarn.util.RackResolver: Resolved hdpn08.ztpl.net to /default-rack 2018-02-03 21:12:52,797 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1517675224254_1052_02_000002 to attempt_1517675224254_1052_m_000000_1000 2018-02-03 21:12:52,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 Sc ...................... 2018-02-03 22:43:58,911 WARN [ContainerLauncher #0] org.apache.hadoop.ipc.Client: Failed to connect to server: hdpn08.ztpl.net/172.20.1.108:45454: retries get failed due to exceeded maximum allowed retries number: 0 java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) Why did AM keep retrying the connection to the NM on hdpn08 for 2 hrs. till the time it was manually killed? If it wasn’t killed it would have continued for much longer. Why did AM not stop trying after x number of tries? Is there any max attempt properties for application master? Why did AM not spin out another map task to compensate for this problematic task? Thanks Khireswar Kalita