[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

Yuanbo Liu (Jira) Fri, 11 Sep 2020 20:27:19 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194592#comment-17194592
 ]


Yuanbo Liu edited comment on YARN-10393 at 9/12/20, 3:26 AM:
-------------------------------------------------------------

[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing, 
which is not correct.  Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 


was (Author: yuanbo):
[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date 
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about 
this situation:
{quote} [heartbeat:1, container:1]->rm

nm not get response

[heartbeat:1, container:1, container2]->rm

rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing 
which is not correct.  Your patch makes sense to me.

 
{quote}Another question is how would these two proposals behave if the NM 
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending 
completed conatianer continuously seems not a bad idea?

 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10393
>                 URL: https://issues.apache.org/jira/browse/YARN-10393
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, yarn
>    Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>            Reporter: zhenzhao wang
>            Assignee: zhenzhao wang
>            Priority: Major
>         Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_000095_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_000095_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXXXXXX
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
>         at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
>         at 
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> {code}
>  # NodeStatusUpdaterImpl try to send another heart beart 10s after last 
> exception. Container_1501226097332_249991_01_000199 was added to the 
> heartbeat request as a completed container for the first time. We can confirm 
> this point from the timestamp in 
> recentlyStoppedContainers@NodeStatusUpdaterImpl.
> {code:java}
> Java.util.LinkedHashMap$Entry@0x7889caca0
> http://localhost:7000/object/0x7889cacc8
> Value 1502209034239 
> Time add to recently stopped 
> 1502208434239 = 1502209034239 - 600000
> $date -d@1502208434
> Tue Aug  8 16:07:14 UTC 2017
> {code}
>  # RM thought the request as duplication heartbeat as the $hid is the same as 
> the heartBeatId it received last time. So it returned the last response 
> generated from previous heartbeat and didn’t handle the request, in other 
> words, it didn't add container_1501226097332_249991_01_000199 as a completed 
> container to its own data structure.
> {code:java}
> // 3. Check if it's a 'fresh' heartbeat i.e. not duplicate heartbeat
> NodeHeartbeatResponse lastNodeHeartbeatResponse = 
> rmNode.getLastNodeHeartBeatResponse();
> if (remoteNodeStatus.getResponseId() + 1 == lastNodeHeartbeatResponse
>    .getResponseId()) {
>  LOG.info("Received duplicate heartbeat from node "
>      + rmNode.getNodeAddress()+ " responseId=" + 
> remoteNodeStatus.getResponseId());
>  return lastNodeHeartbeatResponse;
> {code}
>  # Node manager received the response and clear 
> container_1501226097332_249991_01_000199 from 
> pendingCompletedContainers@nodeStatusUpdaterImpl. However, 
> container_1501226097332_249991_01_000199 is still in 
> recentlyStoppedContainers@NodeStautsUpdaterImp. So the container won’t report 
> container_1501226097332_249991_01_000199 again as a complete container in the 
> heartbeat request.
> {code:java}
>              if (containerStatus.getState() == ContainerState.COMPLETE) {
>  if (isApplicationStopped(applicationId)) {
>    if (LOG.isDebugEnabled()) {
>      LOG.debug(applicationId + " is completing, " + " remove "
>          + containerId + " from NM context.");
>    }
>    context.getContainers().remove(containerId);
>    pendingCompletedContainers.put(containerId, containerStatus);
>  } else {
>    if (!isContainerRecentlyStopped(containerId)) {
>      pendingCompletedContainers.put(containerId, containerStatus);
>      // Adding to finished containers cache. Cache will keep it around at
>      // least for #durationToTrackStoppedContainers duration. In the
>      // subsequent call to stop the container will get removed from cache.
>      addCompletedContainer(containerId);
>    }
>  }
> } else {
>  containerStatuses.add(containerStatus);
> }
> {code}
>  # The application master relies on getResourses() to get completed 
> containers and remove them from assignedRequests. As the container will never 
> be reported to the RM by nodemanger again, thus application master can’t 
> remove the attempt associated with the container. And the preemption 
> calculation returns false due to following codes, thus causing the starving 
> mapper.
> {code:java}
> boolean preemptReducesIfNeeded() {
>  if (reduceResourceRequest.equals(Resources.none())) {
>    return false; // no reduces
>  }
>  if (assignedRequests.maps.size() > 0) {
>    // there are assigned mappers
>    return false;
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

Reply via email to