[
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194592#comment-17194592
]
Yuanbo Liu edited comment on YARN-10393 at 9/12/20, 3:26 AM:
-------------------------------------------------------------
[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about
this situation:
{quote} [heartbeat:1, container:1]->rm
nm not get response
[heartbeat:1, container:1, container2]->rm
rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing,
which is not correct. Your patch makes sense to me.
{quote}Another question is how would these two proposals behave if the NM
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending
completed conatianer continuously seems not a bad idea?
was (Author: yuanbo):
[~Jim_Brennan] Thanks for your comments.
{quote}My one issue with it is that it will not return the most up to date
status for containers on the heartbeat after the missed one...
{quote}
Great point, your comments remind me that I was wrong, I was thinking about
this situation:
{quote} [heartbeat:1, container:1]->rm
nm not get response
[heartbeat:1, container:1, container2]->rm
rm think it's duplicated heartbeat and never process container 2.
{quote}
I thought container 2 would never get updated if heartbeat response missing
which is not correct. Your patch makes sense to me.
{quote}Another question is how would these two proposals behave if the NM
misses multiple heartbeats in a row?
{quote}
I believe if it happens, cluster or nn may be in unusable state, resending
completed conatianer continuously seems not a bad idea?
> MR job live lock caused by completed state container leak in heartbeat
> between node manager and RM
> --------------------------------------------------------------------------------------------------
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager, yarn
> Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3,
> 3.4.0
> Reporter: zhenzhao wang
> Assignee: zhenzhao wang
> Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2.
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC
> retry policy change and other changes. There's still a possibility even with
> the current code if I didn't miss anything.
> *High-level description:*
> We had seen a starving mapper issue several times. The MR job stuck in a
> live lock state and couldn't make any progress. The queue is full so the
> pending mapper can’t get any resource to continue, and the application master
> failed to preempt the reducer, thus causing the job to be stuck. The reason
> why the application master didn’t preempt the reducer was that there was a
> leaked container in assigned mappers. The node manager failed to report the
> completed container to the resource manager.
> *Detailed steps:*
>
> # Container_1501226097332_249991_01_000199 was assigned to
> attempt_1501226097332_249991_m_000095_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator]
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned
> container container_1501226097332_249991_01_000199 to
> attempt_1501226097332_249991_m_000095_0
> {code}
> # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1501226097332_249991_01_000199 transitioned from RUNNING
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
> Cleaning up container container_1501226097332_249991_01_000199
> {code}
> # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource
> manager, however, the node manager failed to receive the response. Let’s
> assume the heartBeatResponseId=$hid in node manager. According to our current
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException:
> Connection reset by peer; Host Details : local host is: ; destination host
> is: XXXXXXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
> at
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.FilterInputStream.read(FilterInputStream.java:133)
> at java.io.FilterInputStream.read(FilterInputStream.java:133)
> at
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1071)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:966)
> {code}
> # NodeStatusUpdaterImpl try to send another heart beart 10s after last
> exception. Container_1501226097332_249991_01_000199 was added to the
> heartbeat request as a completed container for the first time. We can confirm
> this point from the timestamp in
> recentlyStoppedContainers@NodeStatusUpdaterImpl.
> {code:java}
> Java.util.LinkedHashMap$Entry@0x7889caca0
> http://localhost:7000/object/0x7889cacc8
> Value 1502209034239
> Time add to recently stopped
> 1502208434239 = 1502209034239 - 600000
> $date -d@1502208434
> Tue Aug 8 16:07:14 UTC 2017
> {code}
> # RM thought the request as duplication heartbeat as the $hid is the same as
> the heartBeatId it received last time. So it returned the last response
> generated from previous heartbeat and didn’t handle the request, in other
> words, it didn't add container_1501226097332_249991_01_000199 as a completed
> container to its own data structure.
> {code:java}
> // 3. Check if it's a 'fresh' heartbeat i.e. not duplicate heartbeat
> NodeHeartbeatResponse lastNodeHeartbeatResponse =
> rmNode.getLastNodeHeartBeatResponse();
> if (remoteNodeStatus.getResponseId() + 1 == lastNodeHeartbeatResponse
> .getResponseId()) {
> LOG.info("Received duplicate heartbeat from node "
> + rmNode.getNodeAddress()+ " responseId=" +
> remoteNodeStatus.getResponseId());
> return lastNodeHeartbeatResponse;
> {code}
> # Node manager received the response and clear
> container_1501226097332_249991_01_000199 from
> pendingCompletedContainers@nodeStatusUpdaterImpl. However,
> container_1501226097332_249991_01_000199 is still in
> recentlyStoppedContainers@NodeStautsUpdaterImp. So the container won’t report
> container_1501226097332_249991_01_000199 again as a complete container in the
> heartbeat request.
> {code:java}
> if (containerStatus.getState() == ContainerState.COMPLETE) {
> if (isApplicationStopped(applicationId)) {
> if (LOG.isDebugEnabled()) {
> LOG.debug(applicationId + " is completing, " + " remove "
> + containerId + " from NM context.");
> }
> context.getContainers().remove(containerId);
> pendingCompletedContainers.put(containerId, containerStatus);
> } else {
> if (!isContainerRecentlyStopped(containerId)) {
> pendingCompletedContainers.put(containerId, containerStatus);
> // Adding to finished containers cache. Cache will keep it around at
> // least for #durationToTrackStoppedContainers duration. In the
> // subsequent call to stop the container will get removed from cache.
> addCompletedContainer(containerId);
> }
> }
> } else {
> containerStatuses.add(containerStatus);
> }
> {code}
> # The application master relies on getResourses() to get completed
> containers and remove them from assignedRequests. As the container will never
> be reported to the RM by nodemanger again, thus application master can’t
> remove the attempt associated with the container. And the preemption
> calculation returns false due to following codes, thus causing the starving
> mapper.
> {code:java}
> boolean preemptReducesIfNeeded() {
> if (reduceResourceRequest.equals(Resources.none())) {
> return false; // no reduces
> }
> if (assignedRequests.maps.size() > 0) {
> // there are assigned mappers
> return false;
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]