[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-10-04 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207797#comment-17207797
 ] 

zhenzhao wang commented on YARN-10393:
--

+1, LGTM, thanks.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-10393.001.patch, YARN-10393.002.patch, 
> YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-26 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202738#comment-17202738
 ] 

zhenzhao wang commented on YARN-10393:
--

[~Jim_Brennan] And feel free to re-assign the ticket to you if you are 
interested. I guess you are contributing more to the discussion and solution 
recently.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-26 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202737#comment-17202737
 ] 

zhenzhao wang commented on YARN-10393:
--

[~Jim_Brennan] Sorry, I missed the msg. Thanks a lot for all the discussion and 
suggestions. Feel free to put up the patch. 

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch
>
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-02 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189046#comment-17189046
 ] 

zhenzhao wang commented on YARN-10393:
--

And one more thing to clarify. The following code in the current PR could be 
avoided. This is because it only calls getNodeStatus() on heartbeatId change. 
The pendingCompletedCantainers won't be updated on retry as getNodeStatus() is 
not called twice. I added it because I want to add some safeguards. It will 
keep sending completed containers until it's confirmed in response. This could 
prevent potential errors from RM or RM-AM communication. But as [~Jim_Brennan] 
pointed out, it might cause duplicate reports for the same completed containers.

{quote}

pendingCompletedContainers.remove(containerId); 

{quote}

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-09-02 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189009#comment-17189009
 ] 

zhenzhao wang commented on YARN-10393:
--

Thanks all for the great discussion. 

As stated earlier, I guess we could think of the problem could be discussed in 
two aspects:

{{{quote}}}
 # RM and NM has a different understanding of heartbeat. RM uses the 
heartbeatId to distinguish the heartbeat. However, NM might generate different 
requests with the same heartbeat id on heartbeat failure.  
 # The cache for containers inside NM is not maintained correctly on heartbeat 
failure.

{{{quote}}}

The first problem will lead to mulitple missing report fields. The potential 
missing fields include completed containers(leads to live-lock in this case), 
increasedContainers(I didn't dig into the impact though), and etc. It also 
means that people had better be aware of this when they want to add new 
heartbeat fields in the future.  I hope we could fix it too. But I agree with 
the concern of changing protocol. So if we don't want to fix it in this jira. 
We should keep track of it. What do you think? [~Jim_Brennan]

As for the second problem, it's directly related to the missing completed 
container issue. [~Jim_Brennan] proposed a good approach. [~yuanbo] 
[~adam.antal]  also made good points. We couldn't clear the 
pendingCompletedContainers on the first successful response after a failure. 
The marker approach works and the heartbeatId comparasion approach wouldn't 
work.

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> 

[jira] [Commented] (YARN-10398) Every NM will try to upload Jar/Archives/Files/Resources to Yarn Shared Cache Manager Like DDOS

2020-08-23 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182949#comment-17182949
 ] 

zhenzhao wang commented on YARN-10398:
--

[~jiwq] I double checked and confirmed the PR is the fix for the problem. The 
reason why non-application master try to upload is because the clear code 
didn't work. The code and bug are in YARN. MR uses yarn shared cache. I'm not 
sure we should move it MR project.  Thanks.

> Every NM will try to upload Jar/Archives/Files/Resources to Yarn Shared Cache 
> Manager Like DDOS
> ---
>
> Key: YARN-10398
> URL: https://issues.apache.org/jira/browse/YARN-10398
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.9.1, 3.0.1, 3.0.2, 3.2.0, 3.1.1, 
> 2.9.2, 3.0.3, 3.0.4, 3.1.2, 3.3.0, 3.2.1, 2.9.3, 3.1.3, 3.2.2, 3.1.4, 3.4.0, 
> 3.3.1, 3.1.5
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> The design of yarn shared cache manager is only to allow application master 
> should upload the jar/files/resource. However, there was a bug in the code 
> since 2.9.0. Every node manager that take the job task will try to upload the 
> jar/resources. Let's say one job have 5000 tasks. Then there will be up to 
> 5000 NMs try to upload the jar. This is like DDOS and create a snowball 
> effect. It will end up with inavailability of yarn shared cache manager. It 
> wil cause time out in localization and lead to job failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-20 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180989#comment-17180989
 ] 

zhenzhao wang edited comment on YARN-10393 at 8/20/20, 7:03 AM:


Thanks [~Jim_Brennan] [~yuanbo] for the comment!
{quote}It seems to me that the change you made to 
NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext() is all that 
is required to ensure that the completed container status is not lost. I don't 
think you need to change the RM/NM protocol to manually resend the last 
NodeHeartbeatRequest again. As you noted, the RPC retry logic is already doing 
that. Also note that there is a lot of other state in that request, so I am not 
sure of the implications of not sending the most recent status for all that 
other state. Changing the protocol seems scary.
{quote}
[~Jim_Brennan] I guess the RM side assumes heartbeatId is the unique 
identification of a heartbeat. The old logic of generating a heartbeat couldn't 
guarantee this. It might generate a new request and update the cache even when 
the heartbeatid didn't change. I mean to make sure NM only generated request 
only if when heartbeatId changes. This semantic guarantee is more important 
than retry and could help prevent other errors. E.g. a running container is 
also possible to be lost in this case, it's just it will be reported again in 
the next heartbeat. I agree that this change is scary. But I guess fixing it is 
even more meaningful then fix the cache problem itself.
{quote}But the change you made in removeOrTrackCompletedContainersFromContext() 
seems to go directly to the problem. The current code is always clearing 
pendingCompletedContainers at the end of that function. I've read through 
YARN-2997 and it seems like this was a late addition to the patch, but it is 
not clear to me why it was added.
{quote}
[~Jim_Brennan] Yeah, I mean to remove the cache if only the completed container 
is acked by RM. But it's a reasonable concern of potential peak. [~yuanbo] also 
pointed it out with a solution suggestion.
{quote}This would be a potential memory leak if we remove 
"pendingCompletedContainers.clear()".
 I'd suggest that removing "!isContainerRecentlyStopped(containerId)" in 
NodeStatusUpdaterImpl.java[line: 613] would be good to fix this issue.

if (!isContainerRecentlyStopped(containerId))
Unknown macro: \{ pendingCompletedContainers.put(containerId, containerStatus); 
}
Completed containers will be cached in 10mins(default value) until it timeouts 
or gets response from heartbeat. And 10mins cache for completed container is 
long enough for retrying sending requests through heartbeat (default interval 
is 10s).
{quote}
I guess this will end up completed containers being sent multiple times if we 
just remove line 613

What about this? We keep pendingCompletedContainers.clear() unchanged. Let's 
remove completed containers in the heartbeat request from the 
cache(recentlyStoppedContainers) before sending the heartbeat. Then we added 
the acked container back to the cache. From a high level, it is like to update 
the cache only if the heartbeat succeeded with a response.


was (Author: wzzdreamer):
Thanks [~Jim_Brennan] [~yuanbo] for the comment!
??citation It seems to me that the change you made to 
NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext() is all that 
is required to ensure that the completed container status is not lost. I don't 
think you need to change the RM/NM protocol to manually resend the last 
NodeHeartbeatRequest again. As you noted, the RPC retry logic is already doing 
that. Also note that there is a lot of other state in that request, so I am not 
sure of the implications of not sending the most recent status for all that 
other state. Changing the protocol seems scary.??
[~Jim_Brennan] I guess the RM side assumes heartbeatId is the unique 
identification of a heartbeat. The old logic of generating a heartbeat couldn't 
guarantee this. It might generate a new request and update the cache even when 
the heartbeatid didn't change. I mean to make sure NM only generated request 
only if when heartbeatId changes. This semantic guarantee is more important 
than retry and could help prevent other errors. E.g. a running container is 
also possible to be lost in this case, it's just it will be reported again in 
the next heartbeat. I agree that this change is scary. But I guess fixing it is 
even more meaningful then fix the cache problem itself.  

??But the change you made in removeOrTrackCompletedContainersFromContext() 
seems to go directly to the problem.  The current code is always clearing 
pendingCompletedContainers at the end of that function.  I've read through 
YARN-2997 and it seems like this was a late addition to the patch, but it is 
not clear to me why it was added. ??
[~Jim_Brennan] Yeah, I mean to remove the cache if only the completed container 
is 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-20 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180989#comment-17180989
 ] 

zhenzhao wang commented on YARN-10393:
--

Thanks [~Jim_Brennan] [~yuanbo] for the comment!
??citation It seems to me that the change you made to 
NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext() is all that 
is required to ensure that the completed container status is not lost. I don't 
think you need to change the RM/NM protocol to manually resend the last 
NodeHeartbeatRequest again. As you noted, the RPC retry logic is already doing 
that. Also note that there is a lot of other state in that request, so I am not 
sure of the implications of not sending the most recent status for all that 
other state. Changing the protocol seems scary.??
[~Jim_Brennan] I guess the RM side assumes heartbeatId is the unique 
identification of a heartbeat. The old logic of generating a heartbeat couldn't 
guarantee this. It might generate a new request and update the cache even when 
the heartbeatid didn't change. I mean to make sure NM only generated request 
only if when heartbeatId changes. This semantic guarantee is more important 
than retry and could help prevent other errors. E.g. a running container is 
also possible to be lost in this case, it's just it will be reported again in 
the next heartbeat. I agree that this change is scary. But I guess fixing it is 
even more meaningful then fix the cache problem itself.  

??But the change you made in removeOrTrackCompletedContainersFromContext() 
seems to go directly to the problem.  The current code is always clearing 
pendingCompletedContainers at the end of that function.  I've read through 
YARN-2997 and it seems like this was a late addition to the patch, but it is 
not clear to me why it was added. ??
[~Jim_Brennan] Yeah, I mean to remove the cache if only the completed container 
is backed by RM. But it's a reasonable concern of potential peak. [~yuanbo] 
also pointed it out with a solution suggestion.
??This would be a potential memory leak if we remove 
"pendingCompletedContainers.clear()".
I'd suggest that removing "!isContainerRecentlyStopped(containerId)" in 
NodeStatusUpdaterImpl.java[line: 613] would be good to fix this issue.

if (!isContainerRecentlyStopped(containerId)) {
 pendingCompletedContainers.put(containerId, containerStatus);
}
Completed containers will be cached in 10mins(default value) until it timeouts 
or gets response from heartbeat. And 10mins cache for completed container is 
long enough for retrying sending requests through heartbeat (default interval 
is 10s).??
I guess this will end up completed containers being sent multiple times if we 
just remove line 613


What about this? We keep pendingCompletedContainers.clear() unchanged. Let's 
remove completed containers in the heartbeat request from the 
cache(recentlyStoppedContainers) before sending the heartbeat. Then we added 
the acked container back to the cache. From a high level, it is like to update 
the cache only if the heartbeat succeeded with response. 







> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-13 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177462#comment-17177462
 ] 

zhenzhao wang commented on YARN-10393:
--

[~adam.antal] This is a great question. 
First, it's not like we upgraded it to 2.9.2 and the problem was gone. We 
stopped seeing new cases reported when we still run 2.6.x. This was because we 
have a stand-alone police service which could kill long-running mapper/reducer 
or the job itself. I guess all the users whose job pattern is easy to encounter 
this problem have adopted the service to prevent the problem.
Second, I guess this is also because of the default retries policy change. 
Here's the code of creating RM proxy from 2.6. I don't see retry with proxy 
invoke failure.

{code:java}
  public  ProtocolProxy getProxy(Class protocol, long clientVersion,
 InetSocketAddress addr, UserGroupInformation ticket,
 Configuration conf, SocketFactory factory,
 int rpcTimeout, RetryPolicy connectionRetryPolicy,
 AtomicBoolean fallbackToSimpleAuth)
throws IOException {

if (connectionRetryPolicy != null) {
  throw new UnsupportedOperationException(
  "Not supported: connectionRetryPolicy=" + connectionRetryPolicy);
}

T proxy = (T) Proxy.newProxyInstance(protocol.getClassLoader(),
new Class[] { protocol }, new Invoker(protocol, addr, ticket, conf,
factory, rpcTimeout, fallbackToSimpleAuth));
return new ProtocolProxy(protocol, proxy, true);
  }


Invoker.Java
@Override
public Object invoke(Object proxy, Method method, Object[] args)
  throws Throwable {
  long startTime = 0;
  if (LOG.isDebugEnabled()) {
startTime = Time.now();
  }
  TraceScope traceScope = null;
  if (Trace.isTracing()) {
traceScope = Trace.startSpan(
method.getDeclaringClass().getCanonicalName() +
"." + method.getName());
  }
  ObjectWritable value;
  try {
value = (ObjectWritable)
  client.call(RPC.RpcKind.RPC_WRITABLE, new Invocation(method, args),
remoteId, fallbackToSimpleAuth);
  } finally {
if (traceScope != null) traceScope.close();
  }
  if (LOG.isDebugEnabled()) {
long callTime = Time.now() - startTime;
LOG.debug("Call: " + method.getName() + " " + callTime);
  }
  return value.get();
}
{code}
And in 2.9, the RMProxy default retry policy is like the following. It's up 
15min with fixed 30s sleep time. Client could do lots of retries.
{code:java}
  retryPolicy =
  RetryPolicies.retryUpToMaximumTimeWithFixedSleep(rmConnectWaitMS (15 
* 60 * 1000ms),
  rmConnectionRetryIntervalMS(30 * 1000ms), TimeUnit.MILLISECONDS);
{code}




> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> 

[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-13 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177462#comment-17177462
 ] 

zhenzhao wang edited comment on YARN-10393 at 8/14/20, 3:50 AM:


[~adam.antal] This is a great question. 
# First, it's not like we upgraded it to 2.9.2 and the problem was gone. We 
stopped seeing new cases reported when we still run 2.6.x. This was because we 
have a stand-alone police service which could kill long-running mapper/reducer 
or the job itself. I guess all the users whose job pattern is easy to encounter 
this problem have adopted the service to prevent the problem.
# Second, I guess this is also because of the default retries policy change. 
Here's the code of creating RM proxy from 2.6. I don't see retry with proxy 
invoke failure.

{code:java}
  public  ProtocolProxy getProxy(Class protocol, long clientVersion,
 InetSocketAddress addr, UserGroupInformation ticket,
 Configuration conf, SocketFactory factory,
 int rpcTimeout, RetryPolicy connectionRetryPolicy,
 AtomicBoolean fallbackToSimpleAuth)
throws IOException {

if (connectionRetryPolicy != null) {
  throw new UnsupportedOperationException(
  "Not supported: connectionRetryPolicy=" + connectionRetryPolicy);
}

T proxy = (T) Proxy.newProxyInstance(protocol.getClassLoader(),
new Class[] { protocol }, new Invoker(protocol, addr, ticket, conf,
factory, rpcTimeout, fallbackToSimpleAuth));
return new ProtocolProxy(protocol, proxy, true);
  }


Invoker.Java
@Override
public Object invoke(Object proxy, Method method, Object[] args)
  throws Throwable {
  long startTime = 0;
  if (LOG.isDebugEnabled()) {
startTime = Time.now();
  }
  TraceScope traceScope = null;
  if (Trace.isTracing()) {
traceScope = Trace.startSpan(
method.getDeclaringClass().getCanonicalName() +
"." + method.getName());
  }
  ObjectWritable value;
  try {
value = (ObjectWritable)
  client.call(RPC.RpcKind.RPC_WRITABLE, new Invocation(method, args),
remoteId, fallbackToSimpleAuth);
  } finally {
if (traceScope != null) traceScope.close();
  }
  if (LOG.isDebugEnabled()) {
long callTime = Time.now() - startTime;
LOG.debug("Call: " + method.getName() + " " + callTime);
  }
  return value.get();
}
{code}
In 2.9, the RMProxy default retry policy is like the following. It's up 15min 
with fixed 30s sleep time. Client could do lots of retries.
{code:java}
  retryPolicy =
  RetryPolicies.retryUpToMaximumTimeWithFixedSleep(rmConnectWaitMS (15 
* 60 * 1000ms),
  rmConnectionRetryIntervalMS(30 * 1000ms), TimeUnit.MILLISECONDS);
{code}

There might be other changes I'm not aware of. However, I guess the above two 
reasons did make a difference in our clusters. 



was (Author: wzzdreamer):
[~adam.antal] This is a great question. 
First, it's not like we upgraded it to 2.9.2 and the problem was gone. We 
stopped seeing new cases reported when we still run 2.6.x. This was because we 
have a stand-alone police service which could kill long-running mapper/reducer 
or the job itself. I guess all the users whose job pattern is easy to encounter 
this problem have adopted the service to prevent the problem.
Second, I guess this is also because of the default retries policy change. 
Here's the code of creating RM proxy from 2.6. I don't see retry with proxy 
invoke failure.

{code:java}
  public  ProtocolProxy getProxy(Class protocol, long clientVersion,
 InetSocketAddress addr, UserGroupInformation ticket,
 Configuration conf, SocketFactory factory,
 int rpcTimeout, RetryPolicy connectionRetryPolicy,
 AtomicBoolean fallbackToSimpleAuth)
throws IOException {

if (connectionRetryPolicy != null) {
  throw new UnsupportedOperationException(
  "Not supported: connectionRetryPolicy=" + connectionRetryPolicy);
}

T proxy = (T) Proxy.newProxyInstance(protocol.getClassLoader(),
new Class[] { protocol }, new Invoker(protocol, addr, ticket, conf,
factory, rpcTimeout, fallbackToSimpleAuth));
return new ProtocolProxy(protocol, proxy, true);
  }


Invoker.Java
@Override
public Object invoke(Object proxy, Method method, Object[] args)
  throws Throwable {
  long startTime = 0;
  if (LOG.isDebugEnabled()) {
startTime = Time.now();
  }
  TraceScope traceScope = null;
  if (Trace.isTracing()) {
traceScope = Trace.startSpan(
method.getDeclaringClass().getCanonicalName() +
"." + method.getName());
  }
  ObjectWritable value;
  

[jira] [Commented] (YARN-10398) Every NM will try to upload Jar/Archives/Files/Resources to Yarn Shared Cache Manager Like DDOS

2020-08-12 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176758#comment-17176758
 ] 

zhenzhao wang commented on YARN-10398:
--

[~templedf] Could you please help review this patch? Thanks!

> Every NM will try to upload Jar/Archives/Files/Resources to Yarn Shared Cache 
> Manager Like DDOS
> ---
>
> Key: YARN-10398
> URL: https://issues.apache.org/jira/browse/YARN-10398
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.9.1, 3.0.1, 3.0.2, 3.2.0, 3.1.1, 
> 2.9.2, 3.0.3, 3.0.4, 3.1.2, 3.3.0, 3.2.1, 2.9.3, 3.1.3, 3.2.2, 3.1.4, 3.4.0, 
> 3.3.1, 3.1.5
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> The design of yarn shared cache manager is only to allow application master 
> should upload the jar/files/resource. However, there was a bug in the code 
> since 2.9.0. Every node manager that take the job task will try to upload the 
> jar/resources. Let's say one job have 5000 tasks. Then there will be up to 
> 5000 NMs try to upload the jar. This is like DDOS and create a snowball 
> effect. It will end up with inavailability of yarn shared cache manager. It 
> wil cause time out in localization and lead to job failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10398) Every NM will try to upload Jar/Archives/Files/Resources to Yarn Shared Cache Manager Like DDOS

2020-08-12 Thread zhenzhao wang (Jira)
zhenzhao wang created YARN-10398:


 Summary: Every NM will try to upload Jar/Archives/Files/Resources 
to Yarn Shared Cache Manager Like DDOS
 Key: YARN-10398
 URL: https://issues.apache.org/jira/browse/YARN-10398
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.1.3, 3.2.1, 3.1.2, 3.0.3, 2.9.2, 3.1.1, 3.2.0, 3.0.2, 
3.0.1, 2.9.1, 3.1.0, 3.0.0, 2.9.0, 3.0.4, 3.3.0, 2.9.3, 3.2.2, 3.1.4, 3.4.0, 
3.3.1, 3.1.5
Reporter: zhenzhao wang
Assignee: zhenzhao wang


The design of yarn shared cache manager is only to allow application master 
should upload the jar/files/resource. However, there was a bug in the code 
since 2.9.0. Every node manager that take the job task will try to upload the 
jar/resources. Let's say one job have 5000 tasks. Then there will be up to 5000 
NMs try to upload the jar. This is like DDOS and create a snowball effect. It 
will end up with inavailability of yarn shared cache manager. It wil cause time 
out in localization and lead to job failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-12 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176740#comment-17176740
 ] 

zhenzhao wang commented on YARN-10393:
--

[~bibinchundatt] [~adam.antal][~Jim_Brennan][~jdonofrio][~aceric] Could you 
please help with the review? Thanks

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>  

[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-12 Thread zhenzhao wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176467#comment-17176467
 ] 

zhenzhao wang commented on YARN-10393:
--

I could see two issues here:
# RM and NM has a different understanding of heartbeat. RM uses the heartbeatId 
to distinguish the heartbeat. However, NM might generate different requests 
with the same heartbeat id on heartbeat failure.
# The cache for containers inside NM is not maintained correctly on heartbeat 
failure.
I submitted a PR https://github.com/apache/hadoop/pull/2204. I tried to make 
fewer code changes.  However, I'd say some cache structures 
(recentlyStoppedContainers, pendingCompletedContainers) NM used are kind of 
complex and error-prone. E.g. the cache is updated while getContainerStatuses 
regardless. This is before the heartbeat request. I'd suggest maybe worth to do 
a refactor in the future.

[~templedf][~yuanbo]I'd appreciate it if you could help with the review. Thanks!

Note that this patch is not tested in our production Hadoop clusters yet.


> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-12 Thread zhenzhao wang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated YARN-10393:
-
Affects Version/s: 3.4.0
   3.3.0
   2.6.1
   2.7.2
   2.6.2
   3.0.0
   2.9.2
   3.2.1
   3.1.3

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, 
> 3.4.0
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
> We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything.
> *High-level description:*
>  We had seen a starving mapper issue several times. The MR job stuck in a 
> live lock state and couldn't make any progress. The queue is full so the 
> pending mapper can’t get any resource to continue, and the application master 
> failed to preempt the reducer, thus causing the job to be stuck. The reason 
> why the application master didn’t preempt the reducer was that there was a 
> leaked container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:*
>  
>  # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
>  # The container finished on 2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
>  # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at 

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-08 Thread zhenzhao wang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated YARN-10393:
-
Description: 
This was a bug we had seen multiple times on Hadoop 2.6.2. And the following 
analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. 
We hadn't seen it after 2.9 in our env. However, it was because of the RPC 
retry policy change and other changes. There's still a possibility even with 
the current code if I didn't miss anything.

*High-level description:*

 We had seen a starving mapper issue several times. The MR job stuck in a live 
lock state and couldn't make any progress. The queue is full so the pending 
mapper can’t get any resource to continue, and the application master failed to 
preempt the reducer, thus causing the job to be stuck. The reason why the 
application master didn’t preempt the reducer was that there was a leaked 
container in assigned mappers. The node manager failed to report the completed 
container to the resource manager.

*Detailed steps:*

 
 # Container_1501226097332_249991_01_000199 was assigned to 
attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
{code:java}
appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1501226097332_249991_01_000199 to 
attempt_1501226097332_249991_m_95_0
{code}
 # The container finished on 2017-08-08 16:02:53,313.
{code:java}
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1501226097332_249991_01_000199 transitioned from RUNNING to 
EXITED_WITH_SUCCESS
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_1501226097332_249991_01_000199
{code}
 # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
16:07:04,238. In fact, the heartbeat request is actually handled by resource 
manager, however, the node manager failed to receive the response. Let’s assume 
the heartBeatResponseId=$hid in node manager. According to our current 
configuration, next heartbeat will be 10s later.
{code:java}
2017-08-08 16:07:04,238 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
exception in status-updater
java.io.IOException: Failed on local exception: java.io.IOException: Connection 
reset by peer; Host Details : local host is: ; destination host is: XXX
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at 

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-08 Thread zhenzhao wang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated YARN-10393:
-
Description: 
This was a bug we had seen multiple times on Hadoop 2.4.x. And the following 
analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x. 
We hadn't seen it after 2.6 in our env. However, it was because of the RPC 
retry policy change and other changes. There's still a possibility even with 
the current code if I didn't miss anything.

*High-level description:*

 We had seen a starving mapper issue several times. The MR job stuck in a live 
lock state and couldn't make any progress. The queue is full so the pending 
mapper can’t get any resource to continue, and the application master failed to 
preempt the reducer, thus causing the job to be stuck. The reason why the 
application master didn’t preempt the reducer was that there was a leaked 
container in assigned mappers. The node manager failed to report the completed 
container to the resource manager.

*Detailed steps:*

 
 # Container_1501226097332_249991_01_000199 was assigned to 
attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
{code:java}
appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1501226097332_249991_01_000199 to 
attempt_1501226097332_249991_m_95_0
{code}
 # The container finished on 2017-08-08 16:02:53,313.
{code:java}
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1501226097332_249991_01_000199 transitioned from RUNNING to 
EXITED_WITH_SUCCESS
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_1501226097332_249991_01_000199
{code}
 # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
16:07:04,238. In fact, the heartbeat request is actually handled by resource 
manager, however, the node manager failed to receive the response. Let’s assume 
the heartBeatResponseId=$hid in node manager. According to our current 
configuration, next heartbeat will be 10s later.
{code:java}
2017-08-08 16:07:04,238 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
exception in status-updater
java.io.IOException: Failed on local exception: java.io.IOException: Connection 
reset by peer; Host Details : local host is: ; destination host is: XXX
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at 

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-08 Thread zhenzhao wang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated YARN-10393:
-
Description: 
This was a bug we had seen multiple times on Hadoop 2.4.x. And the following 
analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x. 
We hadn't seen it after 2.6 in our env. However, it was because of the RPC 
retry policy change and other changes. There's still a possibility even with 
the current code if I didn't miss anything.

*High-level description:*

 We had seen a starving mapper issue several times. The MR job stuck in a live 
lock state and couldn't make any progress. The queue is full so the pending 
mapper can’t get any resource to continue, and the application master failed to 
preempt the reducer, thus causing the job to be stuck. The reason why the 
application master didn’t preempt the reducer was that there was a leaked 
container in assigned mappers. The node manager failed to report the completed 
container to the resource manager.

*Detailed steps:*

 
 # Container_1501226097332_249991_01_000199 was assigned to 
attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
{code:java}
appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1501226097332_249991_01_000199 to 
attempt_1501226097332_249991_m_95_0
{code}

 # The container finished on 2017-08-08 16:02:53,313.
{code:java}
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1501226097332_249991_01_000199 transitioned from RUNNING to 
EXITED_WITH_SUCCESS
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_1501226097332_249991_01_000199
{code}

 # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
16:07:04,238. In fact, the heartbeat request is actually handled by resource 
manager, however, the node manager failed to receive the response. Let’s assume 
the heartBeatResponseId=$hid in node manager. According to our current 
configuration, next heartbeat will be 10s later.
{code:java}
2017-08-08 16:07:04,238 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
exception in status-updater
java.io.IOException: Failed on local exception: java.io.IOException: Connection 
reset by peer; Host Details : local host is: ; destination host is: XXX
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at 

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-08 Thread zhenzhao wang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated YARN-10393:
-
Description: 
This was a bug we had seen multiple times on Hadoop 2.4.x.  And the following 
analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x.  
We hadn't seen it after 2.6 in our env. However, it was because of the RPC 
retry policy change and other changes. There's still a possibility even with 
the current code if I didn't miss anything. 

*High-level description:
*
We had seen a starving mapper issue several times. The MR job stuck in a live 
lock state and couldn't make any progress. The queue is full so the pending 
mapper can’t get any resource to continue, and the application master failed to 
preempt the reducer, thus causing the job to be stuck. The reason why the 
application master didn’t preempt the reducer was that there was a leaked 
container in assigned mappers. The node manager failed to report the completed 
container to the resource manager.

*Detailed steps:
*
# Container_1501226097332_249991_01_000199 was assigned to 
attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
{code:java}
appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1501226097332_249991_01_000199 to 
attempt_1501226097332_249991_m_95_0
{code}
#  The container finished on  2017-08-08 16:02:53,313.
{code:java}
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1501226097332_249991_01_000199 transitioned from RUNNING to 
EXITED_WITH_SUCCESS
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_1501226097332_249991_01_000199
{code}
# The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
16:07:04,238. In fact, the heartbeat request is actually handled by resource 
manager, however, the node manager failed to receive the response. Let’s assume 
the heartBeatResponseId=$hid in node manager. According to our current 
configuration, next heartbeat will be 10s later.
{code:java}
2017-08-08 16:07:04,238 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
exception in status-updater
java.io.IOException: Failed on local exception: java.io.IOException: Connection 
reset by peer; Host Details : local host is: ; destination host is: XXX
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at 

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-08 Thread zhenzhao wang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated YARN-10393:
-
Description: 
This was a bug we had seen multiple times on Hadoop 2.4.x.  And the following 
analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x.  
We hadn't seen it after 2.6 in our env. However, it was because of the RPC 
retry policy change and other changes. There's still a possibility even with 
the current code if I didn't miss anything. 

*High-level description:
*
We had seen a starving mapper issue several times. The MR job stuck in a live 
lock state and couldn't make any progress. The queue is full so the pending 
mapper can’t get any resource to continue, and the application master failed to 
preempt the reducer, thus causing the job to be stuck. The reason why the 
application master didn’t preempt the reducer was that there was a leaked 
container in assigned mappers. The node manager failed to report the completed 
container to the resource manager.

*Detailed steps:
*
# Container_1501226097332_249991_01_000199 was assigned to 
attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
{code:java}
appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1501226097332_249991_01_000199 to 
attempt_1501226097332_249991_m_95_0
{code}
#  The container finished on  2017-08-08 16:02:53,313.
{code:java}
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1501226097332_249991_01_000199 transitioned from RUNNING to 
EXITED_WITH_SUCCESS
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_1501226097332_249991_01_000199
{code}
# The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
16:07:04,238. In fact, the heartbeat request is actually handled by resource 
manager, however, the node manager failed to receive the response. Let’s assume 
the heartBeatResponseId=$hid in node manager. According to our current 
configuration, next heartbeat will be 10s later.
{code:java}
2017-08-08 16:07:04,238 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
exception in status-updater
java.io.IOException: Failed on local exception: java.io.IOException: Connection 
reset by peer; Host Details : local host is: ; destination host is: XXX
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at 

[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM

2020-08-08 Thread zhenzhao wang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated YARN-10393:
-
Summary: MR job live lock caused by completed state container leak in 
heartbeat between node manager and RM  (was: MR job live lock caused by 
completed state container leak between node manager and RM heartbeat.)

> MR job live lock caused by completed state container leak in heartbeat 
> between node manager and RM
> --
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.4.x.  And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x. 
>  We hadn't seen it after 2.6 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything. 
> *High-level description:
> *
> We had seen a starving mapper issue several times. The MR job stuck in a live 
> lock state and couldn't make any progress. The queue is full so the pending 
> mapper can’t get any resource to continue, and the application master failed 
> to preempt the reducer, thus causing the job to be stuck. The reason why the 
> application master didn’t preempt the reducer was that there was a leaked 
> container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:
> *
> # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
> #  The container finished on  2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
> # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at 

[jira] [Assigned] (YARN-10393) MR job live lock caused by completed state container leak between node manager and RM heartbeat.

2020-08-08 Thread zhenzhao wang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang reassigned YARN-10393:


Assignee: zhenzhao wang

> MR job live lock caused by completed state container leak between node 
> manager and RM heartbeat.
> 
>
> Key: YARN-10393
> URL: https://issues.apache.org/jira/browse/YARN-10393
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> This was a bug we had seen multiple times on Hadoop 2.4.x.  And the following 
> analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x. 
>  We hadn't seen it after 2.6 in our env. However, it was because of the RPC 
> retry policy change and other changes. There's still a possibility even with 
> the current code if I didn't miss anything. 
> *High-level description:
> *
> We had seen a starving mapper issue several times. The MR job stuck in a live 
> lock state and couldn't make any progress. The queue is full so the pending 
> mapper can’t get any resource to continue, and the application master failed 
> to preempt the reducer, thus causing the job to be stuck. The reason why the 
> application master didn’t preempt the reducer was that there was a leaked 
> container in assigned mappers. The node manager failed to report the 
> completed container to the resource manager.
> *Detailed steps:
> *
> # Container_1501226097332_249991_01_000199 was assigned to 
> attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
> {code:java}
> appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned 
> container container_1501226097332_249991_01_000199 to 
> attempt_1501226097332_249991_m_95_0
> {code}
> #  The container finished on  2017-08-08 16:02:53,313.
> {code:java}
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1501226097332_249991_01_000199 transitioned from RUNNING 
> to EXITED_WITH_SUCCESS
> yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1501226097332_249991_01_000199
> {code}
> # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
> 16:07:04,238. In fact, the heartbeat request is actually handled by resource 
> manager, however, the node manager failed to receive the response. Let’s 
> assume the heartBeatResponseId=$hid in node manager. According to our current 
> configuration, next heartbeat will be 10s later.
> {code:java}
> 2017-08-08 16:07:04,238 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
> exception in status-updater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: ; destination host 
> is: XXX
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> at org.apache.hadoop.ipc.Client.call(Client.java:1472)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
> at 
> 

[jira] [Created] (YARN-10393) MR job live lock caused by completed state container leak between node manager and RM heartbeat.

2020-08-08 Thread zhenzhao wang (Jira)
zhenzhao wang created YARN-10393:


 Summary: MR job live lock caused by completed state container leak 
between node manager and RM heartbeat.
 Key: YARN-10393
 URL: https://issues.apache.org/jira/browse/YARN-10393
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, yarn
Reporter: zhenzhao wang


This was a bug we had seen multiple times on Hadoop 2.4.x.  And the following 
analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x.  
We hadn't seen it after 2.6 in our env. However, it was because of the RPC 
retry policy change and other changes. There's still a possibility even with 
the current code if I didn't miss anything. 

*High-level description:
*
We had seen a starving mapper issue several times. The MR job stuck in a live 
lock state and couldn't make any progress. The queue is full so the pending 
mapper can’t get any resource to continue, and the application master failed to 
preempt the reducer, thus causing the job to be stuck. The reason why the 
application master didn’t preempt the reducer was that there was a leaked 
container in assigned mappers. The node manager failed to report the completed 
container to the resource manager.

*Detailed steps:
*
# Container_1501226097332_249991_01_000199 was assigned to 
attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417.
{code:java}
appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1501226097332_249991_01_000199 to 
attempt_1501226097332_249991_m_95_0
{code}

#  The container finished on  2017-08-08 16:02:53,313.
{code:java}
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1501226097332_249991_01_000199 transitioned from RUNNING to 
EXITED_WITH_SUCCESS
yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_1501226097332_249991_01_000199
{code}

# The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 
16:07:04,238. In fact, the heartbeat request is actually handled by resource 
manager, however, the node manager failed to receive the response. Let’s assume 
the heartBeatResponseId=$hid in node manager. According to our current 
configuration, next heartbeat will be 10s later.
{code:java}
2017-08-08 16:07:04,238 ERROR 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught 
exception in status-updater
java.io.IOException: Failed on local exception: java.io.IOException: Connection 
reset by peer; Host Details : local host is: ; destination host is: XXX
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at 

[jira] [Commented] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources

2019-08-05 Thread zhenzhao wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900507#comment-16900507
 ] 

zhenzhao wang commented on YARN-9616:
-

[~smarthan] Sorry, I missed the msg. I got a patch which works well in our 
cluster internally. However, I hadn't got a chance to sort it out and 
contribute to the public repo. I uploaded the  [^YARN-9616.001-2.9.patch]  for 
reference. Feel free to share your patch. Thanks.

> Shared Cache Manager Failed To Upload Unpacked Resources
> 
>
> Key: YARN-9616
> URL: https://issues.apache.org/jira/browse/YARN-9616
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.3, 2.9.2, 2.8.5
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-9616.001-2.9.patch
>
>
> Yarn will unpack archives files and some other files based on the file type 
> and configuration. E.g. 
>  If I started an MR job with -archive one.zip, then the one.zip will be 
> unpacked while download. Let's say there're file1 && file2 inside one.zip. 
> Then the files kept on local disk will be like 
> /disk3/yarn/local/filecache/352/one.zip/file1 
> and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache 
> uploader couldn't upload one.zip to shared cache as it was removed during 
> localization. The following errors will be thrown.
> {code:java}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader:
>  Exception while uploading the file dict.zip
> java.io.FileNotFoundException: File 
> /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources

2019-08-05 Thread zhenzhao wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated YARN-9616:

Attachment: YARN-9616.001-2.9.patch

> Shared Cache Manager Failed To Upload Unpacked Resources
> 
>
> Key: YARN-9616
> URL: https://issues.apache.org/jira/browse/YARN-9616
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.3, 2.9.2, 2.8.5
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-9616.001-2.9.patch
>
>
> Yarn will unpack archives files and some other files based on the file type 
> and configuration. E.g. 
>  If I started an MR job with -archive one.zip, then the one.zip will be 
> unpacked while download. Let's say there're file1 && file2 inside one.zip. 
> Then the files kept on local disk will be like 
> /disk3/yarn/local/filecache/352/one.zip/file1 
> and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache 
> uploader couldn't upload one.zip to shared cache as it was removed during 
> localization. The following errors will be thrown.
> {code:java}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader:
>  Exception while uploading the file dict.zip
> java.io.FileNotFoundException: File 
> /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5727) Improve YARN shared cache support for LinuxContainerExecutor

2019-06-12 Thread zhenzhao wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated YARN-5727:

Attachment: YARN-5727-Design-v2.pdf

> Improve YARN shared cache support for LinuxContainerExecutor
> 
>
> Key: YARN-5727
> URL: https://issues.apache.org/jira/browse/YARN-5727
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: zhenzhao wang
>Priority: Major
> Attachments: YARN-5727-Design-v1.pdf, YARN-5727-Design-v2.pdf, 
> YARN-5727.001.patch
>
>
> When running LinuxContainerExecutor in a secure mode 
> ({{yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users}} set 
> to {{false}}), all localized files are owned by the user that owns the 
> container which localized the resource. This presents a problem for the 
> shared cache when a YARN application requests a resource to be uploaded to 
> the shared cache that has a non-public visibility. The shared cache uploader 
> (running as the node manager user) does not have access to the localized 
> files and can not compute the checksum of the file or upload it to the cache. 
> The solution should ideally satisfy the following three requirements:
> # Localized files should still be safe/secure. Other users that run 
> containers should not be able to modify, or delete the publicly localized 
> files of others.
> # The node manager user should be able to access these files for the purpose 
> of checksumming and uploading to the shared cache without being a privileged 
> user.
> # The solution should avoid making unnecessary copies of the localized files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources

2019-06-10 Thread zhenzhao wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860417#comment-16860417
 ] 

zhenzhao wang commented on YARN-9616:
-

I had seen this issue in 2.9 and 2.6. More check is needed to identify the 
problem in the latest version.

> Shared Cache Manager Failed To Upload Unpacked Resources
> 
>
> Key: YARN-9616
> URL: https://issues.apache.org/jira/browse/YARN-9616
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.3, 2.9.2, 2.8.5
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> Yarn will unpack archives files and some other files based on the file type 
> and configuration. E.g. 
>  If I started an MR job with -archive one.zip, then the one.zip will be 
> unpacked while download. Let's say there're file1 && file2 inside one.zip. 
> Then the files kept on local disk will be like 
> /disk3/yarn/local/filecache/352/one.zip/file1 
> and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache 
> uploader couldn't upload one.zip to shared cache as it was removed during 
> localization. The following errors will be thrown.
> {code:java}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader:
>  Exception while uploading the file dict.zip
> java.io.FileNotFoundException: File 
> /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources

2019-06-10 Thread zhenzhao wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated YARN-9616:

Affects Version/s: 2.8.3
   2.9.2

> Shared Cache Manager Failed To Upload Unpacked Resources
> 
>
> Key: YARN-9616
> URL: https://issues.apache.org/jira/browse/YARN-9616
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.3, 2.9.2, 2.8.5
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> Yarn will unpack archives files and some other files based on the file type 
> and configuration. E.g. 
>  If I started an MR job with -archive one.zip, then the one.zip will be 
> unpacked while download. Let's say there're file1 && file2 inside one.zip. 
> Then the files kept on local disk will be like 
> /disk3/yarn/local/filecache/352/one.zip/file1 
> and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache 
> uploader couldn't upload one.zip to shared cache as it was removed during 
> localization. The following errors will be thrown.
> {code:java}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader:
>  Exception while uploading the file dict.zip
> java.io.FileNotFoundException: File 
> /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources

2019-06-10 Thread zhenzhao wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated YARN-9616:

Affects Version/s: 2.8.5

> Shared Cache Manager Failed To Upload Unpacked Resources
> 
>
> Key: YARN-9616
> URL: https://issues.apache.org/jira/browse/YARN-9616
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.5
>Reporter: zhenzhao wang
>Assignee: zhenzhao wang
>Priority: Major
>
> Yarn will unpack archives files and some other files based on the file type 
> and configuration. E.g. 
>  If I started an MR job with -archive one.zip, then the one.zip will be 
> unpacked while download. Let's say there're file1 && file2 inside one.zip. 
> Then the files kept on local disk will be like 
> /disk3/yarn/local/filecache/352/one.zip/file1 
> and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache 
> uploader couldn't upload one.zip to shared cache as it was removed during 
> localization. The following errors will be thrown.
> {code:java}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader:
>  Exception while uploading the file dict.zip
> java.io.FileNotFoundException: File 
> /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources

2019-06-10 Thread zhenzhao wang (JIRA)
zhenzhao wang created YARN-9616:
---

 Summary: Shared Cache Manager Failed To Upload Unpacked Resources
 Key: YARN-9616
 URL: https://issues.apache.org/jira/browse/YARN-9616
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhenzhao wang
Assignee: zhenzhao wang


Yarn will unpack archives files and some other files based on the file type and 
configuration. E.g. 
 If I started an MR job with -archive one.zip, then the one.zip will be 
unpacked while download. Let's say there're file1 && file2 inside one.zip. Then 
the files kept on local disk will be like 
/disk3/yarn/local/filecache/352/one.zip/file1 
and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache uploader 
couldn't upload one.zip to shared cache as it was removed during localization. 
The following errors will be thrown.

{code:java}
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader:
 Exception while uploading the file dict.zip
java.io.FileNotFoundException: File 
/disk3/yarn/local/filecache/352/one.zip/one.zip does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-2774) shared cache service should authorize calls properly

2019-05-17 Thread zhenzhao wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang reassigned YARN-2774:
---

Assignee: zhenzhao wang

> shared cache service should authorize calls properly
> 
>
> Key: YARN-2774
> URL: https://issues.apache.org/jira/browse/YARN-2774
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sangjin Lee
>Assignee: zhenzhao wang
>Priority: Major
>
> The shared cache manager (SCM) services should authorize calls properly.
> Currently, the uploader service (done in YARN-2186) does not authorize calls 
> to notify the SCM on newly uploaded resource. Proper security/authorization 
> needs to be done in this RPC call. Also, the use/release calls (done in 
> YARN-2188) and the scmAdmin commands (done in YARN-2189) are not properly 
> authorized. The SCM UI done in YARN-2203 as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6097) Add support for directories in the Shared Cache

2019-02-25 Thread zhenzhao wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang reassigned YARN-6097:
---

Assignee: zhenzhao wang

> Add support for directories in the Shared Cache
> ---
>
> Key: YARN-6097
> URL: https://issues.apache.org/jira/browse/YARN-6097
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: zhenzhao wang
>Priority: Major
>
> Add support for directories in the shared cache.
> If a LocalResource URL points to a directory, the directory structure is 
> preserved during localization on the node manager. Currently, the shared 
> cache does not support directories and will fail to upload the URL to the 
> cache if shouldBeUploadedToSharedCache is set to true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6910) Increase RM audit log coverage

2017-08-07 Thread zhenzhao wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang reassigned YARN-6910:
---

Assignee: zhenzhao wang

> Increase RM audit log coverage
> --
>
> Key: YARN-6910
> URL: https://issues.apache.org/jira/browse/YARN-6910
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: zhenzhao wang
>
> RM's audit logger logs certain API calls. It will be useful to increase its 
> coverage to include methods like {{getApplications}}. It addition, the audit 
> logger should track calls from rest APIs in addition to RPC calls.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org