[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207797#comment-17207797 ] zhenzhao wang commented on YARN-10393: -- +1, LGTM, thanks. > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-10393.001.patch, YARN-10393.002.patch, > YARN-10393.draft.2.patch, YARN-10393.draft.patch > > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202738#comment-17202738 ] zhenzhao wang commented on YARN-10393: -- [~Jim_Brennan] And feel free to re-assign the ticket to you if you are interested. I guess you are contributing more to the discussion and solution recently. > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch > > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202737#comment-17202737 ] zhenzhao wang commented on YARN-10393: -- [~Jim_Brennan] Sorry, I missed the msg. Thanks a lot for all the discussion and suggestions. Feel free to put up the patch. > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > Attachments: YARN-10393.draft.2.patch, YARN-10393.draft.patch > > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189046#comment-17189046 ] zhenzhao wang commented on YARN-10393: -- And one more thing to clarify. The following code in the current PR could be avoided. This is because it only calls getNodeStatus() on heartbeatId change. The pendingCompletedCantainers won't be updated on retry as getNodeStatus() is not called twice. I added it because I want to add some safeguards. It will keep sending completed containers until it's confirmed in response. This could prevent potential errors from RM or RM-AM communication. But as [~Jim_Brennan] pointed out, it might cause duplicate reports for the same completed containers. {quote} pendingCompletedContainers.remove(containerId); {quote} > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at >
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189009#comment-17189009 ] zhenzhao wang commented on YARN-10393: -- Thanks all for the great discussion. As stated earlier, I guess we could think of the problem could be discussed in two aspects: {{{quote}}} # RM and NM has a different understanding of heartbeat. RM uses the heartbeatId to distinguish the heartbeat. However, NM might generate different requests with the same heartbeat id on heartbeat failure. # The cache for containers inside NM is not maintained correctly on heartbeat failure. {{{quote}}} The first problem will lead to mulitple missing report fields. The potential missing fields include completed containers(leads to live-lock in this case), increasedContainers(I didn't dig into the impact though), and etc. It also means that people had better be aware of this when they want to add new heartbeat fields in the future. I hope we could fix it too. But I agree with the concern of changing protocol. So if we don't want to fix it in this jira. We should keep track of it. What do you think? [~Jim_Brennan] As for the second problem, it's directly related to the missing completed container issue. [~Jim_Brennan] proposed a good approach. [~yuanbo] [~adam.antal] also made good points. We couldn't clear the pendingCompletedContainers on the first successful response after a failure. The marker approach works and the heartbeatId comparasion approach wouldn't work. > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at >
[jira] [Commented] (YARN-10398) Every NM will try to upload Jar/Archives/Files/Resources to Yarn Shared Cache Manager Like DDOS
[ https://issues.apache.org/jira/browse/YARN-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182949#comment-17182949 ] zhenzhao wang commented on YARN-10398: -- [~jiwq] I double checked and confirmed the PR is the fix for the problem. The reason why non-application master try to upload is because the clear code didn't work. The code and bug are in YARN. MR uses yarn shared cache. I'm not sure we should move it MR project. Thanks. > Every NM will try to upload Jar/Archives/Files/Resources to Yarn Shared Cache > Manager Like DDOS > --- > > Key: YARN-10398 > URL: https://issues.apache.org/jira/browse/YARN-10398 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.9.1, 3.0.1, 3.0.2, 3.2.0, 3.1.1, > 2.9.2, 3.0.3, 3.0.4, 3.1.2, 3.3.0, 3.2.1, 2.9.3, 3.1.3, 3.2.2, 3.1.4, 3.4.0, > 3.3.1, 3.1.5 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > The design of yarn shared cache manager is only to allow application master > should upload the jar/files/resource. However, there was a bug in the code > since 2.9.0. Every node manager that take the job task will try to upload the > jar/resources. Let's say one job have 5000 tasks. Then there will be up to > 5000 NMs try to upload the jar. This is like DDOS and create a snowball > effect. It will end up with inavailability of yarn shared cache manager. It > wil cause time out in localization and lead to job failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180989#comment-17180989 ] zhenzhao wang edited comment on YARN-10393 at 8/20/20, 7:03 AM: Thanks [~Jim_Brennan] [~yuanbo] for the comment! {quote}It seems to me that the change you made to NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext() is all that is required to ensure that the completed container status is not lost. I don't think you need to change the RM/NM protocol to manually resend the last NodeHeartbeatRequest again. As you noted, the RPC retry logic is already doing that. Also note that there is a lot of other state in that request, so I am not sure of the implications of not sending the most recent status for all that other state. Changing the protocol seems scary. {quote} [~Jim_Brennan] I guess the RM side assumes heartbeatId is the unique identification of a heartbeat. The old logic of generating a heartbeat couldn't guarantee this. It might generate a new request and update the cache even when the heartbeatid didn't change. I mean to make sure NM only generated request only if when heartbeatId changes. This semantic guarantee is more important than retry and could help prevent other errors. E.g. a running container is also possible to be lost in this case, it's just it will be reported again in the next heartbeat. I agree that this change is scary. But I guess fixing it is even more meaningful then fix the cache problem itself. {quote}But the change you made in removeOrTrackCompletedContainersFromContext() seems to go directly to the problem. The current code is always clearing pendingCompletedContainers at the end of that function. I've read through YARN-2997 and it seems like this was a late addition to the patch, but it is not clear to me why it was added. {quote} [~Jim_Brennan] Yeah, I mean to remove the cache if only the completed container is acked by RM. But it's a reasonable concern of potential peak. [~yuanbo] also pointed it out with a solution suggestion. {quote}This would be a potential memory leak if we remove "pendingCompletedContainers.clear()". I'd suggest that removing "!isContainerRecentlyStopped(containerId)" in NodeStatusUpdaterImpl.java[line: 613] would be good to fix this issue. if (!isContainerRecentlyStopped(containerId)) Unknown macro: \{ pendingCompletedContainers.put(containerId, containerStatus); } Completed containers will be cached in 10mins(default value) until it timeouts or gets response from heartbeat. And 10mins cache for completed container is long enough for retrying sending requests through heartbeat (default interval is 10s). {quote} I guess this will end up completed containers being sent multiple times if we just remove line 613 What about this? We keep pendingCompletedContainers.clear() unchanged. Let's remove completed containers in the heartbeat request from the cache(recentlyStoppedContainers) before sending the heartbeat. Then we added the acked container back to the cache. From a high level, it is like to update the cache only if the heartbeat succeeded with a response. was (Author: wzzdreamer): Thanks [~Jim_Brennan] [~yuanbo] for the comment! ??citation It seems to me that the change you made to NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext() is all that is required to ensure that the completed container status is not lost. I don't think you need to change the RM/NM protocol to manually resend the last NodeHeartbeatRequest again. As you noted, the RPC retry logic is already doing that. Also note that there is a lot of other state in that request, so I am not sure of the implications of not sending the most recent status for all that other state. Changing the protocol seems scary.?? [~Jim_Brennan] I guess the RM side assumes heartbeatId is the unique identification of a heartbeat. The old logic of generating a heartbeat couldn't guarantee this. It might generate a new request and update the cache even when the heartbeatid didn't change. I mean to make sure NM only generated request only if when heartbeatId changes. This semantic guarantee is more important than retry and could help prevent other errors. E.g. a running container is also possible to be lost in this case, it's just it will be reported again in the next heartbeat. I agree that this change is scary. But I guess fixing it is even more meaningful then fix the cache problem itself. ??But the change you made in removeOrTrackCompletedContainersFromContext() seems to go directly to the problem. The current code is always clearing pendingCompletedContainers at the end of that function. I've read through YARN-2997 and it seems like this was a late addition to the patch, but it is not clear to me why it was added. ?? [~Jim_Brennan] Yeah, I mean to remove the cache if only the completed container is
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180989#comment-17180989 ] zhenzhao wang commented on YARN-10393: -- Thanks [~Jim_Brennan] [~yuanbo] for the comment! ??citation It seems to me that the change you made to NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext() is all that is required to ensure that the completed container status is not lost. I don't think you need to change the RM/NM protocol to manually resend the last NodeHeartbeatRequest again. As you noted, the RPC retry logic is already doing that. Also note that there is a lot of other state in that request, so I am not sure of the implications of not sending the most recent status for all that other state. Changing the protocol seems scary.?? [~Jim_Brennan] I guess the RM side assumes heartbeatId is the unique identification of a heartbeat. The old logic of generating a heartbeat couldn't guarantee this. It might generate a new request and update the cache even when the heartbeatid didn't change. I mean to make sure NM only generated request only if when heartbeatId changes. This semantic guarantee is more important than retry and could help prevent other errors. E.g. a running container is also possible to be lost in this case, it's just it will be reported again in the next heartbeat. I agree that this change is scary. But I guess fixing it is even more meaningful then fix the cache problem itself. ??But the change you made in removeOrTrackCompletedContainersFromContext() seems to go directly to the problem. The current code is always clearing pendingCompletedContainers at the end of that function. I've read through YARN-2997 and it seems like this was a late addition to the patch, but it is not clear to me why it was added. ?? [~Jim_Brennan] Yeah, I mean to remove the cache if only the completed container is backed by RM. But it's a reasonable concern of potential peak. [~yuanbo] also pointed it out with a solution suggestion. ??This would be a potential memory leak if we remove "pendingCompletedContainers.clear()". I'd suggest that removing "!isContainerRecentlyStopped(containerId)" in NodeStatusUpdaterImpl.java[line: 613] would be good to fix this issue. if (!isContainerRecentlyStopped(containerId)) { pendingCompletedContainers.put(containerId, containerStatus); } Completed containers will be cached in 10mins(default value) until it timeouts or gets response from heartbeat. And 10mins cache for completed container is long enough for retrying sending requests through heartbeat (default interval is 10s).?? I guess this will end up completed containers being sent multiple times if we just remove line 613 What about this? We keep pendingCompletedContainers.clear() unchanged. Let's remove completed containers in the heartbeat request from the cache(recentlyStoppedContainers) before sending the heartbeat. Then we added the acked container back to the cache. From a high level, it is like to update the cache only if the heartbeat succeeded with response. > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177462#comment-17177462 ] zhenzhao wang commented on YARN-10393: -- [~adam.antal] This is a great question. First, it's not like we upgraded it to 2.9.2 and the problem was gone. We stopped seeing new cases reported when we still run 2.6.x. This was because we have a stand-alone police service which could kill long-running mapper/reducer or the job itself. I guess all the users whose job pattern is easy to encounter this problem have adopted the service to prevent the problem. Second, I guess this is also because of the default retries policy change. Here's the code of creating RM proxy from 2.6. I don't see retry with proxy invoke failure. {code:java} public ProtocolProxy getProxy(Class protocol, long clientVersion, InetSocketAddress addr, UserGroupInformation ticket, Configuration conf, SocketFactory factory, int rpcTimeout, RetryPolicy connectionRetryPolicy, AtomicBoolean fallbackToSimpleAuth) throws IOException { if (connectionRetryPolicy != null) { throw new UnsupportedOperationException( "Not supported: connectionRetryPolicy=" + connectionRetryPolicy); } T proxy = (T) Proxy.newProxyInstance(protocol.getClassLoader(), new Class[] { protocol }, new Invoker(protocol, addr, ticket, conf, factory, rpcTimeout, fallbackToSimpleAuth)); return new ProtocolProxy(protocol, proxy, true); } Invoker.Java @Override public Object invoke(Object proxy, Method method, Object[] args) throws Throwable { long startTime = 0; if (LOG.isDebugEnabled()) { startTime = Time.now(); } TraceScope traceScope = null; if (Trace.isTracing()) { traceScope = Trace.startSpan( method.getDeclaringClass().getCanonicalName() + "." + method.getName()); } ObjectWritable value; try { value = (ObjectWritable) client.call(RPC.RpcKind.RPC_WRITABLE, new Invocation(method, args), remoteId, fallbackToSimpleAuth); } finally { if (traceScope != null) traceScope.close(); } if (LOG.isDebugEnabled()) { long callTime = Time.now() - startTime; LOG.debug("Call: " + method.getName() + " " + callTime); } return value.get(); } {code} And in 2.9, the RMProxy default retry policy is like the following. It's up 15min with fixed 30s sleep time. Client could do lots of retries. {code:java} retryPolicy = RetryPolicies.retryUpToMaximumTimeWithFixedSleep(rmConnectWaitMS (15 * 60 * 1000ms), rmConnectionRetryIntervalMS(30 * 1000ms), TimeUnit.MILLISECONDS); {code} > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO >
[jira] [Comment Edited] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177462#comment-17177462 ] zhenzhao wang edited comment on YARN-10393 at 8/14/20, 3:50 AM: [~adam.antal] This is a great question. # First, it's not like we upgraded it to 2.9.2 and the problem was gone. We stopped seeing new cases reported when we still run 2.6.x. This was because we have a stand-alone police service which could kill long-running mapper/reducer or the job itself. I guess all the users whose job pattern is easy to encounter this problem have adopted the service to prevent the problem. # Second, I guess this is also because of the default retries policy change. Here's the code of creating RM proxy from 2.6. I don't see retry with proxy invoke failure. {code:java} public ProtocolProxy getProxy(Class protocol, long clientVersion, InetSocketAddress addr, UserGroupInformation ticket, Configuration conf, SocketFactory factory, int rpcTimeout, RetryPolicy connectionRetryPolicy, AtomicBoolean fallbackToSimpleAuth) throws IOException { if (connectionRetryPolicy != null) { throw new UnsupportedOperationException( "Not supported: connectionRetryPolicy=" + connectionRetryPolicy); } T proxy = (T) Proxy.newProxyInstance(protocol.getClassLoader(), new Class[] { protocol }, new Invoker(protocol, addr, ticket, conf, factory, rpcTimeout, fallbackToSimpleAuth)); return new ProtocolProxy(protocol, proxy, true); } Invoker.Java @Override public Object invoke(Object proxy, Method method, Object[] args) throws Throwable { long startTime = 0; if (LOG.isDebugEnabled()) { startTime = Time.now(); } TraceScope traceScope = null; if (Trace.isTracing()) { traceScope = Trace.startSpan( method.getDeclaringClass().getCanonicalName() + "." + method.getName()); } ObjectWritable value; try { value = (ObjectWritable) client.call(RPC.RpcKind.RPC_WRITABLE, new Invocation(method, args), remoteId, fallbackToSimpleAuth); } finally { if (traceScope != null) traceScope.close(); } if (LOG.isDebugEnabled()) { long callTime = Time.now() - startTime; LOG.debug("Call: " + method.getName() + " " + callTime); } return value.get(); } {code} In 2.9, the RMProxy default retry policy is like the following. It's up 15min with fixed 30s sleep time. Client could do lots of retries. {code:java} retryPolicy = RetryPolicies.retryUpToMaximumTimeWithFixedSleep(rmConnectWaitMS (15 * 60 * 1000ms), rmConnectionRetryIntervalMS(30 * 1000ms), TimeUnit.MILLISECONDS); {code} There might be other changes I'm not aware of. However, I guess the above two reasons did make a difference in our clusters. was (Author: wzzdreamer): [~adam.antal] This is a great question. First, it's not like we upgraded it to 2.9.2 and the problem was gone. We stopped seeing new cases reported when we still run 2.6.x. This was because we have a stand-alone police service which could kill long-running mapper/reducer or the job itself. I guess all the users whose job pattern is easy to encounter this problem have adopted the service to prevent the problem. Second, I guess this is also because of the default retries policy change. Here's the code of creating RM proxy from 2.6. I don't see retry with proxy invoke failure. {code:java} public ProtocolProxy getProxy(Class protocol, long clientVersion, InetSocketAddress addr, UserGroupInformation ticket, Configuration conf, SocketFactory factory, int rpcTimeout, RetryPolicy connectionRetryPolicy, AtomicBoolean fallbackToSimpleAuth) throws IOException { if (connectionRetryPolicy != null) { throw new UnsupportedOperationException( "Not supported: connectionRetryPolicy=" + connectionRetryPolicy); } T proxy = (T) Proxy.newProxyInstance(protocol.getClassLoader(), new Class[] { protocol }, new Invoker(protocol, addr, ticket, conf, factory, rpcTimeout, fallbackToSimpleAuth)); return new ProtocolProxy(protocol, proxy, true); } Invoker.Java @Override public Object invoke(Object proxy, Method method, Object[] args) throws Throwable { long startTime = 0; if (LOG.isDebugEnabled()) { startTime = Time.now(); } TraceScope traceScope = null; if (Trace.isTracing()) { traceScope = Trace.startSpan( method.getDeclaringClass().getCanonicalName() + "." + method.getName()); } ObjectWritable value;
[jira] [Commented] (YARN-10398) Every NM will try to upload Jar/Archives/Files/Resources to Yarn Shared Cache Manager Like DDOS
[ https://issues.apache.org/jira/browse/YARN-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176758#comment-17176758 ] zhenzhao wang commented on YARN-10398: -- [~templedf] Could you please help review this patch? Thanks! > Every NM will try to upload Jar/Archives/Files/Resources to Yarn Shared Cache > Manager Like DDOS > --- > > Key: YARN-10398 > URL: https://issues.apache.org/jira/browse/YARN-10398 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.9.1, 3.0.1, 3.0.2, 3.2.0, 3.1.1, > 2.9.2, 3.0.3, 3.0.4, 3.1.2, 3.3.0, 3.2.1, 2.9.3, 3.1.3, 3.2.2, 3.1.4, 3.4.0, > 3.3.1, 3.1.5 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > The design of yarn shared cache manager is only to allow application master > should upload the jar/files/resource. However, there was a bug in the code > since 2.9.0. Every node manager that take the job task will try to upload the > jar/resources. Let's say one job have 5000 tasks. Then there will be up to > 5000 NMs try to upload the jar. This is like DDOS and create a snowball > effect. It will end up with inavailability of yarn shared cache manager. It > wil cause time out in localization and lead to job failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10398) Every NM will try to upload Jar/Archives/Files/Resources to Yarn Shared Cache Manager Like DDOS
zhenzhao wang created YARN-10398: Summary: Every NM will try to upload Jar/Archives/Files/Resources to Yarn Shared Cache Manager Like DDOS Key: YARN-10398 URL: https://issues.apache.org/jira/browse/YARN-10398 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.1.3, 3.2.1, 3.1.2, 3.0.3, 2.9.2, 3.1.1, 3.2.0, 3.0.2, 3.0.1, 2.9.1, 3.1.0, 3.0.0, 2.9.0, 3.0.4, 3.3.0, 2.9.3, 3.2.2, 3.1.4, 3.4.0, 3.3.1, 3.1.5 Reporter: zhenzhao wang Assignee: zhenzhao wang The design of yarn shared cache manager is only to allow application master should upload the jar/files/resource. However, there was a bug in the code since 2.9.0. Every node manager that take the job task will try to upload the jar/resources. Let's say one job have 5000 tasks. Then there will be up to 5000 NMs try to upload the jar. This is like DDOS and create a snowball effect. It will end up with inavailability of yarn shared cache manager. It wil cause time out in localization and lead to job failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176740#comment-17176740 ] zhenzhao wang commented on YARN-10393: -- [~bibinchundatt] [~adam.antal][~Jim_Brennan][~jdonofrio][~aceric] Could you please help with the review? Thanks > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >
[jira] [Commented] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176467#comment-17176467 ] zhenzhao wang commented on YARN-10393: -- I could see two issues here: # RM and NM has a different understanding of heartbeat. RM uses the heartbeatId to distinguish the heartbeat. However, NM might generate different requests with the same heartbeat id on heartbeat failure. # The cache for containers inside NM is not maintained correctly on heartbeat failure. I submitted a PR https://github.com/apache/hadoop/pull/2204. I tried to make fewer code changes. However, I'd say some cache structures (recentlyStoppedContainers, pendingCompletedContainers) NM used are kind of complex and error-prone. E.g. the cache is updated while getContainerStatuses regardless. This is before the heartbeat request. I'd suggest maybe worth to do a refactor in the future. [~templedf][~yuanbo]I'd appreciate it if you could help with the review. Thanks! Note that this patch is not tested in our production Hadoop clusters yet. > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >
[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang updated YARN-10393: - Affects Version/s: 3.4.0 3.3.0 2.6.1 2.7.2 2.6.2 3.0.0 2.9.2 3.2.1 3.1.3 > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.6.1, 2.7.2, 2.6.2, 3.0.0, 2.9.2, 3.3.0, 3.2.1, 3.1.3, > 3.4.0 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.6.2. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. > We hadn't seen it after 2.9 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description:* > We had seen a starving mapper issue several times. The MR job stuck in a > live lock state and couldn't make any progress. The queue is full so the > pending mapper can’t get any resource to continue, and the application master > failed to preempt the reducer, thus causing the job to be stuck. The reason > why the application master didn’t preempt the reducer was that there was a > leaked container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps:* > > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at
[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang updated YARN-10393: - Description: This was a bug we had seen multiple times on Hadoop 2.6.2. And the following analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.6.2. We hadn't seen it after 2.9 in our env. However, it was because of the RPC retry policy change and other changes. There's still a possibility even with the current code if I didn't miss anything. *High-level description:* We had seen a starving mapper issue several times. The MR job stuck in a live lock state and couldn't make any progress. The queue is full so the pending mapper can’t get any resource to continue, and the application master failed to preempt the reducer, thus causing the job to be stuck. The reason why the application master didn’t preempt the reducer was that there was a leaked container in assigned mappers. The node manager failed to report the completed container to the resource manager. *Detailed steps:* # Container_1501226097332_249991_01_000199 was assigned to attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. {code:java} appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1501226097332_249991_01_000199 to attempt_1501226097332_249991_m_95_0 {code} # The container finished on 2017-08-08 16:02:53,313. {code:java} yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1501226097332_249991_01_000199 transitioned from RUNNING to EXITED_WITH_SUCCESS yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1501226097332_249991_01_000199 {code} # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 16:07:04,238. In fact, the heartbeat request is actually handled by resource manager, however, the node manager failed to receive the response. Let’s assume the heartBeatResponseId=$hid in node manager. According to our current configuration, next heartbeat will be 10s later. {code:java} 2017-08-08 16:07:04,238 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: ; destination host is: XXX at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1472) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at
[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang updated YARN-10393: - Description: This was a bug we had seen multiple times on Hadoop 2.4.x. And the following analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x. We hadn't seen it after 2.6 in our env. However, it was because of the RPC retry policy change and other changes. There's still a possibility even with the current code if I didn't miss anything. *High-level description:* We had seen a starving mapper issue several times. The MR job stuck in a live lock state and couldn't make any progress. The queue is full so the pending mapper can’t get any resource to continue, and the application master failed to preempt the reducer, thus causing the job to be stuck. The reason why the application master didn’t preempt the reducer was that there was a leaked container in assigned mappers. The node manager failed to report the completed container to the resource manager. *Detailed steps:* # Container_1501226097332_249991_01_000199 was assigned to attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. {code:java} appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1501226097332_249991_01_000199 to attempt_1501226097332_249991_m_95_0 {code} # The container finished on 2017-08-08 16:02:53,313. {code:java} yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1501226097332_249991_01_000199 transitioned from RUNNING to EXITED_WITH_SUCCESS yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1501226097332_249991_01_000199 {code} # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 16:07:04,238. In fact, the heartbeat request is actually handled by resource manager, however, the node manager failed to receive the response. Let’s assume the heartBeatResponseId=$hid in node manager. According to our current configuration, next heartbeat will be 10s later. {code:java} 2017-08-08 16:07:04,238 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: ; destination host is: XXX at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1472) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at
[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang updated YARN-10393: - Description: This was a bug we had seen multiple times on Hadoop 2.4.x. And the following analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x. We hadn't seen it after 2.6 in our env. However, it was because of the RPC retry policy change and other changes. There's still a possibility even with the current code if I didn't miss anything. *High-level description:* We had seen a starving mapper issue several times. The MR job stuck in a live lock state and couldn't make any progress. The queue is full so the pending mapper can’t get any resource to continue, and the application master failed to preempt the reducer, thus causing the job to be stuck. The reason why the application master didn’t preempt the reducer was that there was a leaked container in assigned mappers. The node manager failed to report the completed container to the resource manager. *Detailed steps:* # Container_1501226097332_249991_01_000199 was assigned to attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. {code:java} appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1501226097332_249991_01_000199 to attempt_1501226097332_249991_m_95_0 {code} # The container finished on 2017-08-08 16:02:53,313. {code:java} yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1501226097332_249991_01_000199 transitioned from RUNNING to EXITED_WITH_SUCCESS yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1501226097332_249991_01_000199 {code} # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 16:07:04,238. In fact, the heartbeat request is actually handled by resource manager, however, the node manager failed to receive the response. Let’s assume the heartBeatResponseId=$hid in node manager. According to our current configuration, next heartbeat will be 10s later. {code:java} 2017-08-08 16:07:04,238 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: ; destination host is: XXX at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1472) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at
[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang updated YARN-10393: - Description: This was a bug we had seen multiple times on Hadoop 2.4.x. And the following analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x. We hadn't seen it after 2.6 in our env. However, it was because of the RPC retry policy change and other changes. There's still a possibility even with the current code if I didn't miss anything. *High-level description: * We had seen a starving mapper issue several times. The MR job stuck in a live lock state and couldn't make any progress. The queue is full so the pending mapper can’t get any resource to continue, and the application master failed to preempt the reducer, thus causing the job to be stuck. The reason why the application master didn’t preempt the reducer was that there was a leaked container in assigned mappers. The node manager failed to report the completed container to the resource manager. *Detailed steps: * # Container_1501226097332_249991_01_000199 was assigned to attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. {code:java} appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1501226097332_249991_01_000199 to attempt_1501226097332_249991_m_95_0 {code} # The container finished on 2017-08-08 16:02:53,313. {code:java} yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1501226097332_249991_01_000199 transitioned from RUNNING to EXITED_WITH_SUCCESS yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1501226097332_249991_01_000199 {code} # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 16:07:04,238. In fact, the heartbeat request is actually handled by resource manager, however, the node manager failed to receive the response. Let’s assume the heartBeatResponseId=$hid in node manager. According to our current configuration, next heartbeat will be 10s later. {code:java} 2017-08-08 16:07:04,238 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: ; destination host is: XXX at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1472) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at
[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang updated YARN-10393: - Description: This was a bug we had seen multiple times on Hadoop 2.4.x. And the following analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x. We hadn't seen it after 2.6 in our env. However, it was because of the RPC retry policy change and other changes. There's still a possibility even with the current code if I didn't miss anything. *High-level description: * We had seen a starving mapper issue several times. The MR job stuck in a live lock state and couldn't make any progress. The queue is full so the pending mapper can’t get any resource to continue, and the application master failed to preempt the reducer, thus causing the job to be stuck. The reason why the application master didn’t preempt the reducer was that there was a leaked container in assigned mappers. The node manager failed to report the completed container to the resource manager. *Detailed steps: * # Container_1501226097332_249991_01_000199 was assigned to attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. {code:java} appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1501226097332_249991_01_000199 to attempt_1501226097332_249991_m_95_0 {code} # The container finished on 2017-08-08 16:02:53,313. {code:java} yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1501226097332_249991_01_000199 transitioned from RUNNING to EXITED_WITH_SUCCESS yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1501226097332_249991_01_000199 {code} # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 16:07:04,238. In fact, the heartbeat request is actually handled by resource manager, however, the node manager failed to receive the response. Let’s assume the heartBeatResponseId=$hid in node manager. According to our current configuration, next heartbeat will be 10s later. {code:java} 2017-08-08 16:07:04,238 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: ; destination host is: XXX at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1472) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:513) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at
[jira] [Updated] (YARN-10393) MR job live lock caused by completed state container leak in heartbeat between node manager and RM
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang updated YARN-10393: - Summary: MR job live lock caused by completed state container leak in heartbeat between node manager and RM (was: MR job live lock caused by completed state container leak between node manager and RM heartbeat.) > MR job live lock caused by completed state container leak in heartbeat > between node manager and RM > -- > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.4.x. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x. > We hadn't seen it after 2.6 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description: > * > We had seen a starving mapper issue several times. The MR job stuck in a live > lock state and couldn't make any progress. The queue is full so the pending > mapper can’t get any resource to continue, and the application master failed > to preempt the reducer, thus causing the job to be stuck. The reason why the > application master didn’t preempt the reducer was that there was a leaked > container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps: > * > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at
[jira] [Assigned] (YARN-10393) MR job live lock caused by completed state container leak between node manager and RM heartbeat.
[ https://issues.apache.org/jira/browse/YARN-10393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang reassigned YARN-10393: Assignee: zhenzhao wang > MR job live lock caused by completed state container leak between node > manager and RM heartbeat. > > > Key: YARN-10393 > URL: https://issues.apache.org/jira/browse/YARN-10393 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > This was a bug we had seen multiple times on Hadoop 2.4.x. And the following > analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x. > We hadn't seen it after 2.6 in our env. However, it was because of the RPC > retry policy change and other changes. There's still a possibility even with > the current code if I didn't miss anything. > *High-level description: > * > We had seen a starving mapper issue several times. The MR job stuck in a live > lock state and couldn't make any progress. The queue is full so the pending > mapper can’t get any resource to continue, and the application master failed > to preempt the reducer, thus causing the job to be stuck. The reason why the > application master didn’t preempt the reducer was that there was a leaked > container in assigned mappers. The node manager failed to report the > completed container to the resource manager. > *Detailed steps: > * > # Container_1501226097332_249991_01_000199 was assigned to > attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. > {code:java} > appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned > container container_1501226097332_249991_01_000199 to > attempt_1501226097332_249991_m_95_0 > {code} > # The container finished on 2017-08-08 16:02:53,313. > {code:java} > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1501226097332_249991_01_000199 transitioned from RUNNING > to EXITED_WITH_SUCCESS > yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1501226097332_249991_01_000199 > {code} > # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 > 16:07:04,238. In fact, the heartbeat request is actually handled by resource > manager, however, the node manager failed to receive the response. Let’s > assume the heartBeatResponseId=$hid in node manager. According to our current > configuration, next heartbeat will be 10s later. > {code:java} > 2017-08-08 16:07:04,238 ERROR > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught > exception in status-updater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: ; destination host > is: XXX > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1472) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) > at >
[jira] [Created] (YARN-10393) MR job live lock caused by completed state container leak between node manager and RM heartbeat.
zhenzhao wang created YARN-10393: Summary: MR job live lock caused by completed state container leak between node manager and RM heartbeat. Key: YARN-10393 URL: https://issues.apache.org/jira/browse/YARN-10393 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, yarn Reporter: zhenzhao wang This was a bug we had seen multiple times on Hadoop 2.4.x. And the following analysis is based on the core dump, logs, and code in 2017 with Hadoop 2.4.x. We hadn't seen it after 2.6 in our env. However, it was because of the RPC retry policy change and other changes. There's still a possibility even with the current code if I didn't miss anything. *High-level description: * We had seen a starving mapper issue several times. The MR job stuck in a live lock state and couldn't make any progress. The queue is full so the pending mapper can’t get any resource to continue, and the application master failed to preempt the reducer, thus causing the job to be stuck. The reason why the application master didn’t preempt the reducer was that there was a leaked container in assigned mappers. The node manager failed to report the completed container to the resource manager. *Detailed steps: * # Container_1501226097332_249991_01_000199 was assigned to attempt_1501226097332_249991_m_95_0 on 2017-08-08 16:00:00,417. {code:java} appmaster.log:6464:2017-08-08 16:00:00,417 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1501226097332_249991_01_000199 to attempt_1501226097332_249991_m_95_0 {code} # The container finished on 2017-08-08 16:02:53,313. {code:java} yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1501226097332_249991_01_000199 transitioned from RUNNING to EXITED_WITH_SUCCESS yarn-mapred-nodemanager-.log.1:2017-08-08 16:02:53,313 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1501226097332_249991_01_000199 {code} # The NodeStatusUpdater go an exception in the heartbeat on 2017-08-08 16:07:04,238. In fact, the heartbeat request is actually handled by resource manager, however, the node manager failed to receive the response. Let’s assume the heartBeatResponseId=$hid in node manager. According to our current configuration, next heartbeat will be 10s later. {code:java} 2017-08-08 16:07:04,238 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: ; destination host is: XXX at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1472) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy33.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy34.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:597) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at
[jira] [Commented] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources
[ https://issues.apache.org/jira/browse/YARN-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900507#comment-16900507 ] zhenzhao wang commented on YARN-9616: - [~smarthan] Sorry, I missed the msg. I got a patch which works well in our cluster internally. However, I hadn't got a chance to sort it out and contribute to the public repo. I uploaded the [^YARN-9616.001-2.9.patch] for reference. Feel free to share your patch. Thanks. > Shared Cache Manager Failed To Upload Unpacked Resources > > > Key: YARN-9616 > URL: https://issues.apache.org/jira/browse/YARN-9616 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.3, 2.9.2, 2.8.5 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > Attachments: YARN-9616.001-2.9.patch > > > Yarn will unpack archives files and some other files based on the file type > and configuration. E.g. > If I started an MR job with -archive one.zip, then the one.zip will be > unpacked while download. Let's say there're file1 && file2 inside one.zip. > Then the files kept on local disk will be like > /disk3/yarn/local/filecache/352/one.zip/file1 > and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache > uploader couldn't upload one.zip to shared cache as it was removed during > localization. The following errors will be thrown. > {code:java} > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader: > Exception while uploading the file dict.zip > java.io.FileNotFoundException: File > /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146) > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources
[ https://issues.apache.org/jira/browse/YARN-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang updated YARN-9616: Attachment: YARN-9616.001-2.9.patch > Shared Cache Manager Failed To Upload Unpacked Resources > > > Key: YARN-9616 > URL: https://issues.apache.org/jira/browse/YARN-9616 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.3, 2.9.2, 2.8.5 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > Attachments: YARN-9616.001-2.9.patch > > > Yarn will unpack archives files and some other files based on the file type > and configuration. E.g. > If I started an MR job with -archive one.zip, then the one.zip will be > unpacked while download. Let's say there're file1 && file2 inside one.zip. > Then the files kept on local disk will be like > /disk3/yarn/local/filecache/352/one.zip/file1 > and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache > uploader couldn't upload one.zip to shared cache as it was removed during > localization. The following errors will be thrown. > {code:java} > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader: > Exception while uploading the file dict.zip > java.io.FileNotFoundException: File > /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146) > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5727) Improve YARN shared cache support for LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-5727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang updated YARN-5727: Attachment: YARN-5727-Design-v2.pdf > Improve YARN shared cache support for LinuxContainerExecutor > > > Key: YARN-5727 > URL: https://issues.apache.org/jira/browse/YARN-5727 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: zhenzhao wang >Priority: Major > Attachments: YARN-5727-Design-v1.pdf, YARN-5727-Design-v2.pdf, > YARN-5727.001.patch > > > When running LinuxContainerExecutor in a secure mode > ({{yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users}} set > to {{false}}), all localized files are owned by the user that owns the > container which localized the resource. This presents a problem for the > shared cache when a YARN application requests a resource to be uploaded to > the shared cache that has a non-public visibility. The shared cache uploader > (running as the node manager user) does not have access to the localized > files and can not compute the checksum of the file or upload it to the cache. > The solution should ideally satisfy the following three requirements: > # Localized files should still be safe/secure. Other users that run > containers should not be able to modify, or delete the publicly localized > files of others. > # The node manager user should be able to access these files for the purpose > of checksumming and uploading to the shared cache without being a privileged > user. > # The solution should avoid making unnecessary copies of the localized files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources
[ https://issues.apache.org/jira/browse/YARN-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860417#comment-16860417 ] zhenzhao wang commented on YARN-9616: - I had seen this issue in 2.9 and 2.6. More check is needed to identify the problem in the latest version. > Shared Cache Manager Failed To Upload Unpacked Resources > > > Key: YARN-9616 > URL: https://issues.apache.org/jira/browse/YARN-9616 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.3, 2.9.2, 2.8.5 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > Yarn will unpack archives files and some other files based on the file type > and configuration. E.g. > If I started an MR job with -archive one.zip, then the one.zip will be > unpacked while download. Let's say there're file1 && file2 inside one.zip. > Then the files kept on local disk will be like > /disk3/yarn/local/filecache/352/one.zip/file1 > and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache > uploader couldn't upload one.zip to shared cache as it was removed during > localization. The following errors will be thrown. > {code:java} > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader: > Exception while uploading the file dict.zip > java.io.FileNotFoundException: File > /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146) > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources
[ https://issues.apache.org/jira/browse/YARN-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang updated YARN-9616: Affects Version/s: 2.8.3 2.9.2 > Shared Cache Manager Failed To Upload Unpacked Resources > > > Key: YARN-9616 > URL: https://issues.apache.org/jira/browse/YARN-9616 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.3, 2.9.2, 2.8.5 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > Yarn will unpack archives files and some other files based on the file type > and configuration. E.g. > If I started an MR job with -archive one.zip, then the one.zip will be > unpacked while download. Let's say there're file1 && file2 inside one.zip. > Then the files kept on local disk will be like > /disk3/yarn/local/filecache/352/one.zip/file1 > and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache > uploader couldn't upload one.zip to shared cache as it was removed during > localization. The following errors will be thrown. > {code:java} > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader: > Exception while uploading the file dict.zip > java.io.FileNotFoundException: File > /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146) > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources
[ https://issues.apache.org/jira/browse/YARN-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang updated YARN-9616: Affects Version/s: 2.8.5 > Shared Cache Manager Failed To Upload Unpacked Resources > > > Key: YARN-9616 > URL: https://issues.apache.org/jira/browse/YARN-9616 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.5 >Reporter: zhenzhao wang >Assignee: zhenzhao wang >Priority: Major > > Yarn will unpack archives files and some other files based on the file type > and configuration. E.g. > If I started an MR job with -archive one.zip, then the one.zip will be > unpacked while download. Let's say there're file1 && file2 inside one.zip. > Then the files kept on local disk will be like > /disk3/yarn/local/filecache/352/one.zip/file1 > and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache > uploader couldn't upload one.zip to shared cache as it was removed during > localization. The following errors will be thrown. > {code:java} > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader: > Exception while uploading the file dict.zip > java.io.FileNotFoundException: File > /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146) > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9616) Shared Cache Manager Failed To Upload Unpacked Resources
zhenzhao wang created YARN-9616: --- Summary: Shared Cache Manager Failed To Upload Unpacked Resources Key: YARN-9616 URL: https://issues.apache.org/jira/browse/YARN-9616 Project: Hadoop YARN Issue Type: Bug Reporter: zhenzhao wang Assignee: zhenzhao wang Yarn will unpack archives files and some other files based on the file type and configuration. E.g. If I started an MR job with -archive one.zip, then the one.zip will be unpacked while download. Let's say there're file1 && file2 inside one.zip. Then the files kept on local disk will be like /disk3/yarn/local/filecache/352/one.zip/file1 and/disk3/yarn/local/filecache/352/one.zip/file2. So the shared cache uploader couldn't upload one.zip to shared cache as it was removed during localization. The following errors will be thrown. {code:java} org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader: Exception while uploading the file dict.zip java.io.FileNotFoundException: File /disk3/yarn/local/filecache/352/one.zip/one.zip does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:631) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:857) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:621) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:926) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.computeChecksum(SharedCacheUploader.java:257) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:128) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploader.call(SharedCacheUploader.java:55) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-2774) shared cache service should authorize calls properly
[ https://issues.apache.org/jira/browse/YARN-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang reassigned YARN-2774: --- Assignee: zhenzhao wang > shared cache service should authorize calls properly > > > Key: YARN-2774 > URL: https://issues.apache.org/jira/browse/YARN-2774 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sangjin Lee >Assignee: zhenzhao wang >Priority: Major > > The shared cache manager (SCM) services should authorize calls properly. > Currently, the uploader service (done in YARN-2186) does not authorize calls > to notify the SCM on newly uploaded resource. Proper security/authorization > needs to be done in this RPC call. Also, the use/release calls (done in > YARN-2188) and the scmAdmin commands (done in YARN-2189) are not properly > authorized. The SCM UI done in YARN-2203 as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6097) Add support for directories in the Shared Cache
[ https://issues.apache.org/jira/browse/YARN-6097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang reassigned YARN-6097: --- Assignee: zhenzhao wang > Add support for directories in the Shared Cache > --- > > Key: YARN-6097 > URL: https://issues.apache.org/jira/browse/YARN-6097 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: zhenzhao wang >Priority: Major > > Add support for directories in the shared cache. > If a LocalResource URL points to a directory, the directory structure is > preserved during localization on the node manager. Currently, the shared > cache does not support directories and will fail to upload the URL to the > cache if shouldBeUploadedToSharedCache is set to true. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6910) Increase RM audit log coverage
[ https://issues.apache.org/jira/browse/YARN-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenzhao wang reassigned YARN-6910: --- Assignee: zhenzhao wang > Increase RM audit log coverage > -- > > Key: YARN-6910 > URL: https://issues.apache.org/jira/browse/YARN-6910 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ming Ma >Assignee: zhenzhao wang > > RM's audit logger logs certain API calls. It will be useful to increase its > coverage to include methods like {{getApplications}}. It addition, the audit > logger should track calls from rest APIs in addition to RPC calls. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org