[jira] [Commented] (YARN-7007) NPE in RM while using YarnClient.getApplications()
[ https://issues.apache.org/jira/browse/YARN-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012529#comment-17012529 ] Yang Wang commented on YARN-7007: - [~cheersyang] [~Tao Yang] We come across the same problem in FLINK-15534 and i think many users are using Flink with bundled hadoop-2.8.x. It will be very good if we could backport this fix to 2.8 and release in 2.8.6. Could you help with this? > NPE in RM while using YarnClient.getApplications() > -- > > Key: YARN-7007 > URL: https://issues.apache.org/jira/browse/YARN-7007 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Lingfeng Su >Assignee: Lingfeng Su >Priority: Major > Labels: patch > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: YARN-7007.001.patch > > > {code:java} > java.lang.NullPointerException: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:118) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:857) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:629) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.verifyAndCreateAppReport(ClientRMService.java:972) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:898) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:734) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplications(ApplicationClientProtocolPBServiceImpl.java:239) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:441) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2198) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2196) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:254) > at sun.reflect.GeneratedMethodAccessor731.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy161.getApplications(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:479) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:456) > {code} > When I use YarnClient.getApplications() to get all applications of RM, > Occasionally, it throw a NPE problem. > {code:java} > RMAppAttempt currentAttempt = rmContext.getRMApps() >.get(attemptId.getApplicationId()).getCurrentAppAttempt(); > {code} > if the application id is not in ConcurrentMap > getRMApps(), it may throw NPE problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.ap
[jira] [Commented] (YARN-7007) NPE in RM while using YarnClient.getApplications()
[ https://issues.apache.org/jira/browse/YARN-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012913#comment-17012913 ] Yang Wang commented on YARN-7007: - [~Tao Yang] Cool, many thanks. > NPE in RM while using YarnClient.getApplications() > -- > > Key: YARN-7007 > URL: https://issues.apache.org/jira/browse/YARN-7007 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Lingfeng Su >Assignee: Lingfeng Su >Priority: Major > Labels: patch > Fix For: 2.9.0, 3.0.0-beta1, 2.8.6 > > Attachments: YARN-7007.001.patch > > > {code:java} > java.lang.NullPointerException: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:118) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:857) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:629) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.verifyAndCreateAppReport(ClientRMService.java:972) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:898) > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:734) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplications(ApplicationClientProtocolPBServiceImpl.java:239) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:441) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2198) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2196) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:254) > at sun.reflect.GeneratedMethodAccessor731.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy161.getApplications(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:479) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:456) > {code} > When I use YarnClient.getApplications() to get all applications of RM, > Occasionally, it throw a NPE problem. > {code:java} > RMAppAttempt currentAttempt = rmContext.getRMApps() >.get(attemptId.getApplicationId()).getCurrentAppAttempt(); > {code} > if the application id is not in ConcurrentMap > getRMApps(), it may throw NPE problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart
Yang Wang created YARN-8153: --- Summary: Guaranteed containers always stay in SCHEDULED on NM after restart Key: YARN-8153 URL: https://issues.apache.org/jira/browse/YARN-8153 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang When nm recovery is enabled, after NM restart, some containers always stay in SCHEDULED because of no sufficient resources. The root cause is that utilizationTracker.addContainerResources has been called twice when restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart
[ https://issues.apache.org/jira/browse/YARN-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang reassigned YARN-8153: --- Assignee: Yang Wang > Guaranteed containers always stay in SCHEDULED on NM after restart > -- > > Key: YARN-8153 > URL: https://issues.apache.org/jira/browse/YARN-8153 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > > When nm recovery is enabled, after NM restart, some containers always stay in > SCHEDULED because of no sufficient resources. > The root cause is that utilizationTracker.addContainerResources has been > called twice when restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart
[ https://issues.apache.org/jira/browse/YARN-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-8153: Attachment: YARN-8153.001.patch > Guaranteed containers always stay in SCHEDULED on NM after restart > -- > > Key: YARN-8153 > URL: https://issues.apache.org/jira/browse/YARN-8153 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Attachments: YARN-8153.001.patch > > > When nm recovery is enabled, after NM restart, some containers always stay in > SCHEDULED because of no sufficient resources. > The root cause is that utilizationTracker.addContainerResources has been > called twice when restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart
[ https://issues.apache.org/jira/browse/YARN-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-8153: Attachment: YARN-8153.002.patch > Guaranteed containers always stay in SCHEDULED on NM after restart > -- > > Key: YARN-8153 > URL: https://issues.apache.org/jira/browse/YARN-8153 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Attachments: YARN-8153.001.patch, YARN-8153.002.patch > > > When nm recovery is enabled, after NM restart, some containers always stay in > SCHEDULED because of no sufficient resources. > The root cause is that utilizationTracker.addContainerResources has been > called twice when restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart
[ https://issues.apache.org/jira/browse/YARN-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436727#comment-16436727 ] Yang Wang commented on YARN-8153: - [~cheersyang] Thanks for your comment. I have fixed the UT failure. > Guaranteed containers always stay in SCHEDULED on NM after restart > -- > > Key: YARN-8153 > URL: https://issues.apache.org/jira/browse/YARN-8153 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Attachments: YARN-8153.001.patch, YARN-8153.002.patch > > > When nm recovery is enabled, after NM restart, some containers always stay in > SCHEDULED because of no sufficient resources. > The root cause is that utilizationTracker.addContainerResources has been > called twice when restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart
[ https://issues.apache.org/jira/browse/YARN-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437186#comment-16437186 ] Yang Wang commented on YARN-8153: - [~cheersyang] Thanks for your commit. > Guaranteed containers always stay in SCHEDULED on NM after restart > -- > > Key: YARN-8153 > URL: https://issues.apache.org/jira/browse/YARN-8153 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8153.001.patch, YARN-8153.002.patch > > > When nm recovery is enabled, after NM restart, some containers always stay in > SCHEDULED because of no sufficient resources. > The root cause is that utilizationTracker.addContainerResources has been > called twice when restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6630) Container worker dir could not recover when NM restart
[ https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6630: Attachment: YARN-6630.003.patch > Container worker dir could not recover when NM restart > -- > > Key: YARN-6630 > URL: https://issues.apache.org/jira/browse/YARN-6630 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Attachments: YARN-6630.001.patch, YARN-6630.002.patch, > YARN-6630.003.patch > > > When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be > saved in NM state store. > {code:title=ContainerLaunch.java} > ... > private void recordContainerWorkDir(ContainerId containerId, > String workDir) throws IOException{ > container.setWorkDir(workDir); > if (container.isRetryContextSet()) { > context.getNMStateStore().storeContainerWorkDir(containerId, workDir); > } > } > {code} > Then NM restarts, container.workDir could not recover and is null, and may > cause some exceptions. > We already have a problem, after NM restart, we send a resource localization > request while container is running(YARN-1503), then NM will fail because of > the following exception. > So, container.workdir always need to be saved in NM state store. > {code:title=ContainerImpl.java} > static class ResourceLocalizedWhileRunningTransition > extends ContainerTransition { > ... > String linkFile = new Path(container.workDir, link).toString(); > ... > {code} > {code} > java.lang.IllegalArgumentException: Can not create a Path from a null string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) > at org.apache.hadoop.fs.Path.(Path.java:175) > at org.apache.hadoop.fs.Path.(Path.java:110) > ... ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6589) Recover all resources when NM restart
[ https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462233#comment-16462233 ] Yang Wang commented on YARN-6589: - ContainerImpl#getResource() has been changed to get from containerTokenIdentifier and containerTokenIdentifier could be recovered correctly. Just close this jira as Won't Fix > Recover all resources when NM restart > - > > Key: YARN-6589 > URL: https://issues.apache.org/jira/browse/YARN-6589 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Blocker > Attachments: YARN-6589-YARN-3926.001.patch, YARN-6589.001.patch, > YARN-6589.002.patch > > > When NM restart, containers will be recovered. However, only memory and > vcores in capability have been recovered. All resources need to be recovered. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > this.resource = > Resource.newInstance(recoveredCapability.getMemorySize(), > recoveredCapability.getVirtualCores()); > {code} > It should be like this. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > // need to recover all resources, not only > this.resource = Resources.clone(recoveredCapability); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6589) Recover all resources when NM restart
[ https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6589: Release Note: (was: ContainerImpl#getResource() has been changed to get from containerTokenIdentifier and containerTokenIdentifier could be recovered correctly. Just close this jira as Won't Fix) > Recover all resources when NM restart > - > > Key: YARN-6589 > URL: https://issues.apache.org/jira/browse/YARN-6589 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Blocker > Attachments: YARN-6589-YARN-3926.001.patch, YARN-6589.001.patch, > YARN-6589.002.patch > > > When NM restart, containers will be recovered. However, only memory and > vcores in capability have been recovered. All resources need to be recovered. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > this.resource = > Resource.newInstance(recoveredCapability.getMemorySize(), > recoveredCapability.getVirtualCores()); > {code} > It should be like this. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > // need to recover all resources, not only > this.resource = Resources.clone(recoveredCapability); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6578) Return container resource utilization from NM ContainerStatus call
[ https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6578: Attachment: YARN-6578.002.patch > Return container resource utilization from NM ContainerStatus call > -- > > Key: YARN-6578 > URL: https://issues.apache.org/jira/browse/YARN-6578 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Attachments: YARN-6578.001.patch, YARN-6578.002.patch > > > When the applicationMaster wants to change(increase/decrease) resources of an > allocated container, resource utilization is an important reference indicator > for decision making. So, when AM call NMClient.getContainerStatus, resource > utilization needs to be returned. > Also container resource utilization need to report to RM to make better > scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6578) Return container resource utilization from NM ContainerStatus call
[ https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6578: Description: When the applicationMaster wants to change(increase/decrease) resources of an allocated container, resource utilization is an important reference indicator for decision making. So, when AM call NMClient.getContainerStatus, resource utilization needs to be returned. Also container resource utilization need to report to RM to make better scheduling. So put resource utilization in ContainerStatus. was: When the applicationMaster wants to change(increase/decrease) resources of an allocated container, resource utilization is an important reference indicator for decision making. So, when AM call NMClient.getContainerStatus, resource utilization needs to be returned. Also container resource utilization need to report to RM to make better scheduling. > Return container resource utilization from NM ContainerStatus call > -- > > Key: YARN-6578 > URL: https://issues.apache.org/jira/browse/YARN-6578 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Attachments: YARN-6578.001.patch, YARN-6578.002.patch > > > When the applicationMaster wants to change(increase/decrease) resources of an > allocated container, resource utilization is an important reference indicator > for decision making. So, when AM call NMClient.getContainerStatus, resource > utilization needs to be returned. > Also container resource utilization need to report to RM to make better > scheduling. > So put resource utilization in ContainerStatus. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6578) Return container resource utilization from NM ContainerStatus call
[ https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465620#comment-16465620 ] Yang Wang commented on YARN-6578: - [~cheersyang], thanks for your comment. Have fixed the findbugs issues. The failed UT seem to be another issue, [YARN-8244|https://issues.apache.org/jira/browse/YARN-8244]. Do not need to fix checkstyle issues. Just as other metric variables, the vMemMBsStat and vMemMBQuantiles could be public in ContainerMetrics.java. > Return container resource utilization from NM ContainerStatus call > -- > > Key: YARN-6578 > URL: https://issues.apache.org/jira/browse/YARN-6578 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Attachments: YARN-6578.001.patch, YARN-6578.002.patch > > > When the applicationMaster wants to change(increase/decrease) resources of an > allocated container, resource utilization is an important reference indicator > for decision making. So, when AM call NMClient.getContainerStatus, resource > utilization needs to be returned. > Also container resource utilization need to report to RM to make better > scheduling. > So put resource utilization in ContainerStatus. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6578) Return container resource utilization from NM ContainerStatus call
[ https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6578: Attachment: YARN-6578.003.patch > Return container resource utilization from NM ContainerStatus call > -- > > Key: YARN-6578 > URL: https://issues.apache.org/jira/browse/YARN-6578 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Attachments: YARN-6578.001.patch, YARN-6578.002.patch, > YARN-6578.003.patch > > > When the applicationMaster wants to change(increase/decrease) resources of an > allocated container, resource utilization is an important reference indicator > for decision making. So, when AM call NMClient.getContainerStatus, resource > utilization needs to be returned. > Also container resource utilization need to report to RM to make better > scheduling. > So put resource utilization in ContainerStatus. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6578) Return container resource utilization from NM ContainerStatus call
[ https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482142#comment-16482142 ] Yang Wang commented on YARN-6578: - [~Naganarasimha] thanks for your comment. Currently we just return pmem/vmem/vcores in ContainerStatus#getUtilization. Just as you mentioned, do we need to make ResourceUtilization extensible like Resource? Get the utilization of extensible resource (gpu/fpga) is not easy as pmem/vmem/vcores. In most use case, scheduling opportunistic containers or increase/decrease container resource, utilization of pmem/vmem/vcores is enough. > Return container resource utilization from NM ContainerStatus call > -- > > Key: YARN-6578 > URL: https://issues.apache.org/jira/browse/YARN-6578 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Attachments: YARN-6578.001.patch, YARN-6578.002.patch, > YARN-6578.003.patch > > > When the applicationMaster wants to change(increase/decrease) resources of an > allocated container, resource utilization is an important reference indicator > for decision making. So, when AM call NMClient.getContainerStatus, resource > utilization needs to be returned. > Also container resource utilization need to report to RM to make better > scheduling. > So put resource utilization in ContainerStatus. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8331) Race condition in NM container launched after done
Yang Wang created YARN-8331: --- Summary: Race condition in NM container launched after done Key: YARN-8331 URL: https://issues.apache.org/jira/browse/YARN-8331 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang When a container is launching, in ContainerLaunch#launchContainer, state is SCHEDULED, kill event was sent to this container, state : SCHEDULED->KILLING->DONE Then ContainerLaunch send CONTAINER_LAUNCHED event and start the container processes. These absent container processes will not be cleaned up anymore. {code:java} 2018-05-21 13:11:56,114 INFO [Thread-11] nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(94)) - USER=nobody OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_0_CONTAINERID=container_0__01_00 2018-05-21 13:11:56,114 INFO [NM ContainerManager dispatcher] application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application application_0_ transitioned from NEW to INITING 2018-05-21 13:11:56,114 INFO [NM ContainerManager dispatcher] application.ApplicationImpl (ApplicationImpl.java:transition(446)) - Adding container_0__01_00 to application application_0_ 2018-05-21 13:11:56,118 INFO [NM ContainerManager dispatcher] application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application application_0_ transitioned from INITING to RUNNING 2018-05-21 13:11:56,119 INFO [NM ContainerManager dispatcher] container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container container_0__01_00 transitioned from NEW to SCHEDULED 2018-05-21 13:11:56,119 INFO [NM ContainerManager dispatcher] containermanager.AuxServices (AuxServices.java:handle(220)) - Got event CONTAINER_INIT for appId application_0_ 2018-05-21 13:11:56,119 INFO [NM ContainerManager dispatcher] scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(504)) - Starting container [container_0__01_00] 2018-05-21 13:11:56,226 INFO [NM ContainerManager dispatcher] container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container container_0__01_00 transitioned from SCHEDULED to KILLING 2018-05-21 13:11:56,227 INFO [NM ContainerManager dispatcher] containermanager.TestContainerManager (BaseContainerManagerTest.java:delete(287)) - Psuedo delete: user - nobody, type - FILE 2018-05-21 13:11:56,227 INFO [NM ContainerManager dispatcher] nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(94)) - USER=nobody OPERATION=Container Finished - Killed TARGET=ContainerImplRESULT=SUCCESS APPID=application_0_CONTAINERID=container_0__01_00 2018-05-21 13:11:56,238 INFO [NM ContainerManager dispatcher] container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container container_0__01_00 transitioned from KILLING to DONE 2018-05-21 13:11:56,238 INFO [NM ContainerManager dispatcher] application.ApplicationImpl (ApplicationImpl.java:transition(489)) - Removing container_0__01_00 from application application_0_ 2018-05-21 13:11:56,239 INFO [NM ContainerManager dispatcher] monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:onStopMonitoringContainer(932)) - Stopping resource-monitoring for container_0__01_00 2018-05-21 13:11:56,239 INFO [NM ContainerManager dispatcher] containermanager.AuxServices (AuxServices.java:handle(220)) - Got event CONTAINER_STOP for appId application_0_ 2018-05-21 13:11:56,274 WARN [NM ContainerManager dispatcher] container.ContainerImpl (ContainerImpl.java:handle(2106)) - Can't handle this event at current state: Current: [DONE], eventType: [CONTAINER_LAUNCHED], container: [container_0__01_00] org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CONTAINER_LAUNCHED at DONE at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2104) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:104) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1525) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1518) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatch
[jira] [Updated] (YARN-6589) Recover all resources when NM restart
[ https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6589: Attachment: YARN-6589.002.patch The constructor in ContainerImpl has change, we do not need to recover resource. Because we will get resource from containerTokenIdentifier. And containerTokenIdentifier could be recovered properly. So i update the patch and just add a test for this case. > Recover all resources when NM restart > - > > Key: YARN-6589 > URL: https://issues.apache.org/jira/browse/YARN-6589 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Blocker > Attachments: YARN-6589-YARN-3926.001.patch, YARN-6589.001.patch, > YARN-6589.002.patch > > > When NM restart, containers will be recovered. However, only memory and > vcores in capability have been recovered. All resources need to be recovered. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > this.resource = > Resource.newInstance(recoveredCapability.getMemorySize(), > recoveredCapability.getVirtualCores()); > {code} > It should be like this. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > // need to recover all resources, not only > this.resource = Resources.clone(recoveredCapability); > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7647) NM print inappropriate error log when node-labels is enabled
Yang Wang created YARN-7647: --- Summary: NM print inappropriate error log when node-labels is enabled Key: YARN-7647 URL: https://issues.apache.org/jira/browse/YARN-7647 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang {code:title=NodeStatusUpdaterImpl.java} ... ... if (response.getAreNodeLabelsAcceptedByRM() && LOG.isDebugEnabled()) { LOG.debug("Node Labels {" + StringUtils.join(",", previousNodeLabels) + "} were Accepted by RM "); } else { // case where updated labels from NodeLabelsProvider is sent to RM and // RM rejected the labels LOG.error( "NM node labels {" + StringUtils.join(",", previousNodeLabels) + "} were not accepted by RM and message from RM : " + response.getDiagnosticsMessage()); } ... ... {code} When LOG.isDebugEnabled() is false, NM will always print error log. It is an obvious error and is so misleading. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7647) NM print inappropriate error log when node-labels is enabled
[ https://issues.apache.org/jira/browse/YARN-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang reassigned YARN-7647: --- Assignee: Yang Wang > NM print inappropriate error log when node-labels is enabled > > > Key: YARN-7647 > URL: https://issues.apache.org/jira/browse/YARN-7647 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang > Attachments: YARN-7647.001.patch > > > {code:title=NodeStatusUpdaterImpl.java} > ... ... > if (response.getAreNodeLabelsAcceptedByRM() && LOG.isDebugEnabled()) { > LOG.debug("Node Labels {" + StringUtils.join(",", > previousNodeLabels) > + "} were Accepted by RM "); > } else { > // case where updated labels from NodeLabelsProvider is sent to RM > and > // RM rejected the labels > LOG.error( > "NM node labels {" + StringUtils.join(",", previousNodeLabels) > + "} were not accepted by RM and message from RM : " > + response.getDiagnosticsMessage()); > } > ... ... > {code} > When LOG.isDebugEnabled() is false, NM will always print error log. It is an > obvious error and is so misleading. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7647) NM print inappropriate error log when node-labels is enabled
[ https://issues.apache.org/jira/browse/YARN-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-7647: Attachment: YARN-7647.001.patch > NM print inappropriate error log when node-labels is enabled > > > Key: YARN-7647 > URL: https://issues.apache.org/jira/browse/YARN-7647 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang > Attachments: YARN-7647.001.patch > > > {code:title=NodeStatusUpdaterImpl.java} > ... ... > if (response.getAreNodeLabelsAcceptedByRM() && LOG.isDebugEnabled()) { > LOG.debug("Node Labels {" + StringUtils.join(",", > previousNodeLabels) > + "} were Accepted by RM "); > } else { > // case where updated labels from NodeLabelsProvider is sent to RM > and > // RM rejected the labels > LOG.error( > "NM node labels {" + StringUtils.join(",", previousNodeLabels) > + "} were not accepted by RM and message from RM : " > + response.getDiagnosticsMessage()); > } > ... ... > {code} > When LOG.isDebugEnabled() is false, NM will always print error log. It is an > obvious error and is so misleading. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7659) NodeManager metrics return wrong value after update resource
Yang Wang created YARN-7659: --- Summary: NodeManager metrics return wrong value after update resource Key: YARN-7659 URL: https://issues.apache.org/jira/browse/YARN-7659 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang {code:title=NodeManagerMetrics.java} public void addResource(Resource res) { availableMB = availableMB + res.getMemorySize(); availableGB.incr((int)Math.floor(availableMB/1024d)); availableVCores.incr(res.getVirtualCores()); } {code} When the node resource was updated through RM-NM heartbeat, the NM metric will get wrong value. The root cause of this issue is that new resource has been added to availableMB, so not needed to increase for availableGB again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7660) NodeManager metrics return wrong value after update node resource
Yang Wang created YARN-7660: --- Summary: NodeManager metrics return wrong value after update node resource Key: YARN-7660 URL: https://issues.apache.org/jira/browse/YARN-7660 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang {code:title=NodeManagerMetrics.java} public void addResource(Resource res) { availableMB = availableMB + res.getMemorySize(); availableGB.incr((int)Math.floor(availableMB/1024d)); availableVCores.incr(res.getVirtualCores()); } {code} When the node resource was updated through RM-NM heartbeat, the NM metric will get wrong value. The root cause of this issue is that new resource has been added to availableMB, so not needed to increase for availableGB again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7661) NodeManager metrics return wrong value after update node resource
Yang Wang created YARN-7661: --- Summary: NodeManager metrics return wrong value after update node resource Key: YARN-7661 URL: https://issues.apache.org/jira/browse/YARN-7661 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang {code:title=NodeManagerMetrics.java} public void addResource(Resource res) { availableMB = availableMB + res.getMemorySize(); availableGB.incr((int)Math.floor(availableMB/1024d)); availableVCores.incr(res.getVirtualCores()); } {code} When the node resource was updated through RM-NM heartbeat, the NM metric will get wrong value. The root cause of this issue is that new resource has been added to availableMB, so not needed to increase for availableGB again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7661) NodeManager metrics return wrong value after update node resource
[ https://issues.apache.org/jira/browse/YARN-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang reassigned YARN-7661: --- Assignee: Yang Wang > NodeManager metrics return wrong value after update node resource > - > > Key: YARN-7661 > URL: https://issues.apache.org/jira/browse/YARN-7661 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang > > {code:title=NodeManagerMetrics.java} > public void addResource(Resource res) { > availableMB = availableMB + res.getMemorySize(); > availableGB.incr((int)Math.floor(availableMB/1024d)); > availableVCores.incr(res.getVirtualCores()); > } > {code} > When the node resource was updated through RM-NM heartbeat, the NM metric > will get wrong value. > The root cause of this issue is that new resource has been added to > availableMB, so not needed to increase for availableGB again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7661) NodeManager metrics return wrong value after update node resource
[ https://issues.apache.org/jira/browse/YARN-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-7661: Attachment: YARN-7661.001.patch Attach a patch to resolve this issue. > NodeManager metrics return wrong value after update node resource > - > > Key: YARN-7661 > URL: https://issues.apache.org/jira/browse/YARN-7661 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang > Attachments: YARN-7661.001.patch > > > {code:title=NodeManagerMetrics.java} > public void addResource(Resource res) { > availableMB = availableMB + res.getMemorySize(); > availableGB.incr((int)Math.floor(availableMB/1024d)); > availableVCores.incr(res.getVirtualCores()); > } > {code} > When the node resource was updated through RM-NM heartbeat, the NM metric > will get wrong value. > The root cause of this issue is that new resource has been added to > availableMB, so not needed to increase for availableGB again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7660) NM node resource should be updated through heartbeat when rmadmin updateNodeResource execute successfully
[ https://issues.apache.org/jira/browse/YARN-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-7660: Summary: NM node resource should be updated through heartbeat when rmadmin updateNodeResource execute successfully (was: NodeManager metrics return wrong value after update node resource) > NM node resource should be updated through heartbeat when rmadmin > updateNodeResource execute successfully > - > > Key: YARN-7660 > URL: https://issues.apache.org/jira/browse/YARN-7660 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > > {code:title=NodeManagerMetrics.java} > public void addResource(Resource res) { > availableMB = availableMB + res.getMemorySize(); > availableGB.incr((int)Math.floor(availableMB/1024d)); > availableVCores.incr(res.getVirtualCores()); > } > {code} > When the node resource was updated through RM-NM heartbeat, the NM metric > will get wrong value. > The root cause of this issue is that new resource has been added to > availableMB, so not needed to increase for availableGB again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7660) NM node resource should be updated through heartbeat when rmadmin updateNodeResource execute successfully
[ https://issues.apache.org/jira/browse/YARN-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-7660: Description: When yarn rmadmin -updateNodeResource is used to update node resource, and execute successfully. The new capability should be sent to NM througn RM-NM heartbeat. 1. NM jmx metrics need to be updated 2. NM cgroup quota need to be updated was: {code:title=NodeManagerMetrics.java} public void addResource(Resource res) { availableMB = availableMB + res.getMemorySize(); availableGB.incr((int)Math.floor(availableMB/1024d)); availableVCores.incr(res.getVirtualCores()); } {code} When the node resource was updated through RM-NM heartbeat, the NM metric will get wrong value. The root cause of this issue is that new resource has been added to availableMB, so not needed to increase for availableGB again. > NM node resource should be updated through heartbeat when rmadmin > updateNodeResource execute successfully > - > > Key: YARN-7660 > URL: https://issues.apache.org/jira/browse/YARN-7660 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > > When yarn rmadmin -updateNodeResource is used to update node resource, and > execute successfully. The new capability should be sent to NM througn RM-NM > heartbeat. > 1. NM jmx metrics need to be updated > 2. NM cgroup quota need to be updated -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7661) NodeManager metrics return wrong value after update node resource
[ https://issues.apache.org/jira/browse/YARN-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294470#comment-16294470 ] Yang Wang commented on YARN-7661: - [~jlowe] Thanks for your comment. I have fixed the test and updated the patch. > NodeManager metrics return wrong value after update node resource > - > > Key: YARN-7661 > URL: https://issues.apache.org/jira/browse/YARN-7661 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Yang Wang >Assignee: Yang Wang > Attachments: YARN-7661.001.patch > > > {code:title=NodeManagerMetrics.java} > public void addResource(Resource res) { > availableMB = availableMB + res.getMemorySize(); > availableGB.incr((int)Math.floor(availableMB/1024d)); > availableVCores.incr(res.getVirtualCores()); > } > {code} > When the node resource was updated through RM-NM heartbeat, the NM metric > will get wrong value. > The root cause of this issue is that new resource has been added to > availableMB, so not needed to increase for availableGB again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7661) NodeManager metrics return wrong value after update node resource
[ https://issues.apache.org/jira/browse/YARN-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-7661: Attachment: YARN-7661.002.patch > NodeManager metrics return wrong value after update node resource > - > > Key: YARN-7661 > URL: https://issues.apache.org/jira/browse/YARN-7661 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Yang Wang >Assignee: Yang Wang > Attachments: YARN-7661.001.patch, YARN-7661.002.patch > > > {code:title=NodeManagerMetrics.java} > public void addResource(Resource res) { > availableMB = availableMB + res.getMemorySize(); > availableGB.incr((int)Math.floor(availableMB/1024d)); > availableVCores.incr(res.getVirtualCores()); > } > {code} > When the node resource was updated through RM-NM heartbeat, the NM metric > will get wrong value. > The root cause of this issue is that new resource has been added to > availableMB, so not needed to increase for availableGB again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7661) NodeManager metrics return wrong value after update node resource
[ https://issues.apache.org/jira/browse/YARN-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296229#comment-16296229 ] Yang Wang commented on YARN-7661: - [~jlowe], thanks for your review and commit. > NodeManager metrics return wrong value after update node resource > - > > Key: YARN-7661 > URL: https://issues.apache.org/jira/browse/YARN-7661 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Yang Wang >Assignee: Yang Wang > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.1, 2.8.4, 2.7.6 > > Attachments: YARN-7661.001.patch, YARN-7661.002.patch > > > {code:title=NodeManagerMetrics.java} > public void addResource(Resource res) { > availableMB = availableMB + res.getMemorySize(); > availableGB.incr((int)Math.floor(availableMB/1024d)); > availableVCores.incr(res.getVirtualCores()); > } > {code} > When the node resource was updated through RM-NM heartbeat, the NM metric > will get wrong value. > The root cause of this issue is that new resource has been added to > availableMB, so not needed to increase for availableGB again. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5621) Support LinuxContainerExecutor to create symlinks for continuously localized resources
[ https://issues.apache.org/jira/browse/YARN-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114150#comment-16114150 ] Yang Wang commented on YARN-5621: - {code:title=LinuxContainerExecutor.java} protected void createSymlinkAsUser(String user, File privateScriptFile, String userScriptFile) throws PrivilegedOperationException { String runAsUser = getRunAsUser(user); ... ... {code} I think we should use containerUser instead of runAsUser here. Because it may cause "Invalid command" in container-executor when getRunAsUser return nonsecureLocalUser. > Support LinuxContainerExecutor to create symlinks for continuously localized > resources > -- > > Key: YARN-5621 > URL: https://issues.apache.org/jira/browse/YARN-5621 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Jian He >Assignee: Jian He > Labels: oct16-hard > Attachments: YARN-5621.1.patch, YARN-5621.2.patch, YARN-5621.3.patch, > YARN-5621.4.patch, YARN-5621.5.patch > > > When new resources are localized, new symlink needs to be created for the > localized resource. This is the change for the LinuxContainerExecutor to > create the symlinks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-5621) Support LinuxContainerExecutor to create symlinks for continuously localized resources
[ https://issues.apache.org/jira/browse/YARN-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114150#comment-16114150 ] Yang Wang edited comment on YARN-5621 at 8/4/17 9:08 AM: - {code:title=LinuxContainerExecutor.java} protected void createSymlinkAsUser(String user, File privateScriptFile, String userScriptFile) throws PrivilegedOperationException { String runAsUser = getRunAsUser(user); ... ... {code} Hi,[~jianhe] I think we should use containerUser instead of runAsUser here. Because it may cause "Invalid command" in container-executor when getRunAsUser return nonsecureLocalUser. was (Author: fly_in_gis): {code:title=LinuxContainerExecutor.java} protected void createSymlinkAsUser(String user, File privateScriptFile, String userScriptFile) throws PrivilegedOperationException { String runAsUser = getRunAsUser(user); ... ... {code} I think we should use containerUser instead of runAsUser here. Because it may cause "Invalid command" in container-executor when getRunAsUser return nonsecureLocalUser. > Support LinuxContainerExecutor to create symlinks for continuously localized > resources > -- > > Key: YARN-5621 > URL: https://issues.apache.org/jira/browse/YARN-5621 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Jian He >Assignee: Jian He > Labels: oct16-hard > Attachments: YARN-5621.1.patch, YARN-5621.2.patch, YARN-5621.3.patch, > YARN-5621.4.patch, YARN-5621.5.patch > > > When new resources are localized, new symlink needs to be created for the > localized resource. This is the change for the LinuxContainerExecutor to > create the symlinks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6951) Fix debug log when Resource handler chain is enabled
Yang Wang created YARN-6951: --- Summary: Fix debug log when Resource handler chain is enabled Key: YARN-6951 URL: https://issues.apache.org/jira/browse/YARN-6951 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang {code title=LinuxContainerExecutor.java} ... ... if (LOG.isDebugEnabled()) { LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain == null)); } ... ... {code} I think it is just a typo.When resourceHandlerChain is not null, print the log "Resource handler chain enabled = true". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6951) Fix debug log when Resource handler chain is enabled
[ https://issues.apache.org/jira/browse/YARN-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6951: Description: {code:title=LinuxContainerExecutor.java} ... ... if (LOG.isDebugEnabled()) { LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain == null)); } ... ... {code} I think it is just a typo.When resourceHandlerChain is not null, print the log "Resource handler chain enabled = true". was: {code title=LinuxContainerExecutor.java} ... ... if (LOG.isDebugEnabled()) { LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain == null)); } ... ... {code} I think it is just a typo.When resourceHandlerChain is not null, print the log "Resource handler chain enabled = true". > Fix debug log when Resource handler chain is enabled > > > Key: YARN-6951 > URL: https://issues.apache.org/jira/browse/YARN-6951 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > > {code:title=LinuxContainerExecutor.java} > ... ... > if (LOG.isDebugEnabled()) { > LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain > == null)); > } > ... ... > {code} > I think it is just a typo.When resourceHandlerChain is not null, print the > log "Resource handler chain enabled = true". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6951) Fix debug log when Resource handler chain is enabled
[ https://issues.apache.org/jira/browse/YARN-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6951: Attachment: YARN-6951.001.patch > Fix debug log when Resource handler chain is enabled > > > Key: YARN-6951 > URL: https://issues.apache.org/jira/browse/YARN-6951 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > Attachments: YARN-6951.001.patch > > > {code:title=LinuxContainerExecutor.java} > ... ... > if (LOG.isDebugEnabled()) { > LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain > == null)); > } > ... ... > {code} > I think it is just a typo.When resourceHandlerChain is not null, print the > log "Resource handler chain enabled = true". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6951) Fix debug log when Resource handler chain is enabled
[ https://issues.apache.org/jira/browse/YARN-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang reassigned YARN-6951: --- Assignee: Yang Wang > Fix debug log when Resource handler chain is enabled > > > Key: YARN-6951 > URL: https://issues.apache.org/jira/browse/YARN-6951 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang > Attachments: YARN-6951.001.patch > > > {code:title=LinuxContainerExecutor.java} > ... ... > if (LOG.isDebugEnabled()) { > LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain > == null)); > } > ... ... > {code} > I think it is just a typo.When resourceHandlerChain is not null, print the > log "Resource handler chain enabled = true". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6212) NodeManager metrics returning wrong negative values
[ https://issues.apache.org/jira/browse/YARN-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116452#comment-16116452 ] Yang Wang commented on YARN-6212: - Hi, Miklos Szegedi I'm afraid this JIRA is not a duplicate of YARN-3933. The primary cause of negative values is that metrics do not recover properly when NM restart. *AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores* in metrics need to recover when NM restart. This should be done in ContainerManagerImpl#recoverContainer. The scenario could be reproduction by the following steps: # Make sure YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true in NM # Submit an application and keep running # Restart NM # Stop the application # Now you get the negative values > NodeManager metrics returning wrong negative values > --- > > Key: YARN-6212 > URL: https://issues.apache.org/jira/browse/YARN-6212 > Project: Hadoop YARN > Issue Type: Bug > Components: metrics >Affects Versions: 2.7.3 >Reporter: Abhishek Shivanna > > It looks like the metrics returned by the NodeManager have negative values > for metrics that never should be negative. Here is an output form NM endpoint > {noformat} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {noformat} > {noformat} > { > "beans" : [ { > "name" : "Hadoop:service=NodeManager,name=NodeManagerMetrics", > "modelerType" : "NodeManagerMetrics", > "tag.Context" : "yarn", > "tag.Hostname" : "", > "ContainersLaunched" : 707, > "ContainersCompleted" : 9, > "ContainersFailed" : 124, > "ContainersKilled" : 579, > "ContainersIniting" : 0, > "ContainersRunning" : 19, > "AllocatedGB" : -26, > "AllocatedContainers" : -5, > "AvailableGB" : 252, > "AllocatedVCores" : -5, > "AvailableVCores" : 101, > "ContainerLaunchDurationNumOps" : 718, > "ContainerLaunchDurationAvgTime" : 18.0 > } ] > } > {noformat} > Is there any circumstance under which the value for AllocatedGB, > AllocatedContainers and AllocatedVCores go below 0? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6212) NodeManager metrics returning wrong negative values
[ https://issues.apache.org/jira/browse/YARN-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116452#comment-16116452 ] Yang Wang edited comment on YARN-6212 at 8/8/17 2:25 AM: - Hi, [~miklos.szeg...@cloudera.com] I'm afraid this JIRA is not a duplicate of YARN-3933. The primary cause of negative values is that metrics do not recover properly when NM restart. *AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores* in metrics need to recover when NM restart. This should be done in ContainerManagerImpl#recoverContainer. The scenario could be reproduction by the following steps: # Make sure YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true in NM # Submit an application and keep running # Restart NM # Stop the application # Now you get the negative values was (Author: fly_in_gis): Hi, Miklos Szegedi I'm afraid this JIRA is not a duplicate of YARN-3933. The primary cause of negative values is that metrics do not recover properly when NM restart. *AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores* in metrics need to recover when NM restart. This should be done in ContainerManagerImpl#recoverContainer. The scenario could be reproduction by the following steps: # Make sure YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true in NM # Submit an application and keep running # Restart NM # Stop the application # Now you get the negative values > NodeManager metrics returning wrong negative values > --- > > Key: YARN-6212 > URL: https://issues.apache.org/jira/browse/YARN-6212 > Project: Hadoop YARN > Issue Type: Bug > Components: metrics >Affects Versions: 2.7.3 >Reporter: Abhishek Shivanna > > It looks like the metrics returned by the NodeManager have negative values > for metrics that never should be negative. Here is an output form NM endpoint > {noformat} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {noformat} > {noformat} > { > "beans" : [ { > "name" : "Hadoop:service=NodeManager,name=NodeManagerMetrics", > "modelerType" : "NodeManagerMetrics", > "tag.Context" : "yarn", > "tag.Hostname" : "", > "ContainersLaunched" : 707, > "ContainersCompleted" : 9, > "ContainersFailed" : 124, > "ContainersKilled" : 579, > "ContainersIniting" : 0, > "ContainersRunning" : 19, > "AllocatedGB" : -26, > "AllocatedContainers" : -5, > "AvailableGB" : 252, > "AllocatedVCores" : -5, > "AvailableVCores" : 101, > "ContainerLaunchDurationNumOps" : 718, > "ContainerLaunchDurationAvgTime" : 18.0 > } ] > } > {noformat} > Is there any circumstance under which the value for AllocatedGB, > AllocatedContainers and AllocatedVCores go below 0? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6966) NodeManager metrics may returning wrong negative values when after restart
Yang Wang created YARN-6966: --- Summary: NodeManager metrics may returning wrong negative values when after restart Key: YARN-6966 URL: https://issues.apache.org/jira/browse/YARN-6966 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. The primary cause of negative values is that metrics do not recover properly when NM restart. AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores in metrics also need to recover when NM restart. This should be done in ContainerManagerImpl#recoverContainer. The scenario could be reproduction by the following steps: # Make sure YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true in NM # Submit an application and keep running # Restart NM # Stop the application # Now you get the negative values {code} /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics {code} {code} { name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", modelerType: "NodeManagerMetrics", tag.Context: "yarn", tag.Hostname: "hadoop.com", ContainersLaunched: 0, ContainersCompleted: 0, ContainersFailed: 2, ContainersKilled: 0, ContainersIniting: 0, ContainersRunning: 0, AllocatedGB: 0, AllocatedContainers: -2, AvailableGB: 160, AllocatedVCores: -11, AvailableVCores: 3611, ContainerLaunchDurationNumOps: 2, ContainerLaunchDurationAvgTime: 6, BadLocalDirs: 0, BadLogDirs: 0, GoodLocalDirsDiskUtilizationPerc: 2, GoodLogDirsDiskUtilizationPerc: 2 } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart
[ https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6966: Summary: NodeManager metrics may return wrong negative values when NM restart (was: NodeManager metrics may returning wrong negative values when after restart) > NodeManager metrics may return wrong negative values when NM restart > > > Key: YARN-6966 > URL: https://issues.apache.org/jira/browse/YARN-6966 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > > Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. > The primary cause of negative values is that metrics do not recover properly > when NM restart. > AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores > in metrics also need to recover when NM restart. > This should be done in ContainerManagerImpl#recoverContainer. > The scenario could be reproduction by the following steps: > # Make sure > YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true > in NM > # Submit an application and keep running > # Restart NM > # Stop the application > # Now you get the negative values > {code} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {code} > {code} > { > name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", > modelerType: "NodeManagerMetrics", > tag.Context: "yarn", > tag.Hostname: "hadoop.com", > ContainersLaunched: 0, > ContainersCompleted: 0, > ContainersFailed: 2, > ContainersKilled: 0, > ContainersIniting: 0, > ContainersRunning: 0, > AllocatedGB: 0, > AllocatedContainers: -2, > AvailableGB: 160, > AllocatedVCores: -11, > AvailableVCores: 3611, > ContainerLaunchDurationNumOps: 2, > ContainerLaunchDurationAvgTime: 6, > BadLocalDirs: 0, > BadLogDirs: 0, > GoodLocalDirsDiskUtilizationPerc: 2, > GoodLogDirsDiskUtilizationPerc: 2 > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart
[ https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6966: Attachment: YARN-6966.001.patch > NodeManager metrics may return wrong negative values when NM restart > > > Key: YARN-6966 > URL: https://issues.apache.org/jira/browse/YARN-6966 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > Attachments: YARN-6966.001.patch > > > Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. > The primary cause of negative values is that metrics do not recover properly > when NM restart. > AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores > in metrics also need to recover when NM restart. > This should be done in ContainerManagerImpl#recoverContainer. > The scenario could be reproduction by the following steps: > # Make sure > YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true > in NM > # Submit an application and keep running > # Restart NM > # Stop the application > # Now you get the negative values > {code} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {code} > {code} > { > name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", > modelerType: "NodeManagerMetrics", > tag.Context: "yarn", > tag.Hostname: "hadoop.com", > ContainersLaunched: 0, > ContainersCompleted: 0, > ContainersFailed: 2, > ContainersKilled: 0, > ContainersIniting: 0, > ContainersRunning: 0, > AllocatedGB: 0, > AllocatedContainers: -2, > AvailableGB: 160, > AllocatedVCores: -11, > AvailableVCores: 3611, > ContainerLaunchDurationNumOps: 2, > ContainerLaunchDurationAvgTime: 6, > BadLocalDirs: 0, > BadLogDirs: 0, > GoodLocalDirsDiskUtilizationPerc: 2, > GoodLogDirsDiskUtilizationPerc: 2 > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart
[ https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6966: Attachment: YARN-6966.002.patch update the patch > NodeManager metrics may return wrong negative values when NM restart > > > Key: YARN-6966 > URL: https://issues.apache.org/jira/browse/YARN-6966 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > Attachments: YARN-6966.001.patch, YARN-6966.002.patch > > > Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. > The primary cause of negative values is that metrics do not recover properly > when NM restart. > AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores > in metrics also need to recover when NM restart. > This should be done in ContainerManagerImpl#recoverContainer. > The scenario could be reproduction by the following steps: > # Make sure > YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true > in NM > # Submit an application and keep running > # Restart NM > # Stop the application > # Now you get the negative values > {code} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {code} > {code} > { > name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", > modelerType: "NodeManagerMetrics", > tag.Context: "yarn", > tag.Hostname: "hadoop.com", > ContainersLaunched: 0, > ContainersCompleted: 0, > ContainersFailed: 2, > ContainersKilled: 0, > ContainersIniting: 0, > ContainersRunning: 0, > AllocatedGB: 0, > AllocatedContainers: -2, > AvailableGB: 160, > AllocatedVCores: -11, > AvailableVCores: 3611, > ContainerLaunchDurationNumOps: 2, > ContainerLaunchDurationAvgTime: 6, > BadLocalDirs: 0, > BadLogDirs: 0, > GoodLocalDirsDiskUtilizationPerc: 2, > GoodLogDirsDiskUtilizationPerc: 2 > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart
[ https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang reassigned YARN-6966: --- Assignee: Yang Wang > NodeManager metrics may return wrong negative values when NM restart > > > Key: YARN-6966 > URL: https://issues.apache.org/jira/browse/YARN-6966 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang > Attachments: YARN-6966.001.patch, YARN-6966.002.patch > > > Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. > The primary cause of negative values is that metrics do not recover properly > when NM restart. > AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores > in metrics also need to recover when NM restart. > This should be done in ContainerManagerImpl#recoverContainer. > The scenario could be reproduction by the following steps: > # Make sure > YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true > in NM > # Submit an application and keep running > # Restart NM > # Stop the application > # Now you get the negative values > {code} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {code} > {code} > { > name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", > modelerType: "NodeManagerMetrics", > tag.Context: "yarn", > tag.Hostname: "hadoop.com", > ContainersLaunched: 0, > ContainersCompleted: 0, > ContainersFailed: 2, > ContainersKilled: 0, > ContainersIniting: 0, > ContainersRunning: 0, > AllocatedGB: 0, > AllocatedContainers: -2, > AvailableGB: 160, > AllocatedVCores: -11, > AvailableVCores: 3611, > ContainerLaunchDurationNumOps: 2, > ContainerLaunchDurationAvgTime: 6, > BadLocalDirs: 0, > BadLogDirs: 0, > GoodLocalDirsDiskUtilizationPerc: 2, > GoodLogDirsDiskUtilizationPerc: 2 > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6589) Recover all resources when NM restart
[ https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang reassigned YARN-6589: --- Assignee: Yang Wang > Recover all resources when NM restart > - > > Key: YARN-6589 > URL: https://issues.apache.org/jira/browse/YARN-6589 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang > > When NM restart, containers will be recovered. However, only memory and > vcores in capability have been recovered. All resources need to be recovered. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > this.resource = > Resource.newInstance(recoveredCapability.getMemorySize(), > recoveredCapability.getVirtualCores()); > {code} > It should be like this. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > // need to recover all resources, not only > this.resource = Resources.clone(recoveredCapability); > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-4166: Attachment: (was: YARN-4166-branch2.8-001.patch) > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.8.0, 3.0.0-alpha2 >Reporter: Jian He >Assignee: Yang Wang > Attachments: YARN-4166.001.patch, YARN-4166.002.patch, > YARN-4166.003.patch, YARN-4166.004.patch > > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart
[ https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6966: Attachment: YARN-6966.003.patch > NodeManager metrics may return wrong negative values when NM restart > > > Key: YARN-6966 > URL: https://issues.apache.org/jira/browse/YARN-6966 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang > Attachments: YARN-6966.001.patch, YARN-6966.002.patch, > YARN-6966.003.patch > > > Just as YARN-6212. However, I think it is not a duplicate of YARN-3933. > The primary cause of negative values is that metrics do not recover properly > when NM restart. > AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores > in metrics also need to recover when NM restart. > This should be done in ContainerManagerImpl#recoverContainer. > The scenario could be reproduction by the following steps: > # Make sure > YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true > in NM > # Submit an application and keep running > # Restart NM > # Stop the application > # Now you get the negative values > {code} > /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics > {code} > {code} > { > name: "Hadoop:service=NodeManager,name=NodeManagerMetrics", > modelerType: "NodeManagerMetrics", > tag.Context: "yarn", > tag.Hostname: "hadoop.com", > ContainersLaunched: 0, > ContainersCompleted: 0, > ContainersFailed: 2, > ContainersKilled: 0, > ContainersIniting: 0, > ContainersRunning: 0, > AllocatedGB: 0, > AllocatedContainers: -2, > AvailableGB: 160, > AllocatedVCores: -11, > AvailableVCores: 3611, > ContainerLaunchDurationNumOps: 2, > ContainerLaunchDurationAvgTime: 6, > BadLocalDirs: 0, > BadLogDirs: 0, > GoodLocalDirsDiskUtilizationPerc: 2, > GoodLogDirsDiskUtilizationPerc: 2 > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6589) Recover all resources when NM restart
[ https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6589: Attachment: YARN-6589-YARN-3926.001.patch > Recover all resources when NM restart > - > > Key: YARN-6589 > URL: https://issues.apache.org/jira/browse/YARN-6589 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang > Attachments: YARN-6589-YARN-3926.001.patch > > > When NM restart, containers will be recovered. However, only memory and > vcores in capability have been recovered. All resources need to be recovered. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > this.resource = > Resource.newInstance(recoveredCapability.getMemorySize(), > recoveredCapability.getVirtualCores()); > {code} > It should be like this. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > // need to recover all resources, not only > this.resource = Resources.clone(recoveredCapability); > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6578) Return container resource utilization from NM ContainerStatus call
[ https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang reassigned YARN-6578: --- Assignee: Yang Wang > Return container resource utilization from NM ContainerStatus call > -- > > Key: YARN-6578 > URL: https://issues.apache.org/jira/browse/YARN-6578 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Attachments: YARN-6578.001.patch > > > When the applicationMaster wants to change(increase/decrease) resources of an > allocated container, resource utilization is an important reference indicator > for decision making. So, when AM call NMClient.getContainerStatus, resource > utilization needs to be returned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6578) Return container resource utilization from NM ContainerStatus call
[ https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6578: Description: When the applicationMaster wants to change(increase/decrease) resources of an allocated container, resource utilization is an important reference indicator for decision making. So, when AM call NMClient.getContainerStatus, resource utilization needs to be returned. Also container resource utilization need to report to RM to make better scheduling. was:When the applicationMaster wants to change(increase/decrease) resources of an allocated container, resource utilization is an important reference indicator for decision making. So, when AM call NMClient.getContainerStatus, resource utilization needs to be returned. > Return container resource utilization from NM ContainerStatus call > -- > > Key: YARN-6578 > URL: https://issues.apache.org/jira/browse/YARN-6578 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Attachments: YARN-6578.001.patch > > > When the applicationMaster wants to change(increase/decrease) resources of an > allocated container, resource utilization is an important reference indicator > for decision making. So, when AM call NMClient.getContainerStatus, resource > utilization needs to be returned. > Also container resource utilization need to report to RM to make better > scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8984) OutstandingSchedRequests in AMRMClient could not be removed when AllocationTags is null or empty
Yang Wang created YARN-8984: --- Summary: OutstandingSchedRequests in AMRMClient could not be removed when AllocationTags is null or empty Key: YARN-8984 URL: https://issues.apache.org/jira/browse/YARN-8984 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang In AMRMClient, outstandingSchedRequests should be removed or decreased when container allocated. However, it could not work when allocation tag is null or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8984) OutstandingSchedRequests in AMRMClient could not be removed when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang reassigned YARN-8984: --- Assignee: Yang Wang > OutstandingSchedRequests in AMRMClient could not be removed when > AllocationTags is null or empty > > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8984) OutstandingSchedRequests in AMRMClient could not be removed when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-8984: Attachment: YARN-8984-001.patch > OutstandingSchedRequests in AMRMClient could not be removed when > AllocationTags is null or empty > > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8984) OutstandingSchedRequests in AMRMClient could not be removed when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677649#comment-16677649 ] Yang Wang commented on YARN-8984: - It could be a critical bug when resync, all the outstandingSchedRequests of empty allocation tags will be sent again. In a big cluster, when the active RM swiched, the RM will receive lots of requests. [~cheersyang] Could you please take a look? > OutstandingSchedRequests in AMRMClient could not be removed when > AllocationTags is null or empty > > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677802#comment-16677802 ] Yang Wang commented on YARN-8984: - Hi, [~cheersyang] I have tried to move the test to TestAMRMClientPlacementConstraints and found the case failed. Because containers could not be allocated when allocationTags is empty. I think it is another issue about placement-processor. > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-8984: Attachment: YARN-8984-002.patch > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch, YARN-8984-002.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678081#comment-16678081 ] Yang Wang commented on YARN-8984: - There's no difference between in a separate class and in TestAMRMClientPlacementConstraints. When set YarnConfiguration.RM_PLACEMENT_CONSTRAINTS_HANDLER to scheduler, we could not get rejectedSchedulingRequests from AllocateResponse. It is not set by the capacity scheduler. So i add another test in TestAMRMClientPlacementConstraints. [~cheersyang] Please help to review. > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch, YARN-8984-002.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-8984: Attachment: YARN-8984-003.patch > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch, YARN-8984-002.patch, > YARN-8984-003.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679208#comment-16679208 ] Yang Wang commented on YARN-8984: - [~cheersyang], I do not think it will throw a NPE when setAllocationTags to null. ContainerPBImpl#getAllocationTags() will new a empty hashSet when the tag is null. SchedulingRequestPBImpl#getAllocationTags() will also new a empty hashSet when tag is null. So the null check is not necessary. Btw, put/get null to a HashMap will not throw NPE. [~botong], Thanks for your reply. The allocationTag in the SchedulingRequest in AMRMClient is empty, so RM will not set any tag for the allocated containers. [~kkaranasos], Thanks for your reply. You are right, the Scheduling Requests are used for placement constraints. However, it does not mean we have to set the allocationTag for each Scheduling Request. We have use the SchedulingRequest instead of ResourceRequest in our computing framework to allocate resource. So we got this issue. > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch, YARN-8984-002.patch, > YARN-8984-003.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-8984: Attachment: YARN-8984-004.patch > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch, YARN-8984-002.patch, > YARN-8984-003.patch, YARN-8984-004.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680902#comment-16680902 ] Yang Wang commented on YARN-8984: - [~botong], [~kkaranasos] Thanks for your reply. I have add the null check for AllocationTag. Please help to review. > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch, YARN-8984-002.patch, > YARN-8984-003.patch, YARN-8984-004.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-8984: Attachment: YARN-8984-005.patch > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch, YARN-8984-002.patch, > YARN-8984-003.patch, YARN-8984-004.patch, YARN-8984-005.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681142#comment-16681142 ] Yang Wang commented on YARN-8984: - [~cheersyang], thanks for your comments. I have added a test to verify the three cases. They map to the same empty HashSet key of outstandingSchedRequests. > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch, YARN-8984-002.patch, > YARN-8984-003.patch, YARN-8984-004.patch, YARN-8984-005.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694602#comment-16694602 ] Yang Wang commented on YARN-8984: - Hi, [~kkaranasos], [~botong], [~asuresh] Could you please take a look about this patch. It is very important when use SchedulingRequest instead of ResourceRequest. > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Attachments: YARN-8984-001.patch, YARN-8984-002.patch, > YARN-8984-003.patch, YARN-8984-004.patch, YARN-8984-005.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
[ https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695741#comment-16695741 ] Yang Wang commented on YARN-8984: - [~cheersyang] [~kkaranasos] Thanks for all your reviews and commit. > AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty > -- > > Key: YARN-8984 > URL: https://issues.apache.org/jira/browse/YARN-8984 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Critical > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-8984-001.patch, YARN-8984-002.patch, > YARN-8984-003.patch, YARN-8984-004.patch, YARN-8984-005.patch > > > In AMRMClient, outstandingSchedRequests should be removed or decreased when > container allocated. However, it could not work when allocation tag is null > or empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972334#comment-15972334 ] Yang Wang commented on YARN-4166: - Hi [~Naganarasimha], Are you still working on this, could you share your progress please? > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: Jian He >Assignee: Naganarasimha G R > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974130#comment-15974130 ] Yang Wang commented on YARN-4166: - We want to use container resize(YARN-1197) in production ASAP. And I already have a patch for this JIRA. [~Naganarasimha] Would you mind take a look? > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: Jian He >Assignee: Naganarasimha G R > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-4166: Comment: was deleted (was: We want to use container resize(YARN-1197) in production ASAP. And I already have a patch for this JIRA. [~Naganarasimha] Would you mind take a look?) > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: Jian He >Assignee: Naganarasimha G R > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974513#comment-15974513 ] Yang Wang commented on YARN-4166: - [~Naganarasimha] Sorry, I can not upload a patch. Could you give me the permission? Also, the hadoop version of our production environment is 2.8, so the patch is for branch-2.8 > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: Jian He >Assignee: Naganarasimha G R > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-4166: Attachment: YARN-4166-branch2.8-001.patch > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: Jian He >Assignee: Naganarasimha G R > Attachments: YARN-4166-branch2.8-001.patch > > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-4166: Attachment: (was: YARN-4166-branch2.8-001.patch) > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: Jian He >Assignee: Naganarasimha G R > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-4166: Attachment: YARN-4166-branch2.8-001.patch > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: Jian He >Assignee: Naganarasimha G R > Attachments: YARN-4166-branch2.8-001.patch > > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974605#comment-15974605 ] Yang Wang commented on YARN-4166: - Upload a patch for branch-2.8 > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Reporter: Jian He >Assignee: Naganarasimha G R > Attachments: YARN-4166-branch2.8-001.patch > > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-4166: Attachment: YARN-4166.001.patch > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.8.0, 3.0.0-alpha2 >Reporter: Jian He >Assignee: Naganarasimha G R > Attachments: YARN-4166.001.patch, YARN-4166-branch2.8-001.patch > > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976485#comment-15976485 ] Yang Wang commented on YARN-4166: - [~Naganarasimha] Thanks for you help. I have already upload a patch for trunk. > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.8.0, 3.0.0-alpha2 >Reporter: Jian He >Assignee: Naganarasimha G R > Attachments: YARN-4166.001.patch, YARN-4166-branch2.8-001.patch > > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-4166: Attachment: YARN-4166.002.patch > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.8.0, 3.0.0-alpha2 >Reporter: Jian He >Assignee: Naganarasimha G R > Attachments: YARN-4166.001.patch, YARN-4166.002.patch, > YARN-4166-branch2.8-001.patch > > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977015#comment-15977015 ] Yang Wang commented on YARN-4166: - Sure, I have fix the red flags. Findbugs red flag has nothing to do with this patch. > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.8.0, 3.0.0-alpha2 >Reporter: Jian He >Assignee: Naganarasimha G R > Attachments: YARN-4166.001.patch, YARN-4166.002.patch, > YARN-4166-branch2.8-001.patch > > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986157#comment-15986157 ] Yang Wang commented on YARN-4166: - [~Naganarasimha], Thanks for your comments on this patch, I will update it ASAP. > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.8.0, 3.0.0-alpha2 >Reporter: Jian He >Assignee: Yang Wang > Attachments: YARN-4166.001.patch, YARN-4166.002.patch, > YARN-4166-branch2.8-001.patch > > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-4166: Attachment: YARN-4166.003.patch > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.8.0, 3.0.0-alpha2 >Reporter: Jian He >Assignee: Yang Wang > Attachments: YARN-4166.001.patch, YARN-4166.002.patch, > YARN-4166.003.patch, YARN-4166-branch2.8-001.patch > > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-4166: Attachment: YARN-4166.004.patch > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.8.0, 3.0.0-alpha2 >Reporter: Jian He >Assignee: Yang Wang > Attachments: YARN-4166.001.patch, YARN-4166.002.patch, > YARN-4166.003.patch, YARN-4166.004.patch, YARN-4166-branch2.8-001.patch > > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4166) Support changing container cpu resource
[ https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996225#comment-15996225 ] Yang Wang commented on YARN-4166: - Update the patch according to [~Naganarasimha]'s suggestion # *updateContainerResource* has updated to be abstract, *DefaultContainerExecutor* has an empty implementation. # *ResourceHandlerException* will be thrown when container resource update failed. # *ContainerExecutor().updateContainerResource* failed, need to persist container resource change for recovery again. # CGroupsCpuResourceHandlerImpl will not invoke *cGroupsHandler.deleteCGroup* when updateContainerResource, ResourceHandlerException will be caught in ContainerManagerImpl, then add to failedContainers # *void updateContainerResource* # Add test *testContainerManager.testUpdateContainerResourceFailed* # fix checkstyle issue > Support changing container cpu resource > --- > > Key: YARN-4166 > URL: https://issues.apache.org/jira/browse/YARN-4166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, nodemanager, resourcemanager >Affects Versions: 2.8.0, 3.0.0-alpha2 >Reporter: Jian He >Assignee: Yang Wang > Attachments: YARN-4166.001.patch, YARN-4166.002.patch, > YARN-4166.003.patch, YARN-4166.004.patch, YARN-4166-branch2.8-001.patch > > > Memory resizing is now supported, we need to support the same for cpu. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6578) Return container resource utilization from NM ContainerStatus call
Yang Wang created YARN-6578: --- Summary: Return container resource utilization from NM ContainerStatus call Key: YARN-6578 URL: https://issues.apache.org/jira/browse/YARN-6578 Project: Hadoop YARN Issue Type: New Feature Reporter: Yang Wang When the applicationMaster wants to change(increase/decrease) resources of an allocated container, resource utilization is an important reference indicator for decision making. So, when AM call NMClient.getContainerStatus, resource utilization needs to be returned. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6578) Return container resource utilization from NM ContainerStatus call
[ https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004415#comment-16004415 ] Yang Wang commented on YARN-6578: - [~Naganarasimha], thanks for your reply. I plan to get usage from ContainerMetrics and return in ContainerStatus. If you worry about this will make the NM heartbeat getting bigger, we could set the utilization only in the response of NMClient.getContainerStatus. {code} ContainerImpl.cloneAndGetContainerStatus() ... ContainerMetrics metrics = ContainerMetrics.getContainerMetrics(this.containerId); if (metrics != null) { status.setUtilization(ResourceUtilization .newInstance((int) metrics.pMemMBsStat.lastStat().mean(), 0, (float) metrics.cpuCoreUsagePercent.lastStat().mean())); } else { status.setUtilization(ResourceUtilization.newInstance(0, 0, 0)); } ... {code} > Return container resource utilization from NM ContainerStatus call > -- > > Key: YARN-6578 > URL: https://issues.apache.org/jira/browse/YARN-6578 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Yang Wang > > When the applicationMaster wants to change(increase/decrease) resources of an > allocated container, resource utilization is an important reference indicator > for decision making. So, when AM call NMClient.getContainerStatus, resource > utilization needs to be returned. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6578) Return container resource utilization from NM ContainerStatus call
[ https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6578: Attachment: YARN-6578.001.patch > Return container resource utilization from NM ContainerStatus call > -- > > Key: YARN-6578 > URL: https://issues.apache.org/jira/browse/YARN-6578 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Yang Wang > Attachments: YARN-6578.001.patch > > > When the applicationMaster wants to change(increase/decrease) resources of an > allocated container, resource utilization is an important reference indicator > for decision making. So, when AM call NMClient.getContainerStatus, resource > utilization needs to be returned. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6578) Return container resource utilization from NM ContainerStatus call
[ https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004749#comment-16004749 ] Yang Wang commented on YARN-6578: - [~Naganarasimha], I have uploaded a WIP patch. > Return container resource utilization from NM ContainerStatus call > -- > > Key: YARN-6578 > URL: https://issues.apache.org/jira/browse/YARN-6578 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Yang Wang > Attachments: YARN-6578.001.patch > > > When the applicationMaster wants to change(increase/decrease) resources of an > allocated container, resource utilization is an important reference indicator > for decision making. So, when AM call NMClient.getContainerStatus, resource > utilization needs to be returned. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6589) Recover all resources when NM restart
Yang Wang created YARN-6589: --- Summary: Recover all resources when NM restart Key: YARN-6589 URL: https://issues.apache.org/jira/browse/YARN-6589 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang When NM restart, containers will be recovered. However, only memory and vcores in capability have been recovered. All resources need to be recovered. {code:title=ContainerImpl.java} // resource capability had been updated before NM was down this.resource = Resource.newInstance(recoveredCapability.getMemorySize(), recoveredCapability.getVirtualCores()); {code} It should be like this. {code:title=ContainerImpl.java} // resource capability had been updated before NM was down // need to recover all resources, not only this.resource = Resources.clone(recoveredCapability); {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6630) Container worker dir could not recover when NM restart
Yang Wang created YARN-6630: --- Summary: Container worker dir could not recover when NM restart Key: YARN-6630 URL: https://issues.apache.org/jira/browse/YARN-6630 Project: Hadoop YARN Issue Type: Bug Reporter: Yang Wang When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be saved in NM state store. Then NM restarts, container.workDir is null, and may cause other exceptions. {code:title=ContainerLaunch.java} ... private void recordContainerWorkDir(ContainerId containerId, String workDir) throws IOException{ container.setWorkDir(workDir); if (container.isRetryContextSet()) { context.getNMStateStore().storeContainerWorkDir(containerId, workDir); } } {code} {code:title=ContainerImpl.java} static class ResourceLocalizedWhileRunningTransition extends ContainerTransition { ... String linkFile = new Path(container.workDir, link).toString(); ... {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6630) Container worker dir could not recover when NM restart
[ https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6630: Description: When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be saved in NM state store. {code:title=ContainerLaunch.java} ... private void recordContainerWorkDir(ContainerId containerId, String workDir) throws IOException{ container.setWorkDir(workDir); if (container.isRetryContextSet()) { context.getNMStateStore().storeContainerWorkDir(containerId, workDir); } } {code} Then NM restarts, container.workDir is null, and may cause other exceptions. {code:title=ContainerImpl.java} static class ResourceLocalizedWhileRunningTransition extends ContainerTransition { ... String linkFile = new Path(container.workDir, link).toString(); ... {code} {code} java.lang.IllegalArgumentException: Can not create a Path from a null string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) at org.apache.hadoop.fs.Path.(Path.java:175) at org.apache.hadoop.fs.Path.(Path.java:110) ... ... {code} was: When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be saved in NM state store. Then NM restarts, container.workDir is null, and may cause other exceptions. {code:title=ContainerLaunch.java} ... private void recordContainerWorkDir(ContainerId containerId, String workDir) throws IOException{ container.setWorkDir(workDir); if (container.isRetryContextSet()) { context.getNMStateStore().storeContainerWorkDir(containerId, workDir); } } {code} {code:title=ContainerImpl.java} static class ResourceLocalizedWhileRunningTransition extends ContainerTransition { ... String linkFile = new Path(container.workDir, link).toString(); ... {code} > Container worker dir could not recover when NM restart > -- > > Key: YARN-6630 > URL: https://issues.apache.org/jira/browse/YARN-6630 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > > When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is > NEVER_RETRY, container worker dir will not be saved in NM state store. > {code:title=ContainerLaunch.java} > ... > private void recordContainerWorkDir(ContainerId containerId, > String workDir) throws IOException{ > container.setWorkDir(workDir); > if (container.isRetryContextSet()) { > context.getNMStateStore().storeContainerWorkDir(containerId, workDir); > } > } > {code} > Then NM restarts, container.workDir is null, and may cause other exceptions. > {code:title=ContainerImpl.java} > static class ResourceLocalizedWhileRunningTransition > extends ContainerTransition { > ... > String linkFile = new Path(container.workDir, link).toString(); > ... > {code} > {code} > java.lang.IllegalArgumentException: Can not create a Path from a null string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) > at org.apache.hadoop.fs.Path.(Path.java:175) > at org.apache.hadoop.fs.Path.(Path.java:110) > ... ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6630) Container worker dir could not recover when NM restart
[ https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022445#comment-16022445 ] Yang Wang commented on YARN-6630: - When yarn.nodemanager.recovery.enabled is true, nm will not clear any workdir. However, container.workDir didn't recover and is null. > Container worker dir could not recover when NM restart > -- > > Key: YARN-6630 > URL: https://issues.apache.org/jira/browse/YARN-6630 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > > When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is > NEVER_RETRY, container worker dir will not be saved in NM state store. > {code:title=ContainerLaunch.java} > ... > private void recordContainerWorkDir(ContainerId containerId, > String workDir) throws IOException{ > container.setWorkDir(workDir); > if (container.isRetryContextSet()) { > context.getNMStateStore().storeContainerWorkDir(containerId, workDir); > } > } > {code} > Then NM restarts, container.workDir is null, and may cause other exceptions. > {code:title=ContainerImpl.java} > static class ResourceLocalizedWhileRunningTransition > extends ContainerTransition { > ... > String linkFile = new Path(container.workDir, link).toString(); > ... > {code} > {code} > java.lang.IllegalArgumentException: Can not create a Path from a null string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) > at org.apache.hadoop.fs.Path.(Path.java:175) > at org.apache.hadoop.fs.Path.(Path.java:110) > ... ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6630) Container worker dir could not recover when NM restart
[ https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6630: Attachment: YARN-6630.001.patch > Container worker dir could not recover when NM restart > -- > > Key: YARN-6630 > URL: https://issues.apache.org/jira/browse/YARN-6630 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > Attachments: YARN-6630.001.patch > > > When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is > NEVER_RETRY, container worker dir will not be saved in NM state store. > {code:title=ContainerLaunch.java} > ... > private void recordContainerWorkDir(ContainerId containerId, > String workDir) throws IOException{ > container.setWorkDir(workDir); > if (container.isRetryContextSet()) { > context.getNMStateStore().storeContainerWorkDir(containerId, workDir); > } > } > {code} > Then NM restarts, container.workDir is null, and may cause other exceptions. > {code:title=ContainerImpl.java} > static class ResourceLocalizedWhileRunningTransition > extends ContainerTransition { > ... > String linkFile = new Path(container.workDir, link).toString(); > ... > {code} > {code} > java.lang.IllegalArgumentException: Can not create a Path from a null string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) > at org.apache.hadoop.fs.Path.(Path.java:175) > at org.apache.hadoop.fs.Path.(Path.java:110) > ... ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6630) Container worker dir could not recover when NM restart
[ https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024097#comment-16024097 ] Yang Wang commented on YARN-6630: - Hi, [~jianhe], Could you help to review the patch. We already have a problem, after NM restart, we send a resource localization request while container is running(YARN-1503), then NM will fail because of the following exception. Also, anywhere which use *container.workDir* may cause a NullPointerException. {code} java.lang.IllegalArgumentException: Can not create a Path from a null string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) at org.apache.hadoop.fs.Path.(Path.java:175) at org.apache.hadoop.fs.Path.(Path.java:110) ... ... {code} > Container worker dir could not recover when NM restart > -- > > Key: YARN-6630 > URL: https://issues.apache.org/jira/browse/YARN-6630 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > Attachments: YARN-6630.001.patch > > > When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is > NEVER_RETRY, container worker dir will not be saved in NM state store. > {code:title=ContainerLaunch.java} > ... > private void recordContainerWorkDir(ContainerId containerId, > String workDir) throws IOException{ > container.setWorkDir(workDir); > if (container.isRetryContextSet()) { > context.getNMStateStore().storeContainerWorkDir(containerId, workDir); > } > } > {code} > Then NM restarts, container.workDir is null, and may cause other exceptions. > {code:title=ContainerImpl.java} > static class ResourceLocalizedWhileRunningTransition > extends ContainerTransition { > ... > String linkFile = new Path(container.workDir, link).toString(); > ... > {code} > {code} > java.lang.IllegalArgumentException: Can not create a Path from a null string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) > at org.apache.hadoop.fs.Path.(Path.java:175) > at org.apache.hadoop.fs.Path.(Path.java:110) > ... ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6630) Container worker dir could not recover when NM restart
[ https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025952#comment-16025952 ] Yang Wang commented on YARN-6630: - Yes, yarn.nodemanager.recovery.enabled=true and ContainerRetryPolicy= NEVER_RETRY is is not ambivalent. I mean container.workdir always need to be saved in NM state store, has nothing to do with ContainerRetryPolicy. > Container worker dir could not recover when NM restart > -- > > Key: YARN-6630 > URL: https://issues.apache.org/jira/browse/YARN-6630 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > Attachments: YARN-6630.001.patch > > > When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is > NEVER_RETRY, container worker dir will not be saved in NM state store. > {code:title=ContainerLaunch.java} > ... > private void recordContainerWorkDir(ContainerId containerId, > String workDir) throws IOException{ > container.setWorkDir(workDir); > if (container.isRetryContextSet()) { > context.getNMStateStore().storeContainerWorkDir(containerId, workDir); > } > } > {code} > Then NM restarts, container.workDir is null, and may cause other exceptions. > {code:title=ContainerImpl.java} > static class ResourceLocalizedWhileRunningTransition > extends ContainerTransition { > ... > String linkFile = new Path(container.workDir, link).toString(); > ... > {code} > {code} > java.lang.IllegalArgumentException: Can not create a Path from a null string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) > at org.apache.hadoop.fs.Path.(Path.java:175) > at org.apache.hadoop.fs.Path.(Path.java:110) > ... ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6630) Container worker dir could not recover when NM restart
[ https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025973#comment-16025973 ] Yang Wang commented on YARN-6630: - When ContainerRetryPolicy is NEVER_RETRY, container.workdir also needs to be saved in NM store. Otherwise, it could not recover and is null after NM restart {quote} We already have a problem, after NM restart, we send a resource localization request while container is running(YARN-1503), then NM will fail because of the following exception. Also, anywhere which use container.workDir may cause a NullPointerException. {code} java.lang.IllegalArgumentException: Can not create a Path from a null string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) at org.apache.hadoop.fs.Path.(Path.java:175) at org.apache.hadoop.fs.Path.(Path.java:110) ... ... {code} {quote} > Container worker dir could not recover when NM restart > -- > > Key: YARN-6630 > URL: https://issues.apache.org/jira/browse/YARN-6630 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > Attachments: YARN-6630.001.patch > > > When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is > NEVER_RETRY, container worker dir will not be saved in NM state store. > {code:title=ContainerLaunch.java} > ... > private void recordContainerWorkDir(ContainerId containerId, > String workDir) throws IOException{ > container.setWorkDir(workDir); > if (container.isRetryContextSet()) { > context.getNMStateStore().storeContainerWorkDir(containerId, workDir); > } > } > {code} > Then NM restarts, container.workDir is null, and may cause other exceptions. > {code:title=ContainerImpl.java} > static class ResourceLocalizedWhileRunningTransition > extends ContainerTransition { > ... > String linkFile = new Path(container.workDir, link).toString(); > ... > {code} > {code} > java.lang.IllegalArgumentException: Can not create a Path from a null string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) > at org.apache.hadoop.fs.Path.(Path.java:175) > at org.apache.hadoop.fs.Path.(Path.java:110) > ... ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6630) Container worker dir could not recover when NM restart
[ https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6630: Description: When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be saved in NM state store. {code:title=ContainerLaunch.java} ... private void recordContainerWorkDir(ContainerId containerId, String workDir) throws IOException{ container.setWorkDir(workDir); if (container.isRetryContextSet()) { context.getNMStateStore().storeContainerWorkDir(containerId, workDir); } } {code} Then NM restarts, container.workDir could not recover and is null, and may cause some exceptions. We already have a problem, after NM restart, we send a resource localization request while container is running(YARN-1503), then NM will fail because of the following exception. So, container.workdir always need to be saved in NM state store. {code:title=ContainerImpl.java} static class ResourceLocalizedWhileRunningTransition extends ContainerTransition { ... String linkFile = new Path(container.workDir, link).toString(); ... {code} {code} java.lang.IllegalArgumentException: Can not create a Path from a null string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) at org.apache.hadoop.fs.Path.(Path.java:175) at org.apache.hadoop.fs.Path.(Path.java:110) ... ... {code} was: When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be saved in NM state store. {code:title=ContainerLaunch.java} ... private void recordContainerWorkDir(ContainerId containerId, String workDir) throws IOException{ container.setWorkDir(workDir); if (container.isRetryContextSet()) { context.getNMStateStore().storeContainerWorkDir(containerId, workDir); } } {code} Then NM restarts, container.workDir is null, and may cause other exceptions. {code:title=ContainerImpl.java} static class ResourceLocalizedWhileRunningTransition extends ContainerTransition { ... String linkFile = new Path(container.workDir, link).toString(); ... {code} {code} java.lang.IllegalArgumentException: Can not create a Path from a null string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) at org.apache.hadoop.fs.Path.(Path.java:175) at org.apache.hadoop.fs.Path.(Path.java:110) ... ... {code} > Container worker dir could not recover when NM restart > -- > > Key: YARN-6630 > URL: https://issues.apache.org/jira/browse/YARN-6630 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang > Attachments: YARN-6630.001.patch > > > When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be > saved in NM state store. > {code:title=ContainerLaunch.java} > ... > private void recordContainerWorkDir(ContainerId containerId, > String workDir) throws IOException{ > container.setWorkDir(workDir); > if (container.isRetryContextSet()) { > context.getNMStateStore().storeContainerWorkDir(containerId, workDir); > } > } > {code} > Then NM restarts, container.workDir could not recover and is null, and may > cause some exceptions. > We already have a problem, after NM restart, we send a resource localization > request while container is running(YARN-1503), then NM will fail because of > the following exception. > So, container.workdir always need to be saved in NM state store. > {code:title=ContainerImpl.java} > static class ResourceLocalizedWhileRunningTransition > extends ContainerTransition { > ... > String linkFile = new Path(container.workDir, link).toString(); > ... > {code} > {code} > java.lang.IllegalArgumentException: Can not create a Path from a null string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) > at org.apache.hadoop.fs.Path.(Path.java:175) > at org.apache.hadoop.fs.Path.(Path.java:110) > ... ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6589) Recover all resources when NM restart
[ https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6589: Attachment: YARN-6589.001.patch > Recover all resources when NM restart > - > > Key: YARN-6589 > URL: https://issues.apache.org/jira/browse/YARN-6589 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Blocker > Attachments: YARN-6589.001.patch, YARN-6589-YARN-3926.001.patch > > > When NM restart, containers will be recovered. However, only memory and > vcores in capability have been recovered. All resources need to be recovered. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > this.resource = > Resource.newInstance(recoveredCapability.getMemorySize(), > recoveredCapability.getVirtualCores()); > {code} > It should be like this. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > // need to recover all resources, not only > this.resource = Resources.clone(recoveredCapability); > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6589) Recover all resources when NM restart
[ https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172660#comment-16172660 ] Yang Wang commented on YARN-6589: - Thanks for your comment, [~leftnoteasy] ContainerImpl.java in the trunk has been changed, and i think this bug has been fixed. I just update the test. > Recover all resources when NM restart > - > > Key: YARN-6589 > URL: https://issues.apache.org/jira/browse/YARN-6589 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Blocker > Attachments: YARN-6589.001.patch, YARN-6589-YARN-3926.001.patch > > > When NM restart, containers will be recovered. However, only memory and > vcores in capability have been recovered. All resources need to be recovered. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > this.resource = > Resource.newInstance(recoveredCapability.getMemorySize(), > recoveredCapability.getVirtualCores()); > {code} > It should be like this. > {code:title=ContainerImpl.java} > // resource capability had been updated before NM was down > // need to recover all resources, not only > this.resource = Resources.clone(recoveredCapability); > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6630) Container worker dir could not recover when NM restart
[ https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Wang updated YARN-6630: Attachment: YARN-6630.002.patch > Container worker dir could not recover when NM restart > -- > > Key: YARN-6630 > URL: https://issues.apache.org/jira/browse/YARN-6630 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang > Attachments: YARN-6630.001.patch, YARN-6630.002.patch > > > When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be > saved in NM state store. > {code:title=ContainerLaunch.java} > ... > private void recordContainerWorkDir(ContainerId containerId, > String workDir) throws IOException{ > container.setWorkDir(workDir); > if (container.isRetryContextSet()) { > context.getNMStateStore().storeContainerWorkDir(containerId, workDir); > } > } > {code} > Then NM restarts, container.workDir could not recover and is null, and may > cause some exceptions. > We already have a problem, after NM restart, we send a resource localization > request while container is running(YARN-1503), then NM will fail because of > the following exception. > So, container.workdir always need to be saved in NM state store. > {code:title=ContainerImpl.java} > static class ResourceLocalizedWhileRunningTransition > extends ContainerTransition { > ... > String linkFile = new Path(container.workDir, link).toString(); > ... > {code} > {code} > java.lang.IllegalArgumentException: Can not create a Path from a null string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) > at org.apache.hadoop.fs.Path.(Path.java:175) > at org.apache.hadoop.fs.Path.(Path.java:110) > ... ... > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6630) Container worker dir could not recover when NM restart
[ https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172698#comment-16172698 ] Yang Wang commented on YARN-6630: - Thanks for your comments, [~djp]. Update the patch and rebase trunk. > Container worker dir could not recover when NM restart > -- > > Key: YARN-6630 > URL: https://issues.apache.org/jira/browse/YARN-6630 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yang Wang >Assignee: Yang Wang > Attachments: YARN-6630.001.patch, YARN-6630.002.patch > > > When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be > saved in NM state store. > {code:title=ContainerLaunch.java} > ... > private void recordContainerWorkDir(ContainerId containerId, > String workDir) throws IOException{ > container.setWorkDir(workDir); > if (container.isRetryContextSet()) { > context.getNMStateStore().storeContainerWorkDir(containerId, workDir); > } > } > {code} > Then NM restarts, container.workDir could not recover and is null, and may > cause some exceptions. > We already have a problem, after NM restart, we send a resource localization > request while container is running(YARN-1503), then NM will fail because of > the following exception. > So, container.workdir always need to be saved in NM state store. > {code:title=ContainerImpl.java} > static class ResourceLocalizedWhileRunningTransition > extends ContainerTransition { > ... > String linkFile = new Path(container.workDir, link).toString(); > ... > {code} > {code} > java.lang.IllegalArgumentException: Can not create a Path from a null string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) > at org.apache.hadoop.fs.Path.(Path.java:175) > at org.apache.hadoop.fs.Path.(Path.java:110) > ... ... > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org