from:"Yang Wang \(Jira\)"

[jira] [Commented] (YARN-7007) NPE in RM while using YarnClient.getApplications()

2020-01-09 Thread Yang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012529#comment-17012529
 ] 

Yang Wang commented on YARN-7007:
-

[~cheersyang] [~Tao Yang] We come across the same problem in FLINK-15534 and i 
think many users are using Flink with bundled hadoop-2.8.x. It will be very 
good if we could backport this fix to 2.8 and release in 2.8.6. Could you help 
with this?

> NPE in RM while using YarnClient.getApplications()
> --
>
> Key: YARN-7007
> URL: https://issues.apache.org/jira/browse/YARN-7007
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Lingfeng Su
>Assignee: Lingfeng Su
>Priority: Major
>  Labels: patch
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: YARN-7007.001.patch
>
>
> {code:java}
> java.lang.NullPointerException: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:857)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:629)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.verifyAndCreateAppReport(ClientRMService.java:972)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:898)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:734)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplications(ApplicationClientProtocolPBServiceImpl.java:239)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:441)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2198)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2196)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:254)
>   at sun.reflect.GeneratedMethodAccessor731.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy161.getApplications(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:479)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:456)
> {code}
> When I use YarnClient.getApplications() to  get all applications of RM, 
> Occasionally, it throw a  NPE problem.
> {code:java}
> RMAppAttempt currentAttempt = rmContext.getRMApps()
>.get(attemptId.getApplicationId()).getCurrentAppAttempt();
> {code}
> if the application id is not in ConcurrentMap 
> getRMApps(), it may throw NPE problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.ap

[jira] [Commented] (YARN-7007) NPE in RM while using YarnClient.getApplications()

2020-01-10 Thread Yang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012913#comment-17012913
 ] 

Yang Wang commented on YARN-7007:
-

[~Tao Yang] Cool, many thanks.

> NPE in RM while using YarnClient.getApplications()
> --
>
> Key: YARN-7007
> URL: https://issues.apache.org/jira/browse/YARN-7007
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Lingfeng Su
>Assignee: Lingfeng Su
>Priority: Major
>  Labels: patch
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.6
>
> Attachments: YARN-7007.001.patch
>
>
> {code:java}
> java.lang.NullPointerException: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:857)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:629)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.verifyAndCreateAppReport(ClientRMService.java:972)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:898)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:734)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplications(ApplicationClientProtocolPBServiceImpl.java:239)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:441)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2198)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2196)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:254)
>   at sun.reflect.GeneratedMethodAccessor731.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy161.getApplications(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:479)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:456)
> {code}
> When I use YarnClient.getApplications() to  get all applications of RM, 
> Occasionally, it throw a  NPE problem.
> {code:java}
> RMAppAttempt currentAttempt = rmContext.getRMApps()
>.get(attemptId.getApplicationId()).getCurrentAppAttempt();
> {code}
> if the application id is not in ConcurrentMap 
> getRMApps(), it may throw NPE problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart

2018-04-11 Thread Yang Wang (JIRA)

Yang Wang created YARN-8153:
---

 Summary: Guaranteed containers always stay in SCHEDULED on NM 
after restart
 Key: YARN-8153
 URL: https://issues.apache.org/jira/browse/YARN-8153
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


When nm recovery is enabled, after NM restart, some containers always stay in 
SCHEDULED because of no sufficient resources.

The root cause is that utilizationTracker.addContainerResources has been called 
twice when restart. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart

2018-04-11 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang reassigned YARN-8153:
---

Assignee: Yang Wang

> Guaranteed containers always stay in SCHEDULED on NM after restart
> --
>
> Key: YARN-8153
> URL: https://issues.apache.org/jira/browse/YARN-8153
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
>
> When nm recovery is enabled, after NM restart, some containers always stay in 
> SCHEDULED because of no sufficient resources.
> The root cause is that utilizationTracker.addContainerResources has been 
> called twice when restart. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart

2018-04-11 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-8153:

Attachment: YARN-8153.001.patch

> Guaranteed containers always stay in SCHEDULED on NM after restart
> --
>
> Key: YARN-8153
> URL: https://issues.apache.org/jira/browse/YARN-8153
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Attachments: YARN-8153.001.patch
>
>
> When nm recovery is enabled, after NM restart, some containers always stay in 
> SCHEDULED because of no sufficient resources.
> The root cause is that utilizationTracker.addContainerResources has been 
> called twice when restart. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart

2018-04-12 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-8153:

Attachment: YARN-8153.002.patch

> Guaranteed containers always stay in SCHEDULED on NM after restart
> --
>
> Key: YARN-8153
> URL: https://issues.apache.org/jira/browse/YARN-8153
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Attachments: YARN-8153.001.patch, YARN-8153.002.patch
>
>
> When nm recovery is enabled, after NM restart, some containers always stay in 
> SCHEDULED because of no sufficient resources.
> The root cause is that utilizationTracker.addContainerResources has been 
> called twice when restart. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart

2018-04-12 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436727#comment-16436727
 ] 

Yang Wang commented on YARN-8153:
-

[~cheersyang] Thanks for your comment. I have fixed the UT failure.

> Guaranteed containers always stay in SCHEDULED on NM after restart
> --
>
> Key: YARN-8153
> URL: https://issues.apache.org/jira/browse/YARN-8153
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Attachments: YARN-8153.001.patch, YARN-8153.002.patch
>
>
> When nm recovery is enabled, after NM restart, some containers always stay in 
> SCHEDULED because of no sufficient resources.
> The root cause is that utilizationTracker.addContainerResources has been 
> called twice when restart. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart

2018-04-13 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437186#comment-16437186
 ] 

Yang Wang commented on YARN-8153:
-

[~cheersyang] Thanks for your commit.

> Guaranteed containers always stay in SCHEDULED on NM after restart
> --
>
> Key: YARN-8153
> URL: https://issues.apache.org/jira/browse/YARN-8153
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8153.001.patch, YARN-8153.002.patch
>
>
> When nm recovery is enabled, after NM restart, some containers always stay in 
> SCHEDULED because of no sufficient resources.
> The root cause is that utilizationTracker.addContainerResources has been 
> called twice when restart. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6630) Container worker dir could not recover when NM restart

2018-05-03 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6630:

Attachment: YARN-6630.003.patch

> Container worker dir could not recover when NM restart
> --
>
> Key: YARN-6630
> URL: https://issues.apache.org/jira/browse/YARN-6630
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Attachments: YARN-6630.001.patch, YARN-6630.002.patch, 
> YARN-6630.003.patch
>
>
> When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be 
> saved in NM state store. 
> {code:title=ContainerLaunch.java}
> ...
>   private void recordContainerWorkDir(ContainerId containerId,
>   String workDir) throws IOException{
> container.setWorkDir(workDir);
> if (container.isRetryContextSet()) {
>   context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
> }
>   }
> {code}
> Then NM restarts, container.workDir could not recover and is null, and may 
> cause some exceptions.
> We already have a problem, after NM restart, we send a resource localization 
> request while container is running(YARN-1503), then NM will fail because of 
> the following exception.
> So, container.workdir always need to be saved in NM state store.
> {code:title=ContainerImpl.java}
>   static class ResourceLocalizedWhileRunningTransition
>   extends ContainerTransition {
> ...
>   String linkFile = new Path(container.workDir, link).toString();
> ...
> {code}
> {code}
> java.lang.IllegalArgumentException: Can not create a Path from a null string
> at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
> at org.apache.hadoop.fs.Path.(Path.java:175)
> at org.apache.hadoop.fs.Path.(Path.java:110)
> ... ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6589) Recover all resources when NM restart

2018-05-03 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462233#comment-16462233
 ] 

Yang Wang commented on YARN-6589:
-

ContainerImpl#getResource() has been changed to get from 
containerTokenIdentifier and containerTokenIdentifier could be recovered 
correctly. Just close this jira as Won't Fix

> Recover all resources when NM restart
> -
>
> Key: YARN-6589
> URL: https://issues.apache.org/jira/browse/YARN-6589
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Blocker
> Attachments: YARN-6589-YARN-3926.001.patch, YARN-6589.001.patch, 
> YARN-6589.002.patch
>
>
> When NM restart, containers will be recovered. However, only memory and 
> vcores in capability have been recovered. All resources need to be recovered.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   this.resource = 
> Resource.newInstance(recoveredCapability.getMemorySize(),
>   recoveredCapability.getVirtualCores());
> {code}
> It should be like this.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   // need to recover all resources, not only 
>   this.resource = Resources.clone(recoveredCapability);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6589) Recover all resources when NM restart

2018-05-03 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6589:

Release Note:   (was: ContainerImpl#getResource() has been changed to get 
from containerTokenIdentifier and containerTokenIdentifier could be recovered 
correctly. Just close this jira as Won't Fix)

> Recover all resources when NM restart
> -
>
> Key: YARN-6589
> URL: https://issues.apache.org/jira/browse/YARN-6589
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Blocker
> Attachments: YARN-6589-YARN-3926.001.patch, YARN-6589.001.patch, 
> YARN-6589.002.patch
>
>
> When NM restart, containers will be recovered. However, only memory and 
> vcores in capability have been recovered. All resources need to be recovered.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   this.resource = 
> Resource.newInstance(recoveredCapability.getMemorySize(),
>   recoveredCapability.getVirtualCores());
> {code}
> It should be like this.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   // need to recover all resources, not only 
>   this.resource = Resources.clone(recoveredCapability);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2018-05-04 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6578:

Attachment: YARN-6578.002.patch

> Return container resource utilization from NM ContainerStatus call
> --
>
> Key: YARN-6578
> URL: https://issues.apache.org/jira/browse/YARN-6578
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Attachments: YARN-6578.001.patch, YARN-6578.002.patch
>
>
> When the applicationMaster wants to change(increase/decrease) resources of an 
> allocated container, resource utilization is an important reference indicator 
> for decision making. So, when AM call NMClient.getContainerStatus, resource 
> utilization needs to be returned.
> Also container resource utilization need to report to RM to make better 
> scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2018-05-04 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6578:

Description: 
When the applicationMaster wants to change(increase/decrease) resources of an 
allocated container, resource utilization is an important reference indicator 
for decision making. So, when AM call NMClient.getContainerStatus, resource 
utilization needs to be returned.

Also container resource utilization need to report to RM to make better 
scheduling.

So put resource utilization in ContainerStatus.

  was:
When the applicationMaster wants to change(increase/decrease) resources of an 
allocated container, resource utilization is an important reference indicator 
for decision making. So, when AM call NMClient.getContainerStatus, resource 
utilization needs to be returned.

Also container resource utilization need to report to RM to make better 
scheduling.


> Return container resource utilization from NM ContainerStatus call
> --
>
> Key: YARN-6578
> URL: https://issues.apache.org/jira/browse/YARN-6578
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Attachments: YARN-6578.001.patch, YARN-6578.002.patch
>
>
> When the applicationMaster wants to change(increase/decrease) resources of an 
> allocated container, resource utilization is an important reference indicator 
> for decision making. So, when AM call NMClient.getContainerStatus, resource 
> utilization needs to be returned.
> Also container resource utilization need to report to RM to make better 
> scheduling.
> So put resource utilization in ContainerStatus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2018-05-07 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465620#comment-16465620
 ] 

Yang Wang commented on YARN-6578:
-

[~cheersyang], thanks for your comment.

Have fixed the findbugs issues.

The failed UT seem to be another issue, 
[YARN-8244|https://issues.apache.org/jira/browse/YARN-8244].

Do not need to fix checkstyle issues. Just as other metric variables, the 
vMemMBsStat and vMemMBQuantiles could be public in ContainerMetrics.java.

 

> Return container resource utilization from NM ContainerStatus call
> --
>
> Key: YARN-6578
> URL: https://issues.apache.org/jira/browse/YARN-6578
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Attachments: YARN-6578.001.patch, YARN-6578.002.patch
>
>
> When the applicationMaster wants to change(increase/decrease) resources of an 
> allocated container, resource utilization is an important reference indicator 
> for decision making. So, when AM call NMClient.getContainerStatus, resource 
> utilization needs to be returned.
> Also container resource utilization need to report to RM to make better 
> scheduling.
> So put resource utilization in ContainerStatus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2018-05-07 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6578:

Attachment: YARN-6578.003.patch

> Return container resource utilization from NM ContainerStatus call
> --
>
> Key: YARN-6578
> URL: https://issues.apache.org/jira/browse/YARN-6578
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Attachments: YARN-6578.001.patch, YARN-6578.002.patch, 
> YARN-6578.003.patch
>
>
> When the applicationMaster wants to change(increase/decrease) resources of an 
> allocated container, resource utilization is an important reference indicator 
> for decision making. So, when AM call NMClient.getContainerStatus, resource 
> utilization needs to be returned.
> Also container resource utilization need to report to RM to make better 
> scheduling.
> So put resource utilization in ContainerStatus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2018-05-20 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482142#comment-16482142
 ] 

Yang Wang commented on YARN-6578:
-

[~Naganarasimha] thanks for your comment.

Currently we just return pmem/vmem/vcores in ContainerStatus#getUtilization.

Just as you mentioned, do we need to make ResourceUtilization extensible like 
Resource?

Get the utilization of extensible resource (gpu/fpga) is not easy as 
pmem/vmem/vcores.

In most use case, scheduling opportunistic containers or increase/decrease 
container resource, utilization of pmem/vmem/vcores is enough.

> Return container resource utilization from NM ContainerStatus call
> --
>
> Key: YARN-6578
> URL: https://issues.apache.org/jira/browse/YARN-6578
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Attachments: YARN-6578.001.patch, YARN-6578.002.patch, 
> YARN-6578.003.patch
>
>
> When the applicationMaster wants to change(increase/decrease) resources of an 
> allocated container, resource utilization is an important reference indicator 
> for decision making. So, when AM call NMClient.getContainerStatus, resource 
> utilization needs to be returned.
> Also container resource utilization need to report to RM to make better 
> scheduling.
> So put resource utilization in ContainerStatus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8331) Race condition in NM container launched after done

2018-05-20 Thread Yang Wang (JIRA)

Yang Wang created YARN-8331:
---

 Summary: Race condition in NM container launched after done
 Key: YARN-8331
 URL: https://issues.apache.org/jira/browse/YARN-8331
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


When a container is launching, in ContainerLaunch#launchContainer, state is 
SCHEDULED,
kill event was sent to this container, state : SCHEDULED->KILLING->DONE
 Then ContainerLaunch send CONTAINER_LAUNCHED event and start the container 
processes. These absent container processes will not be cleaned up anymore.

 
{code:java}
2018-05-21 13:11:56,114 INFO  [Thread-11] nodemanager.NMAuditLogger 
(NMAuditLogger.java:logSuccess(94)) - USER=nobody   OPERATION=Start Container 
Request   TARGET=ContainerManageImpl  RESULT=SUCCESS  
APPID=application_0_CONTAINERID=container_0__01_00
2018-05-21 13:11:56,114 INFO  [NM ContainerManager dispatcher] 
application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application 
application_0_ transitioned from NEW to INITING
2018-05-21 13:11:56,114 INFO  [NM ContainerManager dispatcher] 
application.ApplicationImpl (ApplicationImpl.java:transition(446)) - Adding 
container_0__01_00 to application application_0_
2018-05-21 13:11:56,118 INFO  [NM ContainerManager dispatcher] 
application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application 
application_0_ transitioned from INITING to RUNNING
2018-05-21 13:11:56,119 INFO  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container 
container_0__01_00 transitioned from NEW to SCHEDULED
2018-05-21 13:11:56,119 INFO  [NM ContainerManager dispatcher] 
containermanager.AuxServices (AuxServices.java:handle(220)) - Got event 
CONTAINER_INIT for appId application_0_
2018-05-21 13:11:56,119 INFO  [NM ContainerManager dispatcher] 
scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(504)) - 
Starting container [container_0__01_00]
2018-05-21 13:11:56,226 INFO  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container 
container_0__01_00 transitioned from SCHEDULED to KILLING
2018-05-21 13:11:56,227 INFO  [NM ContainerManager dispatcher] 
containermanager.TestContainerManager 
(BaseContainerManagerTest.java:delete(287)) - Psuedo delete: user - nobody, 
type - FILE
2018-05-21 13:11:56,227 INFO  [NM ContainerManager dispatcher] 
nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(94)) - USER=nobody 
 OPERATION=Container Finished - Killed   TARGET=ContainerImplRESULT=SUCCESS 
 APPID=application_0_CONTAINERID=container_0__01_00
2018-05-21 13:11:56,238 INFO  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container 
container_0__01_00 transitioned from KILLING to DONE
2018-05-21 13:11:56,238 INFO  [NM ContainerManager dispatcher] 
application.ApplicationImpl (ApplicationImpl.java:transition(489)) - Removing 
container_0__01_00 from application application_0_
2018-05-21 13:11:56,239 INFO  [NM ContainerManager dispatcher] 
monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:onStopMonitoringContainer(932)) - Stopping 
resource-monitoring for container_0__01_00
2018-05-21 13:11:56,239 INFO  [NM ContainerManager dispatcher] 
containermanager.AuxServices (AuxServices.java:handle(220)) - Got event 
CONTAINER_STOP for appId application_0_
2018-05-21 13:11:56,274 WARN  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2106)) - Can't handle this 
event at current state: Current: [DONE], eventType: [CONTAINER_LAUNCHED], 
container: [container_0__01_00]
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
CONTAINER_LAUNCHED at DONE
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2104)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:104)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1525)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1518)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 
org.apache.hadoop.yarn.event.AsyncDispatch

[jira] [Updated] (YARN-6589) Recover all resources when NM restart

2017-12-08 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6589:

Attachment: YARN-6589.002.patch

The constructor in ContainerImpl has change, we do not need to recover 
resource. Because we will get resource from containerTokenIdentifier. And 
containerTokenIdentifier could be recovered properly.

So i update the patch and just add a test for this case.

> Recover all resources when NM restart
> -
>
> Key: YARN-6589
> URL: https://issues.apache.org/jira/browse/YARN-6589
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Blocker
> Attachments: YARN-6589-YARN-3926.001.patch, YARN-6589.001.patch, 
> YARN-6589.002.patch
>
>
> When NM restart, containers will be recovered. However, only memory and 
> vcores in capability have been recovered. All resources need to be recovered.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   this.resource = 
> Resource.newInstance(recoveredCapability.getMemorySize(),
>   recoveredCapability.getVirtualCores());
> {code}
> It should be like this.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   // need to recover all resources, not only 
>   this.resource = Resources.clone(recoveredCapability);
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7647) NM print inappropriate error log when node-labels is enabled

2017-12-12 Thread Yang Wang (JIRA)

Yang Wang created YARN-7647:
---

 Summary: NM print inappropriate error log when node-labels is 
enabled
 Key: YARN-7647
 URL: https://issues.apache.org/jira/browse/YARN-7647
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


{code:title=NodeStatusUpdaterImpl.java}
  ... ...
  if (response.getAreNodeLabelsAcceptedByRM() && LOG.isDebugEnabled()) {
  LOG.debug("Node Labels {" + StringUtils.join(",", previousNodeLabels)
  + "} were Accepted by RM ");
} else {
  // case where updated labels from NodeLabelsProvider is sent to RM and
  // RM rejected the labels
  LOG.error(
  "NM node labels {" + StringUtils.join(",", previousNodeLabels)
  + "} were not accepted by RM and message from RM : "
  + response.getDiagnosticsMessage());
}
  ... ...
{code}

When LOG.isDebugEnabled() is false, NM will always print error log. It is an 
obvious error and is so misleading.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-7647) NM print inappropriate error log when node-labels is enabled

2017-12-12 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang reassigned YARN-7647:
---

Assignee: Yang Wang

> NM print inappropriate error log when node-labels is enabled
> 
>
> Key: YARN-7647
> URL: https://issues.apache.org/jira/browse/YARN-7647
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
> Attachments: YARN-7647.001.patch
>
>
> {code:title=NodeStatusUpdaterImpl.java}
>   ... ...
>   if (response.getAreNodeLabelsAcceptedByRM() && LOG.isDebugEnabled()) {
>   LOG.debug("Node Labels {" + StringUtils.join(",", 
> previousNodeLabels)
>   + "} were Accepted by RM ");
> } else {
>   // case where updated labels from NodeLabelsProvider is sent to RM 
> and
>   // RM rejected the labels
>   LOG.error(
>   "NM node labels {" + StringUtils.join(",", previousNodeLabels)
>   + "} were not accepted by RM and message from RM : "
>   + response.getDiagnosticsMessage());
> }
>   ... ...
> {code}
> When LOG.isDebugEnabled() is false, NM will always print error log. It is an 
> obvious error and is so misleading.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7647) NM print inappropriate error log when node-labels is enabled

2017-12-12 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-7647:

Attachment: YARN-7647.001.patch

> NM print inappropriate error log when node-labels is enabled
> 
>
> Key: YARN-7647
> URL: https://issues.apache.org/jira/browse/YARN-7647
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
> Attachments: YARN-7647.001.patch
>
>
> {code:title=NodeStatusUpdaterImpl.java}
>   ... ...
>   if (response.getAreNodeLabelsAcceptedByRM() && LOG.isDebugEnabled()) {
>   LOG.debug("Node Labels {" + StringUtils.join(",", 
> previousNodeLabels)
>   + "} were Accepted by RM ");
> } else {
>   // case where updated labels from NodeLabelsProvider is sent to RM 
> and
>   // RM rejected the labels
>   LOG.error(
>   "NM node labels {" + StringUtils.join(",", previousNodeLabels)
>   + "} were not accepted by RM and message from RM : "
>   + response.getDiagnosticsMessage());
> }
>   ... ...
> {code}
> When LOG.isDebugEnabled() is false, NM will always print error log. It is an 
> obvious error and is so misleading.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7659) NodeManager metrics return wrong value after update resource

2017-12-15 Thread Yang Wang (JIRA)

Yang Wang created YARN-7659:
---

 Summary: NodeManager metrics return wrong value after update 
resource
 Key: YARN-7659
 URL: https://issues.apache.org/jira/browse/YARN-7659
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


{code:title=NodeManagerMetrics.java}
  public void addResource(Resource res) {
availableMB = availableMB + res.getMemorySize();
availableGB.incr((int)Math.floor(availableMB/1024d));
availableVCores.incr(res.getVirtualCores());
  }
{code}
When the node resource was updated through RM-NM heartbeat, the NM metric will 
get wrong value. 
The root cause of this issue is that new resource has been added to 
availableMB, so not needed to increase for availableGB again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7660) NodeManager metrics return wrong value after update node resource

2017-12-15 Thread Yang Wang (JIRA)

Yang Wang created YARN-7660:
---

 Summary: NodeManager metrics return wrong value after update node 
resource
 Key: YARN-7660
 URL: https://issues.apache.org/jira/browse/YARN-7660
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


{code:title=NodeManagerMetrics.java}
  public void addResource(Resource res) {
availableMB = availableMB + res.getMemorySize();
availableGB.incr((int)Math.floor(availableMB/1024d));
availableVCores.incr(res.getVirtualCores());
  }
{code}
When the node resource was updated through RM-NM heartbeat, the NM metric will 
get wrong value. 
The root cause of this issue is that new resource has been added to 
availableMB, so not needed to increase for availableGB again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-7661) NodeManager metrics return wrong value after update node resource

2017-12-15 Thread Yang Wang (JIRA)

Yang Wang created YARN-7661:
---

 Summary: NodeManager metrics return wrong value after update node 
resource
 Key: YARN-7661
 URL: https://issues.apache.org/jira/browse/YARN-7661
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


{code:title=NodeManagerMetrics.java}
  public void addResource(Resource res) {
availableMB = availableMB + res.getMemorySize();
availableGB.incr((int)Math.floor(availableMB/1024d));
availableVCores.incr(res.getVirtualCores());
  }
{code}
When the node resource was updated through RM-NM heartbeat, the NM metric will 
get wrong value. 
The root cause of this issue is that new resource has been added to 
availableMB, so not needed to increase for availableGB again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-7661) NodeManager metrics return wrong value after update node resource

2017-12-15 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang reassigned YARN-7661:
---

Assignee: Yang Wang

> NodeManager metrics return wrong value after update node resource
> -
>
> Key: YARN-7661
> URL: https://issues.apache.org/jira/browse/YARN-7661
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>
> {code:title=NodeManagerMetrics.java}
>   public void addResource(Resource res) {
> availableMB = availableMB + res.getMemorySize();
> availableGB.incr((int)Math.floor(availableMB/1024d));
> availableVCores.incr(res.getVirtualCores());
>   }
> {code}
> When the node resource was updated through RM-NM heartbeat, the NM metric 
> will get wrong value. 
> The root cause of this issue is that new resource has been added to 
> availableMB, so not needed to increase for availableGB again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7661) NodeManager metrics return wrong value after update node resource

2017-12-15 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-7661:

Attachment: YARN-7661.001.patch

Attach a patch to resolve this issue.

> NodeManager metrics return wrong value after update node resource
> -
>
> Key: YARN-7661
> URL: https://issues.apache.org/jira/browse/YARN-7661
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
> Attachments: YARN-7661.001.patch
>
>
> {code:title=NodeManagerMetrics.java}
>   public void addResource(Resource res) {
> availableMB = availableMB + res.getMemorySize();
> availableGB.incr((int)Math.floor(availableMB/1024d));
> availableVCores.incr(res.getVirtualCores());
>   }
> {code}
> When the node resource was updated through RM-NM heartbeat, the NM metric 
> will get wrong value. 
> The root cause of this issue is that new resource has been added to 
> availableMB, so not needed to increase for availableGB again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7660) NM node resource should be updated through heartbeat when rmadmin updateNodeResource execute successfully

2017-12-15 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-7660:

Summary: NM node resource should be updated through heartbeat when rmadmin 
updateNodeResource execute successfully  (was: NodeManager metrics return wrong 
value after update node resource)

> NM node resource should be updated through heartbeat when rmadmin 
> updateNodeResource execute successfully
> -
>
> Key: YARN-7660
> URL: https://issues.apache.org/jira/browse/YARN-7660
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>
> {code:title=NodeManagerMetrics.java}
>   public void addResource(Resource res) {
> availableMB = availableMB + res.getMemorySize();
> availableGB.incr((int)Math.floor(availableMB/1024d));
> availableVCores.incr(res.getVirtualCores());
>   }
> {code}
> When the node resource was updated through RM-NM heartbeat, the NM metric 
> will get wrong value. 
> The root cause of this issue is that new resource has been added to 
> availableMB, so not needed to increase for availableGB again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7660) NM node resource should be updated through heartbeat when rmadmin updateNodeResource execute successfully

2017-12-15 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-7660:

Description: 
When yarn rmadmin -updateNodeResource is used to update node resource, and 
execute successfully. The new capability should be sent to NM througn RM-NM 
heartbeat.
1. NM jmx metrics need to be updated
2. NM cgroup quota need to be updated


  was:
{code:title=NodeManagerMetrics.java}
  public void addResource(Resource res) {
availableMB = availableMB + res.getMemorySize();
availableGB.incr((int)Math.floor(availableMB/1024d));
availableVCores.incr(res.getVirtualCores());
  }
{code}
When the node resource was updated through RM-NM heartbeat, the NM metric will 
get wrong value. 
The root cause of this issue is that new resource has been added to 
availableMB, so not needed to increase for availableGB again.


> NM node resource should be updated through heartbeat when rmadmin 
> updateNodeResource execute successfully
> -
>
> Key: YARN-7660
> URL: https://issues.apache.org/jira/browse/YARN-7660
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>
> When yarn rmadmin -updateNodeResource is used to update node resource, and 
> execute successfully. The new capability should be sent to NM througn RM-NM 
> heartbeat.
> 1. NM jmx metrics need to be updated
> 2. NM cgroup quota need to be updated



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7661) NodeManager metrics return wrong value after update node resource

2017-12-17 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294470#comment-16294470
 ] 

Yang Wang commented on YARN-7661:
-

[~jlowe] Thanks for your comment. I have fixed the test and updated the patch.

> NodeManager metrics return wrong value after update node resource
> -
>
> Key: YARN-7661
> URL: https://issues.apache.org/jira/browse/YARN-7661
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Yang Wang
>Assignee: Yang Wang
> Attachments: YARN-7661.001.patch
>
>
> {code:title=NodeManagerMetrics.java}
>   public void addResource(Resource res) {
> availableMB = availableMB + res.getMemorySize();
> availableGB.incr((int)Math.floor(availableMB/1024d));
> availableVCores.incr(res.getVirtualCores());
>   }
> {code}
> When the node resource was updated through RM-NM heartbeat, the NM metric 
> will get wrong value. 
> The root cause of this issue is that new resource has been added to 
> availableMB, so not needed to increase for availableGB again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7661) NodeManager metrics return wrong value after update node resource

2017-12-17 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-7661:

Attachment: YARN-7661.002.patch

> NodeManager metrics return wrong value after update node resource
> -
>
> Key: YARN-7661
> URL: https://issues.apache.org/jira/browse/YARN-7661
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Yang Wang
>Assignee: Yang Wang
> Attachments: YARN-7661.001.patch, YARN-7661.002.patch
>
>
> {code:title=NodeManagerMetrics.java}
>   public void addResource(Resource res) {
> availableMB = availableMB + res.getMemorySize();
> availableGB.incr((int)Math.floor(availableMB/1024d));
> availableVCores.incr(res.getVirtualCores());
>   }
> {code}
> When the node resource was updated through RM-NM heartbeat, the NM metric 
> will get wrong value. 
> The root cause of this issue is that new resource has been added to 
> availableMB, so not needed to increase for availableGB again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7661) NodeManager metrics return wrong value after update node resource

2017-12-18 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296229#comment-16296229
 ] 

Yang Wang commented on YARN-7661:
-

[~jlowe], thanks for your review and commit.

> NodeManager metrics return wrong value after update node resource
> -
>
> Key: YARN-7661
> URL: https://issues.apache.org/jira/browse/YARN-7661
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Yang Wang
>Assignee: Yang Wang
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.1, 2.8.4, 2.7.6
>
> Attachments: YARN-7661.001.patch, YARN-7661.002.patch
>
>
> {code:title=NodeManagerMetrics.java}
>   public void addResource(Resource res) {
> availableMB = availableMB + res.getMemorySize();
> availableGB.incr((int)Math.floor(availableMB/1024d));
> availableVCores.incr(res.getVirtualCores());
>   }
> {code}
> When the node resource was updated through RM-NM heartbeat, the NM metric 
> will get wrong value. 
> The root cause of this issue is that new resource has been added to 
> availableMB, so not needed to increase for availableGB again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5621) Support LinuxContainerExecutor to create symlinks for continuously localized resources

2017-08-04 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114150#comment-16114150
 ] 

Yang Wang commented on YARN-5621:
-

{code:title=LinuxContainerExecutor.java}
  protected void createSymlinkAsUser(String user, File privateScriptFile,
  String userScriptFile)
  throws PrivilegedOperationException {
  String runAsUser = getRunAsUser(user);
  ... ...
{code}
I think we should use containerUser instead of runAsUser here. Because it may 
cause "Invalid command" in container-executor when getRunAsUser return 
nonsecureLocalUser.

> Support LinuxContainerExecutor to create symlinks for continuously localized 
> resources
> --
>
> Key: YARN-5621
> URL: https://issues.apache.org/jira/browse/YARN-5621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Jian He
>Assignee: Jian He
>  Labels: oct16-hard
> Attachments: YARN-5621.1.patch, YARN-5621.2.patch, YARN-5621.3.patch, 
> YARN-5621.4.patch, YARN-5621.5.patch
>
>
> When new resources are localized, new symlink needs to be created for the 
> localized resource. This is the change for the LinuxContainerExecutor to 
> create the symlinks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-5621) Support LinuxContainerExecutor to create symlinks for continuously localized resources

2017-08-04 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114150#comment-16114150
 ] 

Yang Wang edited comment on YARN-5621 at 8/4/17 9:08 AM:
-

{code:title=LinuxContainerExecutor.java}
  protected void createSymlinkAsUser(String user, File privateScriptFile,
  String userScriptFile)
  throws PrivilegedOperationException {
  String runAsUser = getRunAsUser(user);
  ... ...
{code}
Hi,[~jianhe]
I think we should use containerUser instead of runAsUser here. Because it may 
cause "Invalid command" in container-executor when getRunAsUser return 
nonsecureLocalUser.


was (Author: fly_in_gis):
{code:title=LinuxContainerExecutor.java}
  protected void createSymlinkAsUser(String user, File privateScriptFile,
  String userScriptFile)
  throws PrivilegedOperationException {
  String runAsUser = getRunAsUser(user);
  ... ...
{code}
I think we should use containerUser instead of runAsUser here. Because it may 
cause "Invalid command" in container-executor when getRunAsUser return 
nonsecureLocalUser.

> Support LinuxContainerExecutor to create symlinks for continuously localized 
> resources
> --
>
> Key: YARN-5621
> URL: https://issues.apache.org/jira/browse/YARN-5621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Jian He
>Assignee: Jian He
>  Labels: oct16-hard
> Attachments: YARN-5621.1.patch, YARN-5621.2.patch, YARN-5621.3.patch, 
> YARN-5621.4.patch, YARN-5621.5.patch
>
>
> When new resources are localized, new symlink needs to be created for the 
> localized resource. This is the change for the LinuxContainerExecutor to 
> create the symlinks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-6951) Fix debug log when Resource handler chain is enabled

2017-08-04 Thread Yang Wang (JIRA)

Yang Wang created YARN-6951:
---

 Summary: Fix debug log when Resource handler chain is enabled
 Key: YARN-6951
 URL: https://issues.apache.org/jira/browse/YARN-6951
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


{code title=LinuxContainerExecutor.java}
  ... ...
  if (LOG.isDebugEnabled()) {
LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain
== null));
  }
  ... ...
{code}
I think it is just a typo.When resourceHandlerChain is not null, print the log 
"Resource handler chain enabled = true".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6951) Fix debug log when Resource handler chain is enabled

2017-08-04 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6951:

Description: 
{code:title=LinuxContainerExecutor.java}
  ... ...
  if (LOG.isDebugEnabled()) {
LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain
== null));
  }
  ... ...
{code}
I think it is just a typo.When resourceHandlerChain is not null, print the log 
"Resource handler chain enabled = true".

  was:
{code title=LinuxContainerExecutor.java}
  ... ...
  if (LOG.isDebugEnabled()) {
LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain
== null));
  }
  ... ...
{code}
I think it is just a typo.When resourceHandlerChain is not null, print the log 
"Resource handler chain enabled = true".


> Fix debug log when Resource handler chain is enabled
> 
>
> Key: YARN-6951
> URL: https://issues.apache.org/jira/browse/YARN-6951
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>
> {code:title=LinuxContainerExecutor.java}
>   ... ...
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain
> == null));
>   }
>   ... ...
> {code}
> I think it is just a typo.When resourceHandlerChain is not null, print the 
> log "Resource handler chain enabled = true".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6951) Fix debug log when Resource handler chain is enabled

2017-08-04 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6951:

Attachment: YARN-6951.001.patch

> Fix debug log when Resource handler chain is enabled
> 
>
> Key: YARN-6951
> URL: https://issues.apache.org/jira/browse/YARN-6951
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
> Attachments: YARN-6951.001.patch
>
>
> {code:title=LinuxContainerExecutor.java}
>   ... ...
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain
> == null));
>   }
>   ... ...
> {code}
> I think it is just a typo.When resourceHandlerChain is not null, print the 
> log "Resource handler chain enabled = true".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-6951) Fix debug log when Resource handler chain is enabled

2017-08-04 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang reassigned YARN-6951:
---

Assignee: Yang Wang

> Fix debug log when Resource handler chain is enabled
> 
>
> Key: YARN-6951
> URL: https://issues.apache.org/jira/browse/YARN-6951
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
> Attachments: YARN-6951.001.patch
>
>
> {code:title=LinuxContainerExecutor.java}
>   ... ...
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain
> == null));
>   }
>   ... ...
> {code}
> I think it is just a typo.When resourceHandlerChain is not null, print the 
> log "Resource handler chain enabled = true".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6212) NodeManager metrics returning wrong negative values

2017-08-07 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116452#comment-16116452
 ] 

Yang Wang commented on YARN-6212:
-

Hi, Miklos Szegedi
I'm afraid this JIRA is not a duplicate of YARN-3933.
The primary cause of negative values is that metrics do not recover properly 
when NM restart.
*AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores*
 in metrics need to recover when NM restart.
This should be done in ContainerManagerImpl#recoverContainer.

The scenario could be reproduction by the following steps:
# Make sure 
YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
 in NM
# Submit an application and keep running
# Restart NM
# Stop the application
# Now you get the negative values

> NodeManager metrics returning wrong negative values
> ---
>
> Key: YARN-6212
> URL: https://issues.apache.org/jira/browse/YARN-6212
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.7.3
>Reporter: Abhishek Shivanna
>
> It looks like the metrics returned by the NodeManager have negative values 
> for metrics that never should be negative. Here is an output form NM endpoint 
> {noformat}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {noformat}
> {noformat}
> {
>   "beans" : [ {
> "name" : "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> "modelerType" : "NodeManagerMetrics",
> "tag.Context" : "yarn",
> "tag.Hostname" : "",
> "ContainersLaunched" : 707,
> "ContainersCompleted" : 9,
> "ContainersFailed" : 124,
> "ContainersKilled" : 579,
> "ContainersIniting" : 0,
> "ContainersRunning" : 19,
> "AllocatedGB" : -26,
> "AllocatedContainers" : -5,
> "AvailableGB" : 252,
> "AllocatedVCores" : -5,
> "AvailableVCores" : 101,
> "ContainerLaunchDurationNumOps" : 718,
> "ContainerLaunchDurationAvgTime" : 18.0
>   } ]
> }
> {noformat}
> Is there any circumstance under which the value for AllocatedGB, 
> AllocatedContainers and AllocatedVCores go below 0? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-6212) NodeManager metrics returning wrong negative values

2017-08-07 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116452#comment-16116452
 ] 

Yang Wang edited comment on YARN-6212 at 8/8/17 2:25 AM:
-

Hi, [~miklos.szeg...@cloudera.com]
I'm afraid this JIRA is not a duplicate of YARN-3933.
The primary cause of negative values is that metrics do not recover properly 
when NM restart.
*AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores*
 in metrics need to recover when NM restart.
This should be done in ContainerManagerImpl#recoverContainer.

The scenario could be reproduction by the following steps:
# Make sure 
YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
 in NM
# Submit an application and keep running
# Restart NM
# Stop the application
# Now you get the negative values


was (Author: fly_in_gis):
Hi, Miklos Szegedi
I'm afraid this JIRA is not a duplicate of YARN-3933.
The primary cause of negative values is that metrics do not recover properly 
when NM restart.
*AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores*
 in metrics need to recover when NM restart.
This should be done in ContainerManagerImpl#recoverContainer.

The scenario could be reproduction by the following steps:
# Make sure 
YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
 in NM
# Submit an application and keep running
# Restart NM
# Stop the application
# Now you get the negative values

> NodeManager metrics returning wrong negative values
> ---
>
> Key: YARN-6212
> URL: https://issues.apache.org/jira/browse/YARN-6212
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.7.3
>Reporter: Abhishek Shivanna
>
> It looks like the metrics returned by the NodeManager have negative values 
> for metrics that never should be negative. Here is an output form NM endpoint 
> {noformat}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {noformat}
> {noformat}
> {
>   "beans" : [ {
> "name" : "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> "modelerType" : "NodeManagerMetrics",
> "tag.Context" : "yarn",
> "tag.Hostname" : "",
> "ContainersLaunched" : 707,
> "ContainersCompleted" : 9,
> "ContainersFailed" : 124,
> "ContainersKilled" : 579,
> "ContainersIniting" : 0,
> "ContainersRunning" : 19,
> "AllocatedGB" : -26,
> "AllocatedContainers" : -5,
> "AvailableGB" : 252,
> "AllocatedVCores" : -5,
> "AvailableVCores" : 101,
> "ContainerLaunchDurationNumOps" : 718,
> "ContainerLaunchDurationAvgTime" : 18.0
>   } ]
> }
> {noformat}
> Is there any circumstance under which the value for AllocatedGB, 
> AllocatedContainers and AllocatedVCores go below 0? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-6966) NodeManager metrics may returning wrong negative values when after restart

2017-08-08 Thread Yang Wang (JIRA)

Yang Wang created YARN-6966:
---

 Summary: NodeManager metrics may returning wrong negative values 
when after restart
 Key: YARN-6966
 URL: https://issues.apache.org/jira/browse/YARN-6966
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
The primary cause of negative values is that metrics do not recover properly 
when NM restart.
AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
 in metrics also need to recover when NM restart.
This should be done in ContainerManagerImpl#recoverContainer.

The scenario could be reproduction by the following steps:
# Make sure 
YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
 in NM
# Submit an application and keep running
# Restart NM
# Stop the application
# Now you get the negative values
{code}
/jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
{code}
{code}
{
name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
modelerType: "NodeManagerMetrics",
tag.Context: "yarn",
tag.Hostname: "hadoop.com",
ContainersLaunched: 0,
ContainersCompleted: 0,
ContainersFailed: 2,
ContainersKilled: 0,
ContainersIniting: 0,
ContainersRunning: 0,
AllocatedGB: 0,
AllocatedContainers: -2,
AvailableGB: 160,
AllocatedVCores: -11,
AvailableVCores: 3611,
ContainerLaunchDurationNumOps: 2,
ContainerLaunchDurationAvgTime: 6,
BadLocalDirs: 0,
BadLogDirs: 0,
GoodLocalDirsDiskUtilizationPerc: 2,
GoodLogDirsDiskUtilizationPerc: 2
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2017-08-08 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6966:

Summary: NodeManager metrics may return wrong negative values when NM 
restart  (was: NodeManager metrics may returning wrong negative values when 
after restart)

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2017-08-08 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6966:

Attachment: YARN-6966.001.patch

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
> Attachments: YARN-6966.001.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2017-08-14 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6966:

Attachment: YARN-6966.002.patch

update the patch

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
> Attachments: YARN-6966.001.patch, YARN-6966.002.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2017-08-14 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang reassigned YARN-6966:
---

Assignee: Yang Wang

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
> Attachments: YARN-6966.001.patch, YARN-6966.002.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-6589) Recover all resources when NM restart

2017-08-15 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang reassigned YARN-6589:
---

Assignee: Yang Wang

> Recover all resources when NM restart
> -
>
> Key: YARN-6589
> URL: https://issues.apache.org/jira/browse/YARN-6589
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>
> When NM restart, containers will be recovered. However, only memory and 
> vcores in capability have been recovered. All resources need to be recovered.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   this.resource = 
> Resource.newInstance(recoveredCapability.getMemorySize(),
>   recoveredCapability.getVirtualCores());
> {code}
> It should be like this.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   // need to recover all resources, not only 
>   this.resource = Resources.clone(recoveredCapability);
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4166) Support changing container cpu resource

2017-08-15 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-4166:

Attachment: (was: YARN-4166-branch2.8-001.patch)

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Jian He
>Assignee: Yang Wang
> Attachments: YARN-4166.001.patch, YARN-4166.002.patch, 
> YARN-4166.003.patch, YARN-4166.004.patch
>
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6966) NodeManager metrics may return wrong negative values when NM restart

2017-08-15 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6966:

Attachment: YARN-6966.003.patch

> NodeManager metrics may return wrong negative values when NM restart
> 
>
> Key: YARN-6966
> URL: https://issues.apache.org/jira/browse/YARN-6966
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
> Attachments: YARN-6966.001.patch, YARN-6966.002.patch, 
> YARN-6966.003.patch
>
>
> Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
> The primary cause of negative values is that metrics do not recover properly 
> when NM restart.
> AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
>  in metrics also need to recover when NM restart.
> This should be done in ContainerManagerImpl#recoverContainer.
> The scenario could be reproduction by the following steps:
> # Make sure 
> YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
>  in NM
> # Submit an application and keep running
> # Restart NM
> # Stop the application
> # Now you get the negative values
> {code}
> /jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
> {code}
> {code}
> {
> name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
> modelerType: "NodeManagerMetrics",
> tag.Context: "yarn",
> tag.Hostname: "hadoop.com",
> ContainersLaunched: 0,
> ContainersCompleted: 0,
> ContainersFailed: 2,
> ContainersKilled: 0,
> ContainersIniting: 0,
> ContainersRunning: 0,
> AllocatedGB: 0,
> AllocatedContainers: -2,
> AvailableGB: 160,
> AllocatedVCores: -11,
> AvailableVCores: 3611,
> ContainerLaunchDurationNumOps: 2,
> ContainerLaunchDurationAvgTime: 6,
> BadLocalDirs: 0,
> BadLogDirs: 0,
> GoodLocalDirsDiskUtilizationPerc: 2,
> GoodLogDirsDiskUtilizationPerc: 2
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6589) Recover all resources when NM restart

2017-08-15 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6589:

Attachment: YARN-6589-YARN-3926.001.patch

> Recover all resources when NM restart
> -
>
> Key: YARN-6589
> URL: https://issues.apache.org/jira/browse/YARN-6589
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
> Attachments: YARN-6589-YARN-3926.001.patch
>
>
> When NM restart, containers will be recovered. However, only memory and 
> vcores in capability have been recovered. All resources need to be recovered.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   this.resource = 
> Resource.newInstance(recoveredCapability.getMemorySize(),
>   recoveredCapability.getVirtualCores());
> {code}
> It should be like this.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   // need to recover all resources, not only 
>   this.resource = Resources.clone(recoveredCapability);
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2018-03-22 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang reassigned YARN-6578:
---

Assignee: Yang Wang

> Return container resource utilization from NM ContainerStatus call
> --
>
> Key: YARN-6578
> URL: https://issues.apache.org/jira/browse/YARN-6578
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Attachments: YARN-6578.001.patch
>
>
> When the applicationMaster wants to change(increase/decrease) resources of an 
> allocated container, resource utilization is an important reference indicator 
> for decision making.  So, when AM call NMClient.getContainerStatus, resource 
> utilization needs to be returned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2018-03-22 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6578:

Description: 
When the applicationMaster wants to change(increase/decrease) resources of an 
allocated container, resource utilization is an important reference indicator 
for decision making. So, when AM call NMClient.getContainerStatus, resource 
utilization needs to be returned.

Also container resource utilization need to report to RM to make better 
scheduling.

  was:When the applicationMaster wants to change(increase/decrease) resources 
of an allocated container, resource utilization is an important reference 
indicator for decision making.  So, when AM call NMClient.getContainerStatus, 
resource utilization needs to be returned.


> Return container resource utilization from NM ContainerStatus call
> --
>
> Key: YARN-6578
> URL: https://issues.apache.org/jira/browse/YARN-6578
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
> Attachments: YARN-6578.001.patch
>
>
> When the applicationMaster wants to change(increase/decrease) resources of an 
> allocated container, resource utilization is an important reference indicator 
> for decision making. So, when AM call NMClient.getContainerStatus, resource 
> utilization needs to be returned.
> Also container resource utilization need to report to RM to make better 
> scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8984) OutstandingSchedRequests in AMRMClient could not be removed when AllocationTags is null or empty

2018-11-06 Thread Yang Wang (JIRA)

Yang Wang created YARN-8984:
---

 Summary: OutstandingSchedRequests in AMRMClient could not be 
removed when AllocationTags is null or empty
 Key: YARN-8984
 URL: https://issues.apache.org/jira/browse/YARN-8984
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


In AMRMClient, outstandingSchedRequests should be removed or decreased when 
container allocated. However, it could not work when allocation tag is null or 
empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-8984) OutstandingSchedRequests in AMRMClient could not be removed when AllocationTags is null or empty

2018-11-06 Thread Yang Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang reassigned YARN-8984:
---

Assignee: Yang Wang

> OutstandingSchedRequests in AMRMClient could not be removed when 
> AllocationTags is null or empty
> 
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Major
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8984) OutstandingSchedRequests in AMRMClient could not be removed when AllocationTags is null or empty

2018-11-06 Thread Yang Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-8984:

Attachment: YARN-8984-001.patch

> OutstandingSchedRequests in AMRMClient could not be removed when 
> AllocationTags is null or empty
> 
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8984) OutstandingSchedRequests in AMRMClient could not be removed when AllocationTags is null or empty

2018-11-06 Thread Yang Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677649#comment-16677649
 ] 

Yang Wang commented on YARN-8984:
-

It could be a critical bug when resync, all the outstandingSchedRequests of 
empty allocation tags will be sent again. In a big cluster, when the active RM 
swiched, the RM will receive lots of requests.

[~cheersyang] Could you please take a look?

> OutstandingSchedRequests in AMRMClient could not be removed when 
> AllocationTags is null or empty
> 
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-07 Thread Yang Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677802#comment-16677802
 ] 

Yang Wang commented on YARN-8984:
-

Hi, [~cheersyang]

I have tried to move the test to TestAMRMClientPlacementConstraints and found 
the case failed. Because containers could not be allocated when allocationTags 
is empty.

I think it is another issue about placement-processor.

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-07 Thread Yang Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-8984:

Attachment: YARN-8984-002.patch

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-07 Thread Yang Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678081#comment-16678081
 ] 

Yang Wang commented on YARN-8984:
-

There's no difference between in a separate class and in 
TestAMRMClientPlacementConstraints.

When set YarnConfiguration.RM_PLACEMENT_CONSTRAINTS_HANDLER to scheduler, we 
could not get rejectedSchedulingRequests from AllocateResponse. It is not set 
by the capacity scheduler. So i add another test in 
TestAMRMClientPlacementConstraints.

[~cheersyang] Please help to review.

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-07 Thread Yang Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-8984:

Attachment: YARN-8984-003.patch

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch, 
> YARN-8984-003.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-08 Thread Yang Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679208#comment-16679208
 ] 

Yang Wang commented on YARN-8984:
-

[~cheersyang], I do not think it will throw a NPE when setAllocationTags to 
null. 

ContainerPBImpl#getAllocationTags() will new a empty hashSet when the tag is 
null. SchedulingRequestPBImpl#getAllocationTags() will also new a empty hashSet 
when tag is null. So the null check is not necessary. Btw, put/get null to a 
HashMap will not throw NPE.

[~botong], Thanks for your reply. The allocationTag in the SchedulingRequest in 
AMRMClient is empty, so RM will not set any tag for the allocated containers.

[~kkaranasos], Thanks for your reply.

You are right, the Scheduling Requests are used for placement constraints. 
However, it does not mean we have to set the allocationTag for each Scheduling 
Request. We have use the SchedulingRequest instead of ResourceRequest in our 
computing framework to allocate resource. So we got this issue.

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch, 
> YARN-8984-003.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-08 Thread Yang Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-8984:

Attachment: YARN-8984-004.patch

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch, 
> YARN-8984-003.patch, YARN-8984-004.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-08 Thread Yang Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680902#comment-16680902
 ] 

Yang Wang commented on YARN-8984:
-

[~botong], [~kkaranasos]

Thanks for your reply. I have add the null check for AllocationTag.

Please help to review.

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch, 
> YARN-8984-003.patch, YARN-8984-004.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-09 Thread Yang Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-8984:

Attachment: YARN-8984-005.patch

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch, 
> YARN-8984-003.patch, YARN-8984-004.patch, YARN-8984-005.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-09 Thread Yang Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681142#comment-16681142
 ] 

Yang Wang commented on YARN-8984:
-

[~cheersyang], thanks for your comments.

I have added a test to verify the three cases. They map to the same empty 
HashSet key of outstandingSchedRequests.

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch, 
> YARN-8984-003.patch, YARN-8984-004.patch, YARN-8984-005.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-21 Thread Yang Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694602#comment-16694602
 ] 

Yang Wang commented on YARN-8984:
-

Hi, [~kkaranasos], [~botong], [~asuresh]

Could you please take a look about this patch. It is very important when use 
SchedulingRequest instead of ResourceRequest.

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch, 
> YARN-8984-003.patch, YARN-8984-004.patch, YARN-8984-005.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8984) AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty

2018-11-22 Thread Yang Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695741#comment-16695741
 ] 

Yang Wang commented on YARN-8984:
-

[~cheersyang] [~kkaranasos]

Thanks for all your reviews and commit.

> AMRMClient#OutstandingSchedRequests leaks when AllocationTags is null or empty
> --
>
> Key: YARN-8984
> URL: https://issues.apache.org/jira/browse/YARN-8984
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-8984-001.patch, YARN-8984-002.patch, 
> YARN-8984-003.patch, YARN-8984-004.patch, YARN-8984-005.patch
>
>
> In AMRMClient, outstandingSchedRequests should be removed or decreased when 
> container allocated. However, it could not work when allocation tag is null 
> or empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4166) Support changing container cpu resource

2017-04-18 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972334#comment-15972334
 ] 

Yang Wang commented on YARN-4166:
-

Hi [~Naganarasimha], 
Are you still working on this, could you share your progress please?

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Jian He
>Assignee: Naganarasimha G R
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4166) Support changing container cpu resource

2017-04-18 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974130#comment-15974130
 ] 

Yang Wang commented on YARN-4166:
-

We want to use container resize(YARN-1197) in production ASAP.  And I already 
have a patch for this JIRA.
[~Naganarasimha] Would you mind take a look?

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Jian He
>Assignee: Naganarasimha G R
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Issue Comment Deleted] (YARN-4166) Support changing container cpu resource

2017-04-18 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-4166:

Comment: was deleted

(was: We want to use container resize(YARN-1197) in production ASAP.  And I 
already have a patch for this JIRA.
[~Naganarasimha] Would you mind take a look?)

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Jian He
>Assignee: Naganarasimha G R
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4166) Support changing container cpu resource

2017-04-19 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974513#comment-15974513
 ] 

Yang Wang commented on YARN-4166:
-

[~Naganarasimha] Sorry, I can not upload a patch. Could you give me the 
permission?

Also, the hadoop version of our production environment is 2.8, so the patch is 
for branch-2.8

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Jian He
>Assignee: Naganarasimha G R
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4166) Support changing container cpu resource

2017-04-19 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-4166:

Attachment: YARN-4166-branch2.8-001.patch

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Jian He
>Assignee: Naganarasimha G R
> Attachments: YARN-4166-branch2.8-001.patch
>
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4166) Support changing container cpu resource

2017-04-19 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-4166:

Attachment: (was: YARN-4166-branch2.8-001.patch)

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Jian He
>Assignee: Naganarasimha G R
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4166) Support changing container cpu resource

2017-04-19 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-4166:

Attachment: YARN-4166-branch2.8-001.patch

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Jian He
>Assignee: Naganarasimha G R
> Attachments: YARN-4166-branch2.8-001.patch
>
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4166) Support changing container cpu resource

2017-04-19 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974605#comment-15974605
 ] 

Yang Wang commented on YARN-4166:
-

Upload a patch for branch-2.8

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Reporter: Jian He
>Assignee: Naganarasimha G R
> Attachments: YARN-4166-branch2.8-001.patch
>
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4166) Support changing container cpu resource

2017-04-20 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-4166:

Attachment: YARN-4166.001.patch

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Jian He
>Assignee: Naganarasimha G R
> Attachments: YARN-4166.001.patch, YARN-4166-branch2.8-001.patch
>
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4166) Support changing container cpu resource

2017-04-20 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976485#comment-15976485
 ] 

Yang Wang commented on YARN-4166:
-

[~Naganarasimha] Thanks for you help. I have already upload a patch for trunk.

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Jian He
>Assignee: Naganarasimha G R
> Attachments: YARN-4166.001.patch, YARN-4166-branch2.8-001.patch
>
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4166) Support changing container cpu resource

2017-04-20 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-4166:

Attachment: YARN-4166.002.patch

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Jian He
>Assignee: Naganarasimha G R
> Attachments: YARN-4166.001.patch, YARN-4166.002.patch, 
> YARN-4166-branch2.8-001.patch
>
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4166) Support changing container cpu resource

2017-04-20 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977015#comment-15977015
 ] 

Yang Wang commented on YARN-4166:
-

Sure, I have fix the red flags.
Findbugs red flag has nothing to do with this patch.

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Jian He
>Assignee: Naganarasimha G R
> Attachments: YARN-4166.001.patch, YARN-4166.002.patch, 
> YARN-4166-branch2.8-001.patch
>
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4166) Support changing container cpu resource

2017-04-27 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986157#comment-15986157
 ] 

Yang Wang commented on YARN-4166:
-

[~Naganarasimha], Thanks for your comments on this patch, I will update it ASAP.

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Jian He
>Assignee: Yang Wang
> Attachments: YARN-4166.001.patch, YARN-4166.002.patch, 
> YARN-4166-branch2.8-001.patch
>
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4166) Support changing container cpu resource

2017-05-03 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-4166:

Attachment: YARN-4166.003.patch

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Jian He
>Assignee: Yang Wang
> Attachments: YARN-4166.001.patch, YARN-4166.002.patch, 
> YARN-4166.003.patch, YARN-4166-branch2.8-001.patch
>
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-4166) Support changing container cpu resource

2017-05-03 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-4166:

Attachment: YARN-4166.004.patch

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Jian He
>Assignee: Yang Wang
> Attachments: YARN-4166.001.patch, YARN-4166.002.patch, 
> YARN-4166.003.patch, YARN-4166.004.patch, YARN-4166-branch2.8-001.patch
>
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4166) Support changing container cpu resource

2017-05-03 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996225#comment-15996225
 ] 

Yang Wang commented on YARN-4166:
-

Update the patch according to [~Naganarasimha]'s suggestion

# *updateContainerResource* has updated to be abstract, 
*DefaultContainerExecutor* has an empty implementation.
# *ResourceHandlerException* will be thrown when container resource update 
failed.
# *ContainerExecutor().updateContainerResource* failed, need to persist 
container resource change for recovery again.
# CGroupsCpuResourceHandlerImpl will not invoke *cGroupsHandler.deleteCGroup* 
when updateContainerResource, ResourceHandlerException will be caught in 
ContainerManagerImpl, then add to failedContainers
# *void updateContainerResource*
# Add test *testContainerManager.testUpdateContainerResourceFailed*
# fix checkstyle issue

> Support changing container cpu resource
> ---
>
> Key: YARN-4166
> URL: https://issues.apache.org/jira/browse/YARN-4166
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, nodemanager, resourcemanager
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Jian He
>Assignee: Yang Wang
> Attachments: YARN-4166.001.patch, YARN-4166.002.patch, 
> YARN-4166.003.patch, YARN-4166.004.patch, YARN-4166-branch2.8-001.patch
>
>
> Memory resizing is now supported, we need to support the same for cpu.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2017-05-10 Thread Yang Wang (JIRA)

Yang Wang created YARN-6578:
---

 Summary: Return container resource utilization from NM 
ContainerStatus call
 Key: YARN-6578
 URL: https://issues.apache.org/jira/browse/YARN-6578
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Yang Wang


When the applicationMaster wants to change(increase/decrease) resources of an 
allocated container, resource utilization is an important reference indicator 
for decision making.  So, when AM call NMClient.getContainerStatus, resource 
utilization needs to be returned.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2017-05-10 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004415#comment-16004415
 ] 

Yang Wang commented on YARN-6578:
-

[~Naganarasimha], thanks for your reply.
I plan to get usage from ContainerMetrics and return in ContainerStatus.
If you worry about this will make the NM heartbeat getting bigger, we could set 
the utilization only in the response of  NMClient.getContainerStatus.

{code}
ContainerImpl.cloneAndGetContainerStatus()
...
  ContainerMetrics metrics = 
ContainerMetrics.getContainerMetrics(this.containerId);
  if (metrics != null) {
status.setUtilization(ResourceUtilization
.newInstance((int) metrics.pMemMBsStat.lastStat().mean(), 0,
(float) metrics.cpuCoreUsagePercent.lastStat().mean()));
  } else {
status.setUtilization(ResourceUtilization.newInstance(0, 0, 0));
  }
...
{code}

> Return container resource utilization from NM ContainerStatus call
> --
>
> Key: YARN-6578
> URL: https://issues.apache.org/jira/browse/YARN-6578
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Yang Wang
>
> When the applicationMaster wants to change(increase/decrease) resources of an 
> allocated container, resource utilization is an important reference indicator 
> for decision making.  So, when AM call NMClient.getContainerStatus, resource 
> utilization needs to be returned.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2017-05-10 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6578:

Attachment: YARN-6578.001.patch

> Return container resource utilization from NM ContainerStatus call
> --
>
> Key: YARN-6578
> URL: https://issues.apache.org/jira/browse/YARN-6578
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Yang Wang
> Attachments: YARN-6578.001.patch
>
>
> When the applicationMaster wants to change(increase/decrease) resources of an 
> allocated container, resource utilization is an important reference indicator 
> for decision making.  So, when AM call NMClient.getContainerStatus, resource 
> utilization needs to be returned.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2017-05-10 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004749#comment-16004749
 ] 

Yang Wang commented on YARN-6578:
-

[~Naganarasimha], I have uploaded a WIP patch.

> Return container resource utilization from NM ContainerStatus call
> --
>
> Key: YARN-6578
> URL: https://issues.apache.org/jira/browse/YARN-6578
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Yang Wang
> Attachments: YARN-6578.001.patch
>
>
> When the applicationMaster wants to change(increase/decrease) resources of an 
> allocated container, resource utilization is an important reference indicator 
> for decision making.  So, when AM call NMClient.getContainerStatus, resource 
> utilization needs to be returned.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-6589) Recover all resources when NM restart

2017-05-11 Thread Yang Wang (JIRA)

Yang Wang created YARN-6589:
---

 Summary: Recover all resources when NM restart
 Key: YARN-6589
 URL: https://issues.apache.org/jira/browse/YARN-6589
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


When NM restart, containers will be recovered. However, only memory and vcores 
in capability have been recovered. All resources need to be recovered.
{code:title=ContainerImpl.java}
  // resource capability had been updated before NM was down
  this.resource = Resource.newInstance(recoveredCapability.getMemorySize(),
  recoveredCapability.getVirtualCores());
{code}

It should be like this.

{code:title=ContainerImpl.java}
  // resource capability had been updated before NM was down
  // need to recover all resources, not only 
  this.resource = Resources.clone(recoveredCapability);
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-6630) Container worker dir could not recover when NM restart

2017-05-22 Thread Yang Wang (JIRA)

Yang Wang created YARN-6630:
---

 Summary: Container worker dir could not recover when NM restart
 Key: YARN-6630
 URL: https://issues.apache.org/jira/browse/YARN-6630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be 
saved in NM state store. Then NM restarts, container.workDir is null, and may 
cause other exceptions.

{code:title=ContainerLaunch.java}
...
  private void recordContainerWorkDir(ContainerId containerId,
  String workDir) throws IOException{
container.setWorkDir(workDir);
if (container.isRetryContextSet()) {
  context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
}
  }
{code}

{code:title=ContainerImpl.java}
  static class ResourceLocalizedWhileRunningTransition
  extends ContainerTransition {
...
  String linkFile = new Path(container.workDir, link).toString();
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6630) Container worker dir could not recover when NM restart

2017-05-23 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6630:

Description: 
When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is 
NEVER_RETRY, container worker dir will not be saved in NM state store. 

{code:title=ContainerLaunch.java}
...
  private void recordContainerWorkDir(ContainerId containerId,
  String workDir) throws IOException{
container.setWorkDir(workDir);
if (container.isRetryContextSet()) {
  context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
}
  }
{code}

Then NM restarts, container.workDir is null, and may cause other exceptions.

{code:title=ContainerImpl.java}
  static class ResourceLocalizedWhileRunningTransition
  extends ContainerTransition {
...
  String linkFile = new Path(container.workDir, link).toString();
...
{code}

{code}
java.lang.IllegalArgumentException: Can not create a Path from a null string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
at org.apache.hadoop.fs.Path.(Path.java:175)
at org.apache.hadoop.fs.Path.(Path.java:110)
... ...
{code}

  was:
When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be 
saved in NM state store. Then NM restarts, container.workDir is null, and may 
cause other exceptions.

{code:title=ContainerLaunch.java}
...
  private void recordContainerWorkDir(ContainerId containerId,
  String workDir) throws IOException{
container.setWorkDir(workDir);
if (container.isRetryContextSet()) {
  context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
}
  }
{code}

{code:title=ContainerImpl.java}
  static class ResourceLocalizedWhileRunningTransition
  extends ContainerTransition {
...
  String linkFile = new Path(container.workDir, link).toString();
...
{code}


> Container worker dir could not recover when NM restart
> --
>
> Key: YARN-6630
> URL: https://issues.apache.org/jira/browse/YARN-6630
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>
> When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is 
> NEVER_RETRY, container worker dir will not be saved in NM state store. 
> {code:title=ContainerLaunch.java}
> ...
>   private void recordContainerWorkDir(ContainerId containerId,
>   String workDir) throws IOException{
> container.setWorkDir(workDir);
> if (container.isRetryContextSet()) {
>   context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
> }
>   }
> {code}
> Then NM restarts, container.workDir is null, and may cause other exceptions.
> {code:title=ContainerImpl.java}
>   static class ResourceLocalizedWhileRunningTransition
>   extends ContainerTransition {
> ...
>   String linkFile = new Path(container.workDir, link).toString();
> ...
> {code}
> {code}
> java.lang.IllegalArgumentException: Can not create a Path from a null string
> at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
> at org.apache.hadoop.fs.Path.(Path.java:175)
> at org.apache.hadoop.fs.Path.(Path.java:110)
> ... ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6630) Container worker dir could not recover when NM restart

2017-05-24 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022445#comment-16022445
 ] 

Yang Wang commented on YARN-6630:
-

When yarn.nodemanager.recovery.enabled is true, nm will not clear any workdir. 
However, container.workDir didn't recover and is null.

> Container worker dir could not recover when NM restart
> --
>
> Key: YARN-6630
> URL: https://issues.apache.org/jira/browse/YARN-6630
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>
> When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is 
> NEVER_RETRY, container worker dir will not be saved in NM state store. 
> {code:title=ContainerLaunch.java}
> ...
>   private void recordContainerWorkDir(ContainerId containerId,
>   String workDir) throws IOException{
> container.setWorkDir(workDir);
> if (container.isRetryContextSet()) {
>   context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
> }
>   }
> {code}
> Then NM restarts, container.workDir is null, and may cause other exceptions.
> {code:title=ContainerImpl.java}
>   static class ResourceLocalizedWhileRunningTransition
>   extends ContainerTransition {
> ...
>   String linkFile = new Path(container.workDir, link).toString();
> ...
> {code}
> {code}
> java.lang.IllegalArgumentException: Can not create a Path from a null string
> at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
> at org.apache.hadoop.fs.Path.(Path.java:175)
> at org.apache.hadoop.fs.Path.(Path.java:110)
> ... ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6630) Container worker dir could not recover when NM restart

2017-05-24 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6630:

Attachment: YARN-6630.001.patch

> Container worker dir could not recover when NM restart
> --
>
> Key: YARN-6630
> URL: https://issues.apache.org/jira/browse/YARN-6630
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
> Attachments: YARN-6630.001.patch
>
>
> When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is 
> NEVER_RETRY, container worker dir will not be saved in NM state store. 
> {code:title=ContainerLaunch.java}
> ...
>   private void recordContainerWorkDir(ContainerId containerId,
>   String workDir) throws IOException{
> container.setWorkDir(workDir);
> if (container.isRetryContextSet()) {
>   context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
> }
>   }
> {code}
> Then NM restarts, container.workDir is null, and may cause other exceptions.
> {code:title=ContainerImpl.java}
>   static class ResourceLocalizedWhileRunningTransition
>   extends ContainerTransition {
> ...
>   String linkFile = new Path(container.workDir, link).toString();
> ...
> {code}
> {code}
> java.lang.IllegalArgumentException: Can not create a Path from a null string
> at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
> at org.apache.hadoop.fs.Path.(Path.java:175)
> at org.apache.hadoop.fs.Path.(Path.java:110)
> ... ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6630) Container worker dir could not recover when NM restart

2017-05-24 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024097#comment-16024097
 ] 

Yang Wang commented on YARN-6630:
-

Hi, [~jianhe], Could you help to review the patch.

We already have a problem, after NM restart,  we send a resource localization 
request while container is running(YARN-1503), then NM will fail because of the 
following exception.
Also, anywhere which use *container.workDir* may cause a NullPointerException.
{code}
java.lang.IllegalArgumentException: Can not create a Path from a null string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
at org.apache.hadoop.fs.Path.(Path.java:175)
at org.apache.hadoop.fs.Path.(Path.java:110)
... ...
{code}

> Container worker dir could not recover when NM restart
> --
>
> Key: YARN-6630
> URL: https://issues.apache.org/jira/browse/YARN-6630
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
> Attachments: YARN-6630.001.patch
>
>
> When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is 
> NEVER_RETRY, container worker dir will not be saved in NM state store. 
> {code:title=ContainerLaunch.java}
> ...
>   private void recordContainerWorkDir(ContainerId containerId,
>   String workDir) throws IOException{
> container.setWorkDir(workDir);
> if (container.isRetryContextSet()) {
>   context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
> }
>   }
> {code}
> Then NM restarts, container.workDir is null, and may cause other exceptions.
> {code:title=ContainerImpl.java}
>   static class ResourceLocalizedWhileRunningTransition
>   extends ContainerTransition {
> ...
>   String linkFile = new Path(container.workDir, link).toString();
> ...
> {code}
> {code}
> java.lang.IllegalArgumentException: Can not create a Path from a null string
> at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
> at org.apache.hadoop.fs.Path.(Path.java:175)
> at org.apache.hadoop.fs.Path.(Path.java:110)
> ... ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6630) Container worker dir could not recover when NM restart

2017-05-26 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025952#comment-16025952
 ] 

Yang Wang commented on YARN-6630:
-

Yes, yarn.nodemanager.recovery.enabled=true and ContainerRetryPolicy= 
NEVER_RETRY is is not ambivalent.

I mean container.workdir always need to be saved in NM state store, has nothing 
to do with ContainerRetryPolicy.

> Container worker dir could not recover when NM restart
> --
>
> Key: YARN-6630
> URL: https://issues.apache.org/jira/browse/YARN-6630
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
> Attachments: YARN-6630.001.patch
>
>
> When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is 
> NEVER_RETRY, container worker dir will not be saved in NM state store. 
> {code:title=ContainerLaunch.java}
> ...
>   private void recordContainerWorkDir(ContainerId containerId,
>   String workDir) throws IOException{
> container.setWorkDir(workDir);
> if (container.isRetryContextSet()) {
>   context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
> }
>   }
> {code}
> Then NM restarts, container.workDir is null, and may cause other exceptions.
> {code:title=ContainerImpl.java}
>   static class ResourceLocalizedWhileRunningTransition
>   extends ContainerTransition {
> ...
>   String linkFile = new Path(container.workDir, link).toString();
> ...
> {code}
> {code}
> java.lang.IllegalArgumentException: Can not create a Path from a null string
> at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
> at org.apache.hadoop.fs.Path.(Path.java:175)
> at org.apache.hadoop.fs.Path.(Path.java:110)
> ... ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6630) Container worker dir could not recover when NM restart

2017-05-26 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025973#comment-16025973
 ] 

Yang Wang commented on YARN-6630:
-

When ContainerRetryPolicy is NEVER_RETRY, container.workdir also needs to be 
saved in NM store. Otherwise, it could not recover and is null after NM restart

{quote}
We already have a problem, after NM restart, we send a resource localization 
request while container is running(YARN-1503), then NM will fail because of the 
following exception.
Also, anywhere which use container.workDir may cause a NullPointerException.
{code}
java.lang.IllegalArgumentException: Can not create a Path from a null string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
at org.apache.hadoop.fs.Path.(Path.java:175)
at org.apache.hadoop.fs.Path.(Path.java:110)
... ...
{code}
{quote}

> Container worker dir could not recover when NM restart
> --
>
> Key: YARN-6630
> URL: https://issues.apache.org/jira/browse/YARN-6630
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
> Attachments: YARN-6630.001.patch
>
>
> When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is 
> NEVER_RETRY, container worker dir will not be saved in NM state store. 
> {code:title=ContainerLaunch.java}
> ...
>   private void recordContainerWorkDir(ContainerId containerId,
>   String workDir) throws IOException{
> container.setWorkDir(workDir);
> if (container.isRetryContextSet()) {
>   context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
> }
>   }
> {code}
> Then NM restarts, container.workDir is null, and may cause other exceptions.
> {code:title=ContainerImpl.java}
>   static class ResourceLocalizedWhileRunningTransition
>   extends ContainerTransition {
> ...
>   String linkFile = new Path(container.workDir, link).toString();
> ...
> {code}
> {code}
> java.lang.IllegalArgumentException: Can not create a Path from a null string
> at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
> at org.apache.hadoop.fs.Path.(Path.java:175)
> at org.apache.hadoop.fs.Path.(Path.java:110)
> ... ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6630) Container worker dir could not recover when NM restart

2017-05-26 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6630:

Description: 
When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be 
saved in NM state store. 

{code:title=ContainerLaunch.java}
...
  private void recordContainerWorkDir(ContainerId containerId,
  String workDir) throws IOException{
container.setWorkDir(workDir);
if (container.isRetryContextSet()) {
  context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
}
  }
{code}

Then NM restarts, container.workDir could not recover and is null, and may 
cause some exceptions.
We already have a problem, after NM restart, we send a resource localization 
request while container is running(YARN-1503), then NM will fail because of the 
following exception.
So, container.workdir always need to be saved in NM state store.

{code:title=ContainerImpl.java}
  static class ResourceLocalizedWhileRunningTransition
  extends ContainerTransition {
...
  String linkFile = new Path(container.workDir, link).toString();
...
{code}

{code}
java.lang.IllegalArgumentException: Can not create a Path from a null string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
at org.apache.hadoop.fs.Path.(Path.java:175)
at org.apache.hadoop.fs.Path.(Path.java:110)
... ...
{code}

  was:
When yarn.nodemanager.recovery.enabled is true and ContainerRetryPolicy is 
NEVER_RETRY, container worker dir will not be saved in NM state store. 

{code:title=ContainerLaunch.java}
...
  private void recordContainerWorkDir(ContainerId containerId,
  String workDir) throws IOException{
container.setWorkDir(workDir);
if (container.isRetryContextSet()) {
  context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
}
  }
{code}

Then NM restarts, container.workDir is null, and may cause other exceptions.

{code:title=ContainerImpl.java}
  static class ResourceLocalizedWhileRunningTransition
  extends ContainerTransition {
...
  String linkFile = new Path(container.workDir, link).toString();
...
{code}

{code}
java.lang.IllegalArgumentException: Can not create a Path from a null string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
at org.apache.hadoop.fs.Path.(Path.java:175)
at org.apache.hadoop.fs.Path.(Path.java:110)
... ...
{code}


> Container worker dir could not recover when NM restart
> --
>
> Key: YARN-6630
> URL: https://issues.apache.org/jira/browse/YARN-6630
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
> Attachments: YARN-6630.001.patch
>
>
> When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be 
> saved in NM state store. 
> {code:title=ContainerLaunch.java}
> ...
>   private void recordContainerWorkDir(ContainerId containerId,
>   String workDir) throws IOException{
> container.setWorkDir(workDir);
> if (container.isRetryContextSet()) {
>   context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
> }
>   }
> {code}
> Then NM restarts, container.workDir could not recover and is null, and may 
> cause some exceptions.
> We already have a problem, after NM restart, we send a resource localization 
> request while container is running(YARN-1503), then NM will fail because of 
> the following exception.
> So, container.workdir always need to be saved in NM state store.
> {code:title=ContainerImpl.java}
>   static class ResourceLocalizedWhileRunningTransition
>   extends ContainerTransition {
> ...
>   String linkFile = new Path(container.workDir, link).toString();
> ...
> {code}
> {code}
> java.lang.IllegalArgumentException: Can not create a Path from a null string
> at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
> at org.apache.hadoop.fs.Path.(Path.java:175)
> at org.apache.hadoop.fs.Path.(Path.java:110)
> ... ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6589) Recover all resources when NM restart

2017-09-19 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6589:

Attachment: YARN-6589.001.patch

> Recover all resources when NM restart
> -
>
> Key: YARN-6589
> URL: https://issues.apache.org/jira/browse/YARN-6589
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Blocker
> Attachments: YARN-6589.001.patch, YARN-6589-YARN-3926.001.patch
>
>
> When NM restart, containers will be recovered. However, only memory and 
> vcores in capability have been recovered. All resources need to be recovered.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   this.resource = 
> Resource.newInstance(recoveredCapability.getMemorySize(),
>   recoveredCapability.getVirtualCores());
> {code}
> It should be like this.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   // need to recover all resources, not only 
>   this.resource = Resources.clone(recoveredCapability);
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6589) Recover all resources when NM restart

2017-09-19 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172660#comment-16172660
 ] 

Yang Wang commented on YARN-6589:
-

Thanks for your comment, [~leftnoteasy]
ContainerImpl.java in the trunk has been changed, and i think this bug has been 
fixed.
I just update the test.

> Recover all resources when NM restart
> -
>
> Key: YARN-6589
> URL: https://issues.apache.org/jira/browse/YARN-6589
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Blocker
> Attachments: YARN-6589.001.patch, YARN-6589-YARN-3926.001.patch
>
>
> When NM restart, containers will be recovered. However, only memory and 
> vcores in capability have been recovered. All resources need to be recovered.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   this.resource = 
> Resource.newInstance(recoveredCapability.getMemorySize(),
>   recoveredCapability.getVirtualCores());
> {code}
> It should be like this.
> {code:title=ContainerImpl.java}
>   // resource capability had been updated before NM was down
>   // need to recover all resources, not only 
>   this.resource = Resources.clone(recoveredCapability);
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-6630) Container worker dir could not recover when NM restart

2017-09-19 Thread Yang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Wang updated YARN-6630:

Attachment: YARN-6630.002.patch

> Container worker dir could not recover when NM restart
> --
>
> Key: YARN-6630
> URL: https://issues.apache.org/jira/browse/YARN-6630
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
> Attachments: YARN-6630.001.patch, YARN-6630.002.patch
>
>
> When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be 
> saved in NM state store. 
> {code:title=ContainerLaunch.java}
> ...
>   private void recordContainerWorkDir(ContainerId containerId,
>   String workDir) throws IOException{
> container.setWorkDir(workDir);
> if (container.isRetryContextSet()) {
>   context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
> }
>   }
> {code}
> Then NM restarts, container.workDir could not recover and is null, and may 
> cause some exceptions.
> We already have a problem, after NM restart, we send a resource localization 
> request while container is running(YARN-1503), then NM will fail because of 
> the following exception.
> So, container.workdir always need to be saved in NM state store.
> {code:title=ContainerImpl.java}
>   static class ResourceLocalizedWhileRunningTransition
>   extends ContainerTransition {
> ...
>   String linkFile = new Path(container.workDir, link).toString();
> ...
> {code}
> {code}
> java.lang.IllegalArgumentException: Can not create a Path from a null string
> at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
> at org.apache.hadoop.fs.Path.(Path.java:175)
> at org.apache.hadoop.fs.Path.(Path.java:110)
> ... ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6630) Container worker dir could not recover when NM restart

2017-09-19 Thread Yang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172698#comment-16172698
 ] 

Yang Wang commented on YARN-6630:
-

Thanks for your comments, [~djp].
Update the patch and rebase trunk.

> Container worker dir could not recover when NM restart
> --
>
> Key: YARN-6630
> URL: https://issues.apache.org/jira/browse/YARN-6630
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yang Wang
> Attachments: YARN-6630.001.patch, YARN-6630.002.patch
>
>
> When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be 
> saved in NM state store. 
> {code:title=ContainerLaunch.java}
> ...
>   private void recordContainerWorkDir(ContainerId containerId,
>   String workDir) throws IOException{
> container.setWorkDir(workDir);
> if (container.isRetryContextSet()) {
>   context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
> }
>   }
> {code}
> Then NM restarts, container.workDir could not recover and is null, and may 
> cause some exceptions.
> We already have a problem, after NM restart, we send a resource localization 
> request while container is running(YARN-1503), then NM will fail because of 
> the following exception.
> So, container.workdir always need to be saved in NM state store.
> {code:title=ContainerImpl.java}
>   static class ResourceLocalizedWhileRunningTransition
>   extends ContainerTransition {
> ...
>   String linkFile = new Path(container.workDir, link).toString();
> ...
> {code}
> {code}
> java.lang.IllegalArgumentException: Can not create a Path from a null string
> at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159)
> at org.apache.hadoop.fs.Path.(Path.java:175)
> at org.apache.hadoop.fs.Path.(Path.java:110)
> ... ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

98 matches

Mail list logo