[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport and UserInfo should based on total-used-resources

2015-07-22 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637452#comment-14637452
 ] 

Bibin A Chundatt commented on YARN-3932:


Thanks [~leftnoteasy]  for review and commit.

 SchedulerApplicationAttempt#getResourceUsageReport and UserInfo should based 
 on total-used-resources
 

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Fix For: 2.8.0

 Attachments: 0001-YARN-3932.patch, 0002-YARN-3932.patch, 
 0003-YARN-3932.patch, 0004-YARN-3932.patch, 0005-YARN-3932.patch, 
 0006-YARN-3932.patch, ApplicationReport.jpg, TestResult.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery

2015-07-24 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-3971:
--

 Summary: Skip 
RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel 
recovery
 Key: YARN-3971
 URL: https://issues.apache.org/jira/browse/YARN-3971
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical


Steps to reproduce 
# Create label x,y
# Delete label x,y
# Create label x,y add capacity scheduler xml for labels x and y too
# Restart RM 
 
Both RM will become Standby.

Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}}
{code}
2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in state 
STARTED; cause: java.io.IOException: Cannot remove label=x, because queue=a1 is 
using this label. Please remove label on queue before remove the label
java.io.IOException: Cannot remove label=x, because queue=a1 is using this 
label. Please remove label on queue before remove the label
at 
org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104)
at 
org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118)
at 
org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312)
at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery

2015-07-24 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3971:
---
Attachment: 0001-YARN-3971.patch

{quote}
1) Don't check {{checkRemoveFromClusterNodeLabelsOfQueue}} when replaying edit 
logs
2) To do 1), you may need to create a local flag in CommonNodeLabelsManager. 
When doing {{initNodeLabelStore}}, the flag is true to indicate it's recovering.
{quote}


[~leftnoteasy] Handled recovery as per you offline comments.


 Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel 
 recovery
 --

 Key: YARN-3971
 URL: https://issues.apache.org/jira/browse/YARN-3971
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3971.patch


 Steps to reproduce 
 # Create label x,y
 # Delete label x,y
 # Create label x,y add capacity scheduler xml for labels x and y too
 # Restart RM 
  
 Both RM will become Standby.
 Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}}
 {code}
 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: 
 Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
 state STARTED; cause: java.io.IOException: Cannot remove label=x, because 
 queue=a1 is using this label. Please remove label on queue before remove the 
 label
 java.io.IOException: Cannot remove label=x, because queue=a1 is using this 
 label. Please remove label on queue before remove the label
 at 
 org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118)
 at 
 org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3940) Application moveToQueue should check NodeLabel permission

2015-07-23 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3940:
---
Attachment: 0001-YARN-3940.patch

[~leftnoteasy] Initial patch for review attached

 Application moveToQueue should check NodeLabel permission 
 --

 Key: YARN-3940
 URL: https://issues.apache.org/jira/browse/YARN-3940
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3940.patch


 Configure capacity scheduler 
 Configure node label an submit application {{queue=A Label=X}}
 Move application to queue {{B}} and x is not having access
 {code}
 2015-07-20 19:46:19,626 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application attempt appattempt_1437385548409_0005_01 released container 
 container_e08_1437385548409_0005_01_02 on node: host: 
 host-10-19-92-117:64318 #containers=1 available=memory:2560, vCores:15 
 used=memory:512, vCores:1 with event: KILL
 2015-07-20 19:46:20,970 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Invalid resource ask by application appattempt_1437385548409_0005_01
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
 resource request, queue=b1 doesn't have permission to access all labels in 
 resource request. labelExpression of resource request=x. Queue labels=y
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
 at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
 at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
 {code}
 Same exception will be thrown till *heartbeat timeout*
 Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3940) Application moveToQueue should check NodeLabel permission

2015-07-26 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3940:
---
Attachment: 0002-YARN-3940.patch

Attaching patch after handling testcase failure

 Application moveToQueue should check NodeLabel permission 
 --

 Key: YARN-3940
 URL: https://issues.apache.org/jira/browse/YARN-3940
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3940.patch, 0002-YARN-3940.patch


 Configure capacity scheduler 
 Configure node label an submit application {{queue=A Label=X}}
 Move application to queue {{B}} and x is not having access
 {code}
 2015-07-20 19:46:19,626 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application attempt appattempt_1437385548409_0005_01 released container 
 container_e08_1437385548409_0005_01_02 on node: host: 
 host-10-19-92-117:64318 #containers=1 available=memory:2560, vCores:15 
 used=memory:512, vCores:1 with event: KILL
 2015-07-20 19:46:20,970 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Invalid resource ask by application appattempt_1437385548409_0005_01
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
 resource request, queue=b1 doesn't have permission to access all labels in 
 resource request. labelExpression of resource request=x. Queue labels=y
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
 at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
 at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
 {code}
 Same exception will be thrown till *heartbeat timeout*
 Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3963) AddNodeLabel on duplicate label addition shows success

2015-07-26 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3963:
---
Attachment: 0002-YARN-3963.patch

Attaching patch with testcase for review

 AddNodeLabel on duplicate label addition shows success 
 ---

 Key: YARN-3963
 URL: https://issues.apache.org/jira/browse/YARN-3963
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3963.patch, 0002-YARN-3963.patch


 Currently as per the code in 
 {{CommonNodeLabelManager#addToClusterNodeLabels}} when we add same nodelabel 
 again event will not be fired so no updation is done. 
  
 {noformat}
 ./yarn rmadmin –addToClusterNodeLabels x
 ./yarn rmadmin –addToClusterNodeLabels “x(exclusive=true)”
 ./yarn rmadmin –addToClusterNodeLabels “x(exclusive=false)”
  {noformat}
 All these commands will give success when applied again through CLI 
  
 {code}
 2015-07-22 21:16:57,779 INFO 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
 [z:exclusivity=true]
 2015-07-22 21:16:57,779 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 IP=10.19.92.117 OPERATION=addToClusterNodeLabelsTARGET=AdminService   
   RESULT=SUCCESS
 2015-07-22 21:17:06,431 INFO 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
 [z:exclusivity=false]
 2015-07-22 21:17:06,431 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 IP=IP OPERATION=addToClusterNodeLabelsTARGET=AdminService 
 RESULT=SUCCESS
  {code}
 Also since exclusive=true to false is not supported success is misleading



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3838) Rest API failing when ip configured in RM address in secure https mode

2015-07-21 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635534#comment-14635534
 ] 

Bibin A Chundatt commented on YARN-3838:


Hi [~xgong] any comments on this 

 Rest API failing when ip configured in RM address in secure https mode
 --

 Key: YARN-3838
 URL: https://issues.apache.org/jira/browse/YARN-3838
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-HADOOP-12096.patch, 0001-YARN-3810.patch, 
 0001-YARN-3838.patch, 0002-YARN-3810.patch, 0002-YARN-3838.patch


 Steps to reproduce
 ===
 1.Configure hadoop.http.authentication.kerberos.principal as below
 {code:xml}
   property
 namehadoop.http.authentication.kerberos.principal/name
 valueHTTP/_h...@hadoop.com/value
   /property
 {code}
 2. In RM web address also configure IP 
 3. Startup RM 
 Call Rest API for RM  {{ curl -i -k  --insecure --negotiate -u : https IP 
 /ws/v1/cluster/info}}
 *Actual*
 Rest API  failing
 {code}
 2015-06-16 19:03:49,845 DEBUG 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
 Authentication exception: GSSException: No valid credentials provided 
 (Mechanism level: Failed to find any Kerberos credentails)
 org.apache.hadoop.security.authentication.client.AuthenticationException: 
 GSSException: No valid credentials provided (Mechanism level: Failed to find 
 any Kerberos credentails)
   at 
 org.apache.hadoop.security.authentication.server.KerberosAuthenticationHandler.authenticate(KerberosAuthenticationHandler.java:399)
   at 
 org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:348)
   at 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:519)
   at 
 org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-21 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3932:
---
Attachment: 0006-YARN-3932.patch

Hi [~leftnoteasy] attaching patch again to retrigger CI and for review

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3932.patch, 0002-YARN-3932.patch, 
 0003-YARN-3932.patch, 0004-YARN-3932.patch, 0005-YARN-3932.patch, 
 0006-YARN-3932.patch, ApplicationReport.jpg, TestResult.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-21 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635727#comment-14635727
 ] 

Bibin A Chundatt commented on YARN-3932:


[~leftnoteasy] checkstyle issue is already existing one and test failure are 
unrelated  to this patch.

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3932.patch, 0002-YARN-3932.patch, 
 0003-YARN-3932.patch, 0004-YARN-3932.patch, 0005-YARN-3932.patch, 
 0006-YARN-3932.patch, ApplicationReport.jpg, TestResult.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery

2015-07-25 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3971:
---
Attachment: 0002-YARN-3971.patch

Attaching patch with update and testcase.
[~leftnoteasy] Please review patch attached.

 Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel 
 recovery
 --

 Key: YARN-3971
 URL: https://issues.apache.org/jira/browse/YARN-3971
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3971.patch, 0002-YARN-3971.patch


 Steps to reproduce 
 # Create label x,y
 # Delete label x,y
 # Create label x,y add capacity scheduler xml for labels x and y too
 # Restart RM 
  
 Both RM will become Standby.
 Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}}
 {code}
 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: 
 Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
 state STARTED; cause: java.io.IOException: Cannot remove label=x, because 
 queue=a1 is using this label. Please remove label on queue before remove the 
 label
 java.io.IOException: Cannot remove label=x, because queue=a1 is using this 
 label. Please remove label on queue before remove the label
 at 
 org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118)
 at 
 org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery

2015-07-25 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3971:
---
Attachment: 0003-YARN-3971.patch

Testcase failure is unrelated. Verified locally testcase is passing.
Fixed checkstyle


 Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel 
 recovery
 --

 Key: YARN-3971
 URL: https://issues.apache.org/jira/browse/YARN-3971
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3971.patch, 0002-YARN-3971.patch, 
 0003-YARN-3971.patch


 Steps to reproduce 
 # Create label x,y
 # Delete label x,y
 # Create label x,y add capacity scheduler xml for labels x and y too
 # Restart RM 
  
 Both RM will become Standby.
 Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}}
 {code}
 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: 
 Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
 state STARTED; cause: java.io.IOException: Cannot remove label=x, because 
 queue=a1 is using this label. Please remove label on queue before remove the 
 label
 java.io.IOException: Cannot remove label=x, because queue=a1 is using this 
 label. Please remove label on queue before remove the label
 at 
 org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118)
 at 
 org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3884) RMContainerImpl transition from RESERVED to KILL apphistory status not updated

2015-07-25 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3884:
---
Component/s: timelineserver

 RMContainerImpl transition from RESERVED to KILL apphistory status not updated
 --

 Key: YARN-3884
 URL: https://issues.apache.org/jira/browse/YARN-3884
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
 Environment: Suse11 Sp3
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3884.patch, Apphistory Container Status.jpg, 
 Elapsed Time.jpg, Test Result-Container status.jpg


 Setup
 ===
 1 NM 3072 16 cores each
 Steps to reproduce
 ===
 1.Submit apps  to Queue 1 with 512 mb 1 core
 2.Submit apps  to Queue 2 with 512 mb and 5 core
 lots of containers get reserved and unreserved in this case 
 {code}
 2015-07-02 20:45:31,169 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_e24_1435849994778_0002_01_13 Container Transitioned from NEW to 
 RESERVED
 2015-07-02 20:45:31,170 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 Reserved container  application=application_1435849994778_0002 
 resource=memory:512, vCores:5 queue=QueueA: capacity=0.4, 
 absoluteCapacity=0.4, usedResources=memory:2560, vCores:21, 
 usedCapacity=1.6410257, absoluteUsedCapacity=0.65625, numApps=1, 
 numContainers=5 usedCapacity=1.6410257 absoluteUsedCapacity=0.65625 
 used=memory:2560, vCores:21 cluster=memory:6144, vCores:32
 2015-07-02 20:45:31,170 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Re-sorting assigned queue: root.QueueA stats: QueueA: capacity=0.4, 
 absoluteCapacity=0.4, usedResources=memory:3072, vCores:26, 
 usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1, 
 numContainers=6
 2015-07-02 20:45:31,170 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 assignedContainer queue=root usedCapacity=0.96875 
 absoluteUsedCapacity=0.96875 used=memory:5632, vCores:31 
 cluster=memory:6144, vCores:32
 2015-07-02 20:45:31,191 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_e24_1435849994778_0001_01_14 Container Transitioned from NEW to 
 ALLOCATED
 2015-07-02 20:45:31,191 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 OPERATION=AM Allocated ContainerTARGET=SchedulerApp 
 RESULT=SUCCESS  APPID=application_1435849994778_0001
 CONTAINERID=container_e24_1435849994778_0001_01_14
 2015-07-02 20:45:31,191 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: 
 Assigned container container_e24_1435849994778_0001_01_14 of capacity 
 memory:512, vCores:1 on host host-10-19-92-117:64318, which has 6 
 containers, memory:3072, vCores:14 used and memory:0, vCores:2 available 
 after allocation
 2015-07-02 20:45:31,191 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 assignedContainer application attempt=appattempt_1435849994778_0001_01 
 container=Container: [ContainerId: 
 container_e24_1435849994778_0001_01_14, NodeId: host-10-19-92-117:64318, 
 NodeHttpAddress: host-10-19-92-117:65321, Resource: memory:512, vCores:1, 
 Priority: 20, Token: null, ] queue=default: capacity=0.2, 
 absoluteCapacity=0.2, usedResources=memory:2560, vCores:5, 
 usedCapacity=2.0846906, absoluteUsedCapacity=0.4166, numApps=1, 
 numContainers=5 clusterResource=memory:6144, vCores:32
 2015-07-02 20:45:31,191 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Re-sorting assigned queue: root.default stats: default: capacity=0.2, 
 absoluteCapacity=0.2, usedResources=memory:3072, vCores:6, 
 usedCapacity=2.5016286, absoluteUsedCapacity=0.5, numApps=1, numContainers=6
 2015-07-02 20:45:31,191 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 assignedContainer queue=root usedCapacity=1.0 absoluteUsedCapacity=1.0 
 used=memory:6144, vCores:32 cluster=memory:6144, vCores:32
 2015-07-02 20:45:32,143 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_e24_1435849994778_0001_01_14 Container Transitioned from 
 ALLOCATED to ACQUIRED
 2015-07-02 20:45:32,174 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Trying to fulfill reservation for application application_1435849994778_0002 
 on node: host-10-19-92-143:64318
 2015-07-02 20:45:32,174 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 Reserved container  application=application_1435849994778_0002 
 

[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

2015-07-25 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641743#comment-14641743
 ] 

Bibin A Chundatt commented on YARN-3893:


{quote}
Instead of checking for exception message in test, can you check for 
ServiceFailedException
{quote}
Already the same is verified in may testcases using messages.

{quote}
Can you add a verification in the test to check whether active services were 
stopped ?
{quote}

IMO its not required.

 Both RM in active state when Admin#transitionToActive failure from refeshAll()
 --

 Key: YARN-3893
 URL: https://issues.apache.org/jira/browse/YARN-3893
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 
 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml


 Cases that can cause this.
 # Capacity scheduler xml is wrongly configured during switch
 # Refresh ACL failure due to configuration
 # Refresh User group failure due to configuration
 Continuously both RM will try to be active
 {code}
 dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm1
 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm2
 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 {code}
 # Both Web UI active
 # Status shown as active for both RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3963) AddNodeLabel on duplicate label addition shows success

2015-07-23 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3963:
---
Description: 
Currently as per the code in {{CommonNodeLabelManager#addToClusterNodeLabels}} 
when we add same nodelabel again event will not be fired so no updation is 
done. 
 
{noformat}
./yarn rmadmin –addToClusterNodeLabels x
./yarn rmadmin –addToClusterNodeLabels “x(exclusive=true)”
./yarn rmadmin –addToClusterNodeLabels “x(exclusive=false)”
 {noformat}


All these commands will give success when applied again through CLI 
 
{code}
2015-07-22 21:16:57,779 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
[z:exclusivity=true]
2015-07-22 21:16:57,779 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
IP=10.19.92.117 OPERATION=addToClusterNodeLabelsTARGET=AdminService 
RESULT=SUCCESS
2015-07-22 21:17:06,431 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
[z:exclusivity=false]
2015-07-22 21:17:06,431 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
IP=IP OPERATION=addToClusterNodeLabelsTARGET=AdminService 
RESULT=SUCCESS
 {code}

Also since exclusive=true to false is not supported success is misleading

  was:
Currently as per the code in {{CommonNodeLabelManager#addToClusterNodeLabels}} 
when we add same nodelabel again event will not be fired so no updation is 
done. 
 
{noformat}
./yarn rmadmin –addToClusterNodeLabels x
./yarn rmadmin –addToClusterNodeLabels “x(exclusive=true)”
./yarn rmadmin –addToClusterNodeLabels “x(exclusive=false)”
 {noformat}


All these commands will give success when applied again through CLI again
 
{code}
2015-07-22 21:16:57,779 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
[z:exclusivity=true]
2015-07-22 21:16:57,779 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
IP=10.19.92.117 OPERATION=addToClusterNodeLabelsTARGET=AdminService 
RESULT=SUCCESS
2015-07-22 21:17:06,431 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
[z:exclusivity=false]
2015-07-22 21:17:06,431 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
IP=IP OPERATION=addToClusterNodeLabelsTARGET=AdminService 
RESULT=SUCCESS
 {code}





 AddNodeLabel on duplicate label addition shows success 
 ---

 Key: YARN-3963
 URL: https://issues.apache.org/jira/browse/YARN-3963
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor

 Currently as per the code in 
 {{CommonNodeLabelManager#addToClusterNodeLabels}} when we add same nodelabel 
 again event will not be fired so no updation is done. 
  
 {noformat}
 ./yarn rmadmin –addToClusterNodeLabels x
 ./yarn rmadmin –addToClusterNodeLabels “x(exclusive=true)”
 ./yarn rmadmin –addToClusterNodeLabels “x(exclusive=false)”
  {noformat}
 All these commands will give success when applied again through CLI 
  
 {code}
 2015-07-22 21:16:57,779 INFO 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
 [z:exclusivity=true]
 2015-07-22 21:16:57,779 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 IP=10.19.92.117 OPERATION=addToClusterNodeLabelsTARGET=AdminService   
   RESULT=SUCCESS
 2015-07-22 21:17:06,431 INFO 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
 [z:exclusivity=false]
 2015-07-22 21:17:06,431 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 IP=IP OPERATION=addToClusterNodeLabelsTARGET=AdminService 
 RESULT=SUCCESS
  {code}
 Also since exclusive=true to false is not supported success is misleading



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3963) AddNodeLabel on duplicate label addition shows success

2015-07-23 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638238#comment-14638238
 ] 

Bibin A Chundatt commented on YARN-3963:


[~sunilg]
{quote}
I feel it is better to throw back an error from NodeLabelManager when a 
duplication occurs.
{quote}
I do support this . Any impact in distributed mode? 

 AddNodeLabel on duplicate label addition shows success 
 ---

 Key: YARN-3963
 URL: https://issues.apache.org/jira/browse/YARN-3963
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor

 Currently as per the code in 
 {{CommonNodeLabelManager#addToClusterNodeLabels}} when we add same nodelabel 
 again event will not be fired so no updation is done. 
  
 {noformat}
 ./yarn rmadmin –addToClusterNodeLabels x
 ./yarn rmadmin –addToClusterNodeLabels “x(exclusive=true)”
 ./yarn rmadmin –addToClusterNodeLabels “x(exclusive=false)”
  {noformat}
 All these commands will give success when applied again through CLI 
  
 {code}
 2015-07-22 21:16:57,779 INFO 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
 [z:exclusivity=true]
 2015-07-22 21:16:57,779 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 IP=10.19.92.117 OPERATION=addToClusterNodeLabelsTARGET=AdminService   
   RESULT=SUCCESS
 2015-07-22 21:17:06,431 INFO 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
 [z:exclusivity=false]
 2015-07-22 21:17:06,431 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 IP=IP OPERATION=addToClusterNodeLabelsTARGET=AdminService 
 RESULT=SUCCESS
  {code}
 Also since exclusive=true to false is not supported success is misleading



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3963) AddNodeLabel on duplicate label addition shows success

2015-07-22 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638211#comment-14638211
 ] 

Bibin A Chundatt commented on YARN-3963:


Exclusive updation for label from true to false is not supported.
IMO can handle  in two ways 

# Add validation in {{CommonNodeLabelsManager#addToCluserNodeLabels}} when 
duplicate added and exception to console 
# Improve logs in RM as below

{code}
 if (null != dispatcher  !newLabels.isEmpty()) {
  dispatcher.getEventHandler().handle(
  new StoreNewClusterNodeLabels(newLabels));
 LOG.info(Add labels: [ + StringUtils.join(labels.iterator(), ,) + ]);
}else{
 LOG.info(Skipped labels: [ + StringUtils.join(skippedlabels.iterator(), 
,) + ]);
 }
{code}

Any comments?

 AddNodeLabel on duplicate label addition shows success 
 ---

 Key: YARN-3963
 URL: https://issues.apache.org/jira/browse/YARN-3963
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor

 Currently as per the code in 
 {{CommonNodeLabelManager#addToClusterNodeLabels}} when we add same nodelabel 
 again event will not be fired so no updation is done. 
  
 {noformat}
 ./yarn rmadmin –addToClusterNodeLabels x
 ./yarn rmadmin –addToClusterNodeLabels “x(exclusive=true)”
 ./yarn rmadmin –addToClusterNodeLabels “x(exclusive=false)”
  {noformat}
 All these commands will give success when applied through CLI again
  
 {code}
 2015-07-22 21:16:57,779 INFO 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
 [z:exclusivity=true]
 2015-07-22 21:16:57,779 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 IP=10.19.92.117 OPERATION=addToClusterNodeLabelsTARGET=AdminService   
   RESULT=SUCCESS
 2015-07-22 21:17:06,431 INFO 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
 [z:exclusivity=false]
 2015-07-22 21:17:06,431 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 IP=IP OPERATION=addToClusterNodeLabelsTARGET=AdminService 
 RESULT=SUCCESS
  {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3963) AddNodeLabel on duplicate label addition shows success

2015-07-22 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3963:
---
Description: 
Currently as per the code in {{CommonNodeLabelManager#addToClusterNodeLabels}} 
when we add same nodelabel again event will not be fired so no updation is 
done. 
 
{noformat}
./yarn rmadmin –addToClusterNodeLabels x
./yarn rmadmin –addToClusterNodeLabels “x(exclusive=true)”
./yarn rmadmin –addToClusterNodeLabels “x(exclusive=false)”
 {noformat}


All these commands will give success when applied again through CLI again
 
{code}
2015-07-22 21:16:57,779 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
[z:exclusivity=true]
2015-07-22 21:16:57,779 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
IP=10.19.92.117 OPERATION=addToClusterNodeLabelsTARGET=AdminService 
RESULT=SUCCESS
2015-07-22 21:17:06,431 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
[z:exclusivity=false]
2015-07-22 21:17:06,431 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
IP=IP OPERATION=addToClusterNodeLabelsTARGET=AdminService 
RESULT=SUCCESS
 {code}




  was:
Currently as per the code in {{CommonNodeLabelManager#addToClusterNodeLabels}} 
when we add same nodelabel again event will not be fired so no updation is 
done. 
 
{noformat}
./yarn rmadmin –addToClusterNodeLabels x
./yarn rmadmin –addToClusterNodeLabels “x(exclusive=true)”
./yarn rmadmin –addToClusterNodeLabels “x(exclusive=false)”
 {noformat}


All these commands will give success when applied through CLI again
 
{code}
2015-07-22 21:16:57,779 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
[z:exclusivity=true]
2015-07-22 21:16:57,779 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
IP=10.19.92.117 OPERATION=addToClusterNodeLabelsTARGET=AdminService 
RESULT=SUCCESS
2015-07-22 21:17:06,431 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
[z:exclusivity=false]
2015-07-22 21:17:06,431 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
IP=IP OPERATION=addToClusterNodeLabelsTARGET=AdminService 
RESULT=SUCCESS
 {code}





 AddNodeLabel on duplicate label addition shows success 
 ---

 Key: YARN-3963
 URL: https://issues.apache.org/jira/browse/YARN-3963
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor

 Currently as per the code in 
 {{CommonNodeLabelManager#addToClusterNodeLabels}} when we add same nodelabel 
 again event will not be fired so no updation is done. 
  
 {noformat}
 ./yarn rmadmin –addToClusterNodeLabels x
 ./yarn rmadmin –addToClusterNodeLabels “x(exclusive=true)”
 ./yarn rmadmin –addToClusterNodeLabels “x(exclusive=false)”
  {noformat}
 All these commands will give success when applied again through CLI again
  
 {code}
 2015-07-22 21:16:57,779 INFO 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
 [z:exclusivity=true]
 2015-07-22 21:16:57,779 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 IP=10.19.92.117 OPERATION=addToClusterNodeLabelsTARGET=AdminService   
   RESULT=SUCCESS
 2015-07-22 21:17:06,431 INFO 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
 [z:exclusivity=false]
 2015-07-22 21:17:06,431 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 IP=IP OPERATION=addToClusterNodeLabelsTARGET=AdminService 
 RESULT=SUCCESS
  {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3963) AddNodeLabel on duplicate label addition shows success

2015-07-22 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-3963:
--

 Summary: AddNodeLabel on duplicate label addition shows success 
 Key: YARN-3963
 URL: https://issues.apache.org/jira/browse/YARN-3963
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor


Currently as per the code in {{CommonNodeLabelManager#addToClusterNodeLabels}} 
when we add same nodelabel again event will not be fired so no updation is 
done. 
 
{noformat}
./yarn rmadmin –addToClusterNodeLabels x
./yarn rmadmin –addToClusterNodeLabels “x(exclusive=true)”
./yarn rmadmin –addToClusterNodeLabels “x(exclusive=false)”
 {noformat}


All these commands will give success when applied through CLI again
 
{code}
2015-07-22 21:16:57,779 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
[z:exclusivity=true]
2015-07-22 21:16:57,779 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
IP=10.19.92.117 OPERATION=addToClusterNodeLabelsTARGET=AdminService 
RESULT=SUCCESS
2015-07-22 21:17:06,431 INFO 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
[z:exclusivity=false]
2015-07-22 21:17:06,431 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
IP=IP OPERATION=addToClusterNodeLabelsTARGET=AdminService 
RESULT=SUCCESS
 {code}






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3963) AddNodeLabel on duplicate label addition shows success

2015-07-23 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt reassigned YARN-3963:
--

Assignee: Bibin A Chundatt  (was: Sunil G)

 AddNodeLabel on duplicate label addition shows success 
 ---

 Key: YARN-3963
 URL: https://issues.apache.org/jira/browse/YARN-3963
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor

 Currently as per the code in 
 {{CommonNodeLabelManager#addToClusterNodeLabels}} when we add same nodelabel 
 again event will not be fired so no updation is done. 
  
 {noformat}
 ./yarn rmadmin –addToClusterNodeLabels x
 ./yarn rmadmin –addToClusterNodeLabels “x(exclusive=true)”
 ./yarn rmadmin –addToClusterNodeLabels “x(exclusive=false)”
  {noformat}
 All these commands will give success when applied again through CLI 
  
 {code}
 2015-07-22 21:16:57,779 INFO 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
 [z:exclusivity=true]
 2015-07-22 21:16:57,779 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 IP=10.19.92.117 OPERATION=addToClusterNodeLabelsTARGET=AdminService   
   RESULT=SUCCESS
 2015-07-22 21:17:06,431 INFO 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager: Add labels: 
 [z:exclusivity=false]
 2015-07-22 21:17:06,431 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 IP=IP OPERATION=addToClusterNodeLabelsTARGET=AdminService 
 RESULT=SUCCESS
  {code}
 Also since exclusive=true to false is not supported success is misleading



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-16 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630090#comment-14630090
 ] 

Bibin A Chundatt commented on YARN-3932:


Hi [~leftnoteasy] i think we should iterate over {{liveContainers}} get sum of 
resource used. Any thoughts??

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: ApplicationReport.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3938) AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero with NodeLabel

2015-07-17 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631317#comment-14631317
 ] 

Bibin A Chundatt commented on YARN-3938:


Hi [~leftnoteasy] .As i understand {{ 
labelManager.getResourceByLabel(RMNodeLabelsManager.NO_LABEL, 
clusterResource)}} will return {{0}} that is the reason its going wrong. Please 
correct me if i am wrong. Any thoughts?


 AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero 
 with NodeLabel
 

 Key: YARN-3938
 URL: https://issues.apache.org/jira/browse/YARN-3938
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: Am limit for subqueue.jpg


 In case of leaf queue  the AM resource calculation is based on 
 {{absoluteCapacityResource}}. Below is the calculation for absolute capacity
 {{LeafQueue#updateAbsoluteCapacityResource()}}
 {code}
   private void updateAbsoluteCapacityResource(Resource clusterResource) {
 absoluteCapacityResource =
 Resources.multiplyAndNormalizeUp(resourceCalculator, labelManager
 .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, 
 clusterResource),
 queueCapacities.getAbsoluteCapacity(), minimumAllocation);
   }
 {code}
 If default partition resource is zero for all Leaf queue the resource for AM 
 will be zero
 Snapshot also attached for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-16 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3932:
---
Attachment: ApplicationReport.jpg

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: ApplicationReport.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-16 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-3932:
--

 Summary: SchedulerApplicationAttempt#getResourceUsageReport should 
be based on NodeLabel
 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


Application Resource Report shown wrong when node Label is used.


1.Submit application with NodeLabel
2.Check RM UI for resources used 
Allocated CPU VCores and Allocated Memory MB is always {{zero}}

{code}
 public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
AggregateAppResourceUsage runningResourceUsage =
getRunningAggregateAppResourceUsage();
Resource usedResourceClone =
Resources.clone(attemptResourceUsage.getUsed());
Resource reservedResourceClone =
Resources.clone(attemptResourceUsage.getReserved());
return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
reservedContainers.size(), usedResourceClone, reservedResourceClone,
Resources.add(usedResourceClone, reservedResourceClone),
runningResourceUsage.getMemorySeconds(),
runningResourceUsage.getVcoreSeconds());
  }
{code}
should be {{attemptResourceUsage.getUsed(label)}}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

2015-07-14 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3893:
---
Attachment: 0001-YARN-3893.patch

[~sunilg], [~varun_saxena] and [~xgong] Thanks a lot for comments.
Please review

 Both RM in active state when Admin#transitionToActive failure from refeshAll()
 --

 Key: YARN-3893
 URL: https://issues.apache.org/jira/browse/YARN-3893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3893.patch, yarn-site.xml


 Cases that can cause this.
 # Capacity scheduler xml is wrongly configured during switch
 # Refresh ACL failure due to configuration
 # Refresh User group failure due to configuration
 Continuously both RM will try to be active
 {code}
 dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm1
 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm2
 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 {code}
 # Both Web UI active
 # Status shown as active for both RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

2015-07-15 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3893:
---
Attachment: 0003-YARN-3893.patch

 Both RM in active state when Admin#transitionToActive failure from refeshAll()
 --

 Key: YARN-3893
 URL: https://issues.apache.org/jira/browse/YARN-3893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 
 0003-YARN-3893.patch, yarn-site.xml


 Cases that can cause this.
 # Capacity scheduler xml is wrongly configured during switch
 # Refresh ACL failure due to configuration
 # Refresh User group failure due to configuration
 Continuously both RM will try to be active
 {code}
 dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm1
 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm2
 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 {code}
 # Both Web UI active
 # Status shown as active for both RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

2015-07-14 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3893:
---
Attachment: 0002-YARN-3893.patch

 Both RM in active state when Admin#transitionToActive failure from refeshAll()
 --

 Key: YARN-3893
 URL: https://issues.apache.org/jira/browse/YARN-3893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, yarn-site.xml


 Cases that can cause this.
 # Capacity scheduler xml is wrongly configured during switch
 # Refresh ACL failure due to configuration
 # Refresh User group failure due to configuration
 Continuously both RM will try to be active
 {code}
 dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm1
 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm2
 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 {code}
 # Both Web UI active
 # Status shown as active for both RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-17 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3932:
---
Attachment: TestResult.jpg

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3932.patch, ApplicationReport.jpg, 
 TestResult.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-17 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3932:
---
Attachment: 0001-YARN-3932.patch

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3932.patch, ApplicationReport.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-17 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630877#comment-14630877
 ] 

Bibin A Chundatt commented on YARN-3932:


[~leftnoteasy] used {{attemptResourceUsage.getAllUsed()}} already available 
method.

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3932.patch, ApplicationReport.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3938) AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero with NodeLabel

2015-07-17 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3938:
---
Summary: AM Resources for leaf queues zero when DEFAULT PARTITION resource 
is zero with NodeLabel  (was: AM Resources for leaf queues zero when DEFAULT 
PARTITION resource is zero)

 AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero 
 with NodeLabel
 

 Key: YARN-3938
 URL: https://issues.apache.org/jira/browse/YARN-3938
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: Am limit for subqueue.jpg


 In case of leaf queue  the AM resource calculation is based on 
 {{absoluteCapacityResource}}. Below is the calculation for absolute capacity
 {{LeafQueue#updateAbsoluteCapacityResource()}}
 {code}
   private void updateAbsoluteCapacityResource(Resource clusterResource) {
 absoluteCapacityResource =
 Resources.multiplyAndNormalizeUp(resourceCalculator, labelManager
 .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, 
 clusterResource),
 queueCapacities.getAbsoluteCapacity(), minimumAllocation);
   }
 {code}
 If default partition resource is zero for all Leaf queue the resource for AM 
 will be zero
 Snapshot also attached for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3938) AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero

2015-07-17 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-3938:
--

 Summary: AM Resources for leaf queues zero when DEFAULT PARTITION 
resource is zero
 Key: YARN-3938
 URL: https://issues.apache.org/jira/browse/YARN-3938
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical


In case of leaf queue  the AM resource calculation is based on 
{{absoluteCapacityResource}}. Below is the calculation for absolute capacity

{{LeafQueue#updateAbsoluteCapacityResource()}}


{code}
  private void updateAbsoluteCapacityResource(Resource clusterResource) {
absoluteCapacityResource =
Resources.multiplyAndNormalizeUp(resourceCalculator, labelManager
.getResourceByLabel(RMNodeLabelsManager.NO_LABEL, clusterResource),
queueCapacities.getAbsoluteCapacity(), minimumAllocation);
  }
{code}

If default partition resource is zero for all Leaf queue the resource for AM 
will be zero

Snapshot also attached for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3938) AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero

2015-07-17 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3938:
---
Attachment: Am limit for subqueue.jpg

 AM Resources for leaf queues zero when DEFAULT PARTITION resource is zero
 -

 Key: YARN-3938
 URL: https://issues.apache.org/jira/browse/YARN-3938
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: Am limit for subqueue.jpg


 In case of leaf queue  the AM resource calculation is based on 
 {{absoluteCapacityResource}}. Below is the calculation for absolute capacity
 {{LeafQueue#updateAbsoluteCapacityResource()}}
 {code}
   private void updateAbsoluteCapacityResource(Resource clusterResource) {
 absoluteCapacityResource =
 Resources.multiplyAndNormalizeUp(resourceCalculator, labelManager
 .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, 
 clusterResource),
 queueCapacities.getAbsoluteCapacity(), minimumAllocation);
   }
 {code}
 If default partition resource is zero for all Leaf queue the resource for AM 
 will be zero
 Snapshot also attached for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-19 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3932:
---
Attachment: 0002-YARN-3932.patch

Attaching patch with testcase 

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3932.patch, 0002-YARN-3932.patch, 
 ApplicationReport.jpg, TestResult.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3939) CS schedule Userlevel resource usage shown is zero for NodeLabel partition

2015-07-20 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3939:
---
Attachment: 0001-YARN-3939.patch

 CS schedule Userlevel resource usage shown is zero for NodeLabel partition
 --

 Key: YARN-3939
 URL: https://issues.apache.org/jira/browse/YARN-3939
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
 Environment: Suse 11 SP3 
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3939.patch, ResourceUsage.jpg


 Submit application to queue and particular node label partition
 Check the resource usage for particular user in queue
 {{LeafQueue#getUsers()}}
 {code}
   public synchronized ArrayListUserInfo getUsers() {
 ArrayListUserInfo usersToReturn = new ArrayListUserInfo();
 for (Map.EntryString, User entry : users.entrySet()) {
   User user = entry.getValue();
   usersToReturn.add(new UserInfo(entry.getKey(), Resources.clone(user
   .getUsed()), user.getActiveApplications(), user
   .getPendingApplications(), Resources.clone(user
   .getConsumedAMResources()), Resources.clone(user
   .getUserResourceLimit(;
 }
 return usersToReturn;
   }
 {code}
 Should get usage for particular user and label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3939) CS schedule Userlevel resource usage shown is zero for NodeLabel partition

2015-07-20 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3939:
---
Attachment: (was: 0001-YARN-3939.patch)

 CS schedule Userlevel resource usage shown is zero for NodeLabel partition
 --

 Key: YARN-3939
 URL: https://issues.apache.org/jira/browse/YARN-3939
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
 Environment: Suse 11 SP3 
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: ResourceUsage.jpg


 Submit application to queue and particular node label partition
 Check the resource usage for particular user in queue
 {{LeafQueue#getUsers()}}
 {code}
   public synchronized ArrayListUserInfo getUsers() {
 ArrayListUserInfo usersToReturn = new ArrayListUserInfo();
 for (Map.EntryString, User entry : users.entrySet()) {
   User user = entry.getValue();
   usersToReturn.add(new UserInfo(entry.getKey(), Resources.clone(user
   .getUsed()), user.getActiveApplications(), user
   .getPendingApplications(), Resources.clone(user
   .getConsumedAMResources()), Resources.clone(user
   .getUserResourceLimit(;
 }
 return usersToReturn;
   }
 {code}
 Should get usage for particular user and label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3939) CS schedule Userlevel resource usage shown is zero for NodeLabel partition

2015-07-20 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3939:
---
Attachment: ResourceUsage.jpg

 CS schedule Userlevel resource usage shown is zero for NodeLabel partition
 --

 Key: YARN-3939
 URL: https://issues.apache.org/jira/browse/YARN-3939
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
 Environment: Suse 11 SP3 
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: ResourceUsage.jpg


 Submit application to queue and particular node label partition
 Check the resource usage for particular user in queue
 {{LeafQueue#getUsers()}}
 {code}
   public synchronized ArrayListUserInfo getUsers() {
 ArrayListUserInfo usersToReturn = new ArrayListUserInfo();
 for (Map.EntryString, User entry : users.entrySet()) {
   User user = entry.getValue();
   usersToReturn.add(new UserInfo(entry.getKey(), Resources.clone(user
   .getUsed()), user.getActiveApplications(), user
   .getPendingApplications(), Resources.clone(user
   .getConsumedAMResources()), Resources.clone(user
   .getUserResourceLimit(;
 }
 return usersToReturn;
   }
 {code}
 Should get usage for particular user and label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3939) CS schedule Userlevel resource usage shown is zero for NodeLabel partition

2015-07-20 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-3939:
--

 Summary: CS schedule Userlevel resource usage shown is zero for 
NodeLabel partition
 Key: YARN-3939
 URL: https://issues.apache.org/jira/browse/YARN-3939
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
 Environment: Suse 11 SP3 
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: ResourceUsage.jpg

Submit application to queue and particular node label partition
Check the resource usage for particular user in queue

{{LeafQueue#getUsers()}}
{code}
  public synchronized ArrayListUserInfo getUsers() {
ArrayListUserInfo usersToReturn = new ArrayListUserInfo();
for (Map.EntryString, User entry : users.entrySet()) {
  User user = entry.getValue();
  usersToReturn.add(new UserInfo(entry.getKey(), Resources.clone(user
  .getUsed()), user.getActiveApplications(), user
  .getPendingApplications(), Resources.clone(user
  .getConsumedAMResources()), Resources.clone(user
  .getUserResourceLimit(;
}
return usersToReturn;
  }
{code}

Should get usage for particular user and label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3939) CS schedule Userlevel resource usage shown is zero for NodeLabel partition

2015-07-20 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3939:
---
Attachment: 0001-YARN-3939.patch

 CS schedule Userlevel resource usage shown is zero for NodeLabel partition
 --

 Key: YARN-3939
 URL: https://issues.apache.org/jira/browse/YARN-3939
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
 Environment: Suse 11 SP3 
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3939.patch, ResourceUsage.jpg


 Submit application to queue and particular node label partition
 Check the resource usage for particular user in queue
 {{LeafQueue#getUsers()}}
 {code}
   public synchronized ArrayListUserInfo getUsers() {
 ArrayListUserInfo usersToReturn = new ArrayListUserInfo();
 for (Map.EntryString, User entry : users.entrySet()) {
   User user = entry.getValue();
   usersToReturn.add(new UserInfo(entry.getKey(), Resources.clone(user
   .getUsed()), user.getActiveApplications(), user
   .getPendingApplications(), Resources.clone(user
   .getConsumedAMResources()), Resources.clone(user
   .getUserResourceLimit(;
 }
 return usersToReturn;
   }
 {code}
 Should get usage for particular user and label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3940) Application moveToQueue should check NodeLabel permission

2015-07-20 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-3940:
--

 Summary: Application moveToQueue should check NodeLabel permission 
 Key: YARN-3940
 URL: https://issues.apache.org/jira/browse/YARN-3940
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


Configure capacity scheduler 
Configure node label an submit application {{queue=A Label=X}}
Move application to queue {{B}} where x is not having access

{code}
2015-07-20 19:46:19,626 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Application attempt appattempt_1437385548409_0005_01 released container 
container_e08_1437385548409_0005_01_02 on node: host: 
host-10-19-92-117:64318 #containers=1 available=memory:2560, vCores:15 
used=memory:512, vCores:1 with event: KILL
2015-07-20 19:46:20,970 WARN 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid 
resource ask by application appattempt_1437385548409_0005_01
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request, queue=b1 doesn't have permission to access all labels in 
resource request. labelExpression of resource request=x. Queue labels=y
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)

{code}

Same exception will be thrown till *heartbeat timeout*
Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

2015-07-13 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625304#comment-14625304
 ] 

Bibin A Chundatt commented on YARN-3893:


[~varun_saxena] and [~sunilg] . Only need to call 
{{rm.transitionToStandby(false)}} on exception .
Since it handles  transition to  standby in rm context,Stop active services and 
not reinitializing queues.

 Both RM in active state when Admin#transitionToActive failure from refeshAll()
 --

 Key: YARN-3893
 URL: https://issues.apache.org/jira/browse/YARN-3893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: yarn-site.xml


 Cases that can cause this.
 # Capacity scheduler xml is wrongly configured during switch
 # Refresh ACL failure due to configuration
 # Refresh User group failure due to configuration
 Continuously both RM will try to be active
 {code}
 dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm1
 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm2
 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 {code}
 # Both Web UI active
 # Status shown as active for both RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3884) RMContainerImpl transition from RESERVED to KILL apphistory status not updated

2015-07-13 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625307#comment-14625307
 ] 

Bibin A Chundatt commented on YARN-3884:


Please review patch attached

 RMContainerImpl transition from RESERVED to KILL apphistory status not updated
 --

 Key: YARN-3884
 URL: https://issues.apache.org/jira/browse/YARN-3884
 Project: Hadoop YARN
  Issue Type: Bug
 Environment: Suse11 Sp3
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3884.patch, Apphistory Container Status.jpg, 
 Elapsed Time.jpg, Test Result-Container status.jpg


 Setup
 ===
 1 NM 3072 16 cores each
 Steps to reproduce
 ===
 1.Submit apps  to Queue 1 with 512 mb 1 core
 2.Submit apps  to Queue 2 with 512 mb and 5 core
 lots of containers get reserved and unreserved in this case 
 {code}
 2015-07-02 20:45:31,169 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_e24_1435849994778_0002_01_13 Container Transitioned from NEW to 
 RESERVED
 2015-07-02 20:45:31,170 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 Reserved container  application=application_1435849994778_0002 
 resource=memory:512, vCores:5 queue=QueueA: capacity=0.4, 
 absoluteCapacity=0.4, usedResources=memory:2560, vCores:21, 
 usedCapacity=1.6410257, absoluteUsedCapacity=0.65625, numApps=1, 
 numContainers=5 usedCapacity=1.6410257 absoluteUsedCapacity=0.65625 
 used=memory:2560, vCores:21 cluster=memory:6144, vCores:32
 2015-07-02 20:45:31,170 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Re-sorting assigned queue: root.QueueA stats: QueueA: capacity=0.4, 
 absoluteCapacity=0.4, usedResources=memory:3072, vCores:26, 
 usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1, 
 numContainers=6
 2015-07-02 20:45:31,170 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 assignedContainer queue=root usedCapacity=0.96875 
 absoluteUsedCapacity=0.96875 used=memory:5632, vCores:31 
 cluster=memory:6144, vCores:32
 2015-07-02 20:45:31,191 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_e24_1435849994778_0001_01_14 Container Transitioned from NEW to 
 ALLOCATED
 2015-07-02 20:45:31,191 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 OPERATION=AM Allocated ContainerTARGET=SchedulerApp 
 RESULT=SUCCESS  APPID=application_1435849994778_0001
 CONTAINERID=container_e24_1435849994778_0001_01_14
 2015-07-02 20:45:31,191 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: 
 Assigned container container_e24_1435849994778_0001_01_14 of capacity 
 memory:512, vCores:1 on host host-10-19-92-117:64318, which has 6 
 containers, memory:3072, vCores:14 used and memory:0, vCores:2 available 
 after allocation
 2015-07-02 20:45:31,191 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 assignedContainer application attempt=appattempt_1435849994778_0001_01 
 container=Container: [ContainerId: 
 container_e24_1435849994778_0001_01_14, NodeId: host-10-19-92-117:64318, 
 NodeHttpAddress: host-10-19-92-117:65321, Resource: memory:512, vCores:1, 
 Priority: 20, Token: null, ] queue=default: capacity=0.2, 
 absoluteCapacity=0.2, usedResources=memory:2560, vCores:5, 
 usedCapacity=2.0846906, absoluteUsedCapacity=0.4166, numApps=1, 
 numContainers=5 clusterResource=memory:6144, vCores:32
 2015-07-02 20:45:31,191 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Re-sorting assigned queue: root.default stats: default: capacity=0.2, 
 absoluteCapacity=0.2, usedResources=memory:3072, vCores:6, 
 usedCapacity=2.5016286, absoluteUsedCapacity=0.5, numApps=1, numContainers=6
 2015-07-02 20:45:31,191 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 assignedContainer queue=root usedCapacity=1.0 absoluteUsedCapacity=1.0 
 used=memory:6144, vCores:32 cluster=memory:6144, vCores:32
 2015-07-02 20:45:32,143 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_e24_1435849994778_0001_01_14 Container Transitioned from 
 ALLOCATED to ACQUIRED
 2015-07-02 20:45:32,174 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Trying to fulfill reservation for application application_1435849994778_0002 
 on node: host-10-19-92-143:64318
 2015-07-02 20:45:32,174 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 Reserved container  

[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

2015-07-13 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624246#comment-14624246
 ] 

Bibin A Chundatt commented on YARN-3893:


Thanks [~varun_saxena] and [~sunilg] first option look good and easier to 
implement.
But both RM could be in standBy state. but looks like the best option.

 Both RM in active state when Admin#transitionToActive failure from refeshAll()
 --

 Key: YARN-3893
 URL: https://issues.apache.org/jira/browse/YARN-3893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: yarn-site.xml


 Cases that can cause this.
 # Capacity scheduler xml is wrongly configured during switch
 # Refresh ACL failure due to configuration
 # Refresh User group failure due to configuration
 Continuously both RM will try to be active
 {code}
 dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm1
 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm2
 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 {code}
 # Both Web UI active
 # Status shown as active for both RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3894) RM startup should fail for wrong CS xml NodeLabel capacity configuration

2015-07-13 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624250#comment-14624250
 ] 

Bibin A Chundatt commented on YARN-3894:


[~sunilg] and  [~leftnoteasy] thanks a lot for review and commit

 RM startup should fail for wrong CS xml NodeLabel capacity configuration 
 -

 Key: YARN-3894
 URL: https://issues.apache.org/jira/browse/YARN-3894
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Fix For: 2.8.0

 Attachments: 0001-YARN-3894.patch, 0002-YARN-3894.patch, 
 capacity-scheduler.xml


 Currently in capacity Scheduler when capacity configuration is wrong
 RM will shutdown, but not incase of NodeLabels capacity mismatch
 In {{CapacityScheduler#initializeQueues}}
 {code}
   private void initializeQueues(CapacitySchedulerConfiguration conf)
 throws IOException {   
 root = 
 parseQueue(this, conf, null, CapacitySchedulerConfiguration.ROOT, 
 queues, queues, noop);
 labelManager.reinitializeQueueLabels(getQueueToLabels());
 root = 
 parseQueue(this, conf, null, CapacitySchedulerConfiguration.ROOT, 
 queues, queues, noop);
 LOG.info(Initialized root queue  + root);
 initializeQueueMappings();
 setQueueAcls(authorizer, queues);
   }
 {code}
 {{labelManager}} is initialized from queues and calculation for Label level 
 capacity mismatch happens in {{parseQueue}} . So during initialization 
 {{parseQueue}} the labels will be empty . 
 *Steps to reproduce*
 # Configure RM with capacity scheduler
 # Add one or two node label from rmadmin
 # Configure capacity xml with nodelabel but issue with capacity configuration 
 for already added label
 # Restart both RM
 # Check on service init of capacity scheduler node label list is populated 
 *Expected*
 RM should not start 
 *Current exception on reintialize check*
 {code}
 2015-07-07 19:18:25,655 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Initialized queue: default: capacity=0.5, absoluteCapacity=0.5, 
 usedResources=memory:0, vCores:0, usedCapacity=0.0, 
 absoluteUsedCapacity=0.0, numApps=0, numContainers=0
 2015-07-07 19:18:25,656 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService: Exception refresh 
 queues.
 java.io.IOException: Failed to re-init queues
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:383)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:376)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:605)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:314)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 Caused by: java.lang.IllegalArgumentException: Illegal capacity of 0.5 for 
 children of queue root for label=node2
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.setChildQueues(ParentQueue.java:159)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:639)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:503)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:379)
 ... 8 more
 2015-07-07 19:18:25,656 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 OPERATION=refreshQueues TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Exception refresh queues.   PERMISSIONS=
 2015-07-07 19:18:25,656 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf   
 OPERATION=transitionToActiveTARGET=RMHAProtocolService  
 RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   PERMISSIONS=
 2015-07-07 19:18:25,656 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not 

[jira] [Commented] (YARN-3939) CS schedule Userlevel resource usage shown is zero for NodeLabel partition

2015-07-20 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633894#comment-14633894
 ] 

Bibin A Chundatt commented on YARN-3939:


[~leftnoteasy] will add the fix in YARN-3932

 CS schedule Userlevel resource usage shown is zero for NodeLabel partition
 --

 Key: YARN-3939
 URL: https://issues.apache.org/jira/browse/YARN-3939
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
 Environment: Suse 11 SP3 
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3939.patch, ResourceUsage.jpg


 Submit application to queue and particular node label partition
 Check the resource usage for particular user in queue
 {{LeafQueue#getUsers()}}
 {code}
   public synchronized ArrayListUserInfo getUsers() {
 ArrayListUserInfo usersToReturn = new ArrayListUserInfo();
 for (Map.EntryString, User entry : users.entrySet()) {
   User user = entry.getValue();
   usersToReturn.add(new UserInfo(entry.getKey(), Resources.clone(user
   .getUsed()), user.getActiveApplications(), user
   .getPendingApplications(), Resources.clone(user
   .getConsumedAMResources()), Resources.clone(user
   .getUserResourceLimit(;
 }
 return usersToReturn;
   }
 {code}
 Should get usage for particular user and label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-20 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3932:
---
Attachment: 0004-YARN-3932.patch

Attaching patch with merge for YARN-3939 and testcase for the same

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3932.patch, 0002-YARN-3932.patch, 
 0003-YARN-3932.patch, 0004-YARN-3932.patch, ApplicationReport.jpg, 
 TestResult.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-20 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3932:
---
Attachment: 0003-YARN-3932.patch

Attaching patch after merging YARN-3939

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3932.patch, 0002-YARN-3932.patch, 
 0003-YARN-3932.patch, ApplicationReport.jpg, TestResult.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3940) Application moveToQueue should check NodeLabel permission

2015-07-20 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633870#comment-14633870
 ] 

Bibin A Chundatt commented on YARN-3940:


[~leftnoteasy] IMO ,Since the destination queue is not having permission for 
label we should not allow to move. *Currenty its giving success message* . We 
can validate permission of nodeLabel specified {{X}}   in 
{{CapacityScheduler#moveApplication}} for the destination queue. Is that ok? 
Please do correct me if i am wrong.



 Application moveToQueue should check NodeLabel permission 
 --

 Key: YARN-3940
 URL: https://issues.apache.org/jira/browse/YARN-3940
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt

 Configure capacity scheduler 
 Configure node label an submit application {{queue=A Label=X}}
 Move application to queue {{B}} and x is not having access
 {code}
 2015-07-20 19:46:19,626 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application attempt appattempt_1437385548409_0005_01 released container 
 container_e08_1437385548409_0005_01_02 on node: host: 
 host-10-19-92-117:64318 #containers=1 available=memory:2560, vCores:15 
 used=memory:512, vCores:1 with event: KILL
 2015-07-20 19:46:20,970 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Invalid resource ask by application appattempt_1437385548409_0005_01
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
 resource request, queue=b1 doesn't have permission to access all labels in 
 resource request. labelExpression of resource request=x. Queue labels=y
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
 at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
 at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
 {code}
 Same exception will be thrown till *heartbeat timeout*
 Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3940) Application moveToQueue should check NodeLabel permission

2015-07-20 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3940:
---
Description: 
Configure capacity scheduler 
Configure node label an submit application {{queue=A Label=X}}
Move application to queue {{B}} and x is not having access

{code}
2015-07-20 19:46:19,626 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Application attempt appattempt_1437385548409_0005_01 released container 
container_e08_1437385548409_0005_01_02 on node: host: 
host-10-19-92-117:64318 #containers=1 available=memory:2560, vCores:15 
used=memory:512, vCores:1 with event: KILL
2015-07-20 19:46:20,970 WARN 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid 
resource ask by application appattempt_1437385548409_0005_01
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request, queue=b1 doesn't have permission to access all labels in 
resource request. labelExpression of resource request=x. Queue labels=y
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)

{code}

Same exception will be thrown till *heartbeat timeout*
Then application state will be updated to *FAILED*

  was:
Configure capacity scheduler 
Configure node label an submit application {{queue=A Label=X}}
Move application to queue {{B}} where x is not having access

{code}
2015-07-20 19:46:19,626 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Application attempt appattempt_1437385548409_0005_01 released container 
container_e08_1437385548409_0005_01_02 on node: host: 
host-10-19-92-117:64318 #containers=1 available=memory:2560, vCores:15 
used=memory:512, vCores:1 with event: KILL
2015-07-20 19:46:20,970 WARN 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid 
resource ask by application appattempt_1437385548409_0005_01
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request, queue=b1 doesn't have permission to access all labels in 
resource request. labelExpression of resource request=x. Queue labels=y
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
at 

[jira] [Commented] (YARN-3939) CS schedule Userlevel resource usage shown is zero for NodeLabel partition

2015-07-20 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633873#comment-14633873
 ] 

Bibin A Chundatt commented on YARN-3939:


Hi [~leftnoteasy] The fix was in two different areas thats the reason i raised 
separate. 

 CS schedule Userlevel resource usage shown is zero for NodeLabel partition
 --

 Key: YARN-3939
 URL: https://issues.apache.org/jira/browse/YARN-3939
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
 Environment: Suse 11 SP3 
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3939.patch, ResourceUsage.jpg


 Submit application to queue and particular node label partition
 Check the resource usage for particular user in queue
 {{LeafQueue#getUsers()}}
 {code}
   public synchronized ArrayListUserInfo getUsers() {
 ArrayListUserInfo usersToReturn = new ArrayListUserInfo();
 for (Map.EntryString, User entry : users.entrySet()) {
   User user = entry.getValue();
   usersToReturn.add(new UserInfo(entry.getKey(), Resources.clone(user
   .getUsed()), user.getActiveApplications(), user
   .getPendingApplications(), Resources.clone(user
   .getConsumedAMResources()), Resources.clone(user
   .getUserResourceLimit(;
 }
 return usersToReturn;
   }
 {code}
 Should get usage for particular user and label



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3940) Application moveToQueue should check NodeLabel permission

2015-07-20 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633881#comment-14633881
 ] 

Bibin A Chundatt commented on YARN-3940:


[~leftnoteasy] Thnks for comment will upload the patch soon.

 Application moveToQueue should check NodeLabel permission 
 --

 Key: YARN-3940
 URL: https://issues.apache.org/jira/browse/YARN-3940
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt

 Configure capacity scheduler 
 Configure node label an submit application {{queue=A Label=X}}
 Move application to queue {{B}} and x is not having access
 {code}
 2015-07-20 19:46:19,626 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application attempt appattempt_1437385548409_0005_01 released container 
 container_e08_1437385548409_0005_01_02 on node: host: 
 host-10-19-92-117:64318 #containers=1 available=memory:2560, vCores:15 
 used=memory:512, vCores:1 with event: KILL
 2015-07-20 19:46:20,970 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Invalid resource ask by application appattempt_1437385548409_0005_01
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
 resource request, queue=b1 doesn't have permission to access all labels in 
 resource request. labelExpression of resource request=x. Queue labels=y
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
 at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
 at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
 {code}
 Same exception will be thrown till *heartbeat timeout*
 Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-20 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633995#comment-14633995
 ] 

Bibin A Chundatt commented on YARN-3932:


CheckStyle issue is related to number of lines in {{LeafQueue}} greater than 
2000.Before patch also it was the same.

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3932.patch, 0002-YARN-3932.patch, 
 0003-YARN-3932.patch, 0004-YARN-3932.patch, ApplicationReport.jpg, 
 TestResult.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-20 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634423#comment-14634423
 ] 

Bibin A Chundatt commented on YARN-3932:


Test is skipped for 
{{org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions.testGetClientToken}}
 as i know its not related to this patch.

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3932.patch, 0002-YARN-3932.patch, 
 0003-YARN-3932.patch, 0004-YARN-3932.patch, ApplicationReport.jpg, 
 TestResult.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3932) SchedulerApplicationAttempt#getResourceUsageReport should be based on NodeLabel

2015-07-21 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3932:
---
Attachment: 0005-YARN-3932.patch

Attaching patch after whitespace fix

 SchedulerApplicationAttempt#getResourceUsageReport should be based on 
 NodeLabel
 ---

 Key: YARN-3932
 URL: https://issues.apache.org/jira/browse/YARN-3932
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3932.patch, 0002-YARN-3932.patch, 
 0003-YARN-3932.patch, 0004-YARN-3932.patch, 0005-YARN-3932.patch, 
 ApplicationReport.jpg, TestResult.jpg


 Application Resource Report shown wrong when node Label is used.
 1.Submit application with NodeLabel
 2.Check RM UI for resources used 
 Allocated CPU VCores and Allocated Memory MB is always {{zero}}
 {code}
  public synchronized ApplicationResourceUsageReport getResourceUsageReport() {
 AggregateAppResourceUsage runningResourceUsage =
 getRunningAggregateAppResourceUsage();
 Resource usedResourceClone =
 Resources.clone(attemptResourceUsage.getUsed());
 Resource reservedResourceClone =
 Resources.clone(attemptResourceUsage.getReserved());
 return ApplicationResourceUsageReport.newInstance(liveContainers.size(),
 reservedContainers.size(), usedResourceClone, reservedResourceClone,
 Resources.add(usedResourceClone, reservedResourceClone),
 runningResourceUsage.getMemorySeconds(),
 runningResourceUsage.getVcoreSeconds());
   }
 {code}
 should be {{attemptResourceUsage.getUsed(label)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()

2015-07-15 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3893:
---
Attachment: 0004-YARN-3893.patch

Attaching patch after comment update and adding testcase

 Both RM in active state when Admin#transitionToActive failure from refeshAll()
 --

 Key: YARN-3893
 URL: https://issues.apache.org/jira/browse/YARN-3893
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 
 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml


 Cases that can cause this.
 # Capacity scheduler xml is wrongly configured during switch
 # Refresh ACL failure due to configuration
 # Refresh User group failure due to configuration
 Continuously both RM will try to be active
 {code}
 dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm1
 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin
  ./yarn rmadmin  -getServiceState rm2
 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 active
 {code}
 # Both Web UI active
 # Status shown as active for both RM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4302) SLS not able start due to NPE in SchedulerApplicationAttempt#getResourceUsageReport

2015-10-27 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4302:
---
Attachment: 0001-YARN-4302.patch

Impact of YARN-4285.
Attaching patch for the same.Please do review

> SLS not able start due to NPE in 
> SchedulerApplicationAttempt#getResourceUsageReport
> ---
>
> Key: YARN-4302
> URL: https://issues.apache.org/jira/browse/YARN-4302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4302.patch
>
>
> Configure the samples from tools/sls
> yarn-site.xml
> capacityscheduler.xml
> sls-runner.xml
> to /etc/hadoop
> Start sls using
>  
> bin/slsrun.sh --input-rumen=sample-data/2jobs2min-rumen-jh.json 
> --output-dir=out
> {noformat}
> 15/10/27 14:43:36 ERROR resourcemanager.ResourceManager: Error in handling 
> event type ATTEMPT_ADDED for applicationAttempt application_1445937212593_0001
> java.lang.NullPointerException
> at org.apache.hadoop.yarn.util.resource.Resources.clone(Resources.java:117)
> at org.apache.hadoop.yarn.util.resource.Resources.multiply(Resources.java:151)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:692)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:326)
> at 
> org.apache.hadoop.yarn.sls.scheduler.ResourceSchedulerWrapper.getAppResourceUsageReport(ResourceSchedulerWrapper.java:912)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeNewApplicationAttempt(RMStateStore.java:819)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.storeAttempt(RMAppAttemptImpl.java:2011)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$2700(RMAppAttemptImpl.java:109)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:1021)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:974)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:839)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:820)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:801)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4302) SLS not able starting

2015-10-27 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4302:
--

 Summary: SLS not able starting 
 Key: YARN-4302
 URL: https://issues.apache.org/jira/browse/YARN-4302
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


Configure the samples from tools/sls
yarn-site.xml
capacityscheduler.xml
sls-runner.xml

{quote}
15/10/27 14:43:36 ERROR resourcemanager.ResourceManager: Error in handling 
event type ATTEMPT_ADDED for applicationAttempt application_1445937212593_0001
java.lang.NullPointerException
at org.apache.hadoop.yarn.util.resource.Resources.clone(Resources.java:117)
at org.apache.hadoop.yarn.util.resource.Resources.multiply(Resources.java:151)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:692)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:326)
at 
org.apache.hadoop.yarn.sls.scheduler.ResourceSchedulerWrapper.getAppResourceUsageReport(ResourceSchedulerWrapper.java:912)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeNewApplicationAttempt(RMStateStore.java:819)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.storeAttempt(RMAppAttemptImpl.java:2011)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$2700(RMAppAttemptImpl.java:109)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:1021)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:974)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:839)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:820)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:801)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)

{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4302) SLS not able start due to NPE in SchedulerApplicationAttempt#getResourceUsageReport

2015-10-27 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4302:
---
Description: 
Configure the samples from tools/sls

yarn-site.xml
capacityscheduler.xml
sls-runner.xml

to /etc/hadoop

Start sls using
 
bin/slsrun.sh --input-rumen=sample-data/2jobs2min-rumen-jh.json --output-dir=out


{noformat}
15/10/27 14:43:36 ERROR resourcemanager.ResourceManager: Error in handling 
event type ATTEMPT_ADDED for applicationAttempt application_1445937212593_0001
java.lang.NullPointerException
at org.apache.hadoop.yarn.util.resource.Resources.clone(Resources.java:117)
at org.apache.hadoop.yarn.util.resource.Resources.multiply(Resources.java:151)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:692)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:326)
at 
org.apache.hadoop.yarn.sls.scheduler.ResourceSchedulerWrapper.getAppResourceUsageReport(ResourceSchedulerWrapper.java:912)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeNewApplicationAttempt(RMStateStore.java:819)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.storeAttempt(RMAppAttemptImpl.java:2011)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$2700(RMAppAttemptImpl.java:109)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:1021)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:974)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:839)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:820)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:801)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)

{noformat}


  was:
Configure the samples from tools/sls
yarn-site.xml
capacityscheduler.xml
sls-runner.xml

{quote}
15/10/27 14:43:36 ERROR resourcemanager.ResourceManager: Error in handling 
event type ATTEMPT_ADDED for applicationAttempt application_1445937212593_0001
java.lang.NullPointerException
at org.apache.hadoop.yarn.util.resource.Resources.clone(Resources.java:117)
at org.apache.hadoop.yarn.util.resource.Resources.multiply(Resources.java:151)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:692)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:326)
at 
org.apache.hadoop.yarn.sls.scheduler.ResourceSchedulerWrapper.getAppResourceUsageReport(ResourceSchedulerWrapper.java:912)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeNewApplicationAttempt(RMStateStore.java:819)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.storeAttempt(RMAppAttemptImpl.java:2011)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$2700(RMAppAttemptImpl.java:109)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:1021)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:974)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 

[jira] [Updated] (YARN-4302) SLS not able start

2015-10-27 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4302:
---
Summary: SLS not able start  (was: SLS not able starting )

> SLS not able start
> --
>
> Key: YARN-4302
> URL: https://issues.apache.org/jira/browse/YARN-4302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Configure the samples from tools/sls
> yarn-site.xml
> capacityscheduler.xml
> sls-runner.xml
> {quote}
> 15/10/27 14:43:36 ERROR resourcemanager.ResourceManager: Error in handling 
> event type ATTEMPT_ADDED for applicationAttempt application_1445937212593_0001
> java.lang.NullPointerException
> at org.apache.hadoop.yarn.util.resource.Resources.clone(Resources.java:117)
> at org.apache.hadoop.yarn.util.resource.Resources.multiply(Resources.java:151)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:692)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:326)
> at 
> org.apache.hadoop.yarn.sls.scheduler.ResourceSchedulerWrapper.getAppResourceUsageReport(ResourceSchedulerWrapper.java:912)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeNewApplicationAttempt(RMStateStore.java:819)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.storeAttempt(RMAppAttemptImpl.java:2011)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$2700(RMAppAttemptImpl.java:109)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:1021)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:974)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:839)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:820)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:801)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4302) SLS not able start due to NPE in SchedulerApplicationAttempt#getResourceUsageReport

2015-10-27 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4302:
---
Attachment: 0001-YARN-4302.patch

Uploading again to trigger CI.
Failure is not related to patch attached.Local run in Eclipse  is passing.

> SLS not able start due to NPE in 
> SchedulerApplicationAttempt#getResourceUsageReport
> ---
>
> Key: YARN-4302
> URL: https://issues.apache.org/jira/browse/YARN-4302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4302.patch, 0001-YARN-4302.patch
>
>
> Configure the samples from tools/sls
> yarn-site.xml
> capacityscheduler.xml
> sls-runner.xml
> to /etc/hadoop
> Start sls using
>  
> bin/slsrun.sh --input-rumen=sample-data/2jobs2min-rumen-jh.json 
> --output-dir=out
> {noformat}
> 15/10/27 14:43:36 ERROR resourcemanager.ResourceManager: Error in handling 
> event type ATTEMPT_ADDED for applicationAttempt application_1445937212593_0001
> java.lang.NullPointerException
> at org.apache.hadoop.yarn.util.resource.Resources.clone(Resources.java:117)
> at org.apache.hadoop.yarn.util.resource.Resources.multiply(Resources.java:151)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:692)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:326)
> at 
> org.apache.hadoop.yarn.sls.scheduler.ResourceSchedulerWrapper.getAppResourceUsageReport(ResourceSchedulerWrapper.java:912)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeNewApplicationAttempt(RMStateStore.java:819)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.storeAttempt(RMAppAttemptImpl.java:2011)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$2700(RMAppAttemptImpl.java:109)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:1021)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ScheduleTransition.transition(RMAppAttemptImpl.java:974)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:839)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:820)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:801)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4304) AM max resource configuration per partition need not be displayed properly in UI

2015-10-27 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976309#comment-14976309
 ] 

Bibin A Chundatt commented on YARN-4304:


Hi [~sunilg] 
Cluster metrics also needs updation along with Schedule page .
Currently the {{Total Memory & Total Vcores}} in Cluster metrics are showing 
only DEFAULT_PARTITION resource should i raise seperate JIRA for the same?

> AM max resource configuration per partition need not be displayed properly in 
> UI
> 
>
> Key: YARN-4304
> URL: https://issues.apache.org/jira/browse/YARN-4304
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Sunil G
>
> As we are supporting per-partition level max AM resource percentage 
> configuration, UI also need to display correct configurations related to 
> same. Current UI still shows am-resource percentage per queue level. This is 
> to be updated correctly when label config is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4304) AM max resource configuration per partition need not be displayed properly in UI

2015-10-27 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976483#comment-14976483
 ] 

Bibin A Chundatt commented on YARN-4304:


Not a problem at all. 
[~sunilg]  please do consider metrics too.

> AM max resource configuration per partition need not be displayed properly in 
> UI
> 
>
> Key: YARN-4304
> URL: https://issues.apache.org/jira/browse/YARN-4304
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.7.1
>Reporter: Sunil G
>Assignee: Sunil G
>
> As we are supporting per-partition level max AM resource percentage 
> configuration, UI also need to display correct configurations related to 
> same. Current UI still shows am-resource percentage per queue level. This is 
> to be updated correctly when label config is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4315) NaN in Queue percentage for cluster apps page

2015-10-29 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4315:
---
Attachment: Snap1.jpg

> NaN in Queue percentage for cluster apps page
> -
>
> Key: YARN-4315
> URL: https://issues.apache.org/jira/browse/YARN-4315
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: Snap1.jpg
>
>
> Steps to reproduce
> Submit application 
> Switch RM and check the percentage of queue usage
> Queue percentage shown as NaN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4315) NaN in Queue percentage for cluster apps page

2015-10-29 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4315:
--

 Summary: NaN in Queue percentage for cluster apps page
 Key: YARN-4315
 URL: https://issues.apache.org/jira/browse/YARN-4315
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor


Steps to reproduce

Submit application 
Switch RM and check the percentage of queue usage


Queue percentage shown as NaN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-11-13 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4216:
---
Attachment: 0001-YARN-4216.patch

Attaching patch for review

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4216.patch, NMLog, ScreenshotFolder.png, 
> yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-11-16 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007967#comment-15007967
 ] 

Bibin A Chundatt commented on YARN-4140:


+1 for merging in 2.7 .

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch, 0010-YARN-4140.patch, 0011-YARN-4140.patch, 
> 0012-YARN-4140.patch, 0013-YARN-4140.patch, 0014-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> 

[jira] [Created] (YARN-4362) Too many preemption activity when nodelabels are non exclusive

2015-11-16 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4362:
--

 Summary: Too many preemption activity when nodelabels are non 
exclusive
 Key: YARN-4362
 URL: https://issues.apache.org/jira/browse/YARN-4362
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Priority: Critical


Steps to reproduce
===
1.Configure HA cluster with 6 nodes and 3 partition(1,2,3) all non exclusive

*Partition configuration is as follows*

1,2 NM's mapped with Label 1
NM 3  to  label 2
4,5 NM's mapped to Label 3
NM 6 in DEFAULT partition

In capacity scheduler the queue are linked only to 1,3 partition.
The NM 3 with label 2 is a backup node for any partition whenever required will 
change the label.

Submit and application/job with 200 containers to default queue.
All containers that gets assigned to partition 2 gets preempted 

The application/map task execution is taking more time since 30-40 task gets 
assigned to partition 2 then gets preempted and all of them needs to be 
relaunched.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4362) Too many preemption activity when nodelabels are non exclusive

2015-11-16 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4362:
---
Attachment: capacity-scheduler.xml
Preemptedpartition.log
ProportionalPolicy.log
ProportionalDefaultQueue.log

> Too many preemption activity when nodelabels are non exclusive
> --
>
> Key: YARN-4362
> URL: https://issues.apache.org/jira/browse/YARN-4362
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: Preemptedpartition.log, ProportionalDefaultQueue.log, 
> ProportionalPolicy.log, capacity-scheduler.xml
>
>
> Steps to reproduce
> ===
> 1.Configure HA cluster with 6 nodes and 3 partition(1,2,3) all non exclusive
> *Partition configuration is as follows*
> 1,2 NM's mapped with Label 1
> NM 3  to  label 2
> 4,5 NM's mapped to Label 3
> NM 6 in DEFAULT partition
> In capacity scheduler the queue are linked only to 1,3 partition.
> The NM 3 with label 2 is a backup node for any partition whenever required 
> will change the label.
> Submit and application/job with 200 containers to default queue.
> All containers that gets assigned to partition 2 gets preempted 
> The application/map task execution is taking more time since 30-40 task gets 
> assigned to partition 2 then gets preempted and all of them needs to be 
> relaunched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4362) Too many preemption activity when nodelabels are non exclusive

2015-11-16 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15008072#comment-15008072
 ] 

Bibin A Chundatt commented on YARN-4362:


Attached logs and xml.

Looks like the guaranteed resource for partition 2 for queue default will be 
always zero. So any container assigned to partition 2 will get preempted from 
ProportionalCapacityPreemptionPolicy even when no other application is running.

We should restrict assigning to partition 2.
Thoughts?

> Too many preemption activity when nodelabels are non exclusive
> --
>
> Key: YARN-4362
> URL: https://issues.apache.org/jira/browse/YARN-4362
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: Preemptedpartition.log, ProportionalDefaultQueue.log, 
> ProportionalPolicy.log, capacity-scheduler.xml
>
>
> Steps to reproduce
> ===
> 1.Configure HA cluster with 6 nodes and 3 partition(1,2,3) all non exclusive
> *Partition configuration is as follows*
> 1,2 NM's mapped with Label 1
> NM 3  to  label 2
> 4,5 NM's mapped to Label 3
> NM 6 in DEFAULT partition
> In capacity scheduler the queue are linked only to 1,3 partition.
> The NM 3 with label 2 is a backup node for any partition whenever required 
> will change the label.
> Submit and application/job with 200 containers to default queue.
> All containers that gets assigned to partition 2 gets preempted 
> The application/map task execution is taking more time since 30-40 task gets 
> assigned to partition 2 then gets preempted and all of them needs to be 
> relaunched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4315) NaN in Queue percentage for cluster apps page

2015-11-02 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986699#comment-14986699
 ] 

Bibin A Chundatt commented on YARN-4315:


Only during switch and its temporary.

> NaN in Queue percentage for cluster apps page
> -
>
> Key: YARN-4315
> URL: https://issues.apache.org/jira/browse/YARN-4315
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: Snap1.jpg
>
>
> Steps to reproduce
> Submit application 
> Switch RM and check the percentage of queue usage
> Queue percentage shown as NaN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4269) Log aggregation should not swallow the exception during close()

2015-10-18 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962867#comment-14962867
 ] 

Bibin A Chundatt commented on YARN-4269:


[~lichangleo]
{{IOUtils#cleanup}} can be used rt?

> Log aggregation should not swallow the exception during close()
> ---
>
> Key: YARN-4269
> URL: https://issues.apache.org/jira/browse/YARN-4269
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4269.patch
>
>
> the log aggregation thread ignores exception thrown by close(). It shouldn't 
> be ignored, since the file content may be missing or partial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-10-13 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4155:
---
Attachment: 0004-YARN-4155.patch

[~ste...@apache.org] Thanks for reviewing patch . Attaching patch after update

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch, 
> 0003-YARN-4155.patch, 0004-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4250) NPE in AppSchedulingInfo#isRequestLabelChanged

2015-10-13 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4250:
---
Attachment: YARN-4250-004.patch

Updating patch for the same since precommit build is failing.

[~rohithsharma]/[~sunilg] On second thoughts yea its confusing. So have 
attached patch with comments and splitting check too.





> NPE in AppSchedulingInfo#isRequestLabelChanged
> --
>
> Key: YARN-4250
> URL: https://issues.apache.org/jira/browse/YARN-4250
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Blocker
> Attachments: YARN-4250-002.patch, YARN-4250-003.patch, 
> YARN-4250-004.patch, YARN-4250.patch
>
>
>  *Trace* 
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.isRequestLabelChanged(AppSchedulingInfo.java:420)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:300)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.allocate(FifoScheduler.java:350)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart$MyFifoScheduler.allocate(TestAMRMClientOnRMRestart.java:544)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:507)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:277)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:187)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4250) NPE in AppSchedulingInfo#isRequestLabelChanged

2015-10-13 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956241#comment-14956241
 ] 

Bibin A Chundatt commented on YARN-4250:


Rohith i feel its fine.

> NPE in AppSchedulingInfo#isRequestLabelChanged
> --
>
> Key: YARN-4250
> URL: https://issues.apache.org/jira/browse/YARN-4250
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Blocker
> Attachments: YARN-4250-002.patch, YARN-4250-003.patch, YARN-4250.patch
>
>
>  *Trace* 
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.isRequestLabelChanged(AppSchedulingInfo.java:420)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:300)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.allocate(FifoScheduler.java:350)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart$MyFifoScheduler.allocate(TestAMRMClientOnRMRestart.java:544)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:507)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:277)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:187)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4254) ApplicationAttempt stuck for ever due to UnknowHostexception

2015-10-13 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956132#comment-14956132
 ] 

Bibin A Chundatt commented on YARN-4254:


But this will solve only particular case, not sure anything else could cause AM 
not getting container other than resource related cases.

> ApplicationAttempt stuck for ever due to UnknowHostexception
> 
>
> Key: YARN-4254
> URL: https://issues.apache.org/jira/browse/YARN-4254
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4254.patch, Logs.txt, Test.patch
>
>
> Scenario
> ===
> 1. RM HA and 5 NMs available in cluster and are working fine 
> 2. Add one more NM to the same cluster but RM /etc/hosts not updated.
> 3. Submit application to the same cluster
> If Am get allocated to the newly added NM the *application attempt will get 
> stuck for ever*.User will not get to know why the same happened.
> Impact
> 1.RM logs gets overloaded with exception
> 2.Application gets stuck for ever.
> Handling suggestion YARN-261 allows for Fail application attempt .
> If we fail the same next attempt could get assigned to another NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4228) FileSystemRMStateStore use IOUtils#close instead of fs#close

2015-10-06 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946238#comment-14946238
 ] 

Bibin A Chundatt commented on YARN-4228:


[~rohithsharma]
Thnks for looking in to this issue. Failures are not related . YARN-3342 jira 
is already available for the same 

> FileSystemRMStateStore use IOUtils#close instead of fs#close
> 
>
> Key: YARN-4228
> URL: https://issues.apache.org/jira/browse/YARN-4228
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-4228.patch
>
>
> NPE on {{FileSystemRMStateStore#closeWithRetries}} when active service 
> initialization fails on rm start up
> {noformat}
> 2015-10-05 19:56:38,626 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore 
> failed in state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$13.run(FileSystemRMStateStore.java:721)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$13.run(FileSystemRMStateStore.java:718)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$FSAction.runWithRetries(FileSystemRMStateStore.java:734)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.closeWithRetries(FileSystemRMStateStore.java:718)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.closeInternal(FileSystemRMStateStore.java:169)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStop(RMStateStore.java:618)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> at 
> org.apache.hadoop.service.AbstractService.close(AbstractService.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:609)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:965)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1195)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4232) TopCLI console shows exceptions for help command

2015-10-07 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4232:
--

 Summary: TopCLI console shows exceptions  for help command
 Key: YARN-4232
 URL: https://issues.apache.org/jira/browse/YARN-4232
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor


*Steps to reproduce*

Start Top command in YARN in HA mode
./yarn top

{noformat}
usage: yarn top
 -cols  Number of columns on the terminal
 -delay The refresh delay(in seconds), default is 3 seconds
 -help   Print usage; for help while the tool is running press 'h'
 + Enter
 -queuesComma separated list of queues to restrict applications
 -rows  Number of rows on the terminal
 -types Comma separated list of types to restrict applications,
 case sensitive(though the display is lower case)
 -users Comma separated list of users to restrict applications

{noformat}

Execute *for help while the tool is running press 'h'  + Enter* while top tool 
is running

Exception is thrown in console continuously
{noformat}
15/10/07 14:59:28 ERROR cli.TopCLI: Could not fetch RM start time
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
at 
org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:742)
at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:467)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:420)

{noformat}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4232) TopCLI console shows exceptions for help command

2015-10-07 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4232:
---
Attachment: 0001-YARN-4232.patch

Currently HA mode is not supported.For getting RM start time http request gets 
submitted to default IP port . Clearing screen and showing help message when 
request is failing.

Attaching patch for the same.

> TopCLI console shows exceptions  for help command
> -
>
> Key: YARN-4232
> URL: https://issues.apache.org/jira/browse/YARN-4232
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-4232.patch
>
>
> *Steps to reproduce*
> Start Top command in YARN in HA mode
> ./yarn top
> {noformat}
> usage: yarn top
>  -cols  Number of columns on the terminal
>  -delay The refresh delay(in seconds), default is 3 seconds
>  -help   Print usage; for help while the tool is running press 'h'
>  + Enter
>  -queuesComma separated list of queues to restrict applications
>  -rows  Number of rows on the terminal
>  -types Comma separated list of types to restrict applications,
>  case sensitive(though the display is lower case)
>  -users Comma separated list of users to restrict applications
> {noformat}
> Execute *for help while the tool is running press 'h'  + Enter* while top 
> tool is running
> Exception is thrown in console continuously
> {noformat}
> 15/10/07 14:59:28 ERROR cli.TopCLI: Could not fetch RM start time
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
> at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204)
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at java.net.Socket.connect(Socket.java:538)
> at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
> at sun.net.www.http.HttpClient.(HttpClient.java:211)
> at sun.net.www.http.HttpClient.New(HttpClient.java:308)
> at sun.net.www.http.HttpClient.New(HttpClient.java:326)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
> at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
> at 
> org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:742)
> at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:467)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:420)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4029) Update LogAggregationStatus to store on finish

2015-10-10 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4029:
---
Attachment: (was: 0003-YARN-4029.patch)

> Update LogAggregationStatus to store on finish
> --
>
> Key: YARN-4029
> URL: https://issues.apache.org/jira/browse/YARN-4029
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4029.patch, 0002-YARN-4029.patch, Image.jpg
>
>
> Currently the log aggregation status is not getting updated to Store. When RM 
> is restarted will show NOT_START. 
> Steps to reproduce
> 
> 1.Submit mapreduce application
> 2.Wait for completion
> 3.Once application is completed switch RM
> *Log Aggregation Status* are changing
> *Log Aggregation Status* from SUCCESS to NOT_START



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4029) Update LogAggregationStatus to store on finish

2015-10-10 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4029:
---
Attachment: 0003-YARN-4029.patch

Attaching same patch again to trigger jenkins

> Update LogAggregationStatus to store on finish
> --
>
> Key: YARN-4029
> URL: https://issues.apache.org/jira/browse/YARN-4029
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4029.patch, 0002-YARN-4029.patch, 
> 0003-YARN-4029.patch, Image.jpg
>
>
> Currently the log aggregation status is not getting updated to Store. When RM 
> is restarted will show NOT_START. 
> Steps to reproduce
> 
> 1.Submit mapreduce application
> 2.Wait for completion
> 3.Once application is completed switch RM
> *Log Aggregation Status* are changing
> *Log Aggregation Status* from SUCCESS to NOT_START



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4250) NPE in AppSchedulingInfo#isRequestLabelChanged

2015-10-10 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951959#comment-14951959
 ] 

Bibin A Chundatt commented on YARN-4250:


Small correction in comment.
In {{ApplicationMasterService#allocate}} null labels are getting changed to 
NO_LABEL so label *should not be null* in requests after the below.


Also if you agree please change defect description too.

> NPE in AppSchedulingInfo#isRequestLabelChanged
> --
>
> Key: YARN-4250
> URL: https://issues.apache.org/jira/browse/YARN-4250
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Critical
> Attachments: YARN-4250.patch
>
>
>  *Trace* 
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.isRequestLabelChanged(AppSchedulingInfo.java:420)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:300)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.allocate(FifoScheduler.java:350)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart$MyFifoScheduler.allocate(TestAMRMClientOnRMRestart.java:544)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:507)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:277)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:187)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4250) NPE in AppSchedulingInfo#isRequestLabelChanged

2015-10-10 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951954#comment-14951954
 ] 

Bibin A Chundatt commented on YARN-4250:


Hi [~brahmareddy]

Thanks for reporting this. 

In {{ApplicationMasterService#allocate}} null labels are getting changed to 
NO_LABEL  so label should be null in requests after the below.
{code}
RMServerUtils.normalizeAndValidateRequests(ask,
maximumCapacity, app.getQueue(),
rScheduler, rmContext);
{code}

But to my understanding  fix should be in {{TestAMRMClientOnRMRestart#allocate}}

{code}
  List askCopy = new ArrayList();
  for (ResourceRequest req : ask) {
ResourceRequest reqCopy =
ResourceRequest.newInstance(req.getPriority(),
req.getResourceName(), req.getCapability(),
req.getNumContainers(), req.getRelaxLocality());
askCopy.add(reqCopy);
  }
{code}

When reqCopy is created the label expression is not copied

{code}
ResourceRequest reqCopy =
ResourceRequest.newInstance(req.getPriority(),
req.getResourceName(), req.getCapability(),
req.getNumContainers(), 
req.getRelaxLocality(),req.getNodeLabelExpression());
{code}

Could you please check again with above.

> NPE in AppSchedulingInfo#isRequestLabelChanged
> --
>
> Key: YARN-4250
> URL: https://issues.apache.org/jira/browse/YARN-4250
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Critical
> Attachments: YARN-4250.patch
>
>
>  *Trace* 
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.isRequestLabelChanged(AppSchedulingInfo.java:420)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:300)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.allocate(FifoScheduler.java:350)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart$MyFifoScheduler.allocate(TestAMRMClientOnRMRestart.java:544)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:507)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:277)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:187)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4029) Update LogAggregationStatus to store on finish

2015-10-10 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952129#comment-14952129
 ] 

Bibin A Chundatt commented on YARN-4029:


Testcase failure is no related to this patch. YARN-3342 is for the same 
testcase failure.
checkstyle can be skipped  . 
Please review patch attached

> Update LogAggregationStatus to store on finish
> --
>
> Key: YARN-4029
> URL: https://issues.apache.org/jira/browse/YARN-4029
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4029.patch, 0002-YARN-4029.patch, 
> 0003-YARN-4029.patch, Image.jpg
>
>
> Currently the log aggregation status is not getting updated to Store. When RM 
> is restarted will show NOT_START. 
> Steps to reproduce
> 
> 1.Submit mapreduce application
> 2.Wait for completion
> 3.Once application is completed switch RM
> *Log Aggregation Status* are changing
> *Log Aggregation Status* from SUCCESS to NOT_START



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4254) ApplicationAttempt stuck for ever due to UnknowHostexception

2015-10-12 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4254:
--

 Summary: ApplicationAttempt stuck for ever due to 
UnknowHostexception
 Key: YARN-4254
 URL: https://issues.apache.org/jira/browse/YARN-4254
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt


Scenario
===
1. RM HA and 5 NMs available in cluster and are working fine 
2. Add one more NM to the same cluster but RM /etc/hosts not updated.
3. Submit application to the same cluster

If Am get allocated to the newly added NM the *application attempt will get 
stuck for ever*.User will not get to know why the same happened.

Impact

1.RM logs gets overloaded with exception
2.Application gets stuck for ever.

Handling suggestion YARN-261 allows for Fail application attempt .
If we fail the same next attempt could get assigned to another NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4250) NPE in AppSchedulingInfo#isRequestLabelChanged

2015-10-11 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952626#comment-14952626
 ] 

Bibin A Chundatt commented on YARN-4250:


Hi [~brahmareddy]

The check done is incomplete in YARN-4250.patch

{code}
+return ((null != requestOneLabelExp) && !(requestOneLabelExp
+.equals(requestTwoLabelExp)))
+|| ((null == requestOneLabelExp) && (null != requestTwoLabelExp));
{code}

The check should be like above.

> NPE in AppSchedulingInfo#isRequestLabelChanged
> --
>
> Key: YARN-4250
> URL: https://issues.apache.org/jira/browse/YARN-4250
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Critical
> Attachments: YARN-4250-002.patch, YARN-4250.patch
>
>
>  *Trace* 
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.isRequestLabelChanged(AppSchedulingInfo.java:420)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.updateResourceRequests(AppSchedulingInfo.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.updateResourceRequests(SchedulerApplicationAttempt.java:300)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.allocate(FifoScheduler.java:350)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart$MyFifoScheduler.allocate(TestAMRMClientOnRMRestart.java:544)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:507)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:277)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:187)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4254) ApplicationAttempt stuck for ever due to UnknowHostexception

2015-10-12 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4254:
---
Attachment: 0001-YARN-4254.patch

Attaching patch for review

> ApplicationAttempt stuck for ever due to UnknowHostexception
> 
>
> Key: YARN-4254
> URL: https://issues.apache.org/jira/browse/YARN-4254
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4254.patch
>
>
> Scenario
> ===
> 1. RM HA and 5 NMs available in cluster and are working fine 
> 2. Add one more NM to the same cluster but RM /etc/hosts not updated.
> 3. Submit application to the same cluster
> If Am get allocated to the newly added NM the *application attempt will get 
> stuck for ever*.User will not get to know why the same happened.
> Impact
> 1.RM logs gets overloaded with exception
> 2.Application gets stuck for ever.
> Handling suggestion YARN-261 allows for Fail application attempt .
> If we fail the same next attempt could get assigned to another NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4254) ApplicationAttempt stuck for ever due to UnknowHostexception

2015-10-12 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14953146#comment-14953146
 ] 

Bibin A Chundatt commented on YARN-4254:


[~jlowe]
Thanks for looking into issue
Had cancelled the patch .Sorry forgot to mention the same in JIRA .
Looking further got to know that the retry is for DNS related case. But attempt 
should give up after a fixed period of time. 

For this jira  
{quote}
Would it make more sense if the RM simply refused to accept nodemanagers into 
the cluster that are unresolvable?
{quote}

This solutions sounds good.

Also 

{quote}
Also the fact that we try forever seems broken to me. We should be giving up at 
some point and failing the attempt, whether that be due to unknown host 
exceptions or other persistent errors.
{quote}
Will try to find out further why timeout is not happening or the same is not 
available.





> ApplicationAttempt stuck for ever due to UnknowHostexception
> 
>
> Key: YARN-4254
> URL: https://issues.apache.org/jira/browse/YARN-4254
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4254.patch
>
>
> Scenario
> ===
> 1. RM HA and 5 NMs available in cluster and are working fine 
> 2. Add one more NM to the same cluster but RM /etc/hosts not updated.
> 3. Submit application to the same cluster
> If Am get allocated to the newly added NM the *application attempt will get 
> stuck for ever*.User will not get to know why the same happened.
> Impact
> 1.RM logs gets overloaded with exception
> 2.Application gets stuck for ever.
> Handling suggestion YARN-261 allows for Fail application attempt .
> If we fail the same next attempt could get assigned to another NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4254) ApplicationAttempt stuck for ever due to UnknowHostexception

2015-10-12 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14953368#comment-14953368
 ] 

Bibin A Chundatt commented on YARN-4254:


{quote}
 Did it really try forever?
{quote}
It did get stuck for for more than 48 hours. And 4-5 logs files with same 
exception.


> ApplicationAttempt stuck for ever due to UnknowHostexception
> 
>
> Key: YARN-4254
> URL: https://issues.apache.org/jira/browse/YARN-4254
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4254.patch
>
>
> Scenario
> ===
> 1. RM HA and 5 NMs available in cluster and are working fine 
> 2. Add one more NM to the same cluster but RM /etc/hosts not updated.
> 3. Submit application to the same cluster
> If Am get allocated to the newly added NM the *application attempt will get 
> stuck for ever*.User will not get to know why the same happened.
> Impact
> 1.RM logs gets overloaded with exception
> 2.Application gets stuck for ever.
> Handling suggestion YARN-261 allows for Fail application attempt .
> If we fail the same next attempt could get assigned to another NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4232) TopCLI console shows exceptions for help command

2015-10-07 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946658#comment-14946658
 ] 

Bibin A Chundatt commented on YARN-4232:


Testcase failure is not related this jira patch

> TopCLI console shows exceptions  for help command
> -
>
> Key: YARN-4232
> URL: https://issues.apache.org/jira/browse/YARN-4232
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-4232.patch
>
>
> *Steps to reproduce*
> Start Top command in YARN in HA mode
> ./yarn top
> {noformat}
> usage: yarn top
>  -cols  Number of columns on the terminal
>  -delay The refresh delay(in seconds), default is 3 seconds
>  -help   Print usage; for help while the tool is running press 'h'
>  + Enter
>  -queuesComma separated list of queues to restrict applications
>  -rows  Number of rows on the terminal
>  -types Comma separated list of types to restrict applications,
>  case sensitive(though the display is lower case)
>  -users Comma separated list of users to restrict applications
> {noformat}
> Execute *for help while the tool is running press 'h'  + Enter* while top 
> tool is running
> Exception is thrown in console continuously
> {noformat}
> 15/10/07 14:59:28 ERROR cli.TopCLI: Could not fetch RM start time
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
> at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:204)
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at java.net.Socket.connect(Socket.java:538)
> at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
> at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
> at sun.net.www.http.HttpClient.(HttpClient.java:211)
> at sun.net.www.http.HttpClient.New(HttpClient.java:308)
> at sun.net.www.http.HttpClient.New(HttpClient.java:326)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
> at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
> at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
> at 
> org.apache.hadoop.yarn.client.cli.TopCLI.getRMStartTime(TopCLI.java:742)
> at org.apache.hadoop.yarn.client.cli.TopCLI.run(TopCLI.java:467)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at org.apache.hadoop.yarn.client.cli.TopCLI.main(TopCLI.java:420)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4254) ApplicationAttempt stuck for ever due to UnknowHostexception

2015-10-13 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955144#comment-14955144
 ] 

Bibin A Chundatt commented on YARN-4254:


{quote}
 we could handle the exception while trying to create container token and then 
remove from newlyAllocatedContainers list. 
{quote}
But after some time period only we should be removing the same rt? Immediate 
removal will skip DNS retry logic i think.

> ApplicationAttempt stuck for ever due to UnknowHostexception
> 
>
> Key: YARN-4254
> URL: https://issues.apache.org/jira/browse/YARN-4254
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4254.patch, Logs.txt, Test.patch
>
>
> Scenario
> ===
> 1. RM HA and 5 NMs available in cluster and are working fine 
> 2. Add one more NM to the same cluster but RM /etc/hosts not updated.
> 3. Submit application to the same cluster
> If Am get allocated to the newly added NM the *application attempt will get 
> stuck for ever*.User will not get to know why the same happened.
> Impact
> 1.RM logs gets overloaded with exception
> 2.Application gets stuck for ever.
> Handling suggestion YARN-261 allows for Fail application attempt .
> If we fail the same next attempt could get assigned to another NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4228) FileSystemRMStateStore use IOUtils on fs#close

2015-10-05 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-4228:
--

 Summary: FileSystemRMStateStore use IOUtils on fs#close
 Key: YARN-4228
 URL: https://issues.apache.org/jira/browse/YARN-4228
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor


NPE on {{FileSystemRMStateStore#closeWithRetries}} when active service 
initialization fails on rm start up

{noformat}
2015-10-05 19:56:38,626 INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore failed in 
state STOPPED; cause: java.lang.NullPointerException
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$13.run(FileSystemRMStateStore.java:721)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$13.run(FileSystemRMStateStore.java:718)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$FSAction.runWithRetries(FileSystemRMStateStore.java:734)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.closeWithRetries(FileSystemRMStateStore.java:718)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.closeInternal(FileSystemRMStateStore.java:169)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStop(RMStateStore.java:618)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at org.apache.hadoop.service.AbstractService.close(AbstractService.java:250)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:609)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at 
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:965)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1195)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4228) FileSystemRMStateStore use IOUtils on fs#close

2015-10-06 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4228:
---
Attachment: 0001-YARN-4228.patch

Attaching patch for the same

> FileSystemRMStateStore use IOUtils on fs#close
> --
>
> Key: YARN-4228
> URL: https://issues.apache.org/jira/browse/YARN-4228
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-4228.patch
>
>
> NPE on {{FileSystemRMStateStore#closeWithRetries}} when active service 
> initialization fails on rm start up
> {noformat}
> 2015-10-05 19:56:38,626 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore 
> failed in state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$13.run(FileSystemRMStateStore.java:721)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$13.run(FileSystemRMStateStore.java:718)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$FSAction.runWithRetries(FileSystemRMStateStore.java:734)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.closeWithRetries(FileSystemRMStateStore.java:718)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.closeInternal(FileSystemRMStateStore.java:169)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStop(RMStateStore.java:618)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> at 
> org.apache.hadoop.service.AbstractService.close(AbstractService.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:609)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:965)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1195)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4216) Container logs not shown for newly assigned containers after NM recovery

2015-10-05 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944524#comment-14944524
 ] 

Bibin A Chundatt commented on YARN-4216:


{quote}
 That is intentional. Decommission + nm restart doesn't make sense to me. 
Either we are decommissioning a node and don't expect it to return, or we are 
going to restart it and expect it to return shortly.
{quote}
For *rolling upgrade* the same scenarios can happen *( decommmision (logs 
upload) --> upgrade --> start NM --> new container assignment --> on finish log 
upload )* and container log loss happens. Append logs during aggregation could 
be one solution in this case rt?

> Container logs not shown for newly assigned containers  after NM  recovery
> --
>
> Key: YARN-4216
> URL: https://issues.apache.org/jira/browse/YARN-4216
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation, nodemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: NMLog, ScreenshotFolder.png, yarn-site.xml
>
>
> Steps to reproduce
> # Start 2 nodemanagers  with NM recovery enabled
> # Submit pi job with 20 maps 
> # Once 5 maps gets completed in NM 1 stop NM (yarn daemon stop nodemanager)
> (Logs of all completed container gets aggregated to HDFS)
> # Now start  the NM1 again and wait for job completion
> *The newly assigned container logs on NM1 are not shown*
> *hdfs log dir state*
> # When logs are aggregated to HDFS during stop its with NAME (localhost_38153)
> # On log aggregation after starting NM the newly assigned container logs gets 
> uploaded with name  (localhost_38153.tmp) 
> History server the logs are now shown for new task attempts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4228) FileSystemRMStateStore use IOUtils.close instead of fs#close

2015-10-06 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4228:
---
Summary: FileSystemRMStateStore use IOUtils.close instead of fs#close  
(was: FileSystemRMStateStore use IOUtils on fs#close)

> FileSystemRMStateStore use IOUtils.close instead of fs#close
> 
>
> Key: YARN-4228
> URL: https://issues.apache.org/jira/browse/YARN-4228
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-4228.patch
>
>
> NPE on {{FileSystemRMStateStore#closeWithRetries}} when active service 
> initialization fails on rm start up
> {noformat}
> 2015-10-05 19:56:38,626 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore 
> failed in state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$13.run(FileSystemRMStateStore.java:721)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$13.run(FileSystemRMStateStore.java:718)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$FSAction.runWithRetries(FileSystemRMStateStore.java:734)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.closeWithRetries(FileSystemRMStateStore.java:718)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.closeInternal(FileSystemRMStateStore.java:169)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStop(RMStateStore.java:618)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> at 
> org.apache.hadoop.service.AbstractService.close(AbstractService.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:609)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:965)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1195)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4228) FileSystemRMStateStore use IOUtils#close instead of fs#close

2015-10-06 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4228:
---
Summary: FileSystemRMStateStore use IOUtils#close instead of fs#close  
(was: FileSystemRMStateStore use IOUtils.close instead of fs#close)

> FileSystemRMStateStore use IOUtils#close instead of fs#close
> 
>
> Key: YARN-4228
> URL: https://issues.apache.org/jira/browse/YARN-4228
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-4228.patch
>
>
> NPE on {{FileSystemRMStateStore#closeWithRetries}} when active service 
> initialization fails on rm start up
> {noformat}
> 2015-10-05 19:56:38,626 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore 
> failed in state STOPPED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$13.run(FileSystemRMStateStore.java:721)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$13.run(FileSystemRMStateStore.java:718)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore$FSAction.runWithRetries(FileSystemRMStateStore.java:734)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.closeWithRetries(FileSystemRMStateStore.java:718)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.closeInternal(FileSystemRMStateStore.java:169)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStop(RMStateStore.java:618)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> at 
> org.apache.hadoop.service.AbstractService.close(AbstractService.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:609)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:965)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:256)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1195)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-10-08 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949747#comment-14949747
 ] 

Bibin A Chundatt commented on YARN-4140:


Hi [~leftnoteasy]

Thanks for looking into it.Release audit warning not related to current patch

{noformat}
/home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/fuse-dfs/util/tree.h
Lines that start with ? in the release audit  report indicate files that do 
not have an Apache license header.
{noformat}

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch, 0010-YARN-4140.patch, 0011-YARN-4140.patch, 
> 0012-YARN-4140.patch, 0013-YARN-4140.patch, 0014-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> 

[jira] [Commented] (YARN-4155) TestLogAggregationService.testLogAggregationServiceWithInterval failing

2015-10-09 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950006#comment-14950006
 ] 

Bibin A Chundatt commented on YARN-4155:


Hi [~rohithsharma]/[~ste...@apache.org]

Could you please review patch attached.

> TestLogAggregationService.testLogAggregationServiceWithInterval failing
> ---
>
> Key: YARN-4155
> URL: https://issues.apache.org/jira/browse/YARN-4155
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4155.patch, 0001-YARN-4155.patch, 
> 0003-YARN-4155.patch
>
>
> Test failing on Jenkins: 
> {{TestLogAggregationService.testLogAggregationServiceWithInterval}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4140) RM container allocation delayed incase of app submitted to Nodelabel partition

2015-10-09 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951554#comment-14951554
 ] 

Bibin A Chundatt commented on YARN-4140:


 Thank you [~leftnoteasy] for review and committing 

> RM container allocation delayed incase of app submitted to Nodelabel partition
> --
>
> Key: YARN-4140
> URL: https://issues.apache.org/jira/browse/YARN-4140
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-4140.patch, 0002-YARN-4140.patch, 
> 0003-YARN-4140.patch, 0004-YARN-4140.patch, 0005-YARN-4140.patch, 
> 0006-YARN-4140.patch, 0007-YARN-4140.patch, 0008-YARN-4140.patch, 
> 0009-YARN-4140.patch, 0010-YARN-4140.patch, 0011-YARN-4140.patch, 
> 0012-YARN-4140.patch, 0013-YARN-4140.patch, 0014-YARN-4140.patch
>
>
> Trying to run application on Nodelabel partition I  found that the 
> application execution time is delayed by 5 – 10 min for 500 containers . 
> Total 3 machines 2 machines were in same partition and app submitted to same.
> After enabling debug was able to find the below
> # From AM the container ask is for OFF-SWITCH
> # RM allocating all containers to NODE_LOCAL as shown in logs below.
> # So since I was having about 500 containers time taken was about – 6 minutes 
> to allocate 1st map after AM allocation.
> # Tested with about 1K maps using PI job took 17 minutes to allocate  next 
> container after AM allocation
> Once 500 container allocation on NODE_LOCAL is done the next container 
> allocation is done on OFF_SWITCH
> {code}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> /default-rack, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: *, Relax 
> Locality: true, Node Label Expression: 3}
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-143, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  showRequests: application=application_1441791998224_0001 request={Priority: 
> 20, Capability: , # Containers: 500, Location: 
> host-10-19-92-117, Relax Locality: true, Node Label Expression: }
> 2015-09-09 15:21:58,954 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
>  
> {code}
> 2015-09-09 14:35:45,467 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:45,831 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,469 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> 2015-09-09 14:35:46,832 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Assigned to queue: root.b.b1 stats: b1: capacity=1.0, absoluteCapacity=0.5, 
> usedResources=, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=1 -->  vCores:0>, NODE_LOCAL
> {code}
> {code}
> 

[jira] [Updated] (YARN-4254) ApplicationAttempt stuck for ever due to UnknowHostexception

2015-10-12 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4254:
---
Attachment: Logs.txt

> ApplicationAttempt stuck for ever due to UnknowHostexception
> 
>
> Key: YARN-4254
> URL: https://issues.apache.org/jira/browse/YARN-4254
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4254.patch, Logs.txt
>
>
> Scenario
> ===
> 1. RM HA and 5 NMs available in cluster and are working fine 
> 2. Add one more NM to the same cluster but RM /etc/hosts not updated.
> 3. Submit application to the same cluster
> If Am get allocated to the newly added NM the *application attempt will get 
> stuck for ever*.User will not get to know why the same happened.
> Impact
> 1.RM logs gets overloaded with exception
> 2.Application gets stuck for ever.
> Handling suggestion YARN-261 allows for Fail application attempt .
> If we fail the same next attempt could get assigned to another NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4254) ApplicationAttempt stuck for ever due to UnknowHostexception

2015-10-12 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-4254:
---
Attachment: Test.patch

> ApplicationAttempt stuck for ever due to UnknowHostexception
> 
>
> Key: YARN-4254
> URL: https://issues.apache.org/jira/browse/YARN-4254
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4254.patch, Logs.txt, Test.patch
>
>
> Scenario
> ===
> 1. RM HA and 5 NMs available in cluster and are working fine 
> 2. Add one more NM to the same cluster but RM /etc/hosts not updated.
> 3. Submit application to the same cluster
> If Am get allocated to the newly added NM the *application attempt will get 
> stuck for ever*.User will not get to know why the same happened.
> Impact
> 1.RM logs gets overloaded with exception
> 2.Application gets stuck for ever.
> Handling suggestion YARN-261 allows for Fail application attempt .
> If we fail the same next attempt could get assigned to another NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4254) ApplicationAttempt stuck for ever due to UnknowHostexception

2015-10-12 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954273#comment-14954273
 ] 

Bibin A Chundatt commented on YARN-4254:


Hi [~sunilg]
Attached test patch to reproduce the same. AM attempt is in scheduled state

> ApplicationAttempt stuck for ever due to UnknowHostexception
> 
>
> Key: YARN-4254
> URL: https://issues.apache.org/jira/browse/YARN-4254
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4254.patch, Logs.txt, Test.patch
>
>
> Scenario
> ===
> 1. RM HA and 5 NMs available in cluster and are working fine 
> 2. Add one more NM to the same cluster but RM /etc/hosts not updated.
> 3. Submit application to the same cluster
> If Am get allocated to the newly added NM the *application attempt will get 
> stuck for ever*.User will not get to know why the same happened.
> Impact
> 1.RM logs gets overloaded with exception
> 2.Application gets stuck for ever.
> Handling suggestion YARN-261 allows for Fail application attempt .
> If we fail the same next attempt could get assigned to another NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    1   2   3   4   5   6   7   8   9   10   >