date:20150615


[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585709#comment-14585709
 ] 

Bibin A Chundatt commented on YARN-3789:


[~devaraj.k] Thnks for review . Seems you have reviewed based on 003-YARN-3789 
. In {{activateApplications()}} its checks for new AM limits if the application 
is activated that is the reason why i updated the logs like . 
{code}
LOG.info(Not activating  + applicationId
  + . If application activated usedAMResource  + 
amIfStarted
  +  will exceed amLimit  + amLimit);
{code}

If 004-YARN-3789 is also still confusing will surely update log message.
I will also handle unused imports as part of next patch after your reply. 

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1042) add ability to specify affinity/anti-affinity in container requests

2015-06-15 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-1042:
--
Attachment: YARN-1042.002.patch

 add ability to specify affinity/anti-affinity in container requests
 ---

 Key: YARN-1042
 URL: https://issues.apache.org/jira/browse/YARN-1042
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Steve Loughran
Assignee: Arun C Murthy
 Attachments: YARN-1042-demo.patch, YARN-1042-design-doc.pdf, 
 YARN-1042.001.patch, YARN-1042.002.patch


 container requests to the AM should be able to request anti-affinity to 
 ensure that things like Region Servers don't come up on the same failure 
 zones. 
 Similarly, you may be able to want to specify affinity to same host or rack 
 without specifying which specific host/rack. Example: bringing up a small 
 giraph cluster in a large YARN cluster would benefit from having the 
 processes in the same rack purely for bandwidth reasons.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl


[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585660#comment-14585660
 ] 

Bibin A Chundatt commented on YARN-3804:


Can we  check for AccessControlException in 
{{ActiveStandbyElector#becomeActive()}} send event to shutdown ? 

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt

 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3792) Test case failures in TestDistributedShell and some issue fixes related to ATSV2

2015-06-15 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585676#comment-14585676
 ] 

Naganarasimha G R commented on YARN-3792:
-

* Test case reported is not due to this patch and already YARN-3790 has been 
raised to address it.
* white space is not caused by this patch 
* incorrect findbugs alert, report has no issues.

Patch is good enough to review !

 Test case failures in TestDistributedShell and some issue fixes related to 
 ATSV2
 

 Key: YARN-3792
 URL: https://issues.apache.org/jira/browse/YARN-3792
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
 Attachments: YARN-3792-YARN-2928.001.patch


 # encountered [testcase 
 failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] 
 which was happening even without the patch modifications in YARN-3044
 TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow
 TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow
 TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression
 # Remove unused {{enableATSV1}} in testDisstributedShell
 # container metrics needs to be published only for v2 test cases of 
 testDisstributedShell
 # Nullpointer was thrown in TimelineClientImpl.constructResURI when Aux 
 service was not configured and {{TimelineClient.putObjects}} was getting 
 invoked.
 # Race condition for the Application events to published and test case 
 verification for RM's ApplicationFinished Timeline Events
 # Application Tags for converted to lowercase in 
 ApplicationSubmissionContextPBimpl, hence RMTimelinecollector was not able to 
 detect to custom flow details of the app



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-15 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585664#comment-14585664
 ] 

Rohith commented on YARN-3789:
--

+1(non-binding)

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

Bibin A Chundatt created YARN-3804:
--

 Summary: Both RM are on standBy state when kerberos user not in 
yarn.admin.acl
 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt


Steps to reproduce

1. Configure cluster in secure mode
2. On  RM Configure yarn.admin.acl=dsperf
3. Configure in arn.resourcemanager.principal=yarn
4. Start Both RM 

Both RM will be in Standby forever

{code}

2015-06-15 12:20:21,556 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
DESCRIPTION=Unauthorized userPERMISSIONS=
2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
refreshAdminAcls
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
... 4 more
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
permission to call 'refreshAdminAcls'
at 
org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
... 5 more
Caused by: org.apache.hadoop.security.AccessControlException: User yarn doesn't 
have permission to call 'refreshAdminAcls'
at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
... 7 more
{code}



*Analysis*

On each RM attempt to switch to Active refreshACl is called and acl permission 
not available for the user
Infinite retry for the same switch to Active and always false returned from 
{{ActiveStandbyElector#becomeActive()}}
 

*Expected*

RM should get shutdown event after few retry or even at first attempt
Since at runtime user from which it retries for refreshacl can never be updated.

*States from commands*

 ./yarn rmadmin -getServiceState rm2
*standby*
 ./yarn rmadmin -getServiceState rm1
*standby*

 ./yarn rmadmin -checkHealth rm1
*echo $? = 0*
 ./yarn rmadmin -checkHealth rm2
*echo $? = 0*




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-15 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585735#comment-14585735
 ] 

Devaraj K commented on YARN-3789:
-

I had a look into the 0003-YARN-3789.patch previously, sorry for that. I think 
the latest patch also has the same issue with the message which I mentioned. 

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-15 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585685#comment-14585685
 ] 

Devaraj K commented on YARN-3789:
-

Thanks [~bibinchundatt] for the patch. 

1. 
{code:xml}
+  LOG.info(Not activating  + applicationId
+  +  if application activated usedAMResource  + amIfStarted
+  +  exceeds amLimit  + amLimit);
{code}

{code:xml}
+  LOG.info(Not activating  + applicationId +  for user  + user
+  +  if application activated usedUserAMResource  
+  + userAmIfStarted +  exceeds userAmLimit  + userAMLimit);
{code}

These logs are still confusing(atleast to me), can you make some thing like 
this or anything better,

{code:xml}
Not activating application applicationId as amIfStarted: amIfStarted 
exceeds amLimit: amLimit.
{code}

{code:xml}
Not activating application applicationId for user: user as amIfStarted: 
amIfStarted exceeds userAmLimit: userAMLimit.
{code}

2. Can you also remove the unused imports in the same file LeafQueue.java as 
part of this patch?

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585855#comment-14585855
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

Thanks [~zxu] for your explanation.

I also traced application_1433764310492_7152 in  the log, but the application 
was not removed from RMStateStore. It means application_1433764310492_7152 and 
appattempt_application_1433764310492_7152_* are not visible without removing 
the znodes. [~bibinchundatt] what's the ZK version are you using?

BTW, I found an improvement point: when error code is CONNETIONLOSS or 
OPERATION TIMEOUT, ZKRMStateStore closes a current connection and try to create 
a new connection before retrying. This shouldn't be done in general. We should 
just wait for accepting a SyncConnected event until timeout occur, while 
current code looks good to me.

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
 Attachments: RM.log


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =

[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585897#comment-14585897
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

5. [ZKRMStateStore] ZKRMStateStore uses new connection and old connection is 
still alive. Old view can be seen from new connection since it's another 
client. In this case, there is no guarantee. 
6. When old session is expired, all updates by old session can be seen because 
of virtual synchrony.



 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
 Attachments: RM.log


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at

[jira] [Updated] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging


 [ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3789:
---
Attachment: 0005-YARN-3789.patch

[~devaraj.k] i have removed unused 
 {{import org.apache.hadoop.yarn.server.resourcemanager.RMContext;}}
Also updated the log as per your comments .

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585891#comment-14585891
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

{quote}
while current code looks good to me.
{quote}

I found one corner case current code doesn't work correctly:

1. [ZKRMStateStore] Receiving CONNECTIONLOSS or OPERATIONTIMEOUT in 
ZKRMStateStore#runWithRetries.
2. [ZKRMStateStore] Failing to zkClient.close() in 
ZKRMStateStore#createConnection, but IOException is ignored.
3. [ZK Server] Failing to accept close() request. A previous session is still 
alive.
4. [ZKRMStateStore] Creating new connection in ZKRMStateStore#createConnection. 

In this case, correct fix is to wait for SESSIONEXPIRED or SESSIONMOVED.

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
 Attachments: RM.log


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at

[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging


[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586174#comment-14586174
 ] 

Bibin A Chundatt commented on YARN-3789:


Looked at the precommit build results. Checkstyle already existing , Test 
failure due to YARN-3790.

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging


[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586161#comment-14586161
 ] 

Hadoop QA commented on YARN-3789:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 59s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 33s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 32s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 46s | The applied patch generated  1 
new checkstyle issues (total was 153, now 151). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 37s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 25s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  50m 52s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m 41s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12739612/0005-YARN-3789.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 4c5da9b |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8251/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8251/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8251/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8251/console |


This message was automatically generated.

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-15 Thread MENG DING (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586139#comment-14586139
 ] 

MENG DING commented on YARN-1197:
-

Had a very good discussion with [~leftnoteasy] at the Hadoop summit. We all 
agreed that due to the complexity of the current design, it is worthwhile to 
revisit the idea of increasing and decreasing container size both through 
Resource Manager, that would at least eliminate the need for token expiration 
logic, and also eliminate the need for AM-NM protocol and APIs. I am currently 
working on the new design, and will post it for review when it is ready.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3792) Test case failures in TestDistributedShell and some issue fixes related to ATSV2

2015-06-15 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586128#comment-14586128
 ] 

Junping Du commented on YARN-3792:
--

Thanks [~Naganarasimha] for delivering the patch to fix it! Will review your 
patch soon.

 Test case failures in TestDistributedShell and some issue fixes related to 
 ATSV2
 

 Key: YARN-3792
 URL: https://issues.apache.org/jira/browse/YARN-3792
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
 Attachments: YARN-3792-YARN-2928.001.patch


 # encountered [testcase 
 failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] 
 which was happening even without the patch modifications in YARN-3044
 TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow
 TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow
 TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression
 # Remove unused {{enableATSV1}} in testDisstributedShell
 # container metrics needs to be published only for v2 test cases of 
 testDisstributedShell
 # Nullpointer was thrown in TimelineClientImpl.constructResURI when Aux 
 service was not configured and {{TimelineClient.putObjects}} was getting 
 invoked.
 # Race condition for the Application events to published and test case 
 verification for RM's ApplicationFinished Timeline Events
 # Application Tags for converted to lowercase in 
 ApplicationSubmissionContextPBimpl, hence RMTimelinecollector was not able to 
 detect to custom flow details of the app



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1042) add ability to specify affinity/anti-affinity in container requests

2015-06-15 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585494#comment-14585494
 ] 

Weiwei Yang commented on YARN-1042:
---

Hi Steve

From your comments, I think you want yarn API to support requesting containers 
with given rules per request. Guess you want it to have the API looks like 
in AMRMClient.java, ContainerRequest supports following arguments
* Resource capability
* String[] nodes
* String[] racks
* Priority priority 
* boolean relaxLocality
* String nodeLabelExpression
* *ContainerAllocateRule containerAllocateRule*

the last one (in bold text) is the new argument for you to specify a particular 
rule (we will discuss rule details later). Problem here is, RM needs to know 
which application (or role in slider's context) the rule applies for, only then 
RM can assign containers by obeying that rule when dealing with the request 
coming from the same application. However, if you only specify the rule per 
container request, how can RM know what are the containers need to be 
considered to when applies the rule ? Let me give an example to explain 

{code}
 ContainerRequest containerReq1 = new ContainerRequest(capability1, nodes, 
racks, priority, affinityRequiredRule);
 amClient.addContainerRequest(containerReq1);
 AllocateResponse allocResponse = amClient.allocate(0.1f)
{code}

The AllocationRequest AM sent to RM only tells RM that these container requests 
need to use affinityRequiredRule, but RM does not know which containers this 
request affine with, so RM cannot place the rule during the allocation.  This 
is the reason why I propose to register the mapping about

{code}
application - allocation-rule
{code}

when client submits the application, and keep it in the RM context, so RM can 
apply the rule when there is a request coming from AM. 

 add ability to specify affinity/anti-affinity in container requests
 ---

 Key: YARN-1042
 URL: https://issues.apache.org/jira/browse/YARN-1042
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Steve Loughran
Assignee: Arun C Murthy
 Attachments: YARN-1042-demo.patch, YARN-1042-design-doc.pdf, 
 YARN-1042.001.patch


 container requests to the AM should be able to request anti-affinity to 
 ensure that things like Region Servers don't come up on the same failure 
 zones. 
 Similarly, you may be able to want to specify affinity to same host or rack 
 without specifying which specific host/rack. Example: bringing up a small 
 giraph cluster in a large YARN cluster would benefit from having the 
 processes in the same rack purely for bandwidth reasons.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3711) Documentation of ResourceManager HA should explain configurations about listen addresses


[ 
https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587389#comment-14587389
 ] 

Masatake Iwasaki commented on YARN-3711:


Thanks, [~ozawa]!

 Documentation of ResourceManager HA should explain configurations about 
 listen addresses
 

 Key: YARN-3711
 URL: https://issues.apache.org/jira/browse/YARN-3711
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.7.1

 Attachments: YARN-3711.002.patch, YARN-3711.003.patch


 There should be explanation about webapp address in addition to RPC address.
 AM proxy filter needs explicit definition of 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses 
 in RM-HA mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

Wei Shao created YARN-3806:
--

 Summary: Proposal of Generic Scheduling Framework for YARN
 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao


Currently, A typical YARN cluster runs many different kinds of applications: 
production applications, ad hoc user applications, long running services and so 
on. Different YARN scheduling policies may be suitable for different 
applications. For example, capacity scheduling can manage production 
applications well since application can get guaranteed resource share, fair 
scheduling can manage ad hoc user applications well since it can enforce 
fairness among users. However, current YARN scheduling framework doesn’t have a 
mechanism for multiple scheduling policies work hierarchically in one cluster.

YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
proposed a per-queue policy driven framework. In detail, it supported different 
scheduling policies for leaf queues. However, support of different scheduling 
policies for upper level queues is not seriously considered yet. 

A generic scheduling framework is proposed here to address these limitations. 
It supports different policies for any queue consistently. The proposal tries 
to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3808) Proposal of Time Based Fair Scheduling for YARN


 [ 
https://issues.apache.org/jira/browse/YARN-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3808:
---
Attachment: ProposalOfTimeBasedFairSchedulingForYARN-V1.0.pdf

 Proposal of Time Based Fair Scheduling for YARN
 ---

 Key: YARN-3808
 URL: https://issues.apache.org/jira/browse/YARN-3808
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler, scheduler
Reporter: Wei Shao
 Attachments: ProposalOfTimeBasedFairSchedulingForYARN-V1.0.pdf


 This proposal talks about the issues of YARN fair scheduling policy, and 
 tries to solve them by YARN-3806 and the new scheduling policy called time 
 based fair scheduling.
 Time based fair scheduling policy is proposed to enforces time based fairness 
 among users. For example, if two users share the cluster weekly, each user’s 
 fair share is half of the cluster per week. At a particular week, if the 
 first user has used the whole cluster for first half of the week, then in 
 second half of the week, second user will always have priority to use cluster 
 resources since the first user has used up its fair share of the cluster 
 already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-15 Thread Jun Gong (JIRA)

Jun Gong created YARN-3809:
--

 Summary: Failed to launch new attempts because 
ApplicationMasterLauncher's threads all hang
 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong


ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
AMLauncherEventType(LAUNCH and CLEANUP).

In our cluster, there was many NM with 10+ AM running on it, and one shut down 
for some reason. After RM found the NM LOST, it cleaned up AMs running on it. 
Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, 
the default RPC time out is 15 mins. It means that in 15 mins 
ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3706) Generalize native HBase writer for additional tables

2015-06-15 Thread Joep Rottinghuis (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joep Rottinghuis updated YARN-3706:
---
Attachment: YARN-3706-YARN-2928.013.patch

Thanks for the review [~sjlee0], all your comments are addressed in the latest 
patch (YARN-3706-YARN-2928.013.patch).

I have not yet tested this on a (semi) real cluster aside from the unit test. I 
plan to do this in the next couple of days, preferably on the real test cluster 
we used for previous benchmarking and preferably against the same input set in 
order to confirm execution times are comparable (or better).

 Generalize native HBase writer for additional tables
 

 Key: YARN-3706
 URL: https://issues.apache.org/jira/browse/YARN-3706
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Joep Rottinghuis
Assignee: Joep Rottinghuis
Priority: Minor
 Attachments: YARN-3706-YARN-2928.001.patch, 
 YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, 
 YARN-3706-YARN-2928.012.patch, YARN-3706-YARN-2928.013.patch, 
 YARN-3726-YARN-2928.002.patch, YARN-3726-YARN-2928.003.patch, 
 YARN-3726-YARN-2928.004.patch, YARN-3726-YARN-2928.005.patch, 
 YARN-3726-YARN-2928.006.patch, YARN-3726-YARN-2928.007.patch, 
 YARN-3726-YARN-2928.008.patch, YARN-3726-YARN-2928.009.patch


 When reviewing YARN-3411 we noticed that we could change the class hierarchy 
 a little in order to accommodate additional tables easily.
 In order to get ready for benchmark testing we left the original layout in 
 place, as performance would not be impacted by the code hierarchy.
 Here is a separate jira to address the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3801) [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util

2015-06-15 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587233#comment-14587233
 ] 

Sangjin Lee commented on YARN-3801:
---

+1 from me. Although it's bit early for JDK 8, we can have this proactively on 
our feature branch. We can remove the exclusion rules later if and when we 
decide to move to a HBase version that doesn't bring in conflicting versions.

Folks, let me know if you have any objection, or I'll merge it to our feature 
branch soon.

 [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util
 -

 Key: YARN-3801
 URL: https://issues.apache.org/jira/browse/YARN-3801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Tsuyoshi Ozawa
Assignee: Tsuyoshi Ozawa
 Attachments: YARN-3801.001.patch


 timelineservice depends on hbase-client and hbase-testing-util, and they 
 dpend on jdk.tools:1.7. This leads to fail to compile hadoop with JDK8.
 {quote}
 [WARNING] 
 Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency 
 are:
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT
 +-jdk.tools:jdk.tools:1.8
 and
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hbase:hbase-client:1.0.1
 +-org.apache.hbase:hbase-annotations:1.0.1
   +-jdk.tools:jdk.tools:1.7
 and
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hbase:hbase-testing-util:1.0.1
 +-org.apache.hbase:hbase-annotations:1.0.1
   +-jdk.tools:jdk.tools:1.7
 [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence 
 failed with message:
 Failed while enforcing releasability the error(s) are [
 Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency 
 are:
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT
 +-jdk.tools:jdk.tools:1.8
 and
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hbase:hbase-client:1.0.1
 +-org.apache.hbase:hbase-annotations:1.0.1
   +-jdk.tools:jdk.tools:1.7
 and
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hbase:hbase-testing-util:1.0.1
 +-org.apache.hbase:hbase-annotations:1.0.1
   +-jdk.tools:jdk.tools:1.7
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3711) Documentation of ResourceManager HA should explain about webapp address configuration


[ 
https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587262#comment-14587262
 ] 

Tsuyoshi Ozawa commented on YARN-3711:
--

+1, commtting this shortly.

 Documentation of ResourceManager HA should explain about webapp address 
 configuration
 -

 Key: YARN-3711
 URL: https://issues.apache.org/jira/browse/YARN-3711
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3711.002.patch, YARN-3711.003.patch


 There should be explanation about webapp address in addition to RPC address.
 AM proxy filter needs explicit definition of 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses 
 in RM-HA mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk


 [ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-3798:
-
Affects Version/s: 2.7.0

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
 Attachments: RM.log


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl


[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587275#comment-14587275
 ] 

Bibin A Chundatt commented on YARN-3804:


[~vinodkv] {quote}Allow the daemon user to do the refresh irrespective of what 
admin configures{quote} sounds better to me.

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical

 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers


[ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587297#comment-14587297
 ] 

Xuan Gong commented on YARN-433:


remove LaunchedTransition from RMContainerImpl, and move to RMNodeImpl which 
will be called when RMNode caches up the ContainerStatus. 

 When RM is catching up with node updates then it should not expire acquired 
 containers
 --

 Key: YARN-433
 URL: https://issues.apache.org/jira/browse/YARN-433
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Xuan Gong
 Attachments: YARN-433.1.patch


 RM expires containers that are not launched within some time of being 
 allocated. The default is 10mins. When an RM is not keeping up with node 
 updates then it may not be aware of new launched containers. If the expire 
 thread fires for such containers then the RM can expire them even though they 
 may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3709) RM Web UI AM link shown before MRAppMaster launch

2015-06-15 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587391#comment-14587391
 ] 

Weiwei Yang commented on YARN-3709:
---

I don't think this is a bug. Once an application is ACCEPTED, you can click 
ApplicationMaster link to track the progress, before it starts to run, it shows 
like 

YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, 
launched and register with RM. 
FinalStatus Reported by AM: Application has not completed yet.
Started:Mon Jun 15 20:16:16 -0700 2015
Elapsed:3mins, 21sec 

that gives you the information about the current status of the application, and 
how long the application has been waiting. That is useful, isn't it ?

 RM Web UI AM link shown before MRAppMaster launch
 -

 Key: YARN-3709
 URL: https://issues.apache.org/jira/browse/YARN-3709
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Priority: Minor
 Attachments: ApplicationMasterLink.png


 Steps to reproduce
 ===
 1.Configure HA setup with 2 NM
 2.AM allocated memory 1024 MB in CS
 3.Submit 5 pi jobs in parallel
 4.2 AM runs in parallel
 *Expected :*
 Only for running Applications Tracking Url/AM link should be shown
 *Actual:*
 For all 5 application *ApplicationMaster* link shown
 For application unassigned  with AM Tracking URL should be shown as  
 *UNASSIGNED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-15 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587402#comment-14587402
 ] 

Devaraj K commented on YARN-3789:
-

Thanks [~bibinchundatt] for the updated patch. It looks good to me.

[~rohithsharma], do you have any comments on the latest patch?

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3807) Proposal of Guaranteed Capacity Scheduling for YARN


 [ 
https://issues.apache.org/jira/browse/YARN-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3807:
---
Description: 
This proposal talks about limitations of the YARN scheduling policies for SLA 
applications, and tries to solve them by YARN-3806 and the new scheduling 
policy called guaranteed capacity scheduling.
Guaranteed capacity scheduling makes guarantee to the applications that they 
can get resources under specified capacity cap in totally predictable manner. 
The application can meet SLA more easily since it is self-contained in the 
shared cluster - external uncertainties are eliminated.

  was:
This proposal talks about limitations of the YARN scheduling policies for SLA 
applications, and tries to solve them by [Link] and the new scheduling policy 
called guaranteed capacity scheduling.
Guaranteed capacity scheduling makes guarantee to the applications that they 
can get resources under specified capacity cap in totally predictable manner. 
The application can meet SLA more easily since it is self-contained in the 
shared cluster - external uncertainties are eliminated.


 Proposal of Guaranteed Capacity Scheduling for YARN
 ---

 Key: YARN-3807
 URL: https://issues.apache.org/jira/browse/YARN-3807
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, fairscheduler
Reporter: Wei Shao

 This proposal talks about limitations of the YARN scheduling policies for SLA 
 applications, and tries to solve them by YARN-3806 and the new scheduling 
 policy called guaranteed capacity scheduling.
 Guaranteed capacity scheduling makes guarantee to the applications that they 
 can get resources under specified capacity cap in totally predictable manner. 
 The application can meet SLA more easily since it is self-contained in the 
 shared cluster - external uncertainties are eliminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

[
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei Shao updated YARN-3806:
---
Attachment: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf

Proposal of Generic Scheduling Framework for YARN
-

Key: YARN-3806
URL: https://issues.apache.org/jira/browse/YARN-3806
Project: Hadoop YARN
Issue Type: Improvement
Components: scheduler
Reporter: Wei Shao
Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf

Currently, A typical YARN cluster runs many different kinds of applications:
production applications, ad hoc user applications, long running services and
so on. Different YARN scheduling policies may be suitable for different
applications. For example, capacity scheduling can manage production
applications well since application can get guaranteed resource share, fair
scheduling can manage ad hoc user applications well since it can enforce
fairness among users. However, current YARN scheduling framework doesn’t have
a mechanism for multiple scheduling policies work hierarchically in one
cluster.
YARN-3306 talked about many issues of today’s YARN scheduling framework, and
proposed a per-queue policy driven framework. In detail, it supported
different scheduling policies for leaf queues. However, support of different
scheduling policies for upper level queues is not seriously considered yet.
A generic scheduling framework is proposed here to address these limitations.
It supports different policies for any queue consistently. The proposal tries
to solve many other issues in current YARN scheduling framework as well.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587465#comment-14587465
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

Thanks!

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Blocker
 Attachments: RM.log


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at

[jira] [Updated] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk


 [ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-3798:
-
Priority: Blocker  (was: Major)

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Blocker
 Attachments: RM.log


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at

[jira] [Updated] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk


 [ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-3798:
-
Target Version/s: 2.7.1

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
 Attachments: RM.log


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at

[jira] [Commented] (YARN-1983) Support heterogeneous container types at runtime on YARN

2015-06-15 Thread Chun Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587370#comment-14587370
 ] 

Chun Chen commented on YARN-1983:
-

[~vinodkv], according to your suggestion, I propose the following change: 
1. Allow NM_CE to specify a comma list CE classes.
2. Allow user to specify a env named NM_CLIENT_CE in CLC. If the value of 
NM_CLIENT_CE is one of the CE class configured previous, choose that one to 
execute the container, throw exception otherwise.
3. If user specify only one CE class of NM_CE, ignore NM_CLIENT_CE in env of 
CLC and always use that one to execute containers.
4. If user specify multiple classes of NM_CE, he has to configure for a default 
CE named NM_DEFAULT_CE in yarn-site.xml in case he doesn't specify env 
NM_CLIENT_CE on submit containers.

NM_CE=yarn.nodemanager.container-executor.class
NM_CLIENT_CE=yarn.nodemanager.client.container-executor.class
NM_DEFAULT_CE=yarn.nodemanager.default.container-executor.class

 Support heterogeneous container types at runtime on YARN
 

 Key: YARN-1983
 URL: https://issues.apache.org/jira/browse/YARN-1983
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junping Du
 Attachments: YARN-1983.2.patch, YARN-1983.patch


 Different container types (default, LXC, docker, VM box, etc.) have different 
 semantics on isolation of security, namespace/env, performance, etc.
 Per discussions in YARN-1964, we have some good thoughts on supporting 
 different types of containers running on YARN and specified by application at 
 runtime which largely enhance YARN's flexibility to meet heterogenous app's 
 requirement on isolation at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3807) Proposal of Guaranteed Capacity Scheduling for YARN

Wei Shao created YARN-3807:
--

 Summary: Proposal of Guaranteed Capacity Scheduling for YARN
 Key: YARN-3807
 URL: https://issues.apache.org/jira/browse/YARN-3807
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, fairscheduler
Reporter: Wei Shao


This proposal talks about limitations of the YARN scheduling policies for SLA 
applications, and tries to solve them by [Link] and the new scheduling policy 
called guaranteed capacity scheduling.
Guaranteed capacity scheduling makes guarantee to the applications that they 
can get resources under specified capacity cap in totally predictable manner. 
The application can meet SLA more easily since it is self-contained in the 
shared cluster - external uncertainties are eliminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3711) Documentation of ResourceManager HA should explain configurations about listen addresses


[ 
https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587346#comment-14587346
 ] 

Tsuyoshi Ozawa commented on YARN-3711:
--

Committed this to trunk, branch-2, and branch-2.7. Thanks [~iwasakims] for your 
contribution.

 Documentation of ResourceManager HA should explain configurations about 
 listen addresses
 

 Key: YARN-3711
 URL: https://issues.apache.org/jira/browse/YARN-3711
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.7.1

 Attachments: YARN-3711.002.patch, YARN-3711.003.patch


 There should be explanation about webapp address in addition to RPC address.
 AM proxy filter needs explicit definition of 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses 
 in RM-HA mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-15 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587412#comment-14587412
 ] 

Rohith commented on YARN-3789:
--

Looks good to me too.. 

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3808) Proposal of Time Based Fair Scheduling for YARN

Wei Shao created YARN-3808:
--

 Summary: Proposal of Time Based Fair Scheduling for YARN
 Key: YARN-3808
 URL: https://issues.apache.org/jira/browse/YARN-3808
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler, scheduler
Reporter: Wei Shao


This proposal talks about the issues of YARN fair scheduling policy, and tries 
to solve them by YARN-3806 and the new scheduling policy called time based fair 
scheduling.

Time based fair scheduling policy is proposed to enforces time based fairness 
among users. For example, if two users share the cluster weekly, each user’s 
fair share is half of the cluster per week. At a particular week, if the first 
user has used the whole cluster for first half of the week, then in second half 
of the week, second user will always have priority to use cluster resources 
since the first user has used up its fair share of the cluster already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587241#comment-14587241
 ] 

Bibin A Chundatt commented on YARN-3798:


[~ozawa] Hadoop version 2.7.0 and ZK 3.5.0 we are using

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
 Attachments: RM.log


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at

[jira] [Updated] (YARN-3711) Documentation of ResourceManager HA should explain configurations about listen addresses


 [ 
https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-3711:
-
Summary: Documentation of ResourceManager HA should explain configurations 
about listen addresses  (was: Documentation of ResourceManager HA should 
explain about webapp address configuration)

 Documentation of ResourceManager HA should explain configurations about 
 listen addresses
 

 Key: YARN-3711
 URL: https://issues.apache.org/jira/browse/YARN-3711
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3711.002.patch, YARN-3711.003.patch


 There should be explanation about webapp address in addition to RPC address.
 AM proxy filter needs explicit definition of 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses 
 in RM-HA mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3711) Documentation of ResourceManager HA should explain configurations about listen addresses

2015-06-15 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587292#comment-14587292
 ] 

Hudson commented on YARN-3711:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8023 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8023/])
YARN-3711. Documentation of ResourceManager HA should explain configurations 
about listen addresses. Contributed by Masatake Iwasaki. (ozawa: rev 
e8c514373f2d258663497a33ffb3b231d0743b57)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerHA.md
* hadoop-yarn-project/CHANGES.txt


 Documentation of ResourceManager HA should explain configurations about 
 listen addresses
 

 Key: YARN-3711
 URL: https://issues.apache.org/jira/browse/YARN-3711
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3711.002.patch, YARN-3711.003.patch


 There should be explanation about webapp address in addition to RPC address.
 AM proxy filter needs explicit definition of 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses 
 in RM-HA mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3807) Proposal of Guaranteed Capacity Scheduling for YARN


 [ 
https://issues.apache.org/jira/browse/YARN-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3807:
---
Attachment: ProposalOfGuaranteedCapacitySchedulingForYARN-V1.0.pdf

 Proposal of Guaranteed Capacity Scheduling for YARN
 ---

 Key: YARN-3807
 URL: https://issues.apache.org/jira/browse/YARN-3807
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, fairscheduler
Reporter: Wei Shao
 Attachments: ProposalOfGuaranteedCapacitySchedulingForYARN-V1.0.pdf


 This proposal talks about limitations of the YARN scheduling policies for SLA 
 applications, and tries to solve them by YARN-3806 and the new scheduling 
 policy called guaranteed capacity scheduling.
 Guaranteed capacity scheduling makes guarantee to the applications that they 
 can get resources under specified capacity cap in totally predictable manner. 
 The application can meet SLA more easily since it is self-contained in the 
 shared cluster - external uncertainties are eliminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container


[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586249#comment-14586249
 ] 

Sandy Ryza commented on YARN-1197:
--

Sorry, I've been quiet here for a while, but I'd be concerned about a design 
that requires going through the ResourceManager for decreases.  If I understand 
correctly, this would be considerable hit to performance, which could be 
prohibitive for frameworks like Spark that might use container-resizing for 
allocating per-task resources.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-15 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586455#comment-14586455
 ] 

Vinod Kumar Vavilapalli commented on YARN-1197:
---

bq. We all agreed that due to the complexity of the current design, it is 
worthwhile to revisit the idea of increasing and decreasing container size both 
through Resource Manager
+1 for this idea. Letting this go through NodeManager directly adds too much 
complexity and difficult to understand semantics for the application writers.

bq.  If I understand correctly, this would be considerable hit to performance
[~sandyr], as I understand, going through NM is in fact a worse solution w.r.t 
allocation throughput. Going through RM directly is better as the RM will 
immediately know that the resource is available for future allocations - the 
decrease on the NM can happen offline. The control flow I expect is
 - the framework/app decides it doesn't need that many resources anymore. By 
this time, the container already should have given up on the physical resources 
it doesn't need
 - informs the RM about the required decrement
 - RM informs NM to resize the container (cgroups etc)


 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-15 Thread Varun Saxena (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena reassigned YARN-3804:
--

Assignee: Varun Saxena

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical

 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-15 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3804:
--
Priority: Critical  (was: Major)
Target Version/s: 2.8.0, 2.7.1

Seems like a critical issue to me.

Two options
 # Fail correctly and assume that admin adds yarn user explicitly if it needs 
to work.
 # Allow the daemon user to do the refresh irrespective of what admin configures

I get a feeling (2) is better. Thoughts? /cc [~leftnoteasy], [~jianhe]

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Priority: Critical

 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1012) Report NM aggregated container resource utilization in heartbeat


[ 
https://issues.apache.org/jira/browse/YARN-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586451#comment-14586451
 ] 

Hadoop QA commented on YARN-1012:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  19m 43s | Pre-patch trunk has 3 extant 
Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 2 new or modified test files. |
| {color:green}+1{color} | javac |   7m 43s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 55s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   2m 31s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  2s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 38s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   5m 25s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 26s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 59s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   0m 24s | Tests passed in 
hadoop-yarn-server-common. |
| {color:green}+1{color} | yarn tests |   6m  9s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  56m 56s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12739650/YARN-1012-8.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 32ffda1 |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8253/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8253/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8253/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8253/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8253/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8253/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8253/console |


This message was automatically generated.

 Report NM aggregated container resource utilization in heartbeat
 

 Key: YARN-1012
 URL: https://issues.apache.org/jira/browse/YARN-1012
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Arun C Murthy
Assignee: Inigo Goiri
 Attachments: YARN-1012-1.patch, YARN-1012-2.patch, YARN-1012-3.patch, 
 YARN-1012-4.patch, YARN-1012-5.patch, YARN-1012-6.patch, YARN-1012-7.patch, 
 YARN-1012-8.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3802) Two RMNodes for the same NodeId are used in RM sometimes after NM is reconnected.


[ 
https://issues.apache.org/jira/browse/YARN-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586507#comment-14586507
 ] 

Xuan Gong commented on YARN-3802:
-

[~zxu] The patch looks good overall. One nit:
Could you fix the comment, too ?
{code}
 // Only add new node if old state is RUNNING
{code}

 Two RMNodes for the same NodeId are used in RM sometimes after NM is 
 reconnected.
 -

 Key: YARN-3802
 URL: https://issues.apache.org/jira/browse/YARN-3802
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-3802.000.patch


 Two RMNodes for the same NodeId are used in RM sometimes after NM is 
 reconnected. Scheduler and RMContext use different RMNode reference for the 
 same NodeId sometimes after NM is reconnected, which is not correct. 
 Scheduler and RMContext should always use same RMNode reference for the same 
 NodeId.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3706) Generalize native HBase writer for additional tables

2015-06-15 Thread Joep Rottinghuis (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586363#comment-14586363
 ] 

Joep Rottinghuis commented on YARN-3706:


This patch is ready for review.

 Generalize native HBase writer for additional tables
 

 Key: YARN-3706
 URL: https://issues.apache.org/jira/browse/YARN-3706
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Joep Rottinghuis
Assignee: Joep Rottinghuis
Priority: Minor
 Attachments: YARN-3706-YARN-2928.001.patch, 
 YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, 
 YARN-3706-YARN-2928.012.patch, YARN-3726-YARN-2928.002.patch, 
 YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch, 
 YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch, 
 YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch, 
 YARN-3726-YARN-2928.009.patch


 When reviewing YARN-3411 we noticed that we could change the class hierarchy 
 a little in order to accommodate additional tables easily.
 In order to get ready for benchmark testing we left the original layout in 
 place, as performance would not be impacted by the code hierarchy.
 Here is a separate jira to address the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk

2015-06-15 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586307#comment-14586307
 ] 

Varun Saxena commented on YARN-3798:


ZK version is {{3.5}} I think

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
 Attachments: RM.log


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at

[jira] [Commented] (YARN-3714) AM proxy filter can not get proper default proxy address if RM-HA is enabled


[ 
https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586263#comment-14586263
 ] 

Hadoop QA commented on YARN-3714:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  22m 43s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   9m 15s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  11m 13s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 26s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   2m 35s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 50s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 40s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 43s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 27s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   2m 15s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   0m 22s | Tests passed in 
hadoop-yarn-server-web-proxy. |
| | |  56m 32s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12739513/YARN-3714.003.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 4c5da9b |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8252/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8252/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8252/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-web-proxy test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8252/artifact/patchprocess/testrun_hadoop-yarn-server-web-proxy.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8252/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8252/console |


This message was automatically generated.

 AM proxy filter can not get proper default proxy address if RM-HA is enabled
 

 Key: YARN-3714
 URL: https://issues.apache.org/jira/browse/YARN-3714
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, 
 YARN-3714.003.patch


 Default proxy address could not be got without setting 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is 
 enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk

2015-06-15 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586360#comment-14586360
 ] 

Varun Saxena commented on YARN-3798:


[~ozawa], thanks for your explanation.
This specific log scenario(Logs attached with the JIRA) looks like a zookeeper 
issue. We unfortunately lost the zookeeper logs. Otherwise could have 
confirmed. And are unable to reproduce it since then :(
As you explained, consistent data is guaranteed if a single Zookeeper object is 
used.

The scenario you explained above though is a good catch and I think we can fix 
it.

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
 Attachments: RM.log


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at

[jira] [Commented] (YARN-3714) AM proxy filter can not get proper default proxy address if RM-HA is enabled


[ 
https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586377#comment-14586377
 ] 

Xuan Gong commented on YARN-3714:
-

bq. HAUtil.verifyAndSetConfiguration works only on the RM node. AMs running in 
slave nodes also need to know the RM webapp addresses.

Thanks for the explanation. That makes sense.

The patch looks good overall. One small nit:
in {code}
  public static ListString getRMHAWebappAddresses(
  final YarnConfiguration conf) {
{code}
We could check whether RM_WEBAPP_ADDRESS has been set with RM-ids. If not, we 
only need to check whether RM_HOSTNAME has been set with RM-ids instead of 
calling {code}
HAUtil.verifyAndSetRMHAIdsList(conf);
{code} ?

 AM proxy filter can not get proper default proxy address if RM-HA is enabled
 

 Key: YARN-3714
 URL: https://issues.apache.org/jira/browse/YARN-3714
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, 
 YARN-3714.003.patch


 Default proxy address could not be got without setting 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is 
 enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container


[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586281#comment-14586281
 ] 

Wangda Tan commented on YARN-1197:
--

[~sandyr],
Thanks for coming back :).
I'm not very sure about what's the performance issue you mentioned if decreases 
goes to RM, what's the expected (ideal) delay in your mind of Sparking 
releasing resource.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container


[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586286#comment-14586286
 ] 

Wangda Tan commented on YARN-1197:
--

Sparking-Spark

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container


[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586285#comment-14586285
 ] 

Wangda Tan commented on YARN-1197:
--

Sparking-Spark

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1012) Report NM aggregated container resource utilization in heartbeat

2015-06-15 Thread Inigo Goiri (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Inigo Goiri updated YARN-1012:
--
Attachment: YARN-1012-8.patch

Report aggregated utilization.

 Report NM aggregated container resource utilization in heartbeat
 

 Key: YARN-1012
 URL: https://issues.apache.org/jira/browse/YARN-1012
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Arun C Murthy
Assignee: Inigo Goiri
 Attachments: YARN-1012-1.patch, YARN-1012-2.patch, YARN-1012-3.patch, 
 YARN-1012-4.patch, YARN-1012-5.patch, YARN-1012-6.patch, YARN-1012-7.patch, 
 YARN-1012-8.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl


[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586645#comment-14586645
 ] 

Xuan Gong commented on YARN-3804:
-

I am OK with that. 
In transitionToActive(), we are re-using all the refresh* code, if we choose 
option 2, we need to re-factory all the refresh* functions.

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical

 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM


[ 
https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586582#comment-14586582
 ] 

Karthik Kambatla commented on YARN-3803:


This seems like a serious issue. Any reason for marking it Minor? 

 Application hangs after more then one localization attempt fails on the same 
 NM
 ---

 Key: YARN-3803
 URL: https://issues.apache.org/jira/browse/YARN-3803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0, 2.5.1
Reporter: Yuliya Feldman
Assignee: Yuliya Feldman
Priority: Minor

 In the sandbox (single node) environment with LinuxContainerExecutor when 
 first Application Localization attempt fails second attempt can not proceed 
 and subsequently application hangs until RM kills it as non-responding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-15 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586610#comment-14586610
 ] 

Jian He commented on YARN-3804:
---

+1 for 2)
Not too much point having RM to depend on the admin acl to do transition for 
itself. [~kasha], [~xgong], sounds good ? 

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical

 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl


[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586676#comment-14586676
 ] 

Wangda Tan commented on YARN-3804:
--

There's a inconsistent check in current code path:
- AdminService.checkAccess uses YarnAuthorizationProvider to do the check. In 
its default implementation: {{ConfiguredYarnAuthorizer}}, it uses configured 
{{yarn.admin.acl}}
- ClientRMService.checkAccess uses AdminCLIsManager, it uses configured 
{{yarn.admin.acl}} + {{daemon_user}}

I think we should fix the inconsistency issue, 2) will be completed with if we 
make both of them allow {{daemont_user}}.

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical

 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM


[ 
https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586675#comment-14586675
 ] 

Yuliya Feldman commented on YARN-3803:
--

[~kasha] It happens only if you have single node (at least in my testing) - 
since AM 2nd+ attempt will happen on the same node. Though - I was debating 
whether to make it Major or not. I can change it to major.

I will post a patch later today for the fix. 

 Application hangs after more then one localization attempt fails on the same 
 NM
 ---

 Key: YARN-3803
 URL: https://issues.apache.org/jira/browse/YARN-3803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0, 2.5.1
Reporter: Yuliya Feldman
Assignee: Yuliya Feldman
Priority: Minor

 In the sandbox (single node) environment with LinuxContainerExecutor when 
 first Application Localization attempt fails second attempt can not proceed 
 and subsequently application hangs until RM kills it as non-responding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3469) ZKRMStateStore: Avoid setting watches that are not required


[ 
https://issues.apache.org/jira/browse/YARN-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586586#comment-14586586
 ] 

Karthik Kambatla commented on YARN-3469:


2.8 uses Curator and all watch handling is now implicit. 

 ZKRMStateStore: Avoid setting watches that are not required
 ---

 Key: YARN-3469
 URL: https://issues.apache.org/jira/browse/YARN-3469
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
Priority: Minor
 Fix For: 2.7.1

 Attachments: YARN-3469.01.patch


 In ZKRMStateStore, most operations(e.g. getDataWithRetries, 
 getDataWithRetries, getDataWithRetries) set watches on znode. Large watches 
 will cause problem such as [ZOOKEEPER-706: large numbers of watches can cause 
 session re-establishment to 
 fail|https://issues.apache.org/jira/browse/ZOOKEEPER-706].  
 Although there is a workaround that setting jute.maxbuffer to a larger value, 
 we need to adjust this value once there are more app and attempts stored in 
 ZK. And those watches are useless now. It might be better that do not set 
 watches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587247#comment-14587247
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

Thank you for the sharing, Bibin. Marking this as a blocker of 2.7.1. 

BTW, this problem looks to be solved since 2.8 or later uses Curator. 

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Blocker
 Attachments: RM.log


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at

[jira] [Updated] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers


 [ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-433:
---
Attachment: YARN-433.1.patch

 When RM is catching up with node updates then it should not expire acquired 
 containers
 --

 Key: YARN-433
 URL: https://issues.apache.org/jira/browse/YARN-433
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Xuan Gong
 Attachments: YARN-433.1.patch


 RM expires containers that are not launched within some time of being 
 allocated. The default is 10mins. When an RM is not keeping up with node 
 updates then it may not be aware of new launched containers. If the expire 
 thread fires for such containers then the RM can expire them even though they 
 may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers


[ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587425#comment-14587425
 ] 

Hadoop QA commented on YARN-433:


\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 13s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 39s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 39s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 47s | The applied patch generated  2 
new checkstyle issues (total was 129, now 131). |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 36s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 23s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  50m 52s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  89m 11s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.server.resourcemanager.TestRMNodeTransitions |
|   | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12739752/YARN-433.1.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / e8c5143 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8256/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8256/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8256/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8256/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8256/console |


This message was automatically generated.

 When RM is catching up with node updates then it should not expire acquired 
 containers
 --

 Key: YARN-433
 URL: https://issues.apache.org/jira/browse/YARN-433
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Xuan Gong
 Attachments: YARN-433.1.patch


 RM expires containers that are not launched within some time of being 
 allocated. The default is 10mins. When an RM is not keeping up with node 
 updates then it may not be aware of new launched containers. If the expire 
 thread fires for such containers then the RM can expire them even though they 
 may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3801) [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util

2015-06-15 Thread Zhijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587424#comment-14587424
 ] 

Zhijie Shen commented on YARN-3801:
---

+1, we'd better fix Java 8 issues before merging branch YARN-2928 back to 
trunk. HADOOP-11090 is targeting 2.8.

 [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util
 -

 Key: YARN-3801
 URL: https://issues.apache.org/jira/browse/YARN-3801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Tsuyoshi Ozawa
Assignee: Tsuyoshi Ozawa
 Attachments: YARN-3801.001.patch


 timelineservice depends on hbase-client and hbase-testing-util, and they 
 dpend on jdk.tools:1.7. This leads to fail to compile hadoop with JDK8.
 {quote}
 [WARNING] 
 Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency 
 are:
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT
 +-jdk.tools:jdk.tools:1.8
 and
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hbase:hbase-client:1.0.1
 +-org.apache.hbase:hbase-annotations:1.0.1
   +-jdk.tools:jdk.tools:1.7
 and
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hbase:hbase-testing-util:1.0.1
 +-org.apache.hbase:hbase-annotations:1.0.1
   +-jdk.tools:jdk.tools:1.7
 [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence 
 failed with message:
 Failed while enforcing releasability the error(s) are [
 Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency 
 are:
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT
 +-jdk.tools:jdk.tools:1.8
 and
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hbase:hbase-client:1.0.1
 +-org.apache.hbase:hbase-annotations:1.0.1
   +-jdk.tools:jdk.tools:1.7
 and
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hbase:hbase-testing-util:1.0.1
 +-org.apache.hbase:hbase-annotations:1.0.1
   +-jdk.tools:jdk.tools:1.7
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-15 Thread Jun Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587474#comment-14587474
 ] 

Jun Gong commented on YARN-3809:


How about setting thread pool size in ApplicationMasterLauncher larger, or make 
the size configurable?

 Failed to launch new attempts because ApplicationMasterLauncher's threads all 
 hang
 --

 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong

 ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
 AMLauncherEventType(LAUNCH and CLEANUP).
 In our cluster, there was many NM with 10+ AM running on it, and one shut 
 down for some reason. After RM found the NM LOST, it cleaned up AMs running 
 on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
 ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
 in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
 down, the default RPC time out is 15 mins. It means that in 15 mins 
 ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
 attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl


[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586694#comment-14586694
 ] 

Karthik Kambatla commented on YARN-3804:


On board with suggestions here. 

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical

 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM


[ 
https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586731#comment-14586731
 ] 

Karthik Kambatla commented on YARN-3803:


We have this other issue because of which multiple AMs for the same app get 
assigned to the same node. So, this could be a pretty serious issue. 

 Application hangs after more then one localization attempt fails on the same 
 NM
 ---

 Key: YARN-3803
 URL: https://issues.apache.org/jira/browse/YARN-3803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0, 2.5.1
Reporter: Yuliya Feldman
Assignee: Yuliya Feldman
Priority: Minor

 In the sandbox (single node) environment with LinuxContainerExecutor when 
 first Application Localization attempt fails second attempt can not proceed 
 and subsequently application hangs until RM kills it as non-responding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-15 Thread MENG DING (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586999#comment-14586999
]

MENG DING commented on YARN-1197:
-

[~sandyr], Yes. The key assumption is that by the time the Application Master
requests resource decrease from RM for a particular container, that container
should have already reduced its resource usage. Therefore, RM can immediately
allocate resource to others.

So to summarize the main idea:
* Both container resource increase and decrease requests go through RM. This
eliminates the race condition where while a container increase is in progress,
a decrease for the same container takes place.
* There is no need for AM-NM protocol anymore. This greatly simplifies the
logic for application writers.
* Resource decrease can happen immediately in RM, and the actual
enforce/monitor of the decrease can happen offline, as mentioned by Vinod.
* Resource increase, on the other hand, needs more thoughts.
** In the current design, the RM gives out an increase token to be used by AM
to initiate the increase on NM. There is no need for this. RM can notify the
increase to NM through RM-NM heartbeat response.
** RM still needs to wait for an acknowledgement from NM to confirm that the
increase is done before sending out response to AM. This will take two
heartbeat cycles, but this is not much worse than giving out a token to AM
first, and then letting AM initiating the increase.
** Since RM needs to wait for acknowledgement from NM to confirm the increase,
we must handle such cases as timeout, NM restart/recovery, etc. So we probably
still need to have a container increase token, and token expiration logic for
this purpose, but the token will be sent to NM through RM-NM heartbeat
protocol. (I am still working out the details)

Support changing resources of an allocated container

Key: YARN-1197
URL: https://issues.apache.org/jira/browse/YARN-1197
Project: Hadoop YARN
Issue Type: Task
Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
Attachments: YARN-1197 old-design-docs-patches-for-reference.zip,
YARN-1197_Design.pdf

The current YARN resource management logic assumes resource allocated to a
container is fixed during the lifetime of it. When users want to change a
resource
of an allocated container the only way is releasing it and allocating a new
container with expected size.
Allowing run-time changing resources of an allocated container will give us
better control of resource usage in application side

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container


[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586687#comment-14586687
 ] 

Sandy Ryza commented on YARN-1197:
--

bq. Going through RM directly is better as the RM will immediately know that 
the resource is available for future allocations
Is the idea that the RM would make allocations using the space before receiving 
acknowledgement from the NodeManager that it has resized the container 
(adjusted cgroups)? 

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2768) optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread


[ 
https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586733#comment-14586733
 ] 

Karthik Kambatla commented on YARN-2768:


Thanks for the clarification, [~zhiguohong]. Let me take a closer look at the 
patch and provide review comments. 

 optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% 
 of computing time of update thread
 

 Key: YARN-2768
 URL: https://issues.apache.org/jira/browse/YARN-2768
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png


 See the attached picture of profiling result. The clone of Resource object 
 within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the 
 function FairScheduler.update().
 The code of FSAppAttempt.updateDemand:
 {code}
 public void updateDemand() {
 demand = Resources.createResource(0);
 // Demand is current consumption plus outstanding requests
 Resources.addTo(demand, app.getCurrentConsumption());
 // Add up outstanding resource requests
 synchronized (app) {
   for (Priority p : app.getPriorities()) {
 for (ResourceRequest r : app.getResourceRequests(p).values()) {
   Resource total = Resources.multiply(r.getCapability(), 
 r.getNumContainers());
   Resources.addTo(demand, total);
 }
   }
 }
   }
 {code}
 The code of Resources.multiply:
 {code}
 public static Resource multiply(Resource lhs, double by) {
 return multiplyTo(clone(lhs), by);
 }
 {code}
 The clone could be skipped by directly update the value of this.demand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3706) Generalize native HBase writer for additional tables

2015-06-15 Thread Sangjin Lee (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586980#comment-14586980
]

Sangjin Lee commented on YARN-3706:
---

I took a look at the latest patch, and it looks pretty good overall. I do have
a few comments. Also, at a high level, were you able to run some tests using a
pseudo-distributed cluster to verify that it still works as before? If not,
it'd be great if you can try that out.

(BaseTable.java)
- l.54: nit: space

(RowKey.java/EntityRowKey.java)
- I'm not 100% sure of the value of the inheritance model here. The getRowKey()
and the getRowKeyPrefix() methods are not common across the supposed subtypes
(as the arguments change from table to table). If method contracts are not
shared among the subtypes, there is little commonality among them. In other
words, you will not be able to use type {{RowKeyEntityTable}} in code that
uses it. You'll always have to use {{EntityRowKey}}. Also, it's not like they
have to implement common instance methods. Then does it warrant the inheritance
model? Are you considering adding real inherited (instance) methods later?

(Separator.java)
- l.232: Although this gives you a nice way of combining both methods, I'm
thinking it is OK to provide a separate implementation for the array argument.
How often can this method be invoked? If it can be invoked often, it may cause
List's to be created unnecessarily

(TimelineEntitySchemaConstants.java)
- l.62: nit: spacing

Generalize native HBase writer for additional tables

Key: YARN-3706
URL: https://issues.apache.org/jira/browse/YARN-3706
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Joep Rottinghuis
Assignee: Joep Rottinghuis
Priority: Minor
Attachments: YARN-3706-YARN-2928.001.patch,
YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch,
YARN-3706-YARN-2928.012.patch, YARN-3726-YARN-2928.002.patch,
YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch,
YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch,
YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch,
YARN-3726-YARN-2928.009.patch

When reviewing YARN-3411 we noticed that we could change the class hierarchy
a little in order to accommodate additional tables easily.
In order to get ready for benchmark testing we left the original layout in
place, as performance would not be impacted by the code hierarchy.
Here is a separate jira to address the hierarchy.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container


[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587098#comment-14587098
 ] 

Wangda Tan commented on YARN-1197:
--

[~sandyr],
I think increasing via AM-NM and RM-NM are in very similar range of delay. 
(multi-seconds for now)

a. AM-NM needs 3 stages
1) AM Get increase token from RM
2) AM send increase token to NM
3) Pooling NM about increase status (because we cannot assume increasing can be 
done in NM side very fast)

b. RM-NM needs 4 stages
1) RM send back increasing token to NM
2) NM doing increase locally
3) NM report back to RM when increasing done
4) RM send increase done to AM

Solution b. has an additional RM-NM heartbeat interval

Benefits of b. (Some of them also mentioned by Meng)
- Simpler to AM, only need to know about increase done, don't need to receive 
token and submit/pool NM.
- Create a consistency way for application to increase/decrease containers
- Recovery is simpler, AM only knows increase when its finished, only need to 
handle 2 component recovery (NM/RM) instead of 3 components (NM/RM/AM)

Before we have a fast scheduling design/plan (I don't think we can support 
milli-seconds scheduling for now, too frequent AM heartbeating will overload 
RM), I don't think add an additional NM-RM heartbeat interval is a big problem.


 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

[
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587127#comment-14587127
]

Sandy Ryza commented on YARN-1197:
--

Option (a) can occur in the low hundreds of milliseconds if the cluster is
tuned properly, independent of cluster size.
1) Submit increase request to RM. Poll RM 100 milliseconds later after
continuous scheduling thread has run in order to pick up the increase token.
2) Send increase token to NM.

Why does the AM need to poll the NM about increase status before taking action?
Does the NM need to do anything other than update its tracking of the
resources allotted to the container?

Also, it's not unlikely that schedulers will be improved to return the increase
token on the same heartbeat that it's requested. So this could all happen in 2
RPCs + a scheduler decision, and no additional wait time. Anything more than
this is probably prohibitively expensive for a framework like Spark to submit
an increase request before running each task.

Would option (b) ever be able to achieve this kind of latency?

Support changing resources of an allocated container

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM


 [ 
https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuliya Feldman resolved YARN-3803.
--
Resolution: Not A Problem

I apologize for this one. It is not an issue in branches I mentioned, we just 
had duplicates handled incorrectly.

 Application hangs after more then one localization attempt fails on the same 
 NM
 ---

 Key: YARN-3803
 URL: https://issues.apache.org/jira/browse/YARN-3803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0, 2.5.1
Reporter: Yuliya Feldman
Assignee: Yuliya Feldman

 In the sandbox (single node) environment with LinuxContainerExecutor when 
 first Application Localization attempt fails second attempt can not proceed 
 and subsequently application hangs until RM kills it as non-responding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3174) Consolidate the NodeManager documentation into one


[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587128#comment-14587128
 ] 

Masatake Iwasaki commented on YARN-3174:


NodeManager.md currently explains about only health cheaker and is not linked 
from site.xml. I am going to move the contents of NodeManagerRestart.md into 
NodeManager.md and update the site index.

 Consolidate the NodeManager documentation into one
 --

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki

 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

[
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587146#comment-14587146
]

Wangda Tan commented on YARN-1197:
--

[~sandyr],
Thanks for replying,

bq. Why does the AM need to poll the NM about increase status before taking
action? Does the NM need to do anything other than update its tracking of the
resources allotted to the container?
Yes, NM only needs to update tracking of the resource and cgroups. We cannot
assume this can happen immediately, so we cannot put container increased to
the same RPC. This is same as startContainer, even if launching a container is
fast in most cases, AM needs to poll NM after invoked startContainer.

bq. Would option (b) ever be able to achieve this kind of latency?
If you consider all now/future optimizations, such as continous-scheduling /
scheduler make decision at same AM-RM heart-beat. (b) needs one more NM-RM
heart-beat interval. I agree with you, it could be hundreds of milli-seconds
(a) vs. multi-seconds (b). when the cluster is idle.

But I'm wondering do we really need add these complexity to AM before we have
mature optimizatons listed above? And also, if the cluster is busier, we cannot
expect the delay as well. I tend to do (b) now since it's simpler to app
developer to use this feature, I'm open to add AM-NM channel if we have YARN
scheduler supports fast scheduling better.

Support changing resources of an allocated container

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-15 Thread zhihai xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3790:

Component/s: fairscheduler

 TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
 trunk for FS scheduler
 

 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, test
Reporter: Rohith
Assignee: zhihai xu
 Attachments: YARN-3790.000.patch


 Failure trace is as follows
 {noformat}
 Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
   Time elapsed: 6.502 sec   FAILURE!
 java.lang.AssertionError: expected:6144 but was:8192
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging


 [ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3789:
---
Attachment: 0004-YARN-3789.patch

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-15 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585471#comment-14585471
 ] 

zhihai xu commented on YARN-3790:
-

[~rohithsharma] thanks for the review, yes, I just updated the component to 
FairScheduler.

 TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
 trunk for FS scheduler
 

 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, test
Reporter: Rohith
Assignee: zhihai xu
 Attachments: YARN-3790.000.patch


 Failure trace is as follows
 {noformat}
 Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
   Time elapsed: 6.502 sec   FAILURE!
 java.lang.AssertionError: expected:6144 but was:8192
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM


 [ 
https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuliya Feldman updated YARN-3803:
-
Priority: Major  (was: Minor)

 Application hangs after more then one localization attempt fails on the same 
 NM
 ---

 Key: YARN-3803
 URL: https://issues.apache.org/jira/browse/YARN-3803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0, 2.5.1
Reporter: Yuliya Feldman
Assignee: Yuliya Feldman

 In the sandbox (single node) environment with LinuxContainerExecutor when 
 first Application Localization attempt fails second attempt can not proceed 
 and subsequently application hangs until RM kills it as non-responding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3803) Application hangs after more then one localization attempt fails on the same NM


[ 
https://issues.apache.org/jira/browse/YARN-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587025#comment-14587025
 ] 

Yuliya Feldman commented on YARN-3803:
--

Changed to Major

 Application hangs after more then one localization attempt fails on the same 
 NM
 ---

 Key: YARN-3803
 URL: https://issues.apache.org/jira/browse/YARN-3803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0, 2.5.1
Reporter: Yuliya Feldman
Assignee: Yuliya Feldman

 In the sandbox (single node) environment with LinuxContainerExecutor when 
 first Application Localization attempt fails second attempt can not proceed 
 and subsequently application hangs until RM kills it as non-responding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587044#comment-14587044
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

[~varun_saxena] Thanks for your help. In addition to ZooKeeper version, could 
you share the Hadoop version? Is it 2.7.0? If it's 2.7.0, we can mark this 
issue as a blocker of 2.7.1 release.

{quote}
We unfortunately lost the zookeeper logs.
{quote}

The log of ZooKeeper when ZooKeeper#close() fails is dumped only with DEBUG 
mode. It's a bit difficult to get it.

BTW, can I work with you to fix the corner case? I appreciate if you could help 
me to back port the fix to a branch you're using.

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
 Attachments: RM.log


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-15 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587063#comment-14587063
 ] 

Vinod Kumar Vavilapalli commented on YARN-1197:
---

The details looks good.

Let's make sure we handle RM, AM and NM restarts correctly. Also, let's design 
the RM - NM protocol to be generic and common enough for regular launch/stop 
and increase/decrease.

Tx again for driving this!

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container


[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587072#comment-14587072
 ] 

Sandy Ryza commented on YARN-1197:
--

bq. RM still needs to wait for an acknowledgement from NM to confirm that the 
increase is done before sending out response to AM. This will take two 
heartbeat cycles, but this is not much worse than giving out a token to AM 
first, and then letting AM initiating the increase.

I would argue that waiting for an NM-RM heartbeat is much worse than waiting 
for an AM-RM heartbeat.  With continuous scheduling, the RM can make decisions 
in millisecond time, and the AM can regulate its heartbeats according to the 
application's needs to get fast responses.  If an NM-RM heartbeat is involved, 
the application is at the mercy of the cluster settings, which should be in the 
multi-second range for large clusters.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3174) Consolidate the NodeManager documentation into one


 [ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki reassigned YARN-3174:
--

Assignee: Masatake Iwasaki

 Consolidate the NodeManager documentation into one
 --

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki

 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3714) AM proxy filter can not get proper default proxy address if RM-HA is enabled


 [ 
https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated YARN-3714:
---
Attachment: YARN-3714.004.patch

bq. We could check whether RM_WEBAPP_ADDRESS has been set with RM-ids. If not, 
we only need to check whether RM_HOSTNAME has been set with RM-ids instead of 
calling

Yeah. I rethinked that using HAUtil#verifyAndSetRMHAIdsList which update conf 
is not safe. I attached 004.


 AM proxy filter can not get proper default proxy address if RM-HA is enabled
 

 Key: YARN-3714
 URL: https://issues.apache.org/jira/browse/YARN-3714
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, 
 YARN-3714.003.patch, YARN-3714.004.patch


 Default proxy address could not be got without setting 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is 
 enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container


[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587067#comment-14587067
 ] 

Sandy Ryza commented on YARN-1197:
--

Is my understanding correct that the broader plan is to move stopping 
containers out of the AM-NM protocol? 

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3174) Consolidate the NodeManager documentation into one


[ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587094#comment-14587094
 ] 

Masatake Iwasaki commented on YARN-3174:


There are 4 files refering to NodeManager under 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown.

* NodeManager.md
* NodeManagerCgroups.md
* NodeManagerRest.md
* NodeManagerRestart.md

NodeManagerCgroups.md: It is not the doc about NodeManager and just the file 
name is not appropiate. It explains about feature supported by 
LinuxContainerExecutor only. Even if CGroups is supported by other modules in 
future, it might not be speficic to NodeManager.

NodeManagerRest.md: This is relatively big and reasonable to be independent 
page same as other REST API docs.


 Consolidate the NodeManager documentation into one
 --

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki

 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3779) Aggregated Logs Deletion doesnt work after refreshing Log Retention Settings in secure cluster

2015-06-15 Thread Zhijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587165#comment-14587165
 ] 

Zhijie Shen commented on YARN-3779:
---

[~varun_saxena], do you know why ugi is still the same, but kerberos 
authentication gets failed?

 Aggregated Logs Deletion doesnt work after refreshing Log Retention Settings 
 in secure cluster
 --

 Key: YARN-3779
 URL: https://issues.apache.org/jira/browse/YARN-3779
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
 Environment: mrV2, secure mode
Reporter: Zhang Wei
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3779.01.patch, YARN-3779.02.patch, 
 log_aggr_deletion_on_refresh_error.log, log_aggr_deletion_on_refresh_fix.log


 {{GSSException}} is thrown everytime log aggregation deletion is attempted 
 after executing bin/mapred hsadmin -refreshLogRetentionSettings in a secure 
 cluster.
 The problem can be reproduced by following steps:
 1. startup historyserver in secure cluster.
 2. Log deletion happens as per expectation. 
 3. execute {{mapred hsadmin -refreshLogRetentionSettings}} command to refresh 
 the configuration value.
 4. All the subsequent attempts of log deletion fail with {{GSSException}}
 Following exception can be found in historyserver's log if log deletion is 
 enabled. 
 {noformat}
 2015-06-04 14:14:40,070 | ERROR | Timer-3 | Error reading root log dir this 
 deletion attempt is being aborted | AggregatedLogDeletionService.java:127
 java.io.IOException: Failed on local exception: java.io.IOException: 
 javax.security.sasl.SaslException: GSS initiate failed [Caused by 
 GSSException: No valid credentials provided (Mechanism level: Failed to find 
 any Kerberos tgt)]; Host Details : local host is: vm-31/9.91.12.31; 
 destination host is: vm-33:25000; 
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
 at org.apache.hadoop.ipc.Client.call(Client.java:1414)
 at org.apache.hadoop.ipc.Client.call(Client.java:1363)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
 at com.sun.proxy.$Proxy9.getListing(Unknown Source)
 at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:519)
 at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
 at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
 at com.sun.proxy.$Proxy10.getListing(Unknown Source)
 at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1767)
 at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1750)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:691)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:753)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:749)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:749)
 at 
 org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.run(AggregatedLogDeletionService.java:68)
 at java.util.TimerThread.mainLoop(Timer.java:555)
 at java.util.TimerThread.run(Timer.java:505)
 Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS 
 initiate failed [Caused by GSSException: No valid credentials provided 
 (Mechanism level: Failed to find any Kerberos tgt)]
 at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:677)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1641)
 at 
 org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:640)
 at 
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:724)
 at 
 org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
 at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
 at

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

[
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587168#comment-14587168
]

Sandy Ryza commented on YARN-1197:
--

bq. If you consider all now/future optimizations, such as continous-scheduling
/ scheduler make decision at same AM-RM heart-beat. (b) needs one more NM-RM
heart-beat interval. I agree with you, it could be hundreds of milli-seconds
(a) vs. multi-seconds (b). when the cluster is idle.

To clarify: with proper tuning, we can currently get low hundreds of
milliseconds without adding any new scheduler features. With the new scheduler
feature I'm imagining, we'd only be limited by the RPC + scheduler time, so we
could get 10s of milliseconds with proper tuning.

Support changing resources of an allocated container

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container


[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587174#comment-14587174
 ] 

Sandy Ryza commented on YARN-1197:
--

Regarding complexity in the AM, the NMClient utility so far has been an API 
that's fairly easy for app developers to interact with.  I've used it more than 
once and had no issues.  Would we not be able to handle most of the additional 
complexity behind it?

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3714) AM proxy filter can not get proper default proxy address if RM-HA is enabled


[ 
https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587181#comment-14587181
 ] 

Hadoop QA commented on YARN-3714:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 59s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 41s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 48s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 13s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 32s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 14s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   2m  2s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   0m 24s | Tests passed in 
hadoop-yarn-server-web-proxy. |
| | |  42m 54s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12739710/YARN-3714.004.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 2cb09e9 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8254/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-web-proxy test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8254/artifact/patchprocess/testrun_hadoop-yarn-server-web-proxy.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8254/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8254/console |


This message was automatically generated.

 AM proxy filter can not get proper default proxy address if RM-HA is enabled
 

 Key: YARN-3714
 URL: https://issues.apache.org/jira/browse/YARN-3714
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, 
 YARN-3714.003.patch, YARN-3714.004.patch


 Default proxy address could not be got without setting 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is 
 enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3792) Test case failures in TestDistributedShell and some issue fixes related to ATSV2

2015-06-15 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587183#comment-14587183
 ] 

Sangjin Lee commented on YARN-3792:
---

Thanks [~Naganarasimha] for identifying the issues and providing a patch! I 
applied the patch on top of the current YARN-2928 branch, rebuilt, and ran the 
TestDistributedShell test locally. I still see one test failing:

{noformat}
---
 T E S T S
---
Running 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell
Tests run: 13, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 581.546 sec 
 FAILURE! - in 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell
testDSShellWithoutDomainV2CustomizedFlow(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
  Time elapsed: 29.651 sec   FAILURE!
java.lang.AssertionError: Application finished event should be published 
atleast once expected:1 but was:0
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.verifyStringExistsSpecifiedTimes(TestDistributedShell.java:483)
at 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.checkTimelineV2(TestDistributedShell.java:431)
at 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:323)
at 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow(TestDistributedShell.java:209)


Results :

Failed tests: 
  
TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow:209-testDSShell:323-checkTimelineV2:431-verifyStringExistsSpecifiedTimes:483
 Application finished event should be published atleast once expected:1 but 
was:0

Tests run: 13, Failures: 1, Errors: 0, Skipped: 0
{noformat}

Have you seen this? Could you kindly look into that? I'll also see if this is 
reproducible on my end.

Some quick comments:

(TestDistributedShell.java)
- l.71-75: Is this comment necessary here? I'm not sure if we want to add a 
generic comment like this to a specific test...
- l.106: Are the checks for null necessary? I thought that the test name was 
populated by junit and made available to test methods. Do things fail if we do 
not check for null?
- l.376: I don't really like the sleep call as it is not completely 
deterministic; could there be a way to make this completely deterministic 
(using things like CountDownLatch, etc.)?

(TimelineClientImpl.java)
- l.385: nit: the C-style conditional check is not necessary; I would suggest a 
more natural check of {{(timelineServiceAddress == null)}}

(ContainersMonitorImpl.java)
- l.96: It is unrelated to this patch itself, but should we rename the variable 
name threadPool? It is a completely generic name. We should rename it to 
something like timelineWriterThreadPool or something to that effect. Let me 
know if you have a suggestion.


 Test case failures in TestDistributedShell and some issue fixes related to 
 ATSV2
 

 Key: YARN-3792
 URL: https://issues.apache.org/jira/browse/YARN-3792
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
 Attachments: YARN-3792-YARN-2928.001.patch


 # encountered [testcase 
 failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] 
 which was happening even without the patch modifications in YARN-3044
 TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow
 TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow
 TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression
 # Remove unused {{enableATSV1}} in testDisstributedShell
 # container metrics needs to be published only for v2 test cases of 
 testDisstributedShell
 # Nullpointer was thrown in TimelineClientImpl.constructResURI when Aux 
 service was not configured and {{TimelineClient.putObjects}} was getting 
 invoked.
 # Race condition for the Application events to published and test case 
 verification for RM's ApplicationFinished Timeline Events
 # Application Tags for converted to lowercase in 
 ApplicationSubmissionContextPBimpl, hence RMTimelinecollector was not able to 
 detect to custom flow details of the app



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3714) AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id


 [ 
https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated YARN-3714:
---
Summary: AM proxy filter can not get RM webapp address from 
yarn.resourcemanager.hostname.rm-id  (was: AM proxy filter can not get proper 
default proxy address if RM-HA is enabled)

 AM proxy filter can not get RM webapp address from 
 yarn.resourcemanager.hostname.rm-id
 --

 Key: YARN-3714
 URL: https://issues.apache.org/jira/browse/YARN-3714
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, 
 YARN-3714.003.patch, YARN-3714.004.patch


 Default proxy address could not be got without setting 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is 
 enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3174) Consolidate the NodeManager documentation into one


 [ 
https://issues.apache.org/jira/browse/YARN-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated YARN-3174:
---
Attachment: YARN-3174.001.patch

 Consolidate the NodeManager documentation into one
 --

 Key: YARN-3174
 URL: https://issues.apache.org/jira/browse/YARN-3174
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Allen Wittenauer
Assignee: Masatake Iwasaki
 Attachments: YARN-3174.001.patch


 We really don't need a different document for every individual nodemanager 
 feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container