date:20150617


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590318#comment-14590318
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

Sorry for the delay. I took a time to investigate the behaviour of ZooKeeper 
yesterday. Now I'm checking the comment by Varun.

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Blocker
 Attachments: RM.log, YARN-3798-branch-2.7.patch


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl


[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590208#comment-14590208
 ] 

Xuan Gong commented on YARN-3804:
-

+1 LGTM. Will commit

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590309#comment-14590309
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

[~vinodkv] thank you for taking a look at this issue.

{quote}
If my understanding is correct, someone should edit the title.
{quote}

Sure.

{quote}
Coming to the patch: By definition, CONNECTIONLOSS also means that we should 
recreate the connection?
{quote}

IIUC, we should not recreate new connection when CONNECTIONLOSS happens by 
definition. ZooKeeper client tries to reconnect automatically since it's 
recoverable error. It's written in the Wiki of 
ZooKeeper(http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling). Curator also 
does same thing.

{quote}
Recoverable errors: the disconnected event, connection timed out, and the 
connection loss exception are examples of recoverable errors, they indicate a 
problem that happened, but the ZooKeeper handle is still valid and future 
operations will succeed once the ZooKeeper library can reestablish its 
connection to ZooKeeper.

The ZooKeeper library does try to recover the connection, so the handle should 
not be closed on a recoverable error, but the application must deal with the 
transient error.
{quote}

{quote}
 2. (ZKRMStateStore) Failing to zkClient.close() in 
 ZKRMStateStore#createConnection, but IOException is ignored.

I think this should be fixed in ZooKeeper. No amount of patching in YARN will 
fix this.
{quote}

I took a look at the code of ZooKeeper#close deeply. I found IOException is not 
related.
However, the way of our error handling affects this phenomena as follows: 

# (ZKRMStateStore) CONNETIONLOSS happens - calling closeZkClients inside 
createConnection.
# (ZooKeeper client in ZKRMStateStore) submitRequest - wait() for finishing 
the packet for close().
# (ZooKeeper client # SendThread) Exception happens because of timeout - 
cleanup the packet for the close(). The reply header of the packet has 
CONNECTIONLOSS again. notify to caller of close().
# (ZooKeeper client in ZKRMStateStore) return to closeZkClients().
# (ZKRMStateStore) continuing to createConnection() normally.

I think the error handling when CONNECTIONLOSS happens and the connection 
management in YARN-side are wrong as described above. We should fix it at our 
side. Please correct me if I'm wrong.

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Blocker
 Attachments: RM.log, YARN-3798-branch-2.7.patch


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)

[jira] [Updated] (YARN-3047) [Data Serving] Set up ATS reader with basic request serving structure and lifecycle


 [ 
https://issues.apache.org/jira/browse/YARN-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3047:
---
Attachment: YARN-3047-YARN-2928.08.patch

 [Data Serving] Set up ATS reader with basic request serving structure and 
 lifecycle
 ---

 Key: YARN-3047
 URL: https://issues.apache.org/jira/browse/YARN-3047
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Varun Saxena
  Labels: BB2015-05-TBR
 Attachments: Timeline_Reader(draft).pdf, 
 YARN-3047-YARN-2928.08.patch, YARN-3047.001.patch, YARN-3047.003.patch, 
 YARN-3047.005.patch, YARN-3047.006.patch, YARN-3047.007.patch, 
 YARN-3047.02.patch, YARN-3047.04.patch


 Per design in YARN-2938, set up the ATS reader as a service and implement the 
 basic structure as a service. It includes lifecycle management, request 
 serving, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3714) AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id


[ 
https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590243#comment-14590243
 ] 

Hudson commented on YARN-3714:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #229 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/229/])
YARN-3714. AM proxy filter can not get RM webapp address from (xgong: rev 
e27d5a13b0623e3eb43ac773eccd082b9d6fa9d0)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/amfilter/TestAmFilterInitializer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/RMHAUtils.java
* hadoop-yarn-project/CHANGES.txt


 AM proxy filter can not get RM webapp address from 
 yarn.resourcemanager.hostname.rm-id
 --

 Key: YARN-3714
 URL: https://issues.apache.org/jira/browse/YARN-3714
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, 
 YARN-3714.003.patch, YARN-3714.004.patch


 Default proxy address could not be got without setting 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is 
 enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3819) Collect network usage on the node


 [ 
https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Grandl updated YARN-3819:

Attachment: YARN-3819-2.patch

Updates to DummyResourceCalculatorPlugin.java

 Collect network usage on the node
 -

 Key: YARN-3819
 URL: https://issues.apache.org/jira/browse/YARN-3819
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3819-1.patch, YARN-3819-2.patch


 In this JIRA we propose to collect the network usage on a node. This JIRA is 
 part of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2745) Extend YARN to support multi-resource packing of tasks

2015-06-17 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590289#comment-14590289
 ] 

Allen Wittenauer commented on YARN-2745:


How much of this is actually YARN specific though?  YARN-3819 and YARN-3820 
seem like things that HDFS should care about too.  It seems extremely 
shortsighted not to commit the collection parts into common.

 Extend YARN to support multi-resource packing of tasks
 --

 Key: YARN-2745
 URL: https://issues.apache.org/jira/browse/YARN-2745
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, scheduler
Reporter: Robert Grandl
Assignee: Robert Grandl
 Attachments: sigcomm_14_tetris_talk.pptx, tetris_design_doc.docx, 
 tetris_paper.pdf


 In this umbrella JIRA we propose an extension to existing scheduling 
 techniques, which accounts for all resources used by a task (CPU, memory, 
 disk, network) and it is able to achieve three competing objectives: 
 fairness, improve cluster utilization and reduces average job completion time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3820) Collect disks usages on the node


 [ 
https://issues.apache.org/jira/browse/YARN-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Grandl updated YARN-3820:

Attachment: YARN-3820-1.patch

Added first cut patch

 Collect disks usages on the node
 

 Key: YARN-3820
 URL: https://issues.apache.org/jira/browse/YARN-3820
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3820-1.patch


 In this JIRA we propose to collect disks usages on a node. This JIRA is part 
 of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-17 Thread Vinod Kumar Vavilapalli (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590164#comment-14590164
]

Vinod Kumar Vavilapalli commented on YARN-3811:
---

bq. We should also consider graceful NM decommission. For graceful
decommission, the RM should refrain from assigning more tasks to the node in
question. Should we also prevent AMs that have already been assigned this node
from starting new containers? In that case, I guess we would not be throwing
NMNotYetReadyException, but another YarnException - NMShuttingDownException?
[~kasha], we could. Let's file a separate JIRA?

bq. we should just avoid opening or processing the client port until we've
registered with the RM if it's really a problem in practice
[~jlowe], this is not possible to do as the NM needs to report the RPC server
port during registration - so, server start should happen before registration.

bq. 2. For NM restart with no recovery support, startContainer will fail
anyways because the NMToken is not valid.
bq. 3. For work-preserving RM restart, containers launched before NM
re-register can be recovered on RM when NM sends the container status across.
startContainer call after re-register will fail because the NMToken is not
valid.
[~jianhe], these two errors will be much harder for apps to process and react
to than the current named exception.

Further, things like Auxiliary services are also not setup already by time the
RPC server starts and depending on how the service order changes over time,
users may get different types of errors. Overall, I am in favor of keeping the
named exception with clients explicitly retrying.

NM restarts could lead to app failures
--

Key: YARN-3811
URL: https://issues.apache.org/jira/browse/YARN-3811
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical

Consider the following scenario:
1. RM assigns a container on node N to an app A.
2. Node N is restarted
3. A tries to launch container on node N.
3 could lead to an NMNotYetReadyException depending on whether NM N has
registered with the RM. In MR, this is considered a task attempt failure. A
few of these could lead to a task/job failure.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-17 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590207#comment-14590207
 ] 

Jason Lowe commented on YARN-3811:
--

bq. this is not possible to do as the NM needs to report the RPC server port 
during registration - so, server start should happen before registration.
Yes, but that's a limitation in the RPC layer.  If we could bind the server 
before we start it then we could know the port, register with the RM, then 
start the server.  IMHO the RPC layer should support this, but I understand 
we'll have to work around the lack of that in the interim.  I think we all can 
agree the retry exception is just a hack being used because we can't keep the 
client service from serving too soon.

 NM restarts could lead to app failures
 --

 Key: YARN-3811
 URL: https://issues.apache.org/jira/browse/YARN-3811
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical

 Consider the following scenario:
 1. RM assigns a container on node N to an app A.
 2. Node N is restarted
 3. A tries to launch container on node N.
 3 could lead to an NMNotYetReadyException depending on whether NM N has 
 registered with the RM. In MR, this is considered a task attempt failure. A 
 few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3148) Allow CORS related headers to passthrough in WebAppProxyServlet


[ 
https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590237#comment-14590237
 ] 

Hudson commented on YARN-3148:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #229 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/229/])
YARN-3148. Allow CORS related headers to passthrough in (devaraj: rev 
ebb9a82519c622bb898e1eec5798c2298c726694)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServlet.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestWebAppProxyServlet.java
* hadoop-yarn-project/CHANGES.txt


 Allow CORS related headers to passthrough in WebAppProxyServlet
 ---

 Key: YARN-3148
 URL: https://issues.apache.org/jira/browse/YARN-3148
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Prakash Ramachandran
Assignee: Varun Saxena
 Fix For: 2.8.0

 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, 
 YARN-3148.03.patch, YARN-3148.04.patch


 currently the WebAppProxyServlet filters the request headers as defined by  
 passThroughHeaders. Tez UI is building a webapp which using rest api to fetch 
 data from the am via the rm tracking url. 
 for this purpose it would be nice to have additional headers allowed 
 especially the ones related to CORS. A few of them that would help are 
 * Origin
 * Access-Control-Request-Method
 * Access-Control-Request-Headers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3617) Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1


[ 
https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590375#comment-14590375
 ] 

Hudson commented on YARN-3617:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2159 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2159/])
YARN-3617. Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning 
(devaraj: rev 318d2cde7cb5c05a5f87c4ee967446bb60d28ae4)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java
* hadoop-yarn-project/CHANGES.txt


 Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1
 -

 Key: YARN-3617
 URL: https://issues.apache.org/jira/browse/YARN-3617
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
 Environment: Windows 7 x64 SP1
Reporter: Georg Berendt
Assignee: J.Andreina
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3617.1.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, 
 there is an unused variable for CPU frequency.
  /** {@inheritDoc} */
   @Override
   public long getCpuFrequency() {
 refreshIfNeeded();
 return -1;   
   }
 Please change '-1' to use 'cpuFrequencyKhz'.
 org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3714) AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id


[ 
https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590379#comment-14590379
 ] 

Hudson commented on YARN-3714:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2159 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2159/])
YARN-3714. AM proxy filter can not get RM webapp address from (xgong: rev 
e27d5a13b0623e3eb43ac773eccd082b9d6fa9d0)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/RMHAUtils.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/amfilter/TestAmFilterInitializer.java


 AM proxy filter can not get RM webapp address from 
 yarn.resourcemanager.hostname.rm-id
 --

 Key: YARN-3714
 URL: https://issues.apache.org/jira/browse/YARN-3714
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, 
 YARN-3714.003.patch, YARN-3714.004.patch


 Default proxy address could not be got without setting 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is 
 enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3148) Allow CORS related headers to passthrough in WebAppProxyServlet


[ 
https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590373#comment-14590373
 ] 

Hudson commented on YARN-3148:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2159 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2159/])
YARN-3148. Allow CORS related headers to passthrough in (devaraj: rev 
ebb9a82519c622bb898e1eec5798c2298c726694)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestWebAppProxyServlet.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServlet.java


 Allow CORS related headers to passthrough in WebAppProxyServlet
 ---

 Key: YARN-3148
 URL: https://issues.apache.org/jira/browse/YARN-3148
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Prakash Ramachandran
Assignee: Varun Saxena
 Fix For: 2.8.0

 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, 
 YARN-3148.03.patch, YARN-3148.04.patch


 currently the WebAppProxyServlet filters the request headers as defined by  
 passThroughHeaders. Tez UI is building a webapp which using rest api to fetch 
 data from the am via the rm tracking url. 
 for this purpose it would be nice to have additional headers allowed 
 especially the ones related to CORS. A few of them that would help are 
 * Origin
 * Access-Control-Request-Method
 * Access-Control-Request-Headers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3820) Collect disks usages on the node


[ 
https://issues.apache.org/jira/browse/YARN-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590297#comment-14590297
 ] 

Robert Grandl commented on YARN-3820:
-

[~srikanthkandula] and I were proposing to collect the disks usages on a node. 
This is part of a larger effort of multi-resource scheduling. Currently, yarn 
does not have any mechanism to monitor the amount of bytes read/written from/to 
disks. 

 Collect disks usages on the node
 

 Key: YARN-3820
 URL: https://issues.apache.org/jira/browse/YARN-3820
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util

 In this JIRA we propose to collect disks usages on a node. This JIRA is part 
 of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3617) Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1


[ 
https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590345#comment-14590345
 ] 

Hudson commented on YARN-3617:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2177 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2177/])
YARN-3617. Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning 
(devaraj: rev 318d2cde7cb5c05a5f87c4ee967446bb60d28ae4)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java
* hadoop-yarn-project/CHANGES.txt


 Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1
 -

 Key: YARN-3617
 URL: https://issues.apache.org/jira/browse/YARN-3617
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
 Environment: Windows 7 x64 SP1
Reporter: Georg Berendt
Assignee: J.Andreina
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3617.1.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, 
 there is an unused variable for CPU frequency.
  /** {@inheritDoc} */
   @Override
   public long getCpuFrequency() {
 refreshIfNeeded();
 return -1;   
   }
 Please change '-1' to use 'cpuFrequencyKhz'.
 org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3714) AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id


[ 
https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590349#comment-14590349
 ] 

Hudson commented on YARN-3714:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2177 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2177/])
YARN-3714. AM proxy filter can not get RM webapp address from (xgong: rev 
e27d5a13b0623e3eb43ac773eccd082b9d6fa9d0)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/RMHAUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/amfilter/TestAmFilterInitializer.java
* hadoop-yarn-project/CHANGES.txt


 AM proxy filter can not get RM webapp address from 
 yarn.resourcemanager.hostname.rm-id
 --

 Key: YARN-3714
 URL: https://issues.apache.org/jira/browse/YARN-3714
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, 
 YARN-3714.003.patch, YARN-3714.004.patch


 Default proxy address could not be got without setting 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is 
 enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3148) Allow CORS related headers to passthrough in WebAppProxyServlet


[ 
https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590343#comment-14590343
 ] 

Hudson commented on YARN-3148:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2177 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2177/])
YARN-3148. Allow CORS related headers to passthrough in (devaraj: rev 
ebb9a82519c622bb898e1eec5798c2298c726694)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServlet.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestWebAppProxyServlet.java


 Allow CORS related headers to passthrough in WebAppProxyServlet
 ---

 Key: YARN-3148
 URL: https://issues.apache.org/jira/browse/YARN-3148
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Prakash Ramachandran
Assignee: Varun Saxena
 Fix For: 2.8.0

 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, 
 YARN-3148.03.patch, YARN-3148.04.patch


 currently the WebAppProxyServlet filters the request headers as defined by  
 passThroughHeaders. Tez UI is building a webapp which using rest api to fetch 
 data from the am via the rm tracking url. 
 for this purpose it would be nice to have additional headers allowed 
 especially the ones related to CORS. A few of them that would help are 
 * Origin
 * Access-Control-Request-Method
 * Access-Control-Request-Headers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3819) Collect network usage on the node


 [ 
https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Grandl updated YARN-3819:

Flags: Patch

 Collect network usage on the node
 -

 Key: YARN-3819
 URL: https://issues.apache.org/jira/browse/YARN-3819
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util

 In this JIRA we propose to collect the network usage on a node. This JIRA is 
 part of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk

2015-06-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590172#comment-14590172
 ] 

Vinod Kumar Vavilapalli commented on YARN-3798:
---

[~ozawa], bumping for my comments and those from [~varun_saxena] and to figure 
out if I should hold 2.7.1 for this.

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Blocker
 Attachments: RM.log, YARN-3798-branch-2.7.patch


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)

[jira] [Commented] (YARN-3819) Collect network usage on the node


[ 
https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590217#comment-14590217
 ] 

Robert Grandl commented on YARN-3819:
-

[~srikanthkandula] and I were proposing to collect the network usage on a node. 
This is part of a larger effort of multi-resource scheduling. Previous efforts 
in collecting network usage per containers is not enough for the purpose of 
multi-resource scheduling, as it is not able to capture other traffic 
activities on the node such as ingestion or evacuation. 


 Collect network usage on the node
 -

 Key: YARN-3819
 URL: https://issues.apache.org/jira/browse/YARN-3819
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3819-1.patch


 In this JIRA we propose to collect the network usage on a node. This JIRA is 
 part of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-06-17 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590484#comment-14590484
 ] 

zhihai xu commented on YARN-3591:
-

Hi [~vvasudev], thanks for the explanation.
IMHO, If we want the LocalDirHandlerService to be a central place for the state 
of the local dirs, doing it in {{DirsChangeListener#onDirsChanged}} will be 
better. IIUC, it is also your suggestion.
The benefits for doing this are:
1. It will give better performance. because you will do it only when some Dirs 
become bad, which should happen rarely,
you won't waste your time to do it for every localization request.
2. It will also help the issue What about zombie files lying in the various 
paths which [~lavkesh] found, a similar issue as YARN-2624.
3. {{checkLocalizedResources}}/{{removeResource}} called by {{onDirsChanged}} 
will be done inside {{LocalDirsHandlerService#checkDirs}} without any delay.

 Resource Localisation on a bad disk causes subsequent containers failure 
 -

 Key: YARN-3591
 URL: https://issues.apache.org/jira/browse/YARN-3591
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Lavkesh Lahngir
Assignee: Lavkesh Lahngir
 Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
 YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch


 It happens when a resource is localised on the disk, after localising that 
 disk has gone bad. NM keeps paths for localised resources in memory.  At the 
 time of resource request isResourcePresent(rsrc) will be called which calls 
 file.exists() on the localised path.
 In some cases when disk has gone bad, inodes are stilled cached and 
 file.exists() returns true. But at the time of reading, file will not open.
 Note: file.exists() actually calls stat64 natively which returns true because 
 it was able to find inode information from the OS.
 A proposal is to call file.list() on the parent path of the resource, which 
 will call open() natively. If the disk is good it should return an array of 
 paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3819) Collect network usage on the node


 [ 
https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Grandl updated YARN-3819:

Attachment: YARN-3819-1.patch

 Collect network usage on the node
 -

 Key: YARN-3819
 URL: https://issues.apache.org/jira/browse/YARN-3819
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3819-1.patch


 In this JIRA we propose to collect the network usage on a node. This JIRA is 
 part of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-17 Thread Jian He (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590319#comment-14590319
]

Jian He commented on YARN-3811:
---

bq. this is not possible to do as the NM needs to report the RPC server port
during registration - so, server start should happen before registration.
For RM work-preserving restart, this is not a problem as the NM remain as-is.
For NM restart with no recovery, all outstanding containers allocated on this
node are anyways killed.
For NM work-preserving restart, I found the code already make sure everything
starts first before starting the containerManager server.
{code}
if (delayedRpcServerStart) {
waitForRecoveredContainers();
server.start();
{code}

Overall, I think it's fine to add a client retry fix in 2.7.1;But long term
I'd like to re-visit this, may be I still miss something.

NM restarts could lead to app failures
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3819) Collect network usage on the node

2015-06-17 Thread Lei Guo (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590472#comment-14590472
 ] 

Lei Guo commented on YARN-3819:
---

For multiple resource scheduling, we may have different resource types, not 
just CPU/disk/network. Even for network, we may need other attributes instead 
of just read and write. It's better to have some generic framework in RM/NM and 
collect data via plug-ins. 

 Collect network usage on the node
 -

 Key: YARN-3819
 URL: https://issues.apache.org/jira/browse/YARN-3819
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3819-1.patch, YARN-3819-2.patch


 In this JIRA we propose to collect the network usage on a node. This JIRA is 
 part of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3819) Collect network usage on the node


 [ 
https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Grandl updated YARN-3819:

Attachment: YARN-3819-3.patch

 Collect network usage on the node
 -

 Key: YARN-3819
 URL: https://issues.apache.org/jira/browse/YARN-3819
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch


 In this JIRA we propose to collect the network usage on a node. This JIRA is 
 part of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl


[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590581#comment-14590581
 ] 

Xuan Gong commented on YARN-3804:
-

[~varun_saxena] Looks like the patch does not apply for 2.7. Could you provide 
a patch for branch-2.7, please ?

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2801) Documentation development for Node labels requirment

2015-06-17 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2801:
-
Attachment: (was: YARN-2801.md)

 Documentation development for Node labels requirment
 

 Key: YARN-2801
 URL: https://issues.apache.org/jira/browse/YARN-2801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Gururaj Shetty
Assignee: Wangda Tan

 Documentation needs to be developed for the node label requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

[
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590827#comment-14590827
]

Karthik Kambatla commented on YARN-3806:

FairScheduler supports per-queue policy. Folks could always implement their own
policies.

YARN-3306 aims to generalize this, starting with leaf queues, so we have a
single scheduler.

Proposal of Generic Scheduling Framework for YARN
-

Key: YARN-3806
URL: https://issues.apache.org/jira/browse/YARN-3806
Project: Hadoop YARN
Issue Type: Improvement
Components: scheduler
Reporter: Wei Shao
Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf,
ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf

Currently, a typical YARN cluster runs many different kinds of applications:
production applications, ad hoc user applications, long running services and
so on. Different YARN scheduling policies may be suitable for different
applications. For example, capacity scheduling can manage production
applications well since application can get guaranteed resource share, fair
scheduling can manage ad hoc user applications well since it can enforce
fairness among users. However, current YARN scheduling framework doesn’t have
a mechanism for multiple scheduling policies work hierarchically in one
cluster.
YARN-3306 talked about many issues of today’s YARN scheduling framework, and
proposed a per-queue policy driven framework. In detail, it supported
different scheduling policies for leaf queues. However, support of different
scheduling policies for upper level queues is not seriously considered yet.
A generic scheduling framework is proposed here to address these limitations.
It supports different policies for any queue consistently. The proposal tries
to solve many other issues in current YARN scheduling framework as well.
Two new proposed scheduling policies YARN-3807 YARN-3808 are based on
generic scheduling framework brought up in this proposal.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl


[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590843#comment-14590843
 ] 

Xuan Gong commented on YARN-3804:
-

Committed into trunk/branch-2/branch-2.7. Thanks, [~varun_saxena].

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch, 
 YARN-3804.branch-2.7.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl


 [ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-3804:

Attachment: YARN-3804.branch-2.7.patch

Upload a same patch but can apply to branch-2.7

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch, 
 YARN-3804.branch-2.7.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

[
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei Shao updated YARN-3806:
---
Description:
Currently, a typical YARN cluster runs many different kinds of applications:
production applications, ad hoc user applications, long running services and so
on. Different YARN scheduling policies may be suitable for different
applications. For example, capacity scheduling can manage production
applications well since application can get guaranteed resource share, fair
scheduling can manage ad hoc user applications well since it can enforce
fairness among users. However, current YARN scheduling framework doesn’t have a
mechanism for multiple scheduling policies work hierarchically in one cluster.

YARN-3306 talked about many issues of today’s YARN scheduling framework, and
proposed a per-queue policy driven framework. In detail, it supported different
scheduling policies for leaf queues. However, support of different scheduling
policies for upper level queues is not seriously considered yet.

A generic scheduling framework is proposed here to address these limitations.
It supports different policies (fair, capacity, fifo and so on) for any queue
consistently. The proposal tries to solve many other issues in current YARN
scheduling framework as well.

Two new proposed scheduling policies YARN-3807 YARN-3808 are based on generic
scheduling framework brought up in this proposal.

was:
Currently, a typical YARN cluster runs many different kinds of applications:
production applications, ad hoc user applications, long running services and so
on. Different YARN scheduling policies may be suitable for different
applications. For example, capacity scheduling can manage production
applications well since application can get guaranteed resource share, fair
scheduling can manage ad hoc user applications well since it can enforce
fairness among users. However, current YARN scheduling framework doesn’t have a
mechanism for multiple scheduling policies work hierarchically in one cluster.

A generic scheduling framework is proposed here to address these limitations.
It supports different policies for any queue consistently. The proposal tries
to solve many other issues in current YARN scheduling framework as well.

Two new proposed scheduling policies YARN-3807 YARN-3808 are based on generic
scheduling framework brought up in this proposal.

Proposal of Generic Scheduling Framework for YARN
-

Currently, a typical YARN cluster runs many different kinds of applications:
production applications, ad hoc user applications, long running services and
so on. Different YARN scheduling policies may be suitable for different
applications. For example, capacity scheduling can manage production
applications well since application can get guaranteed resource share, fair
scheduling can manage ad hoc user applications well since it can enforce
fairness among users. However, current YARN scheduling framework doesn’t have
a mechanism for multiple scheduling policies work hierarchically in one
cluster.
YARN-3306 talked about many issues of today’s YARN scheduling framework, and
proposed a per-queue policy driven framework. In detail, it supported
different scheduling policies for leaf queues. However, support of different
scheduling policies for upper level queues is not seriously considered yet.
A generic scheduling framework is proposed here to address these limitations.
It supports different policies (fair, capacity, fifo and so on) for any queue
consistently. The proposal tries to solve many other issues in current YARN
scheduling framework as well.
Two new proposed scheduling policies YARN-3807 YARN-3808 are based on
generic scheduling framework brought up in this proposal.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS

2015-06-17 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590914#comment-14590914
 ] 

Junping Du commented on YARN-3045:
--

Hi [~Naganarasimha], given YARN-3044 is already get in, mind to update patch 
here? Thx!

 [Event producers] Implement NM writing container lifecycle events to ATS
 

 Key: YARN-3045
 URL: https://issues.apache.org/jira/browse/YARN-3045
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
  Labels: BB2015-05-TBR
 Attachments: YARN-3045-YARN-2928.002.patch, 
 YARN-3045-YARN-2928.003.patch, YARN-3045.20150420-1.patch


 Per design in YARN-2928, implement NM writing container lifecycle events and 
 container system metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang


[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590852#comment-14590852
 ] 

Karthik Kambatla commented on YARN-3809:


Or, could we make it so we don't wait as long as 15 minutes? 

 Failed to launch new attempts because ApplicationMasterLauncher's threads all 
 hang
 --

 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3809.01.patch


 ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
 AMLauncherEventType(LAUNCH and CLEANUP).
 In our cluster, there was many NM with 10+ AM running on it, and one shut 
 down for some reason. After RM found the NM LOST, it cleaned up AMs running 
 on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
 ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
 in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
 down, the default RPC time out is 15 mins. It means that in 15 mins 
 ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
 attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

[
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590849#comment-14590849
]

Karthik Kambatla commented on YARN-3809:

Shouldn't the number of threads in the pool be at least as big as the maximum
number of apps that could run on a node? By making it configurable, how do we
expect the admins to pick this number? Just pick an arbitrarily high value?

Failed to launch new attempts because ApplicationMasterLauncher's threads all
hang
--

Key: YARN-3809
URL: https://issues.apache.org/jira/browse/YARN-3809
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
Attachments: YARN-3809.01.patch

ApplicationMasterLauncher create a thread pool whose size is 10 to deal with
AMLauncherEventType(LAUNCH and CLEANUP).
In our cluster, there was many NM with 10+ AM running on it, and one shut
down for some reason. After RM found the NM LOST, it cleaned up AMs running
on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event.
ApplicationMasterLauncher's thread pool would be filled up, and they all hang
in the code containerMgrProxy.stopContainers(stopRequest) because NM was
down, the default RPC time out is 15 mins. It means that in 15 mins
ApplicationMasterLauncher could not handle new event such as LAUNCH, then new
attempts will fails to launch because of time out.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3812) TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347

2015-06-17 Thread Colin Patrick McCabe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590893#comment-14590893
 ] 

Colin Patrick McCabe commented on YARN-3812:


bq. Yes, ImmutableFsPermission should not be overriding applyUMask since the 
method does not actually mutate the object. readFields does mutate and 
therefore is appropriate for preventing invocation for constant objects.

I agree.  It seems that {{FsPermission#ImmutablePermission}} is incorrectly 
overriding {{FsPermission#applyUMask}}.  There is no reason to override this 
method since it doesn't modify the {{FsPermission}}.  The right fix should be 
to simply stop overriding that method.  Do you want to move the JIRA over to 
Hadoop-common and post a patch for that?

 TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347
 --

 Key: YARN-3812
 URL: https://issues.apache.org/jira/browse/YARN-3812
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 3.0.0
Reporter: Robert Kanter
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3812.patch


 {{TestRollingLevelDBTimelineStore}} is failing with the below errors in 
 trunk.  I did a git bisect and found that it was due to HADOOP-11347, which 
 changed something with umasks in {{FsPermission}}.
 {noformat}
 Running org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 Tests run: 16, Failures: 0, Errors: 16, Skipped: 0, Time elapsed: 2.65 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 testGetDomains(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 1.533 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testRelatingToNonExistingEntity(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.085 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testValidateConfig(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.07 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at

[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state

2015-06-17 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590747#comment-14590747
 ] 

Eric Payne commented on YARN-2902:
--

Hi [~varun_saxena]. Thank you very much for working on and fixing this issue. 
We are looking forward to your next patch.  Do you have an ETA for when that 
might be?

 Killing a container that is localizing can orphan resources in the 
 DOWNLOADING state
 

 Key: YARN-2902
 URL: https://issues.apache.org/jira/browse/YARN-2902
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Varun Saxena
 Attachments: YARN-2902.002.patch, YARN-2902.patch


 If a container is in the process of localizing when it is stopped/killed then 
 resources are left in the DOWNLOADING state.  If no other container comes 
 along and requests these resources they linger around with no reference 
 counts but aren't cleaned up during normal cache cleanup scans since it will 
 never delete resources in the DOWNLOADING state even if their reference count 
 is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2745) Extend YARN to support multi-resource packing of tasks


[ 
https://issues.apache.org/jira/browse/YARN-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590858#comment-14590858
 ] 

Karthik Kambatla commented on YARN-2745:


YARN-3332 tracks the work required to move all this collection from within Yarn 
to a service that HDFS could also use. We are just getting the collection bits 
in first, and plan to consolidate and move things around after. 

 Extend YARN to support multi-resource packing of tasks
 --

 Key: YARN-2745
 URL: https://issues.apache.org/jira/browse/YARN-2745
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, scheduler
Reporter: Robert Grandl
Assignee: Robert Grandl
 Attachments: sigcomm_14_tetris_talk.pptx, tetris_design_doc.docx, 
 tetris_paper.pdf


 In this umbrella JIRA we propose an extension to existing scheduling 
 techniques, which accounts for all resources used by a task (CPU, memory, 
 disk, network) and it is able to achieve three competing objectives: 
 fairness, improve cluster utilization and reduces average job completion time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED


 [ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-3798:
-
Summary: ZKRMStateStore shouldn't create new session without occurrance of 
SESSIONEXPIED  (was: RM shutdown with NoNode exception while updating 
appAttempt on zk)

 ZKRMStateStore shouldn't create new session without occurrance of 
 SESSIONEXPIED
 ---

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Blocker
 Attachments: RM.log, YARN-3798-branch-2.7.002.patch, 
 YARN-3798-branch-2.7.patch


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at

[jira] [Updated] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers


 [ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-433:
---
Attachment: YARN-433.2.patch

Fix the testcase failure

 When RM is catching up with node updates then it should not expire acquired 
 containers
 --

 Key: YARN-433
 URL: https://issues.apache.org/jira/browse/YARN-433
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Xuan Gong
 Attachments: YARN-433.1.patch, YARN-433.2.patch


 RM expires containers that are not launched within some time of being 
 allocated. The default is 10mins. When an RM is not keeping up with node 
 updates then it may not be aware of new launched containers. If the expire 
 thread fires for such containers then the RM can expire them even though they 
 may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3617) Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1


[ 
https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590239#comment-14590239
 ] 

Hudson commented on YARN-3617:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #229 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/229/])
YARN-3617. Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning 
(devaraj: rev 318d2cde7cb5c05a5f87c4ee967446bb60d28ae4)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java
* hadoop-yarn-project/CHANGES.txt


 Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1
 -

 Key: YARN-3617
 URL: https://issues.apache.org/jira/browse/YARN-3617
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
 Environment: Windows 7 x64 SP1
Reporter: Georg Berendt
Assignee: J.Andreina
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3617.1.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, 
 there is an unused variable for CPU frequency.
  /** {@inheritDoc} */
   @Override
   public long getCpuFrequency() {
 refreshIfNeeded();
 return -1;   
   }
 Please change '-1' to use 'cpuFrequencyKhz'.
 org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3820) Collect disks usages on the node

Robert Grandl created YARN-3820:
---

 Summary: Collect disks usages on the node
 Key: YARN-3820
 URL: https://issues.apache.org/jira/browse/YARN-3820
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl


In this JIRA we propose to collect disks usages on a node. This JIRA is part of 
a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3528) Tests with 12345 as hard-coded port break jenkins

2015-06-17 Thread Brahma Reddy Battula (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3528:
---
Attachment: YARN-3528.patch

 Tests with 12345 as hard-coded port break jenkins
 -

 Key: YARN-3528
 URL: https://issues.apache.org/jira/browse/YARN-3528
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0
 Environment: ASF Jenkins
Reporter: Steve Loughran
Assignee: Brahma Reddy Battula
Priority: Blocker
  Labels: test
 Attachments: YARN-3528.patch


 A lot of the YARN tests have hard-coded the port 12345 for their services to 
 come up on.
 This makes it impossible to have scheduled or precommit tests to run 
 consistently on the ASF jenkins hosts. Instead the tests fail regularly and 
 appear to get ignored completely.
 A quick grep of 12345 shows up many places in the test suite where this 
 practise has developed.
 * All {{BaseContainerManagerTest}} subclasses
 * {{TestNodeManagerShutdown}}
 * {{TestContainerManager}}
 + others
 This needs to be addressed through portscanning and dynamic port allocation. 
 Please can someone do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3819) Collect network usage on the node


[ 
https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590555#comment-14590555
 ] 

Robert Grandl commented on YARN-3819:
-

Short description of this JIRA:
We process /proc/net/dev file which reports for every network interface 
present on the node, the cumulative amount of bytes read/written. We aggregate 
these numbers across all the interfaces except loopback.

We tested the existence of these files in the following Linux kernel versions:
Linux 3.2.0
Linux 2.6.32
Linux 3.13.0

Also, doing further search on the web, it seems people are using/recommending 
these files for extracting read/written network bytes counters.

 Collect network usage on the node
 -

 Key: YARN-3819
 URL: https://issues.apache.org/jira/browse/YARN-3819
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch


 In this JIRA we propose to collect the network usage on a node. This JIRA is 
 part of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2801) Documentation development for Node labels requirment

2015-06-17 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2801:
-
Attachment: YARN-2801.1.patch

Fixed a bunch of formatting issues and reattached patch.

 Documentation development for Node labels requirment
 

 Key: YARN-2801
 URL: https://issues.apache.org/jira/browse/YARN-2801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Gururaj Shetty
Assignee: Wangda Tan
 Attachments: YARN-2801.1.patch


 Documentation needs to be developed for the node label requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins


[ 
https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590562#comment-14590562
 ] 

Varun Saxena commented on YARN-3528:


Thanks for updating the patch [~brahmareddy]. Few comments.

# Move {{ServerSocketUtil}} to {{org.apache.hadoop.net}} instead of having it 
in {{org.apache.hadoop.fs}}.
# As this is a utility class which might be used elsewhere as well, can we pass 
an initial port into {{getPort()}} and try with that port first before 
randomizing . We can use this instead of using 49152 everytime.
# Probably can pass number of retries as well instead of fixing it as 10. Let 
the caller decide. Thoughts ?
# Replace {{System.out.println}} by {{LOG}}
# In catch block no need to recreate {{IOException}} and wrapping the caught 
exception. You can directly throw the exception caught as it is also 
{{IOException}} and you are not adding any additional information.
# What's the point of having {{getFreePort()}} ? You can write 0 directly in 
code instead of calling this function or use a constant. Thoughts ?
# If this is a class to be used only by tests, we can move it to test folder ?
# In {{TestNodeManagerShutdown}}, catching {{RuntimeException}} is unnecessary.
# In {{TestNodeManagerShutdown#startContainer}}, if exception is thrown(i.e. no 
free port is available) the code simply continues on to make a call to 
{{rpc.getProxy()}} with {{null}} containerManagerBindAddress. We can probably 
throw and exception so that test fails at correct location.
# Can remove below line from {{BaseContainerManagerTest}}
{code}
// String bindAddress = 0.0.0.0:12345;
{code}}

 Tests with 12345 as hard-coded port break jenkins
 -

 Key: YARN-3528
 URL: https://issues.apache.org/jira/browse/YARN-3528
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0
 Environment: ASF Jenkins
Reporter: Steve Loughran
Assignee: Brahma Reddy Battula
Priority: Blocker
  Labels: test
 Attachments: YARN-3528.patch


 A lot of the YARN tests have hard-coded the port 12345 for their services to 
 come up on.
 This makes it impossible to have scheduled or precommit tests to run 
 consistently on the ASF jenkins hosts. Instead the tests fail regularly and 
 appear to get ignored completely.
 A quick grep of 12345 shows up many places in the test suite where this 
 practise has developed.
 * All {{BaseContainerManagerTest}} subclasses
 * {{TestNodeManagerShutdown}}
 * {{TestContainerManager}}
 + others
 This needs to be addressed through portscanning and dynamic port allocation. 
 Please can someone do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3819) Collect network usage on the node

2015-06-17 Thread Srikanth Kandula (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590587#comment-14590587
 ] 

Srikanth Kandula commented on YARN-3819:


[~grey] The patch does have the generic component, in that it needs 
/proc/net... It would be possible to expose whatever additional fields end up 
being needed by schedulers or monitors. We only expose a first cut of them 
(total read/ written).

 Collect network usage on the node
 -

 Key: YARN-3819
 URL: https://issues.apache.org/jira/browse/YARN-3819
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch


 In this JIRA we propose to collect the network usage on a node. This JIRA is 
 part of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3819) Collect network usage on the node

2015-06-17 Thread Lei Guo (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590600#comment-14590600
 ] 

Lei Guo commented on YARN-3819:
---

Thanks for the explanation. My concern is mainly about we will need update the 
code when we need new resource information for scheduling purpose.  If we have 
a generic framework, and the integration developer may can write a script to 
feed information into NM, and then RM can do scheduling based on that, this is 
part of my comment in 3332, 

https://issues.apache.org/jira/browse/YARN-3332?focusedCommentId=14355923page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14355923






 Collect network usage on the node
 -

 Key: YARN-3819
 URL: https://issues.apache.org/jira/browse/YARN-3819
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch


 In this JIRA we propose to collect the network usage on a node. This JIRA is 
 part of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3244) Add user specified information for clean-up container in ApplicationSubmissionContext


 [ 
https://issues.apache.org/jira/browse/YARN-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-3244:

Attachment: YARN-3244.5.patch

rebase the patch 

 Add user specified information for clean-up container in 
 ApplicationSubmissionContext
 -

 Key: YARN-3244
 URL: https://issues.apache.org/jira/browse/YARN-3244
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-3244.1.patch, YARN-3244.2.patch, YARN-3244.3.patch, 
 YARN-3244.4.patch, YARN-3244.5.patch


 To launch user-specified clean up container, users need to provide proper 
 informations to YARN.
 It should at least have following properties:
 * A flag to indicate whether needs to launch the clean-up container
 * A time-out period to indicate how long the clean-up container can run
 * maxRetry times
 * containerLaunchContext for clean-up container



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3656) LowCost: A Cost-Based Placement Agent for YARN Reservations

2015-06-17 Thread Carlo Curino (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590699#comment-14590699
]

Carlo Curino commented on YARN-3656:

Patch looks good to me (as you followed all previous rounds of advises we
discussed).

Looking at it, I would argue that the structure of a solution you provide,
i.e., decomposing the placement agent into various sub-routines and a choice of
policies for time-range selection (IStageEarliestStart) , and single-atom
allocation (IStageAllocator) is quite elegant. I would propose, in fact, to
gut the existing GreedyReservationAgent and turn it into a simple
configuration class that selects the right set of policies for
IStageAllocator, IStageEarliestStart and iteratively run over those. This would
be cleaner, and make it easier for people to improve on specific sub-set of
this problem. If you agree with this, you must ensure the behavior is
identical to the current GreedyReservationAgent. You should be able to do
this fairly easily given the infrastructure you have for your agents and the
testing harnesses you have in place.

My 2 cents...

LowCost: A Cost-Based Placement Agent for YARN Reservations
---

Key: YARN-3656
URL: https://issues.apache.org/jira/browse/YARN-3656
Project: Hadoop YARN
Issue Type: Improvement
Components: capacityscheduler, resourcemanager
Affects Versions: 2.6.0
Reporter: Ishai Menache
Assignee: Jonathan Yaniv
Labels: capacity-scheduler, resourcemanager
Attachments: LowCostRayonExternal.pdf, YARN-3656-v1.patch,
lowcostrayonexternal_v2.pdf

YARN-1051 enables SLA support by allowing users to reserve cluster capacity
ahead of time. YARN-1710 introduced a greedy agent for placing user
reservations. The greedy agent makes fast placement decisions but at the cost
of ignoring the cluster committed resources, which might result in blocking
the cluster resources for certain periods of time, and in turn rejecting some
arriving jobs.
We propose LowCost – a new cost-based planning algorithm. LowCost “spreads”
the demand of the job throughout the allowed time-window according to a
global, load-based cost function.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-06-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590576#comment-14590576
 ] 

Hadoop QA commented on YARN-3591:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 14s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 46s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 52s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 36s | The applied patch generated  2 
new checkstyle issues (total was 172, now 174). |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 13s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   6m 11s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  44m 29s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12740121/YARN-3591.5.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 6e3fcff |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8272/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8272/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8272/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8272/console |


This message was automatically generated.

 Resource Localisation on a bad disk causes subsequent containers failure 
 -

 Key: YARN-3591
 URL: https://issues.apache.org/jira/browse/YARN-3591
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Lavkesh Lahngir
Assignee: Lavkesh Lahngir
 Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
 YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch, YARN-3591.5.patch


 It happens when a resource is localised on the disk, after localising that 
 disk has gone bad. NM keeps paths for localised resources in memory.  At the 
 time of resource request isResourcePresent(rsrc) will be called which calls 
 file.exists() on the localised path.
 In some cases when disk has gone bad, inodes are stilled cached and 
 file.exists() returns true. But at the time of reading, file will not open.
 Note: file.exists() actually calls stat64 natively which returns true because 
 it was able to find inode information from the OS.
 A proposal is to call file.list() on the parent path of the resource, which 
 will call open() natively. If the disk is good it should return an array of 
 paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3820) Collect disks usages on the node


 [ 
https://issues.apache.org/jira/browse/YARN-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Grandl updated YARN-3820:

Attachment: YARN-3820-2.patch

 Collect disks usages on the node
 

 Key: YARN-3820
 URL: https://issues.apache.org/jira/browse/YARN-3820
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3820-1.patch, YARN-3820-2.patch


 In this JIRA we propose to collect disks usages on a node. This JIRA is part 
 of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3244) Add user specified information for clean-up container in ApplicationSubmissionContext


[ 
https://issues.apache.org/jira/browse/YARN-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590702#comment-14590702
 ] 

Xuan Gong commented on YARN-3244:
-

[~jianhe] Please take a look.

 Add user specified information for clean-up container in 
 ApplicationSubmissionContext
 -

 Key: YARN-3244
 URL: https://issues.apache.org/jira/browse/YARN-3244
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-3244.1.patch, YARN-3244.2.patch, YARN-3244.3.patch, 
 YARN-3244.4.patch, YARN-3244.5.patch


 To launch user-specified clean up container, users need to provide proper 
 informations to YARN.
 It should at least have following properties:
 * A flag to indicate whether needs to launch the clean-up container
 * A time-out period to indicate how long the clean-up container can run
 * maxRetry times
 * containerLaunchContext for clean-up container



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2801) Documentation development for Node labels requirment

2015-06-17 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2801:
-
Attachment: YARN-2801.md

Attached initial version for review.

 Documentation development for Node labels requirment
 

 Key: YARN-2801
 URL: https://issues.apache.org/jira/browse/YARN-2801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Gururaj Shetty
Assignee: Wangda Tan
 Attachments: YARN-2801.md


 Documentation needs to be developed for the node label requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml


[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591120#comment-14591120
 ] 

Ray Chiang commented on YARN-3069:
--

I found one more important mismatch in the existing file.

  XML Property: yarn.scheduler.maximum-allocation-vcores
  XML Value:32
  Config Name:  DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES
  Config Value: 4

The Config value comes from YARN-193 and the default xml property comes from 
YARN-2.  Should we keep it this way or should one of the values get updated?

 Document missing properties in yarn-default.xml
 ---

 Key: YARN-3069
 URL: https://issues.apache.org/jira/browse/YARN-3069
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: BB2015-05-TBR, supportability
 Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
 YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
 YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
 YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch


 The following properties are currently not defined in yarn-default.xml.  
 These properties should either be
   A) documented in yarn-default.xml OR
   B)  listed as an exception (with comments, e.g. for internal use) in the 
 TestYarnConfigurationFields unit test
 Any comments for any of the properties below are welcome.
   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
   security.applicationhistory.protocol.acl
   yarn.app.container.log.backups
   yarn.app.container.log.dir
   yarn.app.container.log.filesize
   yarn.client.app-submission.poll-interval
   yarn.client.application-client-protocol.poll-timeout-ms
   yarn.is.minicluster
   yarn.log.server.url
   yarn.minicluster.control-resource-monitoring
   yarn.minicluster.fixed.ports
   yarn.minicluster.use-rpc
   yarn.node-labels.fs-store.retry-policy-spec
   yarn.node-labels.fs-store.root-dir
   yarn.node-labels.manager-class
   yarn.nodemanager.container-executor.os.sched.priority.adjustment
   yarn.nodemanager.container-monitor.process-tree.class
   yarn.nodemanager.disk-health-checker.enable
   yarn.nodemanager.docker-container-executor.image-name
   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
   yarn.nodemanager.linux-container-executor.group
   yarn.nodemanager.log.deletion-threads-count
   yarn.nodemanager.user-home-dir
   yarn.nodemanager.webapp.https.address
   yarn.nodemanager.webapp.spnego-keytab-file
   yarn.nodemanager.webapp.spnego-principal
   yarn.nodemanager.windows-secure-container-executor.group
   yarn.resourcemanager.configuration.file-system-based-store
   yarn.resourcemanager.delegation-token-renewer.thread-count
   yarn.resourcemanager.delegation.key.update-interval
   yarn.resourcemanager.delegation.token.max-lifetime
   yarn.resourcemanager.delegation.token.renew-interval
   yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size
   yarn.resourcemanager.metrics.runtime.buckets
   yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs
   yarn.resourcemanager.reservation-system.class
   yarn.resourcemanager.reservation-system.enable
   yarn.resourcemanager.reservation-system.plan.follower
   yarn.resourcemanager.reservation-system.planfollower.time-step
   yarn.resourcemanager.rm.container-allocation.expiry-interval-ms
   yarn.resourcemanager.webapp.spnego-keytab-file
   yarn.resourcemanager.webapp.spnego-principal
   yarn.scheduler.include-port-in-node-name
   yarn.timeline-service.delegation.key.update-interval
   yarn.timeline-service.delegation.token.max-lifetime
   yarn.timeline-service.delegation.token.renew-interval
   yarn.timeline-service.generic-application-history.enabled
   
 yarn.timeline-service.generic-application-history.fs-history-store.compression-type
   yarn.timeline-service.generic-application-history.fs-history-store.uri
   yarn.timeline-service.generic-application-history.store-class
   yarn.timeline-service.http-cross-origin.enabled
   yarn.tracking.url.generator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml

2015-06-17 Thread Akira AJISAKA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591165#comment-14591165
 ] 

Akira AJISAKA commented on YARN-3069:
-

Nice catch! I'm thinking we can discuss the issue about the mismatch in a 
separate jira.

 Document missing properties in yarn-default.xml
 ---

 Key: YARN-3069
 URL: https://issues.apache.org/jira/browse/YARN-3069
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: BB2015-05-TBR, supportability
 Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
 YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
 YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
 YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch


 The following properties are currently not defined in yarn-default.xml.  
 These properties should either be
   A) documented in yarn-default.xml OR
   B)  listed as an exception (with comments, e.g. for internal use) in the 
 TestYarnConfigurationFields unit test
 Any comments for any of the properties below are welcome.
   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
   security.applicationhistory.protocol.acl
   yarn.app.container.log.backups
   yarn.app.container.log.dir
   yarn.app.container.log.filesize
   yarn.client.app-submission.poll-interval
   yarn.client.application-client-protocol.poll-timeout-ms
   yarn.is.minicluster
   yarn.log.server.url
   yarn.minicluster.control-resource-monitoring
   yarn.minicluster.fixed.ports
   yarn.minicluster.use-rpc
   yarn.node-labels.fs-store.retry-policy-spec
   yarn.node-labels.fs-store.root-dir
   yarn.node-labels.manager-class
   yarn.nodemanager.container-executor.os.sched.priority.adjustment
   yarn.nodemanager.container-monitor.process-tree.class
   yarn.nodemanager.disk-health-checker.enable
   yarn.nodemanager.docker-container-executor.image-name
   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
   yarn.nodemanager.linux-container-executor.group
   yarn.nodemanager.log.deletion-threads-count
   yarn.nodemanager.user-home-dir
   yarn.nodemanager.webapp.https.address
   yarn.nodemanager.webapp.spnego-keytab-file
   yarn.nodemanager.webapp.spnego-principal
   yarn.nodemanager.windows-secure-container-executor.group
   yarn.resourcemanager.configuration.file-system-based-store
   yarn.resourcemanager.delegation-token-renewer.thread-count
   yarn.resourcemanager.delegation.key.update-interval
   yarn.resourcemanager.delegation.token.max-lifetime
   yarn.resourcemanager.delegation.token.renew-interval
   yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size
   yarn.resourcemanager.metrics.runtime.buckets
   yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs
   yarn.resourcemanager.reservation-system.class
   yarn.resourcemanager.reservation-system.enable
   yarn.resourcemanager.reservation-system.plan.follower
   yarn.resourcemanager.reservation-system.planfollower.time-step
   yarn.resourcemanager.rm.container-allocation.expiry-interval-ms
   yarn.resourcemanager.webapp.spnego-keytab-file
   yarn.resourcemanager.webapp.spnego-principal
   yarn.scheduler.include-port-in-node-name
   yarn.timeline-service.delegation.key.update-interval
   yarn.timeline-service.delegation.token.max-lifetime
   yarn.timeline-service.delegation.token.renew-interval
   yarn.timeline-service.generic-application-history.enabled
   
 yarn.timeline-service.generic-application-history.fs-history-store.compression-type
   yarn.timeline-service.generic-application-history.fs-history-store.uri
   yarn.timeline-service.generic-application-history.store-class
   yarn.timeline-service.http-cross-origin.enabled
   yarn.tracking.url.generator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3821) Scheduler spams log with messages at INFO level

2015-06-17 Thread Wilfred Spiegelenburg (JIRA)

Wilfred Spiegelenburg created YARN-3821:
---

 Summary: Scheduler spams log with messages at INFO level
 Key: YARN-3821
 URL: https://issues.apache.org/jira/browse/YARN-3821
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, fairscheduler
Affects Versions: 2.8.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
Priority: Minor


The schedulers spams the logs with messages that are not providing any 
actionable information. There is no action taken in the code and there is 
nothing that needs to be done from an administrative point of view. 

Even after the improvements for the messages from YARN-3197 and YARN-3495 
administrators get confused and ask what needs to be done to prevent the log 
spam.

Moving the messages to a debug log level makes far more sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl


[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591194#comment-14591194
 ] 

Varun Saxena commented on YARN-3804:


Thanks for the review and commit [~xgong]

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch, 
 YARN-3804.branch-2.7.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml


[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591255#comment-14591255
 ] 

Ray Chiang commented on YARN-3069:
--

Created YARN-3823.

 Document missing properties in yarn-default.xml
 ---

 Key: YARN-3069
 URL: https://issues.apache.org/jira/browse/YARN-3069
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: BB2015-05-TBR, supportability
 Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
 YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
 YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
 YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch


 The following properties are currently not defined in yarn-default.xml.  
 These properties should either be
   A) documented in yarn-default.xml OR
   B)  listed as an exception (with comments, e.g. for internal use) in the 
 TestYarnConfigurationFields unit test
 Any comments for any of the properties below are welcome.
   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
   security.applicationhistory.protocol.acl
   yarn.app.container.log.backups
   yarn.app.container.log.dir
   yarn.app.container.log.filesize
   yarn.client.app-submission.poll-interval
   yarn.client.application-client-protocol.poll-timeout-ms
   yarn.is.minicluster
   yarn.log.server.url
   yarn.minicluster.control-resource-monitoring
   yarn.minicluster.fixed.ports
   yarn.minicluster.use-rpc
   yarn.node-labels.fs-store.retry-policy-spec
   yarn.node-labels.fs-store.root-dir
   yarn.node-labels.manager-class
   yarn.nodemanager.container-executor.os.sched.priority.adjustment
   yarn.nodemanager.container-monitor.process-tree.class
   yarn.nodemanager.disk-health-checker.enable
   yarn.nodemanager.docker-container-executor.image-name
   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
   yarn.nodemanager.linux-container-executor.group
   yarn.nodemanager.log.deletion-threads-count
   yarn.nodemanager.user-home-dir
   yarn.nodemanager.webapp.https.address
   yarn.nodemanager.webapp.spnego-keytab-file
   yarn.nodemanager.webapp.spnego-principal
   yarn.nodemanager.windows-secure-container-executor.group
   yarn.resourcemanager.configuration.file-system-based-store
   yarn.resourcemanager.delegation-token-renewer.thread-count
   yarn.resourcemanager.delegation.key.update-interval
   yarn.resourcemanager.delegation.token.max-lifetime
   yarn.resourcemanager.delegation.token.renew-interval
   yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size
   yarn.resourcemanager.metrics.runtime.buckets
   yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs
   yarn.resourcemanager.reservation-system.class
   yarn.resourcemanager.reservation-system.enable
   yarn.resourcemanager.reservation-system.plan.follower
   yarn.resourcemanager.reservation-system.planfollower.time-step
   yarn.resourcemanager.rm.container-allocation.expiry-interval-ms
   yarn.resourcemanager.webapp.spnego-keytab-file
   yarn.resourcemanager.webapp.spnego-principal
   yarn.scheduler.include-port-in-node-name
   yarn.timeline-service.delegation.key.update-interval
   yarn.timeline-service.delegation.token.max-lifetime
   yarn.timeline-service.delegation.token.renew-interval
   yarn.timeline-service.generic-application-history.enabled
   
 yarn.timeline-service.generic-application-history.fs-history-store.compression-type
   yarn.timeline-service.generic-application-history.fs-history-store.uri
   yarn.timeline-service.generic-application-history.store-class
   yarn.timeline-service.http-cross-origin.enabled
   yarn.tracking.url.generator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3823) Fix mismatch in default values for yarn.scheduler.maximum-allocation-vcores property

Ray Chiang created YARN-3823:


 Summary: Fix mismatch in default values for 
yarn.scheduler.maximum-allocation-vcores property
 Key: YARN-3823
 URL: https://issues.apache.org/jira/browse/YARN-3823
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Ray Chiang
Assignee: Ray Chiang
Priority: Minor


In yarn-default.xml, the property is defined as:

  XML Property: yarn.scheduler.maximum-allocation-vcores
  XML Value: 32

In YarnConfiguration.java the corresponding member variable is defined as:

  Config Name: DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_VCORES
  Config Value: 4

The Config value comes from YARN-193 and the default xml property comes from 
YARN-2. Should we keep it this way or should one of the values get updated?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3825) Add automatic search of default Configuration variables to TestConfigurationFieldsBase

Ray Chiang created YARN-3825:


 Summary: Add automatic search of default Configuration variables 
to TestConfigurationFieldsBase
 Key: YARN-3825
 URL: https://issues.apache.org/jira/browse/YARN-3825
 Project: Hadoop YARN
  Issue Type: Test
  Components: test
Affects Versions: 2.7.0
Reporter: Ray Chiang
Assignee: Ray Chiang


Add functionality given a Configuration variable FOO, to at least check the xml 
file value against DEFAULT_FOO.

Without waivers and a mapping for exceptions, this can probably never be a test 
method that generates actual errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state


[ 
https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591285#comment-14591285
 ] 

Varun Saxena commented on YARN-2902:


Yeah this is pending for a long time. I have to primarily update tests. Got 
taken over by other tasks. Don't have any 2.7.1 related JIRAs' with me right 
now so will update a patch by this weekend.

 Killing a container that is localizing can orphan resources in the 
 DOWNLOADING state
 

 Key: YARN-2902
 URL: https://issues.apache.org/jira/browse/YARN-2902
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Jason Lowe
Assignee: Varun Saxena
 Attachments: YARN-2902.002.patch, YARN-2902.patch


 If a container is in the process of localizing when it is stopped/killed then 
 resources are left in the DOWNLOADING state.  If no other container comes 
 along and requests these resources they linger around with no reference 
 counts but aren't cleaned up during normal cache cleanup scans since it will 
 never delete resources in the DOWNLOADING state even if their reference count 
 is zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3069) Document missing properties in yarn-default.xml


 [ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated YARN-3069:
-
Attachment: YARN-3069.012.patch

- Implement Akira's last 3 comments
- First version including fixes from HADOOP-12101
-- Fix default in yarn.nodemanager.env-whitelist to match
-- Fix spacing in two other properties to match

 Document missing properties in yarn-default.xml
 ---

 Key: YARN-3069
 URL: https://issues.apache.org/jira/browse/YARN-3069
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: BB2015-05-TBR, supportability
 Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
 YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
 YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
 YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch, 
 YARN-3069.012.patch


 The following properties are currently not defined in yarn-default.xml.  
 These properties should either be
   A) documented in yarn-default.xml OR
   B)  listed as an exception (with comments, e.g. for internal use) in the 
 TestYarnConfigurationFields unit test
 Any comments for any of the properties below are welcome.
   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
   security.applicationhistory.protocol.acl
   yarn.app.container.log.backups
   yarn.app.container.log.dir
   yarn.app.container.log.filesize
   yarn.client.app-submission.poll-interval
   yarn.client.application-client-protocol.poll-timeout-ms
   yarn.is.minicluster
   yarn.log.server.url
   yarn.minicluster.control-resource-monitoring
   yarn.minicluster.fixed.ports
   yarn.minicluster.use-rpc
   yarn.node-labels.fs-store.retry-policy-spec
   yarn.node-labels.fs-store.root-dir
   yarn.node-labels.manager-class
   yarn.nodemanager.container-executor.os.sched.priority.adjustment
   yarn.nodemanager.container-monitor.process-tree.class
   yarn.nodemanager.disk-health-checker.enable
   yarn.nodemanager.docker-container-executor.image-name
   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
   yarn.nodemanager.linux-container-executor.group
   yarn.nodemanager.log.deletion-threads-count
   yarn.nodemanager.user-home-dir
   yarn.nodemanager.webapp.https.address
   yarn.nodemanager.webapp.spnego-keytab-file
   yarn.nodemanager.webapp.spnego-principal
   yarn.nodemanager.windows-secure-container-executor.group
   yarn.resourcemanager.configuration.file-system-based-store
   yarn.resourcemanager.delegation-token-renewer.thread-count
   yarn.resourcemanager.delegation.key.update-interval
   yarn.resourcemanager.delegation.token.max-lifetime
   yarn.resourcemanager.delegation.token.renew-interval
   yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size
   yarn.resourcemanager.metrics.runtime.buckets
   yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs
   yarn.resourcemanager.reservation-system.class
   yarn.resourcemanager.reservation-system.enable
   yarn.resourcemanager.reservation-system.plan.follower
   yarn.resourcemanager.reservation-system.planfollower.time-step
   yarn.resourcemanager.rm.container-allocation.expiry-interval-ms
   yarn.resourcemanager.webapp.spnego-keytab-file
   yarn.resourcemanager.webapp.spnego-principal
   yarn.scheduler.include-port-in-node-name
   yarn.timeline-service.delegation.key.update-interval
   yarn.timeline-service.delegation.token.max-lifetime
   yarn.timeline-service.delegation.token.renew-interval
   yarn.timeline-service.generic-application-history.enabled
   
 yarn.timeline-service.generic-application-history.fs-history-store.compression-type
   yarn.timeline-service.generic-application-history.fs-history-store.uri
   yarn.timeline-service.generic-application-history.store-class
   yarn.timeline-service.http-cross-origin.enabled
   yarn.tracking.url.generator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3824) Fix two minor nits in member variable properties of YarnConfiguration

Ray Chiang created YARN-3824:


 Summary: Fix two minor nits in member variable properties of 
YarnConfiguration
 Key: YARN-3824
 URL: https://issues.apache.org/jira/browse/YARN-3824
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.0
Reporter: Ray Chiang
Assignee: Ray Chiang
Priority: Trivial
 Attachments: YARN-3824.001.patch

Two nitpicks that could be cleaned up easily:

- DEFAULT_YARN_INTERMEDIATE_DATA_ENCRYPTION is defined as a java.lang.Boolean 
instead of a boolean primitive

- DEFAULT_RM_PROXY_USER_PRIVILEGES_ENABLED is missing the final keyword




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3824) Fix two minor nits in member variable properties of YarnConfiguration


 [ 
https://issues.apache.org/jira/browse/YARN-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated YARN-3824:
-
Attachment: YARN-3824.001.patch

 Fix two minor nits in member variable properties of YarnConfiguration
 -

 Key: YARN-3824
 URL: https://issues.apache.org/jira/browse/YARN-3824
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.0
Reporter: Ray Chiang
Assignee: Ray Chiang
Priority: Trivial
  Labels: newbie
 Attachments: YARN-3824.001.patch


 Two nitpicks that could be cleaned up easily:
 - DEFAULT_YARN_INTERMEDIATE_DATA_ENCRYPTION is defined as a java.lang.Boolean 
 instead of a boolean primitive
 - DEFAULT_RM_PROXY_USER_PRIVILEGES_ENABLED is missing the final keyword



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml


[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591295#comment-14591295
 ] 

Ray Chiang commented on YARN-3069:
--

Forgot to mention the above two comments are for the .012 patch, coming next.

 Document missing properties in yarn-default.xml
 ---

 Key: YARN-3069
 URL: https://issues.apache.org/jira/browse/YARN-3069
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: BB2015-05-TBR, supportability
 Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
 YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
 YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
 YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch


 The following properties are currently not defined in yarn-default.xml.  
 These properties should either be
   A) documented in yarn-default.xml OR
   B)  listed as an exception (with comments, e.g. for internal use) in the 
 TestYarnConfigurationFields unit test
 Any comments for any of the properties below are welcome.
   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
   security.applicationhistory.protocol.acl
   yarn.app.container.log.backups
   yarn.app.container.log.dir
   yarn.app.container.log.filesize
   yarn.client.app-submission.poll-interval
   yarn.client.application-client-protocol.poll-timeout-ms
   yarn.is.minicluster
   yarn.log.server.url
   yarn.minicluster.control-resource-monitoring
   yarn.minicluster.fixed.ports
   yarn.minicluster.use-rpc
   yarn.node-labels.fs-store.retry-policy-spec
   yarn.node-labels.fs-store.root-dir
   yarn.node-labels.manager-class
   yarn.nodemanager.container-executor.os.sched.priority.adjustment
   yarn.nodemanager.container-monitor.process-tree.class
   yarn.nodemanager.disk-health-checker.enable
   yarn.nodemanager.docker-container-executor.image-name
   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
   yarn.nodemanager.linux-container-executor.group
   yarn.nodemanager.log.deletion-threads-count
   yarn.nodemanager.user-home-dir
   yarn.nodemanager.webapp.https.address
   yarn.nodemanager.webapp.spnego-keytab-file
   yarn.nodemanager.webapp.spnego-principal
   yarn.nodemanager.windows-secure-container-executor.group
   yarn.resourcemanager.configuration.file-system-based-store
   yarn.resourcemanager.delegation-token-renewer.thread-count
   yarn.resourcemanager.delegation.key.update-interval
   yarn.resourcemanager.delegation.token.max-lifetime
   yarn.resourcemanager.delegation.token.renew-interval
   yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size
   yarn.resourcemanager.metrics.runtime.buckets
   yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs
   yarn.resourcemanager.reservation-system.class
   yarn.resourcemanager.reservation-system.enable
   yarn.resourcemanager.reservation-system.plan.follower
   yarn.resourcemanager.reservation-system.planfollower.time-step
   yarn.resourcemanager.rm.container-allocation.expiry-interval-ms
   yarn.resourcemanager.webapp.spnego-keytab-file
   yarn.resourcemanager.webapp.spnego-principal
   yarn.scheduler.include-port-in-node-name
   yarn.timeline-service.delegation.key.update-interval
   yarn.timeline-service.delegation.token.max-lifetime
   yarn.timeline-service.delegation.token.renew-interval
   yarn.timeline-service.generic-application-history.enabled
   
 yarn.timeline-service.generic-application-history.fs-history-store.compression-type
   yarn.timeline-service.generic-application-history.fs-history-store.uri
   yarn.timeline-service.generic-application-history.store-class
   yarn.timeline-service.http-cross-origin.enabled
   yarn.tracking.url.generator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-17 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591055#comment-14591055
 ] 

Sandy Ryza commented on YARN-1197:
--

The latest proposal makes sense to me as well.  Thanks [~wangda] and [~mding]!

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3819) Collect network usage on the node

2015-06-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591075#comment-14591075
 ] 

Hadoop QA commented on YARN-3819:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m  4s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 2 new or modified test files. |
| {color:red}-1{color} | javac |   3m  3s | The patch appears to cause the 
build to fail. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12740201/YARN-3819-3.patch |
| Optional Tests | javac unit findbugs checkstyle javadoc |
| git revision | trunk / cc43288 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8274/console |


This message was automatically generated.

 Collect network usage on the node
 -

 Key: YARN-3819
 URL: https://issues.apache.org/jira/browse/YARN-3819
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch


 In this JIRA we propose to collect the network usage on a node. This JIRA is 
 part of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl


[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591143#comment-14591143
 ] 

Hudson commented on YARN-3804:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8035 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8035/])
YARN-3804. Both RM are on standBy state when kerberos user not in 
yarn.admin.acl. Contributed by Varun Saxena (xgong: rev 
a826d432f9b45550cc5ab79ef63ca39b176dabb2)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMAdminService.java


 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch, 
 YARN-3804.branch-2.7.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by

[jira] [Updated] (YARN-3819) Collect network usage on the node


 [ 
https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Grandl updated YARN-3819:

Attachment: YARN-3819-4.patch

Updated patch to address the failure and the whitespaces

 Collect network usage on the node
 -

 Key: YARN-3819
 URL: https://issues.apache.org/jira/browse/YARN-3819
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch, 
 YARN-3819-4.patch


 In this JIRA we propose to collect the network usage on a node. This JIRA is 
 part of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl


[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591193#comment-14591193
 ] 

Varun Saxena commented on YARN-3804:


[~xgong], thanks for updating branch-2.7 patch. Didn't notice your comment due 
to time difference. 

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch, 
 YARN-3804.branch-2.7.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-17 Thread Jun Gong (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591093#comment-14591093
]

Jun Gong commented on YARN-3809:

[~devaraj.k] and [~kasha], thank you for the comments and suggestions.

{quote}
Shouldn't the number of threads in the pool be at least as big as the maximum
number of apps that could run on a node?By making it configurable, how do we
expect the admins to pick this number? Just pick an arbitrarily high value?
{quote}
Threads in the pool are just launching/stopping AMs, so it will be better that
the number of threads in the pool is at least as big as the maximum number of
AMs that could run on a node. Although we could not know the max value for all
clusters in advance, a larger value will make it faster that deal with
AMLauncher events. Admins could just pick the default value, and they could
adjust the value if they find the value is a little small.

{quote}
Or, could we make it so we don't wait as long as 15 minutes?
{quote}
Yes, we could make it shorter. I think we also need a larger thread pool, then
it could deal with more events at the same time.

Failed to launch new attempts because ApplicationMasterLauncher's threads all
hang
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3821) Scheduler spams log with messages at INFO level

2015-06-17 Thread Wilfred Spiegelenburg (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-3821:

Attachment: YARN-3821.patch

Moving messages to debug level
No tests added since this is just a log level change

 Scheduler spams log with messages at INFO level
 ---

 Key: YARN-3821
 URL: https://issues.apache.org/jira/browse/YARN-3821
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, fairscheduler
Affects Versions: 2.8.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg
Priority: Minor
 Attachments: YARN-3821.patch


 The schedulers spams the logs with messages that are not providing any 
 actionable information. There is no action taken in the code and there is 
 nothing that needs to be done from an administrative point of view. 
 Even after the improvements for the messages from YARN-3197 and YARN-3495 
 administrators get confused and ask what needs to be done to prevent the log 
 spam.
 Moving the messages to a debug log level makes far more sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

[
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei Shao updated YARN-3806:
---
Attachment: ProposalOfGenericSchedulingFrameworkForYARN-V1.2.pdf

Proposal of Generic Scheduling Framework for YARN
-

Currently, a typical YARN cluster runs many different kinds of applications:
production applications, ad hoc user applications, long running services and
so on. Different YARN scheduling policies may be suitable for different
applications. For example, capacity scheduling can manage production
applications well since application can get guaranteed resource share, fair
scheduling can manage ad hoc user applications well since it can enforce
fairness among users. However, current YARN scheduling framework doesn’t have
a mechanism for multiple scheduling policies work hierarchically in one
cluster.
YARN-3306 talked about many issues of today’s YARN scheduling framework, and
proposed a per-queue policy driven framework. In detail, it supported
different scheduling policies for leaf queues. However, support of different
scheduling policies for upper level queues is not seriously considered yet.
A generic scheduling framework is proposed here to address these limitations.
It supports different policies (fair, capacity, fifo and so on) for any queue
consistently. The proposal tries to solve many other issues in current YARN
scheduling framework as well.
Two new proposed scheduling policies YARN-3807 YARN-3808 are based on
generic scheduling framework brought up in this proposal.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

[
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei Shao updated YARN-3806:
---
Attachment: (was: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf)

Proposal of Generic Scheduling Framework for YARN
-

Currently, a typical YARN cluster runs many different kinds of applications:
production applications, ad hoc user applications, long running services and
so on. Different YARN scheduling policies may be suitable for different
applications. For example, capacity scheduling can manage production
applications well since application can get guaranteed resource share, fair
scheduling can manage ad hoc user applications well since it can enforce
fairness among users. However, current YARN scheduling framework doesn’t have
a mechanism for multiple scheduling policies work hierarchically in one
cluster.
YARN-3306 talked about many issues of today’s YARN scheduling framework, and
proposed a per-queue policy driven framework. In detail, it supported
different scheduling policies for leaf queues. However, support of different
scheduling policies for upper level queues is not seriously considered yet.
A generic scheduling framework is proposed here to address these limitations.
It supports different policies (fair, capacity, fifo and so on) for any queue
consistently. The proposal tries to solve many other issues in current YARN
scheduling framework as well.
Two new proposed scheduling policies YARN-3807 YARN-3808 are based on
generic scheduling framework brought up in this proposal.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

[
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei Shao updated YARN-3806:
---
Attachment: (was: ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf)

Proposal of Generic Scheduling Framework for YARN
-

Currently, a typical YARN cluster runs many different kinds of applications:
production applications, ad hoc user applications, long running services and
so on. Different YARN scheduling policies may be suitable for different
applications. For example, capacity scheduling can manage production
applications well since application can get guaranteed resource share, fair
scheduling can manage ad hoc user applications well since it can enforce
fairness among users. However, current YARN scheduling framework doesn’t have
a mechanism for multiple scheduling policies work hierarchically in one
cluster.
YARN-3306 talked about many issues of today’s YARN scheduling framework, and
proposed a per-queue policy driven framework. In detail, it supported
different scheduling policies for leaf queues. However, support of different
scheduling policies for upper level queues is not seriously considered yet.
A generic scheduling framework is proposed here to address these limitations.
It supports different policies (fair, capacity, fifo and so on) for any queue
consistently. The proposal tries to solve many other issues in current YARN
scheduling framework as well.
Two new proposed scheduling policies YARN-3807 YARN-3808 are based on
generic scheduling framework brought up in this proposal.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3801) [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util


[ 
https://issues.apache.org/jira/browse/YARN-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591189#comment-14591189
 ] 

Tsuyoshi Ozawa commented on YARN-3801:
--

Thanks Sangjin, Zhijie, and Sean for your reviews!

 [JDK-8][YARN-2928] Exclude jdk.tools from hbase-client and hbase-testing-util
 -

 Key: YARN-3801
 URL: https://issues.apache.org/jira/browse/YARN-3801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Tsuyoshi Ozawa
Assignee: Tsuyoshi Ozawa
 Fix For: YARN-2928

 Attachments: YARN-3801.001.patch


 timelineservice depends on hbase-client and hbase-testing-util, and they 
 dpend on jdk.tools:1.7. This leads to fail to compile hadoop with JDK8.
 {quote}
 [WARNING] 
 Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency 
 are:
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT
 +-jdk.tools:jdk.tools:1.8
 and
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hbase:hbase-client:1.0.1
 +-org.apache.hbase:hbase-annotations:1.0.1
   +-jdk.tools:jdk.tools:1.7
 and
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hbase:hbase-testing-util:1.0.1
 +-org.apache.hbase:hbase-annotations:1.0.1
   +-jdk.tools:jdk.tools:1.7
 [WARNING] Rule 0: org.apache.maven.plugins.enforcer.DependencyConvergence 
 failed with message:
 Failed while enforcing releasability the error(s) are [
 Dependency convergence error for jdk.tools:jdk.tools:1.8 paths to dependency 
 are:
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hadoop:hadoop-annotations:3.0.0-SNAPSHOT
 +-jdk.tools:jdk.tools:1.8
 and
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hbase:hbase-client:1.0.1
 +-org.apache.hbase:hbase-annotations:1.0.1
   +-jdk.tools:jdk.tools:1.7
 and
 +-org.apache.hadoop:hadoop-yarn-server-timelineservice:3.0.0-SNAPSHOT
   +-org.apache.hbase:hbase-testing-util:1.0.1
 +-org.apache.hbase:hbase-annotations:1.0.1
   +-jdk.tools:jdk.tools:1.7
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3812) TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347

2015-06-17 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3812:
---
Attachment: 0002-YARN-3812.patch

Thnk you for review comments .As per comments have updated 
ImmutableFspermission .Please review the patch. TestRollingLevelDB class all 
test are passing. 

 TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347
 --

 Key: YARN-3812
 URL: https://issues.apache.org/jira/browse/YARN-3812
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 3.0.0
Reporter: Robert Kanter
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3812.patch, 0002-YARN-3812.patch


 {{TestRollingLevelDBTimelineStore}} is failing with the below errors in 
 trunk.  I did a git bisect and found that it was due to HADOOP-11347, which 
 changed something with umasks in {{FsPermission}}.
 {noformat}
 Running org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 Tests run: 16, Failures: 0, Errors: 16, Skipped: 0, Time elapsed: 2.65 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 testGetDomains(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 1.533 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testRelatingToNonExistingEntity(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.085 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testValidateConfig(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.07 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at

[jira] [Created] (YARN-3822) Scalability validation of RM writing app/attempt/container lifecycle events

2015-06-17 Thread Zhijie Shen (JIRA)

Zhijie Shen created YARN-3822:
-

 Summary: Scalability validation of RM writing 
app/attempt/container lifecycle events
 Key: YARN-3822
 URL: https://issues.apache.org/jira/browse/YARN-3822
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager, timelineserver
Reporter: Zhijie Shen
Assignee: Naganarasimha G R


We need to test how scalable RM metrics publisher is



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers

2015-06-17 Thread Zhijie Shen (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590133#comment-14590133
]

Zhijie Shen commented on YARN-3051:
---

[~sjlee0], thanks for your chiming in. Varun, Li and I recently have a offline
discussion. In general, we agreed on focusing on storage-oriented interface
(raw data query) together with a FS implementation of it on this jira, but
spinning off change about the user-oriented interface, web front wire up, and
single reader daemon setup and dealing with them separately. The rationale is
to roll out the reader interface fast, and we can work the HBase/Phoenix
implement and web front wireup on a commonly agreed interface in parallel. How
do you think about the plan?

bq. It's already doing that to some extent, and we should push that some more.
For instance, it might be helpful to create Context.

Context is useful. Instead of creating a new one, maybe we can reuse the
existing Context, which hosts more content than reader needs. So we just need
to let reader put/get the required information to/from it.

bq. In essence, one way to look at it is that a query onto the storage is
really (context) + (predicate/filters) + (contents to retrieve). Then we could
consolidate arguments into these coarse-grained things.

+1 LGTM, but I think it's for the query of searching a set of qualified
entities, right. For fetching a single entity, the query may look like
(context) + (entity identifier) + (contents to retrieve)

Another issue I want to raise is that after our performance evaluation, we
agreed on using HBase for raw data and Phoenix for aggregated data. It implies
that we need to use HBase to implement the APIs for the raw entities, while use
Phoenix to implement the APIs for the aggregated data.

[Storage abstraction] Create backing storage read interface for ATS readers
---

Key: YARN-3051
URL: https://issues.apache.org/jira/browse/YARN-3051
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Varun Saxena
Attachments: YARN-3051-YARN-2928.003.patch,
YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch,
YARN-3051.wip.02.YARN-2928.patch, YARN-3051.wip.patch, YARN-3051_temp.patch

Per design in YARN-2928, create backing storage read interface that can be
implemented by multiple backing storage implementations.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3813) Support Application timeout feature in YARN.

2015-06-17 Thread Steve Loughran (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steve Loughran updated YARN-3813:
-
Component/s: scheduler
Fix Version/s: (was: 2.8.0)

Support Application timeout feature in YARN.
-

Key: YARN-3813
URL: https://issues.apache.org/jira/browse/YARN-3813
Project: Hadoop YARN
Issue Type: New Feature
Components: scheduler
Reporter: nijel

It will be useful to support Application Timeout in YARN. Some use cases are
not worried about the output of the applications if the application is not
completed in a specific time.
*Background:*
The requirement is to show the CDR statistics of last few minutes, say for
every 5 minutes. The same Job will run continuously with different dataset.
So one job will be started in every 5 minutes. The estimate time for this
task is 2 minutes or lesser time.
If the application is not completing in the given time the output is not
useful.
*Proposal*
So idea is to support application timeout, with which timeout parameter is
given while submitting the job.
Here, user is expecting to finish (complete or kill) the application in the
given time.
One option for us is to move this logic to Application client (who submit the
job).
But it will be nice if it can be generic logic and can make more robust.
Kindly provide your suggestions/opinion on this feature. If it sounds good, i
will update the design doc and prototype patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3819) Collect network usage on the node

Robert Grandl created YARN-3819:
---

 Summary: Collect network usage on the node
 Key: YARN-3819
 URL: https://issues.apache.org/jira/browse/YARN-3819
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Robert Grandl
Assignee: Robert Grandl


In this JIRA we propose to collect the network usage on a node. This JIRA is 
part of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3820) Collect disks usages on the node


[ 
https://issues.apache.org/jira/browse/YARN-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590533#comment-14590533
 ] 

Robert Grandl commented on YARN-3820:
-

Short description:
This JIRA collects bytes read/written from/to disks in Linux. 
Step 1: We exploit the /proc/diskstats counters, extract the number of 
sectors read/written for every disk, and return the aggregation of these 
counters among all the disks. 

Step 2: To convert sectors into bytes, for every disk, we extract the sector 
size from /sys/block/diskName/queue/hw_sector_size.

Step 3: Finally by multiplying the number of sectors from Step 1 with sector 
size from Step 2 we compute the number of bytes.

We tested the existence of these files in the following Linux kernel versions:
Linux 3.2.0
Linux 2.6.32
Linux 3.13.0

Also, doing further search on the web, it seems people are using/recommending 
these files for extracting read/written disks counters


 Collect disks usages on the node
 

 Key: YARN-3820
 URL: https://issues.apache.org/jira/browse/YARN-3820
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3820-1.patch


 In this JIRA we propose to collect disks usages on a node. This JIRA is part 
 of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3819) Collect network usage on the node


[ 
https://issues.apache.org/jira/browse/YARN-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590553#comment-14590553
 ] 

Robert Grandl commented on YARN-3819:
-

[~grey], YARN-2745 is an effort to schedule multiple resources. The resources 
taken in account are CPU/Memory/Disk/Network. For fungible resources such as 
disk and network, the counters required are the total number of bytes 
read/written from/to disk/network. 

This JIRA extends the ResourceCalculatorPlugin which is able to extract the 
amount of available CPU and Memory on a node. YARN-1012 is already using this 
information and YARN-1012 is aggregating this information in a heartbeat from 
NM to RM. 

 Collect network usage on the node
 -

 Key: YARN-3819
 URL: https://issues.apache.org/jira/browse/YARN-3819
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3819-1.patch, YARN-3819-2.patch, YARN-3819-3.patch


 In this JIRA we propose to collect the network usage on a node. This JIRA is 
 part of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk


 [ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-3798:
-
Attachment: YARN-3798-branch-2.7.002.patch

Attaching v2 patch to handle SESSIONMOVED correctly.

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Blocker
 Attachments: RM.log, YARN-3798-branch-2.7.002.patch, 
 YARN-3798-branch-2.7.patch


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at

[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk


[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590574#comment-14590574
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

[~varun_saxena] thank you for your review. It helps me a lot.

{quote}
Moreover, there can be a case when a particular zookeeper server is forever 
down. In this case also, we will keep on getting ConnectionLoss IIUC till 
retries exhaust.
{quote}

It looks exhaust, but it's not: reconnection to other ZooKeeper servers are 
done at ClientCnxn#startConnect in a main thread of ZooKeeper's client. Please 
note that a session is not equal to a connection in ZooKeeper. What we can do 
it to retry with current zookeeper client. I also noticed that we shouldn't 
create new session when SESSIONMOVED occurs. Updating a patch soon.

{quote}
So to handle these cases, I think we should retry with new connection atleast 
once. Thoughts ?
{quote}

I think we shouldn't create new ZooKeeper session unless SESSIONEXPIRED occurs: 
from http://wiki.apache.org/hadoop/ZooKeeper/FAQ : 
{quote}
Only create a new session when you are notified of session expiration 
(mandatory)
{quote}

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Blocker
 Attachments: RM.log, YARN-3798-branch-2.7.patch


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at

[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins


[ 
https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590571#comment-14590571
 ] 

Varun Saxena commented on YARN-3528:


9. In TestNodeManagerShutdown#startContainer, if exception is thrown(i.e. no 
free port is available) the code simply continues on to make a call to 
rpc.getProxy() with null containerManagerBindAddress. We can probably throw an 
exception so that test fails at correct location.

 Tests with 12345 as hard-coded port break jenkins
 -

 Key: YARN-3528
 URL: https://issues.apache.org/jira/browse/YARN-3528
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0
 Environment: ASF Jenkins
Reporter: Steve Loughran
Assignee: Brahma Reddy Battula
Priority: Blocker
  Labels: test
 Attachments: YARN-3528.patch


 A lot of the YARN tests have hard-coded the port 12345 for their services to 
 come up on.
 This makes it impossible to have scheduled or precommit tests to run 
 consistently on the ASF jenkins hosts. Instead the tests fail regularly and 
 appear to get ignored completely.
 A quick grep of 12345 shows up many places in the test suite where this 
 practise has developed.
 * All {{BaseContainerManagerTest}} subclasses
 * {{TestNodeManagerShutdown}}
 * {{TestContainerManager}}
 + others
 This needs to be addressed through portscanning and dynamic port allocation. 
 Please can someone do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3820) Collect disks usages on the node


 [ 
https://issues.apache.org/jira/browse/YARN-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Grandl updated YARN-3820:

Attachment: YARN-3820-3.patch

Remove whitespaces and same build crash as YARN-3819-3.patch

 Collect disks usages on the node
 

 Key: YARN-3820
 URL: https://issues.apache.org/jira/browse/YARN-3820
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 3.0.0
Reporter: Robert Grandl
Assignee: Robert Grandl
  Labels: yarn-common, yarn-util
 Attachments: YARN-3820-1.patch, YARN-3820-2.patch, YARN-3820-3.patch


 In this JIRA we propose to collect disks usages on a node. This JIRA is part 
 of a larger effort of monitoring resource usages on the nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml


[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591288#comment-14591288
 ] 

Ray Chiang commented on YARN-3069:
--

Wrote up the code for automatic checking at HADOOP-12101 for automatic default 
checking.  Ran automatic checking with the following results:

- yarn-default.xml has 15 properties that do not match the default Config value
-- Filed one bug filed at YARN-3823
-- Remaining 14 are due to variable references like 
${yarn.resourcemanager.hostname} or a documented -1 value like 
yarn.nodemanager.resource.memory-mb.

- Configuration(s) have 67 properties with no corresponding default member 
variable.  These will need to be verified manually.
-- Will document as a separate comment.

- yarn-default.xml has 6 properties with empty values
-- Nothing to compare

- yarn-default.xml has 135 properties which match a corresponding Config 
variable
-- No need to compare

 Document missing properties in yarn-default.xml
 ---

 Key: YARN-3069
 URL: https://issues.apache.org/jira/browse/YARN-3069
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: BB2015-05-TBR, supportability
 Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
 YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
 YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
 YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch


 The following properties are currently not defined in yarn-default.xml.  
 These properties should either be
   A) documented in yarn-default.xml OR
   B)  listed as an exception (with comments, e.g. for internal use) in the 
 TestYarnConfigurationFields unit test
 Any comments for any of the properties below are welcome.
   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
   security.applicationhistory.protocol.acl
   yarn.app.container.log.backups
   yarn.app.container.log.dir
   yarn.app.container.log.filesize
   yarn.client.app-submission.poll-interval
   yarn.client.application-client-protocol.poll-timeout-ms
   yarn.is.minicluster
   yarn.log.server.url
   yarn.minicluster.control-resource-monitoring
   yarn.minicluster.fixed.ports
   yarn.minicluster.use-rpc
   yarn.node-labels.fs-store.retry-policy-spec
   yarn.node-labels.fs-store.root-dir
   yarn.node-labels.manager-class
   yarn.nodemanager.container-executor.os.sched.priority.adjustment
   yarn.nodemanager.container-monitor.process-tree.class
   yarn.nodemanager.disk-health-checker.enable
   yarn.nodemanager.docker-container-executor.image-name
   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
   yarn.nodemanager.linux-container-executor.group
   yarn.nodemanager.log.deletion-threads-count
   yarn.nodemanager.user-home-dir
   yarn.nodemanager.webapp.https.address
   yarn.nodemanager.webapp.spnego-keytab-file
   yarn.nodemanager.webapp.spnego-principal
   yarn.nodemanager.windows-secure-container-executor.group
   yarn.resourcemanager.configuration.file-system-based-store
   yarn.resourcemanager.delegation-token-renewer.thread-count
   yarn.resourcemanager.delegation.key.update-interval
   yarn.resourcemanager.delegation.token.max-lifetime
   yarn.resourcemanager.delegation.token.renew-interval
   yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size
   yarn.resourcemanager.metrics.runtime.buckets
   yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs
   yarn.resourcemanager.reservation-system.class
   yarn.resourcemanager.reservation-system.enable
   yarn.resourcemanager.reservation-system.plan.follower
   yarn.resourcemanager.reservation-system.planfollower.time-step
   yarn.resourcemanager.rm.container-allocation.expiry-interval-ms
   yarn.resourcemanager.webapp.spnego-keytab-file
   yarn.resourcemanager.webapp.spnego-principal
   yarn.scheduler.include-port-in-node-name
   yarn.timeline-service.delegation.key.update-interval
   yarn.timeline-service.delegation.token.max-lifetime
   yarn.timeline-service.delegation.token.renew-interval
   yarn.timeline-service.generic-application-history.enabled
   
 yarn.timeline-service.generic-application-history.fs-history-store.compression-type
   yarn.timeline-service.generic-application-history.fs-history-store.uri
   yarn.timeline-service.generic-application-history.store-class
   yarn.timeline-service.http-cross-origin.enabled
   yarn.tracking.url.generator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml


[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591294#comment-14591294
 ] 

Ray Chiang commented on YARN-3069:
--

Most of the manual verification were in the following categories:
- Hardcoded value
- Not using DEFAULT_FOO for FOO member variable naming convention
- No default value at all
- Variable is used indirectly

Manual verification specifics:

CLIENT_FAILOVER_MAX_ATTEMPTS
- Hardcoded default to -1 in RMProxy

CLIENT_FAILOVER_SLEEPTIME_BASE_MS
CLIENT_FAILOVER_SLEEPTIME_MAX_MS
- Defaults to RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_MS or 
  DEFAULT_RESOURCEMANAGER_CONNECT_RETRY_INTERVAL_MS

DEBUG_NM_DELETE_DELAY_SEC
- Hardcoded default to 0 in DeletionService

FS_NODE_LABELS_STORE_ROOT_DIR
- Defaults to FileSystemNodeLabelsStore#getDefaultFSNodeLabelsRootDir() return 
value

FS_RM_STATE_STORE_URI
- No default value anywhere

IS_MINI_YARN_CLUSTER
- Hardcoded to false in Client, MRApps, ResourceManager

NM_AUX_SERVICES
- No default value anywhere.  Maybe whatever 
Configuration#getStringCollection() returns.

NM_BIND_HOST
- No default value anywhere

NM_CONTAINER_EXECUTOR
- Hardcoded to DefaultContainerExecutor.class in NodeManager

NM_CONTAINER_LOCALIZER_JAVA_OPTS_KEY
- Defaults to YarnConfiguration.NM_CONTAINER_LOCALIZER_JAVA_OPTS_DEFAULT in 
ContainerLocalizer

NM_CONTAINER_MON_PROCESS_TREE
NM_CONTAINER_MON_RESOURCE_CALCULATOR
- Hardcoded to null in ContainersMonitorImpl

NM_DISK_HEALTH_CHECK_ENABLE
- Hardcoded to true in LocalDirsHanderService

NM_DOCKER_CONTAINER_EXECUTOR_EXEC_NAME
- Defaults to unconventional name 
YarnConfiguration.NM_DEFAULT_DOCKER_CONTAINER_EXECUTOR_EXEC_NAME

NM_DOCKER_CONTAINER_EXECUTOR_IMAGE_NAME
- No default value anywhere

NM_HEALTH_CHECK_SCRIPT_OPTS
- Defaults to empty String array in NodeManager

NM_HEALTH_CHECK_SCRIPT_PATH
- No default value anywhere

NM_KEYTAB
- Defaults to YarnConfiguration.NM_PRINCIPAL

NM_LINUX_CONTAINER_CGROUPS_HIERARCHY
- Hardcoded to /hadoop-yarn in CGroupsHandlerImpl and 
CgroupsLCEResourcesHandler

NM_LINUX_CONTAINER_CGROUPS_MOUNT
- Hardcoded to false in CGroupsHandlerImpl and CgroupsLCEResourcesHandler

NM_LINUX_CONTAINER_CGROUPS_MOUNT_PATH
- Hardcoded to null in CGroupsHandlerImpl and CgroupsLCEResourcesHandler

NM_LINUX_CONTAINER_EXECUTOR_PATH
- Defaults to internal variable defaultPath (which looks to be based off 
HADOOP_YARN_HOME environment)

NM_LINUX_CONTAINER_GROUP
- Not used anywhere

NM_LINUX_CONTAINER_RESOURCES_HANDLER
- Hardcoded to DefaultLCEResourcesHandler.class in LinuxContainerExecutor

NM_LOG_DELETION_THREADS_COUNT
- Defaults to unconventional name 
YarnConfiguration.DEFAULT_NM_LOG_DELETE_THREAD_COUNT

NM_NONSECURE_MODE_LOCAL_USER_KEY
- Defaults to unconventional name 
YarnConfiguration.DEFAULT_NM_NONSECURE_MODE_LOCAL_USER

NM_NONSECURE_MODE_USER_PATTERN_KEY
- Defaults to unconventional name 
YarnConfiguration.DEFAULT_NM_NONSECURE_MODE_USER_PATTERN

NM_PRINCIPAL
- Is the default value for YarnConfiguration.NM_KEYTAB

NM_RECOVERY_DIR
- No default value anywhere

NM_SYSTEM_RESERVED_PMEM_MB
- Hardcoded to -1 in NodeManagerHardwareUtils

NM_WEBAPP_SPNEGO_KEYTAB_FILE_KEY
NM_WEBAPP_SPNEGO_USER_NAME_KEY
- No default value anywhere

NM_WINDOWS_SECURE_CONTAINER_GROUP
- No default value anywhere

PROXY_KEYTAB
PROXY_PRINCIPAL
- No default value anywhere

RECOVERY_ENABLED
- Defaults to YarnConfiguration.DEFAULT_NM_NONSECURE_MODE_USER_PATTERN in 
ResourceManager

RM_BIND_HOST
- No default value anywhere

RM_CLUSTER_ID
- No default value anywhere

RM_DELEGATION_KEY_UPDATE_INTERVAL_KEY
- Defaults to YarnConfiguration.RM_DELEGATION_KEY_UPDATE_INTERVAL_DEFAULT in 
RMSecretManagerService

RM_DELEGATION_TOKEN_MAX_LIFETIME_KEY
- Defaults to YarnConfiguration.RM_DELEGATION_TOKEN_MAX_LIFETIME_DEFAULT in 
RMSecretManagerService

RM_DELEGATION_TOKEN_RENEW_INTERVAL_KEY
- Defaults to YarnConfiguration.RM_DELEGATION_TOKEN_RENEW_INTERVAL_DEFAULT in 
RMSecretManagerService

RM_HA_ID
- Defaults to values from RM_HA_IDS

RM_HA_IDS
- No default value, but gets validation in HAUtil#verifyAndSetRMHAIdsList()

RM_HOSTNAME
- Defaults to internal variable RMId in HAUtils

RM_KEYTAB
- Defaults to YarnConfiguration.RM_PRINCIPAL

RM_LEVELDB_STORE_PATH
- No default value anywhere

RM_PRINCIPAL
- Default value for RM_KEYTAB

RM_PROXY_USER_PRIVILEGES_ENABLED
- Defaults to YarnConfiguration.DEFAULT_RM_PROXY_USER_PRIVILEGES_ENABLED.  
Needs final keyword added.

RM_RESERVATION_SYSTEM_CLASS
- Defaults to AbstractReservationSystem#getDefaultReservationSystem(scheduler)

RM_RESERVATION_SYSTEM_PLAN_FOLLOWER
- Defaults to AbstractReservationSystem.getDefaultPlanFollower()

RM_SCHEDULER_INCLUDE_PORT_IN_NODE_NAME
- Unconventional default 
YarnConfiguration.DEFAULT_RM_SCHEDULER_USE_PORT_FOR_NODE_NAME

RM_SCHEDULER_MONITOR_POLICIES
- Defaults to an SchedulingEditPolicy.class as an Interface

RM_STORE
- Hardcoded to MemoryRMStateStore.class in

[jira] [Commented] (YARN-3148) Allow CORS related headers to passthrough in WebAppProxyServlet


[ 
https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589690#comment-14589690
 ] 

Varun Saxena commented on YARN-3148:


Thanks [~devaraj.k] for the commit

 Allow CORS related headers to passthrough in WebAppProxyServlet
 ---

 Key: YARN-3148
 URL: https://issues.apache.org/jira/browse/YARN-3148
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Prakash Ramachandran
Assignee: Varun Saxena
 Fix For: 2.8.0

 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, 
 YARN-3148.03.patch, YARN-3148.04.patch


 currently the WebAppProxyServlet filters the request headers as defined by  
 passThroughHeaders. Tez UI is building a webapp which using rest api to fetch 
 data from the am via the rm tracking url. 
 for this purpose it would be nice to have additional headers allowed 
 especially the ones related to CORS. A few of them that would help are 
 * Origin
 * Access-Control-Request-Method
 * Access-Control-Request-Headers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3706) Generalize native HBase writer for additional tables

2015-06-17 Thread Joep Rottinghuis (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joep Rottinghuis updated YARN-3706:
---
Attachment: YARN-3706-YARN-2928.015.patch

YARN-3706-YARN-2928.015.patch

local runs in pseudo distributed mode work
moved Entity* classes to o.a.h.y.timelineservice.storage.entity

 Generalize native HBase writer for additional tables
 

 Key: YARN-3706
 URL: https://issues.apache.org/jira/browse/YARN-3706
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Joep Rottinghuis
Assignee: Joep Rottinghuis
Priority: Minor
 Attachments: YARN-3706-YARN-2928.001.patch, 
 YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, 
 YARN-3706-YARN-2928.012.patch, YARN-3706-YARN-2928.013.patch, 
 YARN-3706-YARN-2928.014.patch, YARN-3706-YARN-2928.015.patch, 
 YARN-3726-YARN-2928.002.patch, YARN-3726-YARN-2928.003.patch, 
 YARN-3726-YARN-2928.004.patch, YARN-3726-YARN-2928.005.patch, 
 YARN-3726-YARN-2928.006.patch, YARN-3726-YARN-2928.007.patch, 
 YARN-3726-YARN-2928.008.patch, YARN-3726-YARN-2928.009.patch


 When reviewing YARN-3411 we noticed that we could change the class hierarchy 
 a little in order to accommodate additional tables easily.
 In order to get ready for benchmark testing we left the original layout in 
 place, as performance would not be impacted by the code hierarchy.
 Here is a separate jira to address the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl


[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589686#comment-14589686
 ] 

Varun Saxena commented on YARN-3804:


Test failure unrelated. YARN-3790 already filed for it

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch, YARN-3804.04.patch, YARN-3804.05.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3148) Allow CORS related headers to passthrough in WebAppProxyServlet


[ 
https://issues.apache.org/jira/browse/YARN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589694#comment-14589694
 ] 

Hudson commented on YARN-3148:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8033 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8033/])
YARN-3148. Allow CORS related headers to passthrough in (devaraj: rev 
ebb9a82519c622bb898e1eec5798c2298c726694)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestWebAppProxyServlet.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServlet.java


 Allow CORS related headers to passthrough in WebAppProxyServlet
 ---

 Key: YARN-3148
 URL: https://issues.apache.org/jira/browse/YARN-3148
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Prakash Ramachandran
Assignee: Varun Saxena
 Fix For: 2.8.0

 Attachments: YARN-3148.001.patch, YARN-3148.02.patch, 
 YARN-3148.03.patch, YARN-3148.04.patch


 currently the WebAppProxyServlet filters the request headers as defined by  
 passThroughHeaders. Tez UI is building a webapp which using rest api to fetch 
 data from the am via the rm tracking url. 
 for this purpose it would be nice to have additional headers allowed 
 especially the ones related to CORS. A few of them that would help are 
 * Origin
 * Access-Control-Request-Method
 * Access-Control-Request-Headers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3617) Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1


[ 
https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589695#comment-14589695
 ] 

Hudson commented on YARN-3617:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8033 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8033/])
YARN-3617. Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning 
(devaraj: rev 318d2cde7cb5c05a5f87c4ee967446bb60d28ae4)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java


 Fix WindowsResourceCalculatorPlugin.getCpuFrequency() returning always -1
 -

 Key: YARN-3617
 URL: https://issues.apache.org/jira/browse/YARN-3617
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
 Environment: Windows 7 x64 SP1
Reporter: Georg Berendt
Assignee: J.Andreina
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3617.1.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, 
 there is an unused variable for CPU frequency.
  /** {@inheritDoc} */
   @Override
   public long getCpuFrequency() {
 refreshIfNeeded();
 return -1;   
   }
 Please change '-1' to use 'cpuFrequencyKhz'.
 org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3047) [Data Serving] Set up ATS reader with basic request serving structure and lifecycle