date:20150408


 [ 
https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-3391:
--
Attachment: YARN-3391.4.patch

 Clearly define flow ID/ flow run / flow version in API and storage
 --

 Key: YARN-3391
 URL: https://issues.apache.org/jira/browse/YARN-3391
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-3391.1.patch, YARN-3391.2.patch, YARN-3391.3.patch, 
 YARN-3391.4.patch


 To continue the discussion in YARN-3040, let's figure out the best way to 
 describe the flow.
 Some key issues that we need to conclude on:
 - How do we include the flow version in the context so that it gets passed 
 into the collector and to the storage eventually?
 - Flow run id should be a number as opposed to a generic string?
 - Default behavior for the flow run id if it is missing (i.e. client did not 
 set it)
 - How do we handle flow attributes in case of nested levels of flows?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called

2015-04-08 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484813#comment-14484813
 ] 

Hadoop QA commented on YARN-3457:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12723815/YARN-3457.001.patch
  against trunk revision ab04ff9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7252//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7252//console

This message is automatically generated.

 NPE when NodeManager.serviceInit fails and stopRecoveryStore called
 ---

 Key: YARN-3457
 URL: https://issues.apache.org/jira/browse/YARN-3457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: YARN-3457.001.patch


 When NodeManager service init fails during stopRecoveryStore null pointer 
 exception is thrown
 {code}
  @Override
   protected void serviceInit(Configuration conf) throws Exception {
..
   try {
   exec.init();
 } catch (IOException e) {
   throw new YarnRuntimeException(Failed to initialize container 
 executor, e);
 }
 this.context = createNMContext(containerTokenSecretManager,
 nmTokenSecretManager, nmStore);
 
 {code}
 context is null when service init fails
 {code}
   private void stopRecoveryStore() throws IOException {
 nmStore.stop();
 if (context.getDecommissioned()  nmStore.canRecover()) {
..
 }
   }
 {code}
 Null pointer exception thrown
 {quote}
 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service NodeManager : java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage

[
https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484815#comment-14484815
]

Zhijie Shen commented on YARN-3391:
---

I created a new patch:

bq. So in general, I think we should use as much javadoc comments instead of
inline comments for public APIs.

Move the comments into TimelineUtils and make them javadoc.

bq. We should add more info to LOG.warn messages, at least to tell user flow
run should be numeric.

Improve the warn message

bq. In addition, do we need to check negative value for flow run here?

According to Sangjin's given example, we usually want to identify a flow run by
timestamp, which theoretically can be negative to represent sometime before
1970.

Clearly define flow ID/ flow run / flow version in API and storage
--

Key: YARN-3391
URL: https://issues.apache.org/jira/browse/YARN-3391
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Attachments: YARN-3391.1.patch, YARN-3391.2.patch, YARN-3391.3.patch,
YARN-3391.4.patch

To continue the discussion in YARN-3040, let's figure out the best way to
describe the flow.
Some key issues that we need to conclude on:
- How do we include the flow version in the context so that it gets passed
into the collector and to the storage eventually?
- Flow run id should be a number as opposed to a generic string?
- Default behavior for the flow run id if it is missing (i.e. client did not
set it)
- How do we handle flow attributes in case of nested levels of flows?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called

2015-04-08 Thread Tsuyoshi Ozawa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484845#comment-14484845
 ] 

Tsuyoshi Ozawa commented on YARN-3457:
--

+1, committing this shortly.

 NPE when NodeManager.serviceInit fails and stopRecoveryStore called
 ---

 Key: YARN-3457
 URL: https://issues.apache.org/jira/browse/YARN-3457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: YARN-3457.001.patch


 When NodeManager service init fails during stopRecoveryStore null pointer 
 exception is thrown
 {code}
  @Override
   protected void serviceInit(Configuration conf) throws Exception {
..
   try {
   exec.init();
 } catch (IOException e) {
   throw new YarnRuntimeException(Failed to initialize container 
 executor, e);
 }
 this.context = createNMContext(containerTokenSecretManager,
 nmTokenSecretManager, nmStore);
 
 {code}
 context is null when service init fails
 {code}
   private void stopRecoveryStore() throws IOException {
 nmStore.stop();
 if (context.getDecommissioned()  nmStore.canRecover()) {
..
 }
   }
 {code}
 Null pointer exception thrown
 {quote}
 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service NodeManager : java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3225) New parameter or CLI for decommissioning node gracefully in RMAdmin CLI

2015-04-08 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485071#comment-14485071
 ] 

Devaraj K commented on YARN-3225:
-

Thanks [~djp] for your review. 

bq. I think we should support a case that Admin want node get decommissioned 
whenever all apps on these node get finished. If so, shall we support nigative 
value (anyone or some special one, like: -1) to specify this case?
If the user wants to achieve this, they can give some larger timeout value and 
wait for all nodes to get decommissioned gracefully(without forceful). Do we 
really need to provide special handling for this case?

bq. For NORMAL, shall we use Decommission nodes in normal (old) way instead 
or something simpler- Decommission nodes?
I feel Decommission nodes in normal way would be ok, no need to mention the 
'old' term. What is your opinion on this?

bq. IMO, the methods inside a class should't be more public than class itself? 
If we don't expect other projects to use class, we alwasy don't expect some 
methods get used. The same problem happen in an old API 
RefreshNodeRequest.java. I think we may need to fix both?
I agree, I will fix both of them.

bq. Why do we need this change? 
recordFactory.newRecordInstance(RefreshNodesRequest.class) will return 
something with DecommissionType.NORMAL as default. No?
It will not give any difference because the NORMAL is the default. I made this 
change to make it consistent with other decommission types.


 New parameter or CLI for decommissioning node gracefully in RMAdmin CLI
 ---

 Key: YARN-3225
 URL: https://issues.apache.org/jira/browse/YARN-3225
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Junping Du
Assignee: Devaraj K
 Attachments: YARN-3225-1.patch, YARN-3225-2.patch, YARN-3225-3.patch, 
 YARN-3225.patch, YARN-914.patch


 New CLI (or existing CLI with parameters) should put each node on 
 decommission list to decommissioning status and track timeout to terminate 
 the nodes that haven't get finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3462) Patches applied for YARN-2424 are inconsistent between trunk and branch-2

2015-04-08 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485151#comment-14485151
 ] 

Naganarasimha G R commented on YARN-3462:
-

[~qwertymaniac]/[~aw]
Can you guys take a look at this patch ?


 Patches applied for YARN-2424 are inconsistent between trunk and branch-2
 -

 Key: YARN-3462
 URL: https://issues.apache.org/jira/browse/YARN-3462
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Sidharta Seethana
Assignee: Naganarasimha G R
 Attachments: YARN-3462.20150508-1.patch


 It looks like the changes for YARN-2424 are not the same for trunk (commit 
 7e75226e68715c3eca9d346c8eaf2f265aa70d23) and branch-2 (commit 
 5d965f2f3cf97a87603720948aacd4f7877d73c4) . Branch-2 has a missing warning 
 and documentation is a bit different as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui


[ 
https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485118#comment-14485118
 ] 

Hudson commented on YARN-3110:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #148 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/148/])
YARN-3110. Few issues in ApplicationHistory web ui. Contributed by 
Naganarasimha G R (xgong: rev 19a4feaf6fcf42ebbfe98b8a7153ade96d37fb14)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java


 Few issues in ApplicationHistory web ui
 ---

 Key: YARN-3110
 URL: https://issues.apache.org/jira/browse/YARN-3110
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: applications, timelineserver
Affects Versions: 2.6.0
Reporter: Bibin A Chundatt
Assignee: Naganarasimha G R
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, 
 YARN-3110.20150406-1.patch


 Application state and History link wrong when Application is in unassigned 
 state
  
 1.Configure capacity schedular with queue size as 1  also max Absolute Max 
 Capacity:  10.0%
 (Current application state is Accepted and Unassigned from resource manager 
 side)
 2.Submit application to queue and check the state and link in Application 
 history
 State= null and History link shown as N/A in applicationhistory page
 Kill the same application . In timeline server logs the below is show when 
 selecting application link.
 {quote}
 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to 
 read the AM container of the application attempt 
 appattempt_1422467063659_0007_01.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77)
   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
   at 
 org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
   at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845)
   at 
 org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56)
   at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
   at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38)
   at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
   at 
 com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
   at 
 com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
   at

[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period


[ 
https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485122#comment-14485122
 ] 

Hudson commented on YARN-3294:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #148 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/148/])
YARN-3294. Allow dumping of Capacity Scheduler debug logs via web UI for 
(xgong: rev d27e9241e8676a0edb2d35453cac5f9495fcd605)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/AdHocLogDumper.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestAdHocLogDumper.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* hadoop-yarn-project/CHANGES.txt


 Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time 
 period
 -

 Key: YARN-3294
 URL: https://issues.apache.org/jira/browse/YARN-3294
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Fix For: 2.8.0

 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, 
 apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, 
 apache-yarn-3294.3.patch, apache-yarn-3294.4.patch


 It would be nice to have a button on the web UI that would allow dumping of 
 debug logs for just the capacity scheduler for a fixed period of time(1 min, 
 5 min or so) in a separate log file. It would be useful when debugging 
 scheduler behavior without affecting the rest of the resourcemanager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken


[ 
https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485127#comment-14485127
 ] 

Hudson commented on YARN-3429:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #148 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/148/])
YARN-3429. Fix incorrect CHANGES.txt (rkanter: rev 
5b8a3ae366294aec492f69f1a429aa7fce5d13be)
* hadoop-yarn-project/CHANGES.txt


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken
 

 Key: YARN-3429
 URL: https://issues.apache.org/jira/browse/YARN-3429
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3429.000.patch


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken from appattempt_1427804754787_0001_01
 The error logs is at 
 https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called


[ 
https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485120#comment-14485120
 ] 

Hudson commented on YARN-3457:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #148 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/148/])
YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. 
Contributed by Bibin A Chundatt. (ozawa: rev 
dd852f5b8c8fe9e52d15987605f36b5b60f02701)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* hadoop-yarn-project/CHANGES.txt


 NPE when NodeManager.serviceInit fails and stopRecoveryStore called
 ---

 Key: YARN-3457
 URL: https://issues.apache.org/jira/browse/YARN-3457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3457.001.patch


 When NodeManager service init fails during stopRecoveryStore null pointer 
 exception is thrown
 {code}
  @Override
   protected void serviceInit(Configuration conf) throws Exception {
..
   try {
   exec.init();
 } catch (IOException e) {
   throw new YarnRuntimeException(Failed to initialize container 
 executor, e);
 }
 this.context = createNMContext(containerTokenSecretManager,
 nmTokenSecretManager, nmStore);
 
 {code}
 context is null when service init fails
 {code}
   private void stopRecoveryStore() throws IOException {
 nmStore.stop();
 if (context.getDecommissioned()  nmStore.canRecover()) {
..
 }
   }
 {code}
 Null pointer exception thrown
 {quote}
 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service NodeManager : java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken


[ 
https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485114#comment-14485114
 ] 

Hudson commented on YARN-3429:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2089 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2089/])
YARN-3429. Fix incorrect CHANGES.txt (rkanter: rev 
5b8a3ae366294aec492f69f1a429aa7fce5d13be)
* hadoop-yarn-project/CHANGES.txt


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken
 

 Key: YARN-3429
 URL: https://issues.apache.org/jira/browse/YARN-3429
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3429.000.patch


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken from appattempt_1427804754787_0001_01
 The error logs is at 
 https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui


[ 
https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485045#comment-14485045
 ] 

Hudson commented on YARN-3110:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #157 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/157/])
YARN-3110. Few issues in ApplicationHistory web ui. Contributed by 
Naganarasimha G R (xgong: rev 19a4feaf6fcf42ebbfe98b8a7153ade96d37fb14)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java


 Few issues in ApplicationHistory web ui
 ---

 Key: YARN-3110
 URL: https://issues.apache.org/jira/browse/YARN-3110
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: applications, timelineserver
Affects Versions: 2.6.0
Reporter: Bibin A Chundatt
Assignee: Naganarasimha G R
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, 
 YARN-3110.20150406-1.patch


 Application state and History link wrong when Application is in unassigned 
 state
  
 1.Configure capacity schedular with queue size as 1  also max Absolute Max 
 Capacity:  10.0%
 (Current application state is Accepted and Unassigned from resource manager 
 side)
 2.Submit application to queue and check the state and link in Application 
 history
 State= null and History link shown as N/A in applicationhistory page
 Kill the same application . In timeline server logs the below is show when 
 selecting application link.
 {quote}
 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to 
 read the AM container of the application attempt 
 appattempt_1422467063659_0007_01.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77)
   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
   at 
 org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
   at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845)
   at 
 org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56)
   at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
   at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38)
   at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
   at 
 com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
   at 
 com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
   at

[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period


[ 
https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485049#comment-14485049
 ] 

Hudson commented on YARN-3294:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #157 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/157/])
YARN-3294. Allow dumping of Capacity Scheduler debug logs via web UI for 
(xgong: rev d27e9241e8676a0edb2d35453cac5f9495fcd605)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestAdHocLogDumper.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/AdHocLogDumper.java


 Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time 
 period
 -

 Key: YARN-3294
 URL: https://issues.apache.org/jira/browse/YARN-3294
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Fix For: 2.8.0

 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, 
 apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, 
 apache-yarn-3294.3.patch, apache-yarn-3294.4.patch


 It would be nice to have a button on the web UI that would allow dumping of 
 debug logs for just the capacity scheduler for a fixed period of time(1 min, 
 5 min or so) in a separate log file. It would be useful when debugging 
 scheduler behavior without affecting the rest of the resourcemanager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called


[ 
https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485047#comment-14485047
 ] 

Hudson commented on YARN-3457:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #157 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/157/])
YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. 
Contributed by Bibin A Chundatt. (ozawa: rev 
dd852f5b8c8fe9e52d15987605f36b5b60f02701)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java


 NPE when NodeManager.serviceInit fails and stopRecoveryStore called
 ---

 Key: YARN-3457
 URL: https://issues.apache.org/jira/browse/YARN-3457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3457.001.patch


 When NodeManager service init fails during stopRecoveryStore null pointer 
 exception is thrown
 {code}
  @Override
   protected void serviceInit(Configuration conf) throws Exception {
..
   try {
   exec.init();
 } catch (IOException e) {
   throw new YarnRuntimeException(Failed to initialize container 
 executor, e);
 }
 this.context = createNMContext(containerTokenSecretManager,
 nmTokenSecretManager, nmStore);
 
 {code}
 context is null when service init fails
 {code}
   private void stopRecoveryStore() throws IOException {
 nmStore.stop();
 if (context.getDecommissioned()  nmStore.canRecover()) {
..
 }
   }
 {code}
 Null pointer exception thrown
 {quote}
 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service NodeManager : java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken


[ 
https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485054#comment-14485054
 ] 

Hudson commented on YARN-3429:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #157 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/157/])
YARN-3429. Fix incorrect CHANGES.txt (rkanter: rev 
5b8a3ae366294aec492f69f1a429aa7fce5d13be)
* hadoop-yarn-project/CHANGES.txt


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken
 

 Key: YARN-3429
 URL: https://issues.apache.org/jira/browse/YARN-3429
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3429.000.patch


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken from appattempt_1427804754787_0001_01
 The error logs is at 
 https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period


[ 
https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485142#comment-14485142
 ] 

Hudson commented on YARN-3294:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #891 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/891/])
YARN-3294. Allow dumping of Capacity Scheduler debug logs via web UI for 
(xgong: rev d27e9241e8676a0edb2d35453cac5f9495fcd605)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestAdHocLogDumper.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/AdHocLogDumper.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* hadoop-yarn-project/CHANGES.txt


 Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time 
 period
 -

 Key: YARN-3294
 URL: https://issues.apache.org/jira/browse/YARN-3294
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Fix For: 2.8.0

 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, 
 apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, 
 apache-yarn-3294.3.patch, apache-yarn-3294.4.patch


 It would be nice to have a button on the web UI that would allow dumping of 
 debug logs for just the capacity scheduler for a fixed period of time(1 min, 
 5 min or so) in a separate log file. It would be useful when debugging 
 scheduler behavior without affecting the rest of the resourcemanager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui


[ 
https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485138#comment-14485138
 ] 

Hudson commented on YARN-3110:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #891 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/891/])
YARN-3110. Few issues in ApplicationHistory web ui. Contributed by 
Naganarasimha G R (xgong: rev 19a4feaf6fcf42ebbfe98b8a7153ade96d37fb14)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java


 Few issues in ApplicationHistory web ui
 ---

 Key: YARN-3110
 URL: https://issues.apache.org/jira/browse/YARN-3110
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: applications, timelineserver
Affects Versions: 2.6.0
Reporter: Bibin A Chundatt
Assignee: Naganarasimha G R
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, 
 YARN-3110.20150406-1.patch


 Application state and History link wrong when Application is in unassigned 
 state
  
 1.Configure capacity schedular with queue size as 1  also max Absolute Max 
 Capacity:  10.0%
 (Current application state is Accepted and Unassigned from resource manager 
 side)
 2.Submit application to queue and check the state and link in Application 
 history
 State= null and History link shown as N/A in applicationhistory page
 Kill the same application . In timeline server logs the below is show when 
 selecting application link.
 {quote}
 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to 
 read the AM container of the application attempt 
 appattempt_1422467063659_0007_01.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77)
   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
   at 
 org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
   at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845)
   at 
 org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56)
   at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
   at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38)
   at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
   at 
 com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
   at 
 com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
   at

[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called


[ 
https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485140#comment-14485140
 ] 

Hudson commented on YARN-3457:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #891 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/891/])
YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. 
Contributed by Bibin A Chundatt. (ozawa: rev 
dd852f5b8c8fe9e52d15987605f36b5b60f02701)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* hadoop-yarn-project/CHANGES.txt


 NPE when NodeManager.serviceInit fails and stopRecoveryStore called
 ---

 Key: YARN-3457
 URL: https://issues.apache.org/jira/browse/YARN-3457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3457.001.patch


 When NodeManager service init fails during stopRecoveryStore null pointer 
 exception is thrown
 {code}
  @Override
   protected void serviceInit(Configuration conf) throws Exception {
..
   try {
   exec.init();
 } catch (IOException e) {
   throw new YarnRuntimeException(Failed to initialize container 
 executor, e);
 }
 this.context = createNMContext(containerTokenSecretManager,
 nmTokenSecretManager, nmStore);
 
 {code}
 context is null when service init fails
 {code}
   private void stopRecoveryStore() throws IOException {
 nmStore.stop();
 if (context.getDecommissioned()  nmStore.canRecover()) {
..
 }
   }
 {code}
 Null pointer exception thrown
 {quote}
 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service NodeManager : java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken


[ 
https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485147#comment-14485147
 ] 

Hudson commented on YARN-3429:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #891 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/891/])
YARN-3429. Fix incorrect CHANGES.txt (rkanter: rev 
5b8a3ae366294aec492f69f1a429aa7fce5d13be)
* hadoop-yarn-project/CHANGES.txt


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken
 

 Key: YARN-3429
 URL: https://issues.apache.org/jira/browse/YARN-3429
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3429.000.patch


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken from appattempt_1427804754787_0001_01
 The error logs is at 
 https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui


[ 
https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485105#comment-14485105
 ] 

Hudson commented on YARN-3110:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2089 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2089/])
YARN-3110. Few issues in ApplicationHistory web ui. Contributed by 
Naganarasimha G R (xgong: rev 19a4feaf6fcf42ebbfe98b8a7153ade96d37fb14)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java


 Few issues in ApplicationHistory web ui
 ---

 Key: YARN-3110
 URL: https://issues.apache.org/jira/browse/YARN-3110
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: applications, timelineserver
Affects Versions: 2.6.0
Reporter: Bibin A Chundatt
Assignee: Naganarasimha G R
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, 
 YARN-3110.20150406-1.patch


 Application state and History link wrong when Application is in unassigned 
 state
  
 1.Configure capacity schedular with queue size as 1  also max Absolute Max 
 Capacity:  10.0%
 (Current application state is Accepted and Unassigned from resource manager 
 side)
 2.Submit application to queue and check the state and link in Application 
 history
 State= null and History link shown as N/A in applicationhistory page
 Kill the same application . In timeline server logs the below is show when 
 selecting application link.
 {quote}
 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to 
 read the AM container of the application attempt 
 appattempt_1422467063659_0007_01.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77)
   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
   at 
 org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
   at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845)
   at 
 org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56)
   at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
   at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38)
   at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
   at 
 com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
   at 
 com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
   at

[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called


[ 
https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485107#comment-14485107
 ] 

Hudson commented on YARN-3457:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2089 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2089/])
YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. 
Contributed by Bibin A Chundatt. (ozawa: rev 
dd852f5b8c8fe9e52d15987605f36b5b60f02701)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java


 NPE when NodeManager.serviceInit fails and stopRecoveryStore called
 ---

 Key: YARN-3457
 URL: https://issues.apache.org/jira/browse/YARN-3457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3457.001.patch


 When NodeManager service init fails during stopRecoveryStore null pointer 
 exception is thrown
 {code}
  @Override
   protected void serviceInit(Configuration conf) throws Exception {
..
   try {
   exec.init();
 } catch (IOException e) {
   throw new YarnRuntimeException(Failed to initialize container 
 executor, e);
 }
 this.context = createNMContext(containerTokenSecretManager,
 nmTokenSecretManager, nmStore);
 
 {code}
 context is null when service init fails
 {code}
   private void stopRecoveryStore() throws IOException {
 nmStore.stop();
 if (context.getDecommissioned()  nmStore.canRecover()) {
..
 }
   }
 {code}
 Null pointer exception thrown
 {quote}
 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service NodeManager : java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period


[ 
https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485109#comment-14485109
 ] 

Hudson commented on YARN-3294:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2089 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2089/])
YARN-3294. Allow dumping of Capacity Scheduler debug logs via web UI for 
(xgong: rev d27e9241e8676a0edb2d35453cac5f9495fcd605)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestAdHocLogDumper.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/AdHocLogDumper.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* hadoop-yarn-project/CHANGES.txt


 Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time 
 period
 -

 Key: YARN-3294
 URL: https://issues.apache.org/jira/browse/YARN-3294
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Fix For: 2.8.0

 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, 
 apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, 
 apache-yarn-3294.3.patch, apache-yarn-3294.4.patch


 It would be nice to have a button on the web UI that would allow dumping of 
 debug logs for just the capacity scheduler for a fixed period of time(1 min, 
 5 min or so) in a separate log file. It would be useful when debugging 
 scheduler behavior without affecting the rest of the resourcemanager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3326) ReST support for getLabelsToNodes

2015-04-08 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485159#comment-14485159
 ] 

Naganarasimha G R commented on YARN-3326:
-

Hi [~ozawa], Test case failure is not related to this issue and seperate jira 
is already raised for it ( YARN-2871)


 ReST support for getLabelsToNodes 
 --

 Key: YARN-3326
 URL: https://issues.apache.org/jira/browse/YARN-3326
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Priority: Minor
 Attachments: YARN-3326.20150310-1.patch, YARN-3326.20150407-1.patch, 
 YARN-3326.20150408-1.patch


 REST to support to retrieve LabelsToNodes Mapping



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called

2015-04-08 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484931#comment-14484931
 ] 

Bibin A Chundatt commented on YARN-3457:


Thank you [~ozawa] for checking and committing patch.

 NPE when NodeManager.serviceInit fails and stopRecoveryStore called
 ---

 Key: YARN-3457
 URL: https://issues.apache.org/jira/browse/YARN-3457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3457.001.patch


 When NodeManager service init fails during stopRecoveryStore null pointer 
 exception is thrown
 {code}
  @Override
   protected void serviceInit(Configuration conf) throws Exception {
..
   try {
   exec.init();
 } catch (IOException e) {
   throw new YarnRuntimeException(Failed to initialize container 
 executor, e);
 }
 this.context = createNMContext(containerTokenSecretManager,
 nmTokenSecretManager, nmStore);
 
 {code}
 context is null when service init fails
 {code}
   private void stopRecoveryStore() throws IOException {
 nmStore.stop();
 if (context.getDecommissioned()  nmStore.canRecover()) {
..
 }
   }
 {code}
 Null pointer exception thrown
 {quote}
 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service NodeManager : java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.


 [ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3464:

Description: 
Race condition in LocalizerRunner causes container localization timeout.
Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
for LocalizerResourceRequestEvent is empty.
{code}
  } else if (pending.isEmpty()) {
action = LocalizerAction.DIE;
  }
{code}
If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
ContainerLocalizer due to empty pending list, this 
LocalizerResourceRequestEvent will never be handled.
Without ContainerLocalizer, LocalizerRunner#update will never be called.
The container will stay at LOCALIZING state, until the container is killed by 
AM due to TASK_TIMEOUT.

  was:
Race condition in LocalizerRunner causes container localization timeout.
Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
for LocalizerResourceRequestEvent is empty.
{code}
  } else if (pending.isEmpty()) {
action = LocalizerAction.DIE;
  }
{code}
If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
ContainerLocalizer due to empty pending list, this 
LocalizerResourceRequestEvent will never be handled.
The container will stay at LOCALIZING state, until the container is killed by 
AM due to TASK_TIMEOUT.


 Race condition in LocalizerRunner causes container localization timeout.
 

 Key: YARN-3464
 URL: https://issues.apache.org/jira/browse/YARN-3464
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical

 Race condition in LocalizerRunner causes container localization timeout.
 Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
 for LocalizerResourceRequestEvent is empty.
 {code}
   } else if (pending.isEmpty()) {
 action = LocalizerAction.DIE;
   }
 {code}
 If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
 ContainerLocalizer due to empty pending list, this 
 LocalizerResourceRequestEvent will never be handled.
 Without ContainerLocalizer, LocalizerRunner#update will never be called.
 The container will stay at LOCALIZING state, until the container is killed by 
 AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3465) use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl

zhihai xu created YARN-3465:
---

 Summary: use LinkedHashMap to keep the order of 
LocalResourceRequest in ContainerImpl
 Key: YARN-3465
 URL: https://issues.apache.org/jira/browse/YARN-3465
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu


use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called


[ 
https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484858#comment-14484858
 ] 

Hudson commented on YARN-3457:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7531 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7531/])
YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. 
Contributed by Bibin A Chundatt. (ozawa: rev 
dd852f5b8c8fe9e52d15987605f36b5b60f02701)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java


 NPE when NodeManager.serviceInit fails and stopRecoveryStore called
 ---

 Key: YARN-3457
 URL: https://issues.apache.org/jira/browse/YARN-3457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3457.001.patch


 When NodeManager service init fails during stopRecoveryStore null pointer 
 exception is thrown
 {code}
  @Override
   protected void serviceInit(Configuration conf) throws Exception {
..
   try {
   exec.init();
 } catch (IOException e) {
   throw new YarnRuntimeException(Failed to initialize container 
 executor, e);
 }
 this.context = createNMContext(containerTokenSecretManager,
 nmTokenSecretManager, nmStore);
 
 {code}
 context is null when service init fails
 {code}
   private void stopRecoveryStore() throws IOException {
 nmStore.stop();
 if (context.getDecommissioned()  nmStore.canRecover()) {
..
 }
   }
 {code}
 Null pointer exception thrown
 {quote}
 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service NodeManager : java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

zhihai xu created YARN-3464:
---

 Summary: Race condition in LocalizerRunner causes container 
localization timeout.
 Key: YARN-3464
 URL: https://issues.apache.org/jira/browse/YARN-3464
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical


Race condition in LocalizerRunner causes container localization timeout.
Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
for LocalizerResourceRequestEvent is empty.
{code}
  } else if (pending.isEmpty()) {
action = LocalizerAction.DIE;
  }
{code}
If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
ContainerLocalizer due to empty pending list, this 
LocalizerResourceRequestEvent will never be handled.
The container will stay at LOCALIZING state, until the container is killed by 
AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken

2015-04-08 Thread Ravi Prakash (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484917#comment-14484917
 ] 

Ravi Prakash commented on YARN-3429:


You may have inadvertantly used the wrong JIRA number in your commit [~rkanter] 
Ought to be YARN-3429 (instead of YARN-2429) I see comments on YARN-2429.

 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken
 

 Key: YARN-3429
 URL: https://issues.apache.org/jira/browse/YARN-3429
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3429.000.patch


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken from appattempt_1427804754787_0001_01
 The error logs is at 
 https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

[
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484920#comment-14484920
]

zhihai xu commented on YARN-3464:
-

This issue only happened for PRIVATE/APPLICATION resource Localization
We saw this issue happened when the PRIVATE LocalizerResourceRequestEvent
interleaved with PUBLIC LocalizerResourceRequestEvent in the following order:
PRIVATE1
PRIVATE2
..
PRIVATEm
PUBLIC1
PUBLIC2
.
PUBLICn
PRIVATEm+1
PRIVATEm+2
The last two PRIVATE LocalizerResourceRequestEvent is added after all previous
m PRIVATE LocalizerResourceRequestEvent are LOCALIZED due to the delay to
process n PUBLIC LocalizerResourceRequestEvent.
Then the container will stay at LOCALIZING state until it is killed by AM.

Race condition in LocalizerRunner causes container localization timeout.

Key: YARN-3464
URL: https://issues.apache.org/jira/browse/YARN-3464
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical

Race condition in LocalizerRunner causes container localization timeout.
Currently LocalizerRunner will kill the ContainerLocalizer when pending list
for LocalizerResourceRequestEvent is empty.
{code}
} else if (pending.isEmpty()) {
action = LocalizerAction.DIE;
}
{code}
If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the
ContainerLocalizer due to empty pending list, this
LocalizerResourceRequestEvent will never be handled.
The container will stay at LOCALIZING state, until the container is killed by
AM due to TASK_TIMEOUT.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3293) Track and display capacity scheduler health metrics in web UI

2015-04-08 Thread Craig Welch (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485466#comment-14485466
 ] 

Craig Welch commented on YARN-3293:
---

Overall +1 looks good to me.  One additional thing occurred to me when looking 
it over again - I think that CapacitySchedulerHealthInfo in the web dao is, for 
the most part, cross-scheduler.  Does it make sense to factor most of it up 
into a generalized SchedulerHealthInfo with all the common pieces and extend 
it (to CapacitySchedulerHealthInfo)  just for the CS specific constructor?

 Track and display capacity scheduler health metrics in web UI
 -

 Key: YARN-3293
 URL: https://issues.apache.org/jira/browse/YARN-3293
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: Screen Shot 2015-03-30 at 4.30.14 PM.png, 
 apache-yarn-3293.0.patch, apache-yarn-3293.1.patch, apache-yarn-3293.2.patch, 
 apache-yarn-3293.4.patch, apache-yarn-3293.5.patch, apache-yarn-3293.6.patch


 It would be good to display metrics that let users know about the health of 
 the capacity scheduler in the web UI. Today it is hard to get an idea if the 
 capacity scheduler is functioning correctly. Metrics such as the time for the 
 last allocation, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]

2015-04-08 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485392#comment-14485392
 ] 

Sunil G commented on YARN-2003:
---

Findbugs warnings are not related.

 Support to process Job priority from Submission Context in 
 AppAttemptAddedSchedulerEvent [RM side]
 --

 Key: YARN-2003
 URL: https://issues.apache.org/jira/browse/YARN-2003
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, 
 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 
 0006-YARN-2003.patch


 AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from 
 Submission Context and store.
 Later this can be used by Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3388) Allocation in LeafQueue could get stuck because DRF calculator isn't well supported when computing user-limit

2015-04-08 Thread Nathan Roberts (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485413#comment-14485413
 ] 

Nathan Roberts commented on YARN-3388:
--

Test failures don't appear related to patch. Ran failing tests locally and they 
pass. 

 Allocation in LeafQueue could get stuck because DRF calculator isn't well 
 supported when computing user-limit
 -

 Key: YARN-3388
 URL: https://issues.apache.org/jira/browse/YARN-3388
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: YARN-3388-v0.patch, YARN-3388-v1.patch


 When there are multiple active users in a queue, it should be possible for 
 those users to make use of capacity up-to max_capacity (or close). The 
 resources should be fairly distributed among the active users in the queue. 
 This works pretty well when there is a single resource being scheduled.   
 However, when there are multiple resources the situation gets more complex 
 and the current algorithm tends to get stuck at Capacity. 
 Example illustrated in subsequent comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3466) RM nodes web page does not sort by node HTTP address or containers


[ 
https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485451#comment-14485451
 ] 

Jason Lowe commented on YARN-3466:
--

This was caused by YARN-2943.  A new column was added at the beginning of the 
table but table indices in the sorting metadata for the javascript were not 
updated accordingly.

 RM nodes web page does not sort by node HTTP address or containers
 --

 Key: YARN-3466
 URL: https://issues.apache.org/jira/browse/YARN-3466
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe

 The ResourceManager does not support sorting by the node HTTP address nor the 
 container count columns on the cluster nodes page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3466) RM nodes web page does not sort by node HTTP address or containers


 [ 
https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-3466:
-
Attachment: YARN-3466.001.patch

Patch to bump the column indices to take into account the new node label 
column.  This also restores the formatting of the code where the columns are 
defined so it's easier to see the column order and count them.

[~leftnoteasy] or [~jianhe] please review.  It would be nice to get this into 
2.7.

 RM nodes web page does not sort by node HTTP address or containers
 --

 Key: YARN-3466
 URL: https://issues.apache.org/jira/browse/YARN-3466
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-3466.001.patch


 The ResourceManager does not support sorting by the node HTTP address nor the 
 container count columns on the cluster nodes page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3448) Add Rolling Time To Lives Level DB Plugin Capabilities

[
https://issues.apache.org/jira/browse/YARN-3448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485446#comment-14485446
]

Zhijie Shen commented on YARN-3448:
---

bq. In fact, all rolling dbs from now unto ttl may be active.

Yeah, actually this is the point I'd like make. For example, if ttl = 10h and
rolling period = 1h, we will have 10 active rolling dbs. Though 2 - 10 dbs are
not current, but they can't be deleted because they contain the data that is
still alive. Only rolling dbs from 11 and son will be deleted. While ttl = 10h,
we change rolling period = 10h, and I will only have 1 active 10 rolling db,
and its size should be equivalent to prior 10 1h rolling dbs. Therefore, my
point is that if rolling period smaller than ttl, we still need to keep all the
data alive, it's not necessary to separate them into multiple dbs rather than
keeping them together in the current db.

One benefit I can think of about multiple-rolling-db approach (as well as
different dbs for different data type) is to increase concurrency. However, I
didn't see we have multiple threads to write different dbs concurrently.

Add Rolling Time To Lives Level DB Plugin Capabilities
--

Key: YARN-3448
URL: https://issues.apache.org/jira/browse/YARN-3448
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
Attachments: YARN-3448.1.patch, YARN-3448.2.patch, YARN-3448.3.patch

For large applications, the majority of the time in LeveldbTimelineStore is
spent deleting old entities record at a time. An exclusive write lock is held
during the entire deletion phase which in practice can be hours. If we are to
relax some of the consistency constraints, other performance enhancing
techniques can be employed to maximize the throughput and minimize locking
time.
Split the 5 sections of the leveldb database (domain, owner, start time,
entity, index) into 5 separate databases. This allows each database to
maximize the read cache effectiveness based on the unique usage patterns of
each database. With 5 separate databases each lookup is much faster. This can
also help with I/O to have the entity and index databases on separate disks.
Rolling DBs for entity and index DBs. 99.9% of the data are in these two
sections 4:1 ration (index to entity) at least for tez. We replace DB record
removal with file system removal if we create a rolling set of databases that
age out and can be efficiently removed. To do this we must place a constraint
to always place an entity's events into it's correct rolling db instance
based on start time. This allows us to stitching the data back together while
reading and artificial paging.
Relax the synchronous writes constraints. If we are willing to accept losing
some records that we not flushed in the operating system during a crash, we
can use async writes that can be much faster.
Prefer Sequential writes. sequential writes can be several times faster than
random writes. Spend some small effort arranging the writes in such a way
that will trend towards sequential write performance over random write
performance.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3466) RM nodes web page does not sort by node HTTP address or containers


 [ 
https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-3466:
-
Affects Version/s: (was: 2.6.0)
   2.7.0

 RM nodes web page does not sort by node HTTP address or containers
 --

 Key: YARN-3466
 URL: https://issues.apache.org/jira/browse/YARN-3466
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.7.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-3466.001.patch


 The ResourceManager does not support sorting by the node HTTP address nor the 
 container count columns on the cluster nodes page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2890) MiniMRYarnCluster should turn on timeline service if configured to do so

2015-04-08 Thread Mit Desai (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485303#comment-14485303
 ] 

Mit Desai commented on YARN-2890:
-

[~hitesh] any comments on the latest patch?

 MiniMRYarnCluster should turn on timeline service if configured to do so
 

 Key: YARN-2890
 URL: https://issues.apache.org/jira/browse/YARN-2890
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai
 Attachments: YARN-2890.1.patch, YARN-2890.2.patch, YARN-2890.3.patch, 
 YARN-2890.4.patch, YARN-2890.patch, YARN-2890.patch, YARN-2890.patch, 
 YARN-2890.patch, YARN-2890.patch


 Currently the MiniMRYarnCluster does not consider the configuration value for 
 enabling timeline service before starting. The MiniYarnCluster should only 
 start the timeline service if it is configured to do so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications.


[ 
https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485551#comment-14485551
 ] 

Junping Du commented on YARN-2637:
--

Hi [~cwelch] and [~jianhe], I think MAPREDUCE-6189 could be related to this 
patch. Can you take a look at it? Thanks!

 maximum-am-resource-percent could be respected for both LeafQueue/User when 
 trying to activate applications.
 

 Key: YARN-2637
 URL: https://issues.apache.org/jira/browse/YARN-2637
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: Craig Welch
Priority: Critical
 Fix For: 2.7.0

 Attachments: YARN-2637.0.patch, YARN-2637.1.patch, 
 YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, 
 YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, 
 YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, 
 YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, 
 YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.27.patch, 
 YARN-2637.28.patch, YARN-2637.29.patch, YARN-2637.30.patch, 
 YARN-2637.31.patch, YARN-2637.32.patch, YARN-2637.36.patch, 
 YARN-2637.38.patch, YARN-2637.39.patch, YARN-2637.40.patch, 
 YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch


 Currently, number of AM in leaf queue will be calculated in following way:
 {code}
 max_am_resource = queue_max_capacity * maximum_am_resource_percent
 #max_am_number = max_am_resource / minimum_allocation
 #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor
 {code}
 And when submit new application to RM, it will check if an app can be 
 activated in following way:
 {code}
 for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); 
  i.hasNext(); ) {
   FiCaSchedulerApp application = i.next();
   
   // Check queue limit
   if (getNumActiveApplications() = getMaximumActiveApplications()) {
 break;
   }
   
   // Check user limit
   User user = getUser(application.getUser());
   if (user.getActiveApplications()  
 getMaximumActiveApplicationsPerUser()) {
 user.activateApplication();
 activeApplications.add(application);
 i.remove();
 LOG.info(Application  + application.getApplicationId() +
  from user:  + application.getUser() + 
  activated in queue:  + getQueueName());
   }
 }
 {code}
 An example is,
 If a queue has capacity = 1G, max_am_resource_percent  = 0.2, the maximum 
 resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be 
 launched is 200, and if user uses 5M for each AM ( minimum_allocation). All 
 apps can still be activated, and it will occupy all resource of a queue 
 instead of only a max_am_resource_percent of a queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3293) Track and display capacity scheduler health metrics in web UI

2015-04-08 Thread Craig Welch (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485557#comment-14485557
 ] 

Craig Welch commented on YARN-3293:
---

Your call, I think it's also fine to wait to do this until we do FairScheduler 
integration when we are clear on exactly what needs to happen (it may be 
premature to do it now, not entirely sure), but ultimately I think as much as 
can be shared should be.

 Track and display capacity scheduler health metrics in web UI
 -

 Key: YARN-3293
 URL: https://issues.apache.org/jira/browse/YARN-3293
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: Screen Shot 2015-03-30 at 4.30.14 PM.png, 
 apache-yarn-3293.0.patch, apache-yarn-3293.1.patch, apache-yarn-3293.2.patch, 
 apache-yarn-3293.4.patch, apache-yarn-3293.5.patch, apache-yarn-3293.6.patch


 It would be good to display metrics that let users know about the health of 
 the capacity scheduler in the web UI. Today it is hard to get an idea if the 
 capacity scheduler is functioning correctly. Metrics such as the time for the 
 last allocation, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-08 Thread Karthik Kambatla (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485566#comment-14485566
]

Karthik Kambatla commented on YARN-3464:

I have been investigating a similar issue. Initially I thought of the same
race, but not sure if that alone solves the issue.

Looking at the code closely, I don't see any resources being removed from
pending. So, pending shouldn't be empty after some of the resources have been
downloaded.

Related: YARN-3024 increases the frequency of this issue.

Race condition in LocalizerRunner causes container localization timeout.

Key: YARN-3464
URL: https://issues.apache.org/jira/browse/YARN-3464
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical

Race condition in LocalizerRunner causes container localization timeout.
Currently LocalizerRunner will kill the ContainerLocalizer when pending list
for LocalizerResourceRequestEvent is empty.
{code}
} else if (pending.isEmpty()) {
action = LocalizerAction.DIE;
}
{code}
If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the
ContainerLocalizer due to empty pending list, this
LocalizerResourceRequestEvent will never be handled.
Without ContainerLocalizer, LocalizerRunner#update will never be called.
The container will stay at LOCALIZING state, until the container is killed by
AM due to TASK_TIMEOUT.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3361) CapacityScheduler side changes to support non-exclusive node labels

[
https://issues.apache.org/jira/browse/YARN-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wangda Tan updated YARN-3361:
-
Attachment: YARN-3361.3.patch

Thanks for your comments, [~vinodkv]/[~jianhe]:

* Main code comments from Vinod: *

bq. checkNodeLabelExpression: NPEs on labelExpression can happen?
No, I removed checkings

bq. FiCaSchedulerNode: exclusive, setters, getters - exclusivePartition
They're not used by anybody, removed

bq. ExclusiveType renames
Done

bq. AbstractCSQueue:
1. Change to nodePartitionToLookAt: Done
2. Now all queues checks needResources
3. Renamed to hasPendingResourceRequest as suggested by Jian

bq. checkResourceRequestMatchingNodeLabel can be moved into the application?
Moved to SchedulerUtils

bq. checkResourceRequestMatchingNodeLabel nodeLabelToLookAt arg is not used
anywhere else.
Done (merged it in SchedulerUtils.checkResourceRequestMatchingNodePartition)

bq. addNonExclusiveSchedulingOpportunity
Renamed to reset/addMissedNonPartitionedRequestSchedulingOpportunity

bq. It seems like we are not putting absolute max-capacities on the individual
queues when not-respecting-partitions. Describe why? Similarly, describe as to
why user-limit-factor is ignored in the not-respecting-paritions mode.
Done

* Test code comments from Vinod: *
bq. testNonExclusiveNodeLabelsAllocationIgnoreAppSubmitOrder
Done

bq. testNonExclusiveNodeLabelsAllocationIgnorePriority
Rename to testPreferenceOfNeedyPrioritiesUnderSameAppTowardsNodePartitions
bq. Actually, now that I rename it that way, this may not be the right
behavior. Not respecting priorities within an app can result in scheduling
deadlocks:
This will not lead deadlock, because we separately count resource usage under
each partition, priority=1 goes first on partition=y before priority=0 all
satisifed only because priority=1 is the lowest priority asks for partition=y.

bq. testLabeledResourceRequestsGetPreferrenceInHierarchyOfQueue
Renamed to testPreferenceOfQueuesTowardsNodePartitions

bq. testNonLabeledQueueUsesLabeledResource
Done

bq. Let's move all these node-label related tests into their own test-case.
Moved to TestNodeLabelContainerAllocation

Add more tests:
1. Added testAMContainerAllocationWillAlwaysBeExclusive to make sure AM will be
always excluisve.
2. Added testQueueMaxCapacitiesWillNotBeHonoredWhenNotRespectingExclusivity to
make sure max-capacities on individual queues ignored when doing ignore
exclusivity allocation

* Main code comments from Jian: *
bq. Merge queue#needResource and application#needResource
Done, now moved common implementation to
SchedulerUtils.hasPendingResourceRequest

bq. Merge queue#needResource and application#needResource
Done

bq. Some methods like canAssignToThisQueue where both nodeLabels and
exclusiveType are passed, it may be simplified by passing the current
partitionToAllocate to simplify the internal if/else check.
Actually, it will not simplify logic too much, I checked there're only few
places can leverage nodePartitionToLookAt, I perfer to keep semantics of
SchedulingMode

bq. The following may be incorrect, as the current request may be not the AM
container request, though null == rmAppAttempt.getMasterContainer()
I understand masterContainer could be async initialized in RMApp, but the
interval could be ignored, doing the null check here can make sure AM container
isn't get allocated.

bq. below if/else can be avoided if passing the nodePartition into
queueCapacities.getAbsoluteCapacity(nodePartition),
Done

bq. the second limit won’t be hit?
Yeah, it will not be hit, but set it to be maxUserLimit will enhance
readability.

bq. nonExclusiveSchedulingOpportunities#setCount - add(Priority)
Done

Attached new patch (ver.3)

CapacityScheduler side changes to support non-exclusive node labels
---

Key: YARN-3361
URL: https://issues.apache.org/jira/browse/YARN-3361
Project: Hadoop YARN
Issue Type: Sub-task
Components: capacityscheduler
Reporter: Wangda Tan
Assignee: Wangda Tan
Attachments: YARN-3361.1.patch, YARN-3361.2.patch, YARN-3361.3.patch

According to design doc attached in YARN-3214, we need implement following
logic in CapacityScheduler:
1) When allocate a resource request with no node-label specified, it should
get preferentially allocated to node without labels.
2) When there're some available resource in a node with label, they can be
used by applications with following order:
- Applications under queues which can access the label and ask for same
labeled resource.
- Applications under queues which can access the label and ask for
non-labeled resource.
- Applications under queues cannot access the label and ask for non-labeled
resource.
3) Expose necessary information that can be used by preemption

[jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers


[ 
https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485607#comment-14485607
 ] 

Zhijie Shen commented on YARN-3051:
---

bq. My sense is that it should be fine to use the same time window for all 
metrics.

Makes sense to me too.

bq. Or we have to be handle it as part of a single query ?

The result will just include the entity identifier of the related entities. And 
then we issue separate query to pull the detailed info of each related entity. 
This is also preventing the response from being nested. Otherwise, entity is 
related to the other, which will consequently related to another. The response 
will be too big then.  And if A is related to B, and B is then related to A, 
JAX-RS will find the cyclic dependency and throw the exception.

 [Storage abstraction] Create backing storage read interface for ATS readers
 ---

 Key: YARN-3051
 URL: https://issues.apache.org/jira/browse/YARN-3051
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Varun Saxena
 Attachments: YARN-3051_temp.patch


 Per design in YARN-2928, create backing storage read interface that can be 
 implemented by multiple backing storage implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3225) New parameter or CLI for decommissioning node gracefully in RMAdmin CLI

[
https://issues.apache.org/jira/browse/YARN-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485199#comment-14485199
]

Junping Du commented on YARN-3225:
--

Thanks [~devaraj.k] for replying.
bq. If the user wants to achieve this, they can give some larger timeout value
and wait for all nodes to get decommissioned gracefully(without forceful). Do
we really need to provide special handling for this case?
It would be great if we can support this case because users doesn't have to
think out a large number for an important job and doesn't known when to end.
Given this is a trivial effort comparing what you already achieved, we'd better
do here instead of filing a separated JIRA. What do you think?

bq. I feel Decommission nodes in normal way would be ok, no need to mention
the 'old' term. What is your opinion on this?
Yes. That sounds good. My previous point is not to mention decommissioning
for normal/previous decommission process to get rid of any confusing.

New parameter or CLI for decommissioning node gracefully in RMAdmin CLI
---

Key: YARN-3225
URL: https://issues.apache.org/jira/browse/YARN-3225
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Junping Du
Assignee: Devaraj K
Attachments: YARN-3225-1.patch, YARN-3225-2.patch, YARN-3225-3.patch,
YARN-3225.patch, YARN-914.patch

New CLI (or existing CLI with parameters) should put each node on
decommission list to decommissioning status and track timeout to terminate
the nodes that haven't get finished.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3127) Apphistory url crashes when RM switches with ATS enabled

2015-04-08 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485235#comment-14485235
 ] 

Xuan Gong commented on YARN-3127:
-

[~Naganarasimha] Thanks for working on this. I will take a look shortly.

 Apphistory url crashes when RM switches with ATS enabled
 

 Key: YARN-3127
 URL: https://issues.apache.org/jira/browse/YARN-3127
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, timelineserver
Affects Versions: 2.6.0
 Environment: RM HA with ATS
Reporter: Bibin A Chundatt
Assignee: Naganarasimha G R
 Attachments: YARN-3127.20150213-1.patch, YARN-3127.20150329-1.patch


 1.Start RM with HA and ATS configured and run some yarn applications
 2.Once applications are finished sucessfully start timeline server
 3.Now failover HA form active to standby
 4.Access timeline server URL IP:PORT/applicationhistory
 Result: Application history URL fails with below info
 {quote}
 2015-02-03 20:28:09,511 ERROR org.apache.hadoop.yarn.webapp.View: Failed to 
 read the applications.
 java.lang.reflect.UndeclaredThrowableException
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1643)
   at 
 org.apache.hadoop.yarn.server.webapp.AppsBlock.render(AppsBlock.java:80)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77)
   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
   ...
 Caused by: 
 org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: The 
 entity for application attempt appattempt_1422972608379_0001_01 doesn't 
 exist in the timeline store
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getApplicationAttempt(ApplicationHistoryManagerOnTimelineStore.java:151)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.generateApplicationReport(ApplicationHistoryManagerOnTimelineStore.java:499)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAllApplications(ApplicationHistoryManagerOnTimelineStore.java:108)
   at 
 org.apache.hadoop.yarn.server.webapp.AppsBlock$1.run(AppsBlock.java:84)
   at 
 org.apache.hadoop.yarn.server.webapp.AppsBlock$1.run(AppsBlock.java:81)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   ... 51 more
 2015-02-03 20:28:09,512 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error 
 handling URI: /applicationhistory
 org.apache.hadoop.yarn.webapp.WebAppException: Error rendering block: 
 nestLevel=6 expected 5
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77)
 {quote}
 Behaviour with AHS with file based history store
   -Apphistory url is working 
   -No attempt entries are shown for each application.
   
 Based on inital analysis when RM switches ,application attempts from state 
 store  are not replayed but only applications are.
 So when /applicaitonhistory url is accessed it tries for all attempt id and 
 fails



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage


[ 
https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485242#comment-14485242
 ] 

Varun Vasudev commented on YARN-3348:
-

The last line Moved the cache to YarnClientImpl where the hashcode doesn't 
show up should be Moved the cache to YarnClientImpl where the hashcode issue 
doesn't show up

 Add a 'yarn top' tool to help understand cluster usage
 --

 Key: YARN-3348
 URL: https://issues.apache.org/jira/browse/YARN-3348
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch, 
 apache-yarn-3348.2.patch


 It would be helpful to have a 'yarn top' tool that would allow administrators 
 to understand which apps are consuming resources.
 Ideally the tool would allow you to filter by queue, user, maybe labels, etc 
 and show you statistics on container allocation across the cluster to find 
 out which apps are consuming the most resources on the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage


[ 
https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485237#comment-14485237
 ] 

Junping Du commented on YARN-3391:
--

Thanks [~zjshen] for updating the patch!
bq. According to Sangjin's given example, we usually want to identify a flow 
run by timestamp, which theoretically can be negative to represent sometime 
before 1970.
Except time travel, I don't believe any flow run running on hadoop and new 
timeline service should happen before 1970. :) 
Anyway, we do have some practice to check timestamp  0 (like: 
MetricsRecordImpl), but more cases sounds like we didn't do this negative check 
for timestamp. Given this, I am fine with not checking here.

v4 patch looks good to me. [~sjlee0], [~vrushalic] and [~jrottinghuis], any 
additional comments for the patch?

 Clearly define flow ID/ flow run / flow version in API and storage
 --

 Key: YARN-3391
 URL: https://issues.apache.org/jira/browse/YARN-3391
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-3391.1.patch, YARN-3391.2.patch, YARN-3391.3.patch, 
 YARN-3391.4.patch


 To continue the discussion in YARN-3040, let's figure out the best way to 
 describe the flow.
 Some key issues that we need to conclude on:
 - How do we include the flow version in the context so that it gets passed 
 into the collector and to the storage eventually?
 - Flow run id should be a number as opposed to a generic string?
 - Default behavior for the flow run id if it is missing (i.e. client did not 
 set it)
 - How do we handle flow attributes in case of nested levels of flows?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3348) Add a 'yarn top' tool to help understand cluster usage

[
https://issues.apache.org/jira/browse/YARN-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Varun Vasudev updated YARN-3348:

Attachment: apache-yarn-3348.2.patch

Thanks for the reviews [~aw] and [~jianhe].

bq. Why are we doing this manipulation here and not in the Java code?
I get different values when I run the command in the yarn script vs spawn it
via Java. From Java, I get lower values - 80x24, whereas the yarn script gives
me 204x44.

bq. backticks are antiquated in modern bash. Use $() construction
Fixed.

bq. What happens if tput gives you zero or an error because you are on a
non-addressable terminal? (You can generally simulate this by unset TERM or
equivalent env var)
Thank you for pointing this out. I hadn't considered it. I've added additional
checks in the script. If the values can't be determined either by the script or
by the Java code, it sets it to 80x24.

bq. “Unable to fetach cluster metrics” - typo
Fixed.

bq. exceeding 80 Column limit,
Fixed.

bq. the -rows, -cols options seems not having effect on my screen when I tried
it, could you double check ?
I found an issue with cols option which I've fixed. Can you please try it again?

bq. the ‘yarn top’ output is repeatedly showing up on terminal every $delay
seconds. it’ll be better to only show that only once.
I didn't understand this - do you mean that it shouldn't auto-refresh?

bq. Does the patch only show root queue info ? should we show all queues info ?
Queues can be specified as a comma seperated string using the -queues option.
By default, it shows information for the root queue.

bq. “F + Enter : Select sort field” ; may be use ’S’ for sorting ?
Fixed.

bq. “Memory seconds(in GBseconds” - missing “)”
Fixed

{quote}
It seems a bit odd to have this method in a public API record. Do you know why
hashcode is not correct without this method ? Or we can just type cast it to
GetApplicationsRequestPBImpl and use the method from there.

// need this otherwise the hashcode doesn't get generated correctly
request.initAllFields();

for the caching in ClientRMService. Do you think we can do the cache on
client side ? that’ll save RPCs, especially if we have many top commands
running on client side.
{quote}
Fixed. Moved the cache to YarnClientImpl where the hashcode doesn't show up. As
to why it wasn't correct - I suspect it might be to do with lazy initialization
but I'm not sure.

Add a 'yarn top' tool to help understand cluster usage
--

Key: YARN-3348
URL: https://issues.apache.org/jira/browse/YARN-3348
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
Attachments: apache-yarn-3348.0.patch, apache-yarn-3348.1.patch,
apache-yarn-3348.2.patch

It would be helpful to have a 'yarn top' tool that would allow administrators
to understand which apps are consuming resources.
Ideally the tool would allow you to filter by queue, user, maybe labels, etc
and show you statistics on container allocation across the cluster to find
out which apps are consuming the most resources on the cluster.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3293) Track and display capacity scheduler health metrics in web UI


 [ 
https://issues.apache.org/jira/browse/YARN-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-3293:

Attachment: apache-yarn-3293.6.patch

Uploaded a new patch with getters so that findbugs doesn't complain.

 Track and display capacity scheduler health metrics in web UI
 -

 Key: YARN-3293
 URL: https://issues.apache.org/jira/browse/YARN-3293
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: Screen Shot 2015-03-30 at 4.30.14 PM.png, 
 apache-yarn-3293.0.patch, apache-yarn-3293.1.patch, apache-yarn-3293.2.patch, 
 apache-yarn-3293.4.patch, apache-yarn-3293.5.patch, apache-yarn-3293.6.patch


 It would be good to display metrics that let users know about the health of 
 the capacity scheduler in the web UI. Today it is hard to get an idea if the 
 capacity scheduler is functioning correctly. Metrics such as the time for the 
 last allocation, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3347) Improve YARN log command to get AMContainer logs as well as running containers logs


[ 
https://issues.apache.org/jira/browse/YARN-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485295#comment-14485295
 ] 

Junping Du commented on YARN-3347:
--

Hi [~xgong], thanks for reporting this issue and delivering a patch for fixing 
this! 
This looks like a very helpful feature for trouble shooting. I went through the 
patch quickly and have some comments so far:
{code}
+Option amOption = new Option(AM_CONTAINER_OPTION, true, 
+  Prints the AM Container logs for this application. 
+  + Specify comma-separated value to get logs for related AM Container. 
+  + To get logs for all AM Containers, use -am ALL. 
+  + To get logs for the latest AM Container, use -am -1. 
+  + By default, it will only print out syslog. Work with -logFiles 
+  + to get other logs);
{code}
For comma-separated value, do we mean attempt number? If so, may be we should 
describe more explicitly here? Also, can we use 0 (instead of -1) for AM 
container of latest attempt. If so, all negative value here is illegal. 

{code}
+if (getConf().getBoolean(YarnConfiguration.APPLICATION_HISTORY_ENABLED,
+  YarnConfiguration.DEFAULT_APPLICATION_HISTORY_ENABLED)) {
+  System.out.println(Please enable the application history service. 
Or );
+}
{code}
Missing ! before getConf()?

In method of printAMContainerLogsForRunningApplication(),
{code}
+boolean printAll = amContainers.contains(ALL);
+
+for (int i = 0; i  amContainersInfo.length(); i++) {
+  boolean printThis = amContainers.contains(Integer.toString(i+1)) 
+  || (i == (amContainersInfo.length()-1)
+   amContainers.contains(Integer.toString(-1)));
+  if (printAll || printThis) {
+String nodeHttpAddress =
+amContainersInfo.getJSONObject(i).getString(nodeHttpAddress);
+String containerId =
+amContainersInfo.getJSONObject(i).getString(containerId);
+String nodeId = amContainersInfo.getJSONObject(i).getString(nodeId);
+if (nodeHttpAddress != null  containerId != null
+ !nodeHttpAddress.isEmpty()  !containerId.isEmpty()) {
+  printContainerLogsFromRunningApplication(conf, appId, containerId,
+nodeHttpAddress, nodeId, logFiles, logCliHelper, appOwner);
+}
+  }
+}
+return 0;
+  }
{code}
Sounds like we are re-order the sequence of user's input which seems 
unnecessary to me. I would suggest to keep order from user's input or it could 
confuse people. Also, the logic here sounds not quite straightforward. I would 
expect something simper, like pseudo code below:
{code}
if (printAll) {
  go through amContainersInfo and print
}
for (amContainer : amContainers) {
  amContainer == -1? print amContainersInfo(last-one) : 
  print amContainersInfo(amContainer -1);
}
{code}

Also, for method of run(String[] args), it looks very complexity for now. Can 
we do some refactor work there and put some comments inline?

 Improve YARN log command to get AMContainer logs as well as running 
 containers logs
 ---

 Key: YARN-3347
 URL: https://issues.apache.org/jira/browse/YARN-3347
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-3347.1.patch, YARN-3347.1.rebase.patch, 
 YARN-3347.2.patch, YARN-3347.2.rebase.patch


 Right now, we could specify applicationId, node http address and container ID 
 to get the specify container log. Or we could only specify applicationId to 
 get all the container logs. It is very hard for the users to get logs for AM 
 container since the AMContainer logs have more useful information. Users need 
 to know the AMContainer's container ID and related Node http address.
 We could improve the YARN Log Command to allow users to get AMContainer logs 
 directly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3466) Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column


[ 
https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485717#comment-14485717
 ] 

Wangda Tan commented on YARN-3466:
--

Updated title and description, added node-label column to reflect changes in 
the patch.

 Fix RM nodes web page to sort by node HTTP-address, #containers and 
 node-label column
 -

 Key: YARN-3466
 URL: https://issues.apache.org/jira/browse/YARN-3466
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.7.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-3466.001.patch


 The ResourceManager does not support sorting by the node HTTP address, 
 container count  and node label column on the cluster nodes page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3466) Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column


 [ 
https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3466:
-
Description: The ResourceManager does not support sorting by the node HTTP 
address, container count  and node label column on the cluster nodes page.   
(was: The ResourceManager does not support sorting by the node HTTP address nor 
the container count columns on the cluster nodes page. )

 Fix RM nodes web page to sort by node HTTP-address, #containers and 
 node-label column
 -

 Key: YARN-3466
 URL: https://issues.apache.org/jira/browse/YARN-3466
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.7.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-3466.001.patch


 The ResourceManager does not support sorting by the node HTTP address, 
 container count  and node label column on the cluster nodes page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken


[ 
https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485733#comment-14485733
 ] 

Hudson commented on YARN-3429:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2107 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2107/])
YARN-3429. Fix incorrect CHANGES.txt (rkanter: rev 
5b8a3ae366294aec492f69f1a429aa7fce5d13be)
* hadoop-yarn-project/CHANGES.txt


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken
 

 Key: YARN-3429
 URL: https://issues.apache.org/jira/browse/YARN-3429
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3429.000.patch


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken from appattempt_1427804754787_0001_01
 The error logs is at 
 https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3467) Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications on RM Web UI

2015-04-08 Thread Anthony Rojas (JIRA)

Anthony Rojas created YARN-3467:
---

 Summary: Expose allocatedMB, allocatedVCores, and 
runningContainers metrics on running Applications on RM Web UI
 Key: YARN-3467
 URL: https://issues.apache.org/jira/browse/YARN-3467
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: webapp, yarn
Affects Versions: 2.5.0
Reporter: Anthony Rojas
Priority: Minor


The YARN REST API can report on the following properties:

*allocatedMB*: The sum of memory in MB allocated to the application's running 
containers
*allocatedVCores*: The sum of virtual cores allocated to the application's 
running containers
*runningContainers*: The number of containers currently running for the 
application

Currently, the RM Web UI does not report on these items (at least I couldn't 
find any entries within the Web UI).

It would be useful for YARN Application and Resource troubleshooting to have 
these properties and their corresponding values exposed on the RM WebUI.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period


[ 
https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485728#comment-14485728
 ] 

Hudson commented on YARN-3294:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2107 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2107/])
YARN-3294. Allow dumping of Capacity Scheduler debug logs via web UI for 
(xgong: rev d27e9241e8676a0edb2d35453cac5f9495fcd605)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/AdHocLogDumper.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestAdHocLogDumper.java


 Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time 
 period
 -

 Key: YARN-3294
 URL: https://issues.apache.org/jira/browse/YARN-3294
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Fix For: 2.8.0

 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, 
 apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, 
 apache-yarn-3294.3.patch, apache-yarn-3294.4.patch


 It would be nice to have a button on the web UI that would allow dumping of 
 debug logs for just the capacity scheduler for a fixed period of time(1 min, 
 5 min or so) in a separate log file. It would be useful when debugging 
 scheduler behavior without affecting the rest of the resourcemanager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called


[ 
https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485726#comment-14485726
 ] 

Hudson commented on YARN-3457:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2107 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2107/])
YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. 
Contributed by Bibin A Chundatt. (ozawa: rev 
dd852f5b8c8fe9e52d15987605f36b5b60f02701)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* hadoop-yarn-project/CHANGES.txt


 NPE when NodeManager.serviceInit fails and stopRecoveryStore called
 ---

 Key: YARN-3457
 URL: https://issues.apache.org/jira/browse/YARN-3457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3457.001.patch


 When NodeManager service init fails during stopRecoveryStore null pointer 
 exception is thrown
 {code}
  @Override
   protected void serviceInit(Configuration conf) throws Exception {
..
   try {
   exec.init();
 } catch (IOException e) {
   throw new YarnRuntimeException(Failed to initialize container 
 executor, e);
 }
 this.context = createNMContext(containerTokenSecretManager,
 nmTokenSecretManager, nmStore);
 
 {code}
 context is null when service init fails
 {code}
   private void stopRecoveryStore() throws IOException {
 nmStore.stop();
 if (context.getDecommissioned()  nmStore.canRecover()) {
..
 }
   }
 {code}
 Null pointer exception thrown
 {quote}
 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service NodeManager : java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui


[ 
https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485724#comment-14485724
 ] 

Hudson commented on YARN-3110:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2107 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2107/])
YARN-3110. Few issues in ApplicationHistory web ui. Contributed by 
Naganarasimha G R (xgong: rev 19a4feaf6fcf42ebbfe98b8a7153ade96d37fb14)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java
* hadoop-yarn-project/CHANGES.txt


 Few issues in ApplicationHistory web ui
 ---

 Key: YARN-3110
 URL: https://issues.apache.org/jira/browse/YARN-3110
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: applications, timelineserver
Affects Versions: 2.6.0
Reporter: Bibin A Chundatt
Assignee: Naganarasimha G R
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, 
 YARN-3110.20150406-1.patch


 Application state and History link wrong when Application is in unassigned 
 state
  
 1.Configure capacity schedular with queue size as 1  also max Absolute Max 
 Capacity:  10.0%
 (Current application state is Accepted and Unassigned from resource manager 
 side)
 2.Submit application to queue and check the state and link in Application 
 history
 State= null and History link shown as N/A in applicationhistory page
 Kill the same application . In timeline server logs the below is show when 
 selecting application link.
 {quote}
 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to 
 read the AM container of the application attempt 
 appattempt_1422467063659_0007_01.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77)
   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
   at 
 org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
   at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845)
   at 
 org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56)
   at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
   at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38)
   at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
   at 
 com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
   at 
 com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
   at

[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

2015-04-08 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485741#comment-14485741
 ] 

Jian He commented on YARN-3136:
---

[~sunilg], sorry for the late response.
we can suppress the find bug warning, given it's  a no issue. 
I found below synchronization is added in the newest patch, I think it's not 
necessary ?
{code}
synchronized (this) {
  appImpl = this.rmContext.getRMApps().get(appId);
  amContainerId = rmContext.getRMApps().get(appId)
  .getCurrentAppAttempt().getMasterContainer().getId();
}
{code}

 getTransferredContainers can be a bottleneck during AM registration
 ---

 Key: YARN-3136
 URL: https://issues.apache.org/jira/browse/YARN-3136
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Sunil G
 Attachments: 0001-YARN-3136.patch, 0002-YARN-3136.patch, 
 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 
 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 
 0009-YARN-3136.patch


 While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
 stuck waiting for the scheduler lock trying to call getTransferredContainers. 
  The scheduler lock is highly contended, especially on a large cluster with 
 many nodes heartbeating, and it would be nice if we could find a way to 
 eliminate the need to grab this lock during this call.  We've already done 
 similar work during AM allocate calls to make sure they don't needlessly grab 
 the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer

2015-04-08 Thread Daryn Sharp (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daryn Sharp updated YARN-3055:
--
Attachment: YARN-3055.patch

Haven't had a chance to run findbugs. Might grumble about sync
dttr.applicationIds. Will check this afternoon.

The token is not renewed properly if it's shared by jobs (oozie) in
DelegationTokenRenewer
--

Key: YARN-3055
URL: https://issues.apache.org/jira/browse/YARN-3055
Project: Hadoop YARN
Issue Type: Bug
Components: security
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Blocker
Attachments: YARN-3055.001.patch, YARN-3055.002.patch, YARN-3055.patch

After YARN-2964, there is only one timer to renew the token if it's shared by
jobs.
In {{removeApplicationFromRenewal}}, when going to remove a token, and the
token is shared by other jobs, we will not cancel the token.
Meanwhile, we should not cancel the _timerTask_, also we should not remove it
from {{allTokens}}. Otherwise for the existing submitted applications which
share this token will not get renew any more, and for new submitted
applications which share this token, the token will be renew immediately.
For example, we have 3 applications: app1, app2, app3. And they share the
token1. See following scenario:
*1).* app1 is submitted firstly, then app2, and then app3. In this case,
there is only one token renewal timer for token1, and is scheduled when app1
is submitted
*2).* app1 is finished, then the renewal timer is cancelled. token1 will not
be renewed any more, but app2 and app3 still use it, so there is problem.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken

2015-04-08 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485751#comment-14485751
 ] 

Robert Kanter commented on YARN-3429:
-

Ya, sorry about that; I only noticed yesterday, and so I fixed CHANGES.txt to 
say YARN-3429.  Unfortunately, I can't fix the git message or the Hudson 
comments in YARN-2429.

 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken
 

 Key: YARN-3429
 URL: https://issues.apache.org/jira/browse/YARN-3429
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3429.000.patch


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken from appattempt_1427804754787_0001_01
 The error logs is at 
 https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.


[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485775#comment-14485775
 ] 

zhihai xu commented on YARN-3464:
-

[~kasha], thanks for the information. I just looked at YARN-3024, Yes, it will 
make this issue happen more frequently.
Before YARN-3024, The localization for private resource is one by one. The next 
one won't start until the current one finish localization.
It will take longer time for private resource localization.
With YARN-3024, The localization will be done in parallel, multiple files can 
be localized at the same time.
The chance for ContainerLocalizer being killed when the last two PRIVATE 
LocalizerResourceRequestEvent are added is bigger.
Yes, your suggestion is also what I thought.

 Race condition in LocalizerRunner causes container localization timeout.
 

 Key: YARN-3464
 URL: https://issues.apache.org/jira/browse/YARN-3464
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical

 Race condition in LocalizerRunner causes container localization timeout.
 Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
 for LocalizerResourceRequestEvent is empty.
 {code}
   } else if (pending.isEmpty()) {
 action = LocalizerAction.DIE;
   }
 {code}
 If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
 ContainerLocalizer due to empty pending list, this 
 LocalizerResourceRequestEvent will never be handled.
 Without ContainerLocalizer, LocalizerRunner#update will never be called.
 The container will stay at LOCALIZING state, until the container is killed by 
 AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage

2015-04-08 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485781#comment-14485781
 ] 

Vinod Kumar Vavilapalli commented on YARN-3391:
---

A cosmetic suggestion: flow_run - flow_run_name or flow_run_id ?

 Clearly define flow ID/ flow run / flow version in API and storage
 --

 Key: YARN-3391
 URL: https://issues.apache.org/jira/browse/YARN-3391
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-3391.1.patch, YARN-3391.2.patch, YARN-3391.3.patch, 
 YARN-3391.4.patch


 To continue the discussion in YARN-3040, let's figure out the best way to 
 describe the flow.
 Some key issues that we need to conclude on:
 - How do we include the flow version in the context so that it gets passed 
 into the collector and to the storage eventually?
 - Flow run id should be a number as opposed to a generic string?
 - Default behavior for the flow run id if it is missing (i.e. client did not 
 set it)
 - How do we handle flow attributes in case of nested levels of flows?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage

2015-04-08 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485791#comment-14485791
 ] 

Vrushali C commented on YARN-3391:
--

[~vinodkv] ,
+1 for flow_run to be called as flow_run_id. 
It's a number (epoch timestamp). If we call it flow_run_name, that makes it 
sound like it's a string. 

 Clearly define flow ID/ flow run / flow version in API and storage
 --

 Key: YARN-3391
 URL: https://issues.apache.org/jira/browse/YARN-3391
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-3391.1.patch, YARN-3391.2.patch, YARN-3391.3.patch, 
 YARN-3391.4.patch


 To continue the discussion in YARN-3040, let's figure out the best way to 
 describe the flow.
 Some key issues that we need to conclude on:
 - How do we include the flow version in the context so that it gets passed 
 into the collector and to the storage eventually?
 - Flow run id should be a number as opposed to a generic string?
 - Default behavior for the flow run id if it is missing (i.e. client did not 
 set it)
 - How do we handle flow attributes in case of nested levels of flows?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS

2015-04-08 Thread Naganarasimha G R (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485797#comment-14485797
]

Naganarasimha G R commented on YARN-3044:
-

Thanks for the review comments [~zjshen],
bq. Can we use ContainerEntity. The events from RM are RM__EVENT, and those
from NM are NM__EVENT.
This approach should be fine, will update in the next patch.

bq. I think we may overestimate the performance impact of writing NM
lifecycles. Perhaps a more reasonable performance metric is {{cost of writing
lifecycle events per container / cost of managing lifecycle per container *
100%}}. For example, if it is 2%, I guess it will probably be acceptable.
Well true we might be underestimating the RM's ability to handle publishing of
Container Entity. But currently anyway have made it configurable to publish
Container entities from RM side and while measuring performance we can enable
this and check the performance, if fine then we can totally disable this
configuration check and make RM publish always, your opinion ?

bq. I'm not sure if I understand this part correctly, but I incline that system
timeline data (RM/NM) is controlled by cluster config and per cluster, while
application data is controlled by framework or even per-application config. It
may have some problem if the user is able to change the former config. For
example, he can hide its application information from cluster admin.
may be i dint get this correctly, Is it that you intend to say that
framework/cluster config (which can impact the application execution) should be
logged by RM/NM and other application specific config can be logged by the AM ?

bq. Do you mean we should keep
yarn.resourcemanager.system-metrics-publisher.enabled to control RM SMP, and
and create yarn.nodemanager.system-metrics-publisher.enabled to control NM SMP?
No i meant this commment of [~djp] {{We can have different entity types, e.g.
NM_CONTAINER_EVENT, RM_CONTAINER_EVENT, for containers' event get posted from
NM or RM then we can fully understand how the world could be different from NM
and RM (i.e. start time, end time, etc.}} {{However, we can disable RM-side
posting work in production environment by default.}}

[Event producers] Implement RM writing app lifecycle events to ATS
--

Key: YARN-3044
URL: https://issues.apache.org/jira/browse/YARN-3044
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Sangjin Lee
Assignee: Naganarasimha G R
Attachments: YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch

Per design in YARN-2928, implement RM writing app lifecycle events to ATS.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

[
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485794#comment-14485794
]

zhihai xu commented on YARN-3464:
-

I also created another JIRA YARN-3465, which can help this issue and make sure
localization is based on the correct order:
PUBLIC, PRIVATE and APPLICATION.
The issue in my case is also because PRIVATE LocalResourceRequest is reordered
to first and APPLICATION LocalResourceRequest is reordered to last. The PUBLIC
LocalResourceRequest is in the middle which add delay for APPLICATION
LocalResourceRequest.
Because the entrySet order based on HashMap will not be fixed. use
LinkedHashMap should be used.

Race condition in LocalizerRunner causes container localization timeout.

Key: YARN-3464
URL: https://issues.apache.org/jira/browse/YARN-3464
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical

Race condition in LocalizerRunner causes container localization timeout.
Currently LocalizerRunner will kill the ContainerLocalizer when pending list
for LocalizerResourceRequestEvent is empty.
{code}
} else if (pending.isEmpty()) {
action = LocalizerAction.DIE;
}
{code}
If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the
ContainerLocalizer due to empty pending list, this
LocalizerResourceRequestEvent will never be handled.
Without ContainerLocalizer, LocalizerRunner#update will never be called.
The container will stay at LOCALIZING state, until the container is killed by
AM due to TASK_TIMEOUT.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation


[ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485798#comment-14485798
 ] 

Thomas Graves commented on YARN-3434:
-

[~wangda] YARN-3243 fixes part of the problem with the max capacities, but it 
doesn't solve the user limit side of it.   The user limit check is never done 
again.  I'll have a patch up for this shortly I would appreciate it if you 
could take a look and give me feedback.

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves

 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

[
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485798#comment-14485798
]

Thomas Graves edited comment on YARN-3434 at 4/8/15 6:59 PM:
-

[~wangda] YARN-3243 fixes part of the problem with the max capacities, but it
doesn't solve the user limit side of it. The user limit check is never done
again in assignContainer() if it skipped the checks in assignContainers() based
on reservations but then is allowed to shouldAllocOrReserveNewContainer. I'll
have a patch up for this shortly I would appreciate it if you could take a look
and give me feedback.

was (Author: tgraves):
[~wangda] YARN-3243 fixes part of the problem with the max capacities, but it
doesn't solve the user limit side of it. The user limit check is never done
again. I'll have a patch up for this shortly I would appreciate it if you
could take a look and give me feedback.

Interaction between reservations and userlimit can result in significant ULF
violation
--

Key: YARN-3434
URL: https://issues.apache.org/jira/browse/YARN-3434
Project: Hadoop YARN
Issue Type: Bug
Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves

ULF was set to 1.0
User was able to consume 1.4X queue capacity.
It looks like when this application launched, it reserved about 1000
containers, each 8G each, within about 5 seconds. I think this allowed the
logic in assignToUser() to allow the userlimit to be surpassed.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3467) Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications in RM Web UI

2015-04-08 Thread Anthony Rojas (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Rojas updated YARN-3467:

Summary: Expose allocatedMB, allocatedVCores, and runningContainers metrics 
on running Applications in RM Web UI  (was: Expose allocatedMB, 
allocatedVCores, and runningContainers metrics on running Applications on RM 
Web UI)

 Expose allocatedMB, allocatedVCores, and runningContainers metrics on running 
 Applications in RM Web UI
 ---

 Key: YARN-3467
 URL: https://issues.apache.org/jira/browse/YARN-3467
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: webapp, yarn
Affects Versions: 2.5.0
Reporter: Anthony Rojas
Priority: Minor

 The YARN REST API can report on the following properties:
 *allocatedMB*: The sum of memory in MB allocated to the application's running 
 containers
 *allocatedVCores*: The sum of virtual cores allocated to the application's 
 running containers
 *runningContainers*: The number of containers currently running for the 
 application
 Currently, the RM Web UI does not report on these items (at least I couldn't 
 find any entries within the Web UI).
 It would be useful for YARN Application and Resource troubleshooting to have 
 these properties and their corresponding values exposed on the RM WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3467) Expose allocatedMB, allocatedVCores, and runningContainers metrics on running Applications on RM Web UI

2015-04-08 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485792#comment-14485792
 ] 

Rohith commented on YARN-3467:
--

I think ApplicationAttempt page would give these information. This page very 
much help full for debugging the application. Would you have look into this 
page ?

 Expose allocatedMB, allocatedVCores, and runningContainers metrics on running 
 Applications on RM Web UI
 ---

 Key: YARN-3467
 URL: https://issues.apache.org/jira/browse/YARN-3467
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: webapp, yarn
Affects Versions: 2.5.0
Reporter: Anthony Rojas
Priority: Minor

 The YARN REST API can report on the following properties:
 *allocatedMB*: The sum of memory in MB allocated to the application's running 
 containers
 *allocatedVCores*: The sum of virtual cores allocated to the application's 
 running containers
 *runningContainers*: The number of containers currently running for the 
 application
 Currently, the RM Web UI does not report on these items (at least I couldn't 
 find any entries within the Web UI).
 It would be useful for YARN Application and Resource troubleshooting to have 
 these properties and their corresponding values exposed on the RM WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation


 [ 
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-3434:

Attachment: YARN-3434.patch

 Interaction between reservations and userlimit can result in significant ULF 
 violation
 --

 Key: YARN-3434
 URL: https://issues.apache.org/jira/browse/YARN-3434
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.6.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-3434.patch


 ULF was set to 1.0
 User was able to consume 1.4X queue capacity.
 It looks like when this application launched, it reserved about 1000 
 containers, each 8G each, within about 5 seconds. I think this allowed the 
 logic in assignToUser() to allow the userlimit to be surpassed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-08 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485818#comment-14485818
 ] 

Karthik Kambatla commented on YARN-3464:


We can may be discuss this more on YARN-3465, but I don't think having it 
sorted is necessary. The container can not be started until all the resources 
are localized; so, the order of their downloads shouldn't matter as long as 
they all get localized. 


 Race condition in LocalizerRunner causes container localization timeout.
 

 Key: YARN-3464
 URL: https://issues.apache.org/jira/browse/YARN-3464
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical

 Race condition in LocalizerRunner causes container localization timeout.
 Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
 for LocalizerResourceRequestEvent is empty.
 {code}
   } else if (pending.isEmpty()) {
 action = LocalizerAction.DIE;
   }
 {code}
 If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
 ContainerLocalizer due to empty pending list, this 
 LocalizerResourceRequestEvent will never be handled.
 Without ContainerLocalizer, LocalizerRunner#update will never be called.
 The container will stay at LOCALIZING state, until the container is killed by 
 AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users

2015-04-08 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485827#comment-14485827
 ] 

Robert Kanter commented on YARN-2423:
-

Thanks for the comments Steve; those definitely sound like good suggestions.  
However, I'm not going to spend time updating the patch again if we're not 
going to actually commit this, and it seems like we're not.  If that ever 
changes, I'll make sure to incorporate them though.

 TimelineClient should wrap all GET APIs to facilitate Java users
 

 Key: YARN-2423
 URL: https://issues.apache.org/jira/browse/YARN-2423
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Robert Kanter
 Attachments: YARN-2423.004.patch, YARN-2423.005.patch, 
 YARN-2423.006.patch, YARN-2423.007.patch, YARN-2423.patch, YARN-2423.patch, 
 YARN-2423.patch


 TimelineClient provides the Java method to put timeline entities. It's also 
 good to wrap over all GET APIs (both entity and domain), and deserialize the 
 json response into Java POJO objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation

[
https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485834#comment-14485834
]

Thomas Graves commented on YARN-3434:
-

Note I had a reproducible test case for this. Set userlimit% to 100%, user
limit factor to 1. 15 nodes, 20GB each. 1 queue configured for capacity 70,
the 2nd queue configured capacity 30.
In one queue I started a sleep job needing 10 - 12GB containers in the first
queue. I then started a second job in the 2nd queue that needed 25, 12GB
containers, the second job got containers but then had to reserve others
waiting for the first job to release some.

Without this change when the first job started releasing containers the second
job would grab them and go over the user limit. With this fix it stayed within
the user limit.

Interaction between reservations and userlimit can result in significant ULF
violation
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

2015-04-08 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485396#comment-14485396
 ] 

Sunil G commented on YARN-3136:
---

Hi [~jlowe] and [~jianhe]
Cud u pls have a look on the comment above. 

 getTransferredContainers can be a bottleneck during AM registration
 ---

 Key: YARN-3136
 URL: https://issues.apache.org/jira/browse/YARN-3136
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Sunil G
 Attachments: 0001-YARN-3136.patch, 0002-YARN-3136.patch, 
 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 
 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 
 0009-YARN-3136.patch


 While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
 stuck waiting for the scheduler lock trying to call getTransferredContainers. 
  The scheduler lock is highly contended, especially on a large cluster with 
 many nodes heartbeating, and it would be nice if we could find a way to 
 eliminate the need to grab this lock during this call.  We've already done 
 similar work during AM allocate calls to make sure they don't needlessly grab 
 the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3466) RM nodes web page does not sort by node HTTP address or containers

Jason Lowe created YARN-3466:


 Summary: RM nodes web page does not sort by node HTTP address or 
containers
 Key: YARN-3466
 URL: https://issues.apache.org/jira/browse/YARN-3466
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe


The ResourceManager does not support sorting by the node HTTP address nor the 
container count columns on the cluster nodes page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer

2015-04-08 Thread Daryn Sharp (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485461#comment-14485461
]

Daryn Sharp commented on YARN-3055:
---

bq. It does seem odd to get the expiration date by renewing the token

The expiration is metadata associated with the token that is only known to the
token issuer's secret manager. The correct fix is for the renewer to not
reschedule if the next expiration is the same as the last. The bug wasn't a
real priority when tokens weren't renewed forever. If we regress to renewing
forever, then it does become a problem.

bq. I think currently the sub-job won't kill the overall workflow.

Correct, I misread in my haste. It's rather the opposite: sub-jobs can
override the original job's request to cancel the tokens.

bq. I think overall the current patch will work, other than few comments I have.

It works but not in a desirable way. Jason posted my patch we use internally
on YARN-3439 which is duped to this jira. I'm updating it to handle the proxy
refresh cases and will post shortly. The current semantics of the conf setting
and the 2.x changes have been nothing but production blockers. Ref counting
will solve this once and for all.

The token is not renewed properly if it's shared by jobs (oozie) in
DelegationTokenRenewer
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS

[
https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485527#comment-14485527
]

Zhijie Shen commented on YARN-3044:
---

Before screening the patch details, I have some high level comments:

bq. IIUC you meant we will have RMContainerEntity having type as
YARN_RM_CONTAINER and NMContainerEntity having type as YARN_NM_CONTAINER right ?

Can we use ContainerEntity. The events from RM are RM__EVENT, and those
from NM are NM__EVENT.

bq. I'm very much concerned about the volume of writes that the RM collector
would need to do,
bq. I fully understand the concern from Sangjin Lee that RM may not afford tens
of thousands containers in large size cluster.

I also think publishing all container lifecycle events from NM is likely to be
a big cost in total, but I'd like to provide some point from other point of
view. Say we have a big cluster that can afford 5,000 concurrent containers. RM
have to maintain the lifecycle of these 5K containers, and I don't think a less
powerful server can manage it, right? Assume we have such a powerful server to
run the RM of a big cluster, will publishing lifecycle events be a big deal to
the server? I'm not sure, but I can provide some hints. Now each container will
write 2 events per lifecycle, and perhaps in the future we want to record each
state transition, and result in ~10 events per lifecycle. Therefore, we have 10
* 5K lifecycle events, and they won't be written at the same moment because
containers' lifecycles are usually async. Let's assume each container run for
1h and lifecycle events are uniformly distributed, in each second, there will
just be around 14 concurrent writes (for a powerful server).

I think we may overestimate the performance impact of writing NM lifecycles.
Perhaps a more reasonable performance metric is {{cost of writing lifecycle
events per container / cost of managing lifecycle per container * 100%}}. For
example, if it is 2%, I guess it will probably be acceptable.

bq. all configs will not be set as part of this so was there more planned for
this from the framework side or each application needs to take care of this on
their own to populate configuration information ?
bq. In that sense, how about letting frameworks (namely AMs) write the
configuration instead of RM?

I'm not sure if I understand this part correctly, but I incline that system
timeline data (RM/NM) is controlled by cluster config and per cluster, while
application data is controlled by framework or even per-application config. It
may have some problem if the user is able to change the former config. For
example, he can hide its application information from cluster admin.

bq. I have also incorporated the changes to support RMContainer metrics based
on configuration (Junping's comments).

Do you mean we should keep
{{yarn.resourcemanager.system-metrics-publisher.enabled}} to control RM SMP,
and and create {{yarn.nodemanager.system-metrics-publisher.enabled}} to control
NM SMP?

[Event producers] Implement RM writing app lifecycle events to ATS
--

Per design in YARN-2928, implement RM writing app lifecycle events to ATS.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3347) Improve YARN log command to get AMContainer logs as well as running containers logs

2015-04-08 Thread Xuan Gong (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485530#comment-14485530
]

Xuan Gong commented on YARN-3347:
-

Thanks for the review.

bq. For comma-separated value, do we mean attempt number? If so, may be we
should describe more explicitly here? Also, can we use 0 (instead of -1) for AM
container of latest attempt. If so, all negative value here is illegal.

Added. I prefer to use -1 for the latest AM Container. 0 in the list/array is
the first element.

bq. Missing ! before getConf()?

Fixed

bq. Sounds like we are re-order the sequence of user's input which seems
unnecessary to me. I would suggest to keep order from user's input or it could
confuse people.

Fixed

bq. Also, for method of run(String[] args), it looks very complexity for now.
Can we do some refactor work there and put some comments inline?

Yes, it indeed added some logics. Added some comments.

Improve YARN log command to get AMContainer logs as well as running
containers logs
---

Key: YARN-3347
URL: https://issues.apache.org/jira/browse/YARN-3347
Project: Hadoop YARN
Issue Type: Sub-task
Components: log-aggregation
Reporter: Xuan Gong
Assignee: Xuan Gong
Attachments: YARN-3347.1.patch, YARN-3347.1.rebase.patch,
YARN-3347.2.patch, YARN-3347.2.rebase.patch

Right now, we could specify applicationId, node http address and container ID
to get the specify container log. Or we could only specify applicationId to
get all the container logs. It is very hard for the users to get logs for AM
container since the AMContainer logs have more useful information. Users need
to know the AMContainer's container ID and related Node http address.
We could improve the YARN Log Command to allow users to get AMContainer logs
directly

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3347) Improve YARN log command to get AMContainer logs as well as running containers logs

2015-04-08 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-3347:

Attachment: YARN-3347.3.patch

 Improve YARN log command to get AMContainer logs as well as running 
 containers logs
 ---

 Key: YARN-3347
 URL: https://issues.apache.org/jira/browse/YARN-3347
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-3347.1.patch, YARN-3347.1.rebase.patch, 
 YARN-3347.2.patch, YARN-3347.2.rebase.patch, YARN-3347.3.patch


 Right now, we could specify applicationId, node http address and container ID 
 to get the specify container log. Or we could only specify applicationId to 
 get all the container logs. It is very hard for the users to get logs for AM 
 container since the AMContainer logs have more useful information. Users need 
 to know the AMContainer's container ID and related Node http address.
 We could improve the YARN Log Command to allow users to get AMContainer logs 
 directly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3293) Track and display capacity scheduler health metrics in web UI


[ 
https://issues.apache.org/jira/browse/YARN-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485541#comment-14485541
 ] 

Varun Vasudev commented on YARN-3293:
-

Thanks for the review Craig! I thought about it but I didn't get a chance to 
look at the FairScheduler page. It should be pretty easy to pull out the block 
into its own class.

 Track and display capacity scheduler health metrics in web UI
 -

 Key: YARN-3293
 URL: https://issues.apache.org/jira/browse/YARN-3293
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: Screen Shot 2015-03-30 at 4.30.14 PM.png, 
 apache-yarn-3293.0.patch, apache-yarn-3293.1.patch, apache-yarn-3293.2.patch, 
 apache-yarn-3293.4.patch, apache-yarn-3293.5.patch, apache-yarn-3293.6.patch


 It would be good to display metrics that let users know about the health of 
 the capacity scheduler in the web UI. Today it is hard to get an idea if the 
 capacity scheduler is functioning correctly. Metrics such as the time for the 
 last allocation, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3361) CapacityScheduler side changes to support non-exclusive node labels


 [ 
https://issues.apache.org/jira/browse/YARN-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3361:
-
Attachment: YARN-3361.4.patch

Attached patch fixed several naming issues. (ver.4)

 CapacityScheduler side changes to support non-exclusive node labels
 ---

 Key: YARN-3361
 URL: https://issues.apache.org/jira/browse/YARN-3361
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-3361.1.patch, YARN-3361.2.patch, YARN-3361.3.patch, 
 YARN-3361.4.patch


 According to design doc attached in YARN-3214, we need implement following 
 logic in CapacityScheduler:
 1) When allocate a resource request with no node-label specified, it should 
 get preferentially allocated to node without labels.
 2) When there're some available resource in a node with label, they can be 
 used by applications with following order:
 - Applications under queues which can access the label and ask for same 
 labeled resource. 
 - Applications under queues which can access the label and ask for 
 non-labeled resource.
 - Applications under queues cannot access the label and ask for non-labeled 
 resource.
 3) Expose necessary information that can be used by preemption policy to make 
 preemption decisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3055) The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer

2015-04-08 Thread Jian He (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485638#comment-14485638
]

Jian He commented on YARN-3055:
---

sure, looking forward to your patch.

bq. The correct fix is for the renewer to not reschedule if the next
expiration is the same as the last.
Sorry, didn't get what you mean. mind clarifying more ? The renew call after
getting the new token is solely to retrieve the expiration date for the token.
I found given that RM renews all tokens at once for each app on app submission,
if renew rescheduling becomes a DOS problem, then app submission situation may
be much worse.

The token is not renewed properly if it's shared by jobs (oozie) in
DelegationTokenRenewer
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3459) TestLog4jWarningErrorMetricsAppender breaks in trunk


 [ 
https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3459:
-
Fix Version/s: (was: 2.7.0)
   2.8.0

 TestLog4jWarningErrorMetricsAppender breaks in trunk
 

 Key: YARN-3459
 URL: https://issues.apache.org/jira/browse/YARN-3459
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Li Lu
Assignee: Li Lu
Priority: Blocker
 Fix For: 2.8.0

 Attachments: apache-yarn-3459.0.patch


 TestLog4jWarningErrorMetricsAppender fails with the following message:
 {code}
 Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender
 Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec  
 FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender
 testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender)  
 Time elapsed: 2.01 sec   FAILURE!
 java.lang.AssertionError: expected:0 but was:1
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3459) Fix failiure of TestLog4jWarningErrorMetricsAppender


 [ 
https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3459:
-
Assignee: Varun Vasudev  (was: Li Lu)

 Fix failiure of TestLog4jWarningErrorMetricsAppender
 

 Key: YARN-3459
 URL: https://issues.apache.org/jira/browse/YARN-3459
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Li Lu
Assignee: Varun Vasudev
Priority: Blocker
 Fix For: 2.8.0

 Attachments: apache-yarn-3459.0.patch


 TestLog4jWarningErrorMetricsAppender fails with the following message:
 {code}
 Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender
 Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec  
 FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender
 testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender)  
 Time elapsed: 2.01 sec   FAILURE!
 java.lang.AssertionError: expected:0 but was:1
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2684) FairScheduler should tolerate queue configuration changes across RM restarts

2015-04-08 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485655#comment-14485655
 ] 

Rohith commented on YARN-2684:
--

[~kasha] kindly provide your thoughts any more changes to be done as part of 
this JIRA.

 FairScheduler should tolerate queue configuration changes across RM restarts
 

 Key: YARN-2684
 URL: https://issues.apache.org/jira/browse/YARN-2684
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, resourcemanager
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Rohith
Priority: Critical
 Attachments: 0001-YARN-2684.patch


 YARN-2308 fixes this issue for CS, this JIRA is to fix it for FS. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2901) Add errors and warning metrics page to RM, NM web UI


 [ 
https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-2901:


Assignee: Wangda Tan  (was: Varun Vasudev)

 Add errors and warning metrics page to RM, NM web UI
 

 Key: YARN-2901
 URL: https://issues.apache.org/jira/browse/YARN-2901
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Reporter: Varun Vasudev
Assignee: Wangda Tan
 Fix For: 2.8.0

 Attachments: Exception collapsed.png, Exception expanded.jpg, Screen 
 Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, 
 apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, 
 apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch


 It would be really useful to have statistics on the number of errors and 
 warnings in the RM and NM web UI. I'm thinking about -
 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day
 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 
 hours/day
 By errors and warnings I'm referring to the log level.
 I suspect we can probably achieve this by writing a custom appender?(I'm open 
 to suggestions on alternate mechanisms for implementing this).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3466) RM nodes web page does not sort by node HTTP address or containers


[ 
https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485711#comment-14485711
 ] 

Wangda Tan commented on YARN-3466:
--

Tried in a local cluster, HTTP address, #containers and node-label sorting all 
work. +1.

Pending Jenkins.

 RM nodes web page does not sort by node HTTP address or containers
 --

 Key: YARN-3466
 URL: https://issues.apache.org/jira/browse/YARN-3466
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.7.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-3466.001.patch


 The ResourceManager does not support sorting by the node HTTP address nor the 
 container count columns on the cluster nodes page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3459) Fix failiure of TestLog4jWarningErrorMetricsAppender


 [ 
https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3459:
-
Summary: Fix failiure of TestLog4jWarningErrorMetricsAppender  (was: 
TestLog4jWarningErrorMetricsAppender breaks in trunk)

 Fix failiure of TestLog4jWarningErrorMetricsAppender
 

 Key: YARN-3459
 URL: https://issues.apache.org/jira/browse/YARN-3459
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Li Lu
Assignee: Li Lu
Priority: Blocker
 Fix For: 2.8.0

 Attachments: apache-yarn-3459.0.patch


 TestLog4jWarningErrorMetricsAppender fails with the following message:
 {code}
 Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender
 Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec  
 FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender
 testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender)  
 Time elapsed: 2.01 sec   FAILURE!
 java.lang.AssertionError: expected:0 but was:1
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2901) Add errors and warning metrics page to RM, NM web UI


 [ 
https://issues.apache.org/jira/browse/YARN-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2901:
-
Assignee: Varun Vasudev  (was: Wangda Tan)

 Add errors and warning metrics page to RM, NM web UI
 

 Key: YARN-2901
 URL: https://issues.apache.org/jira/browse/YARN-2901
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Fix For: 2.8.0

 Attachments: Exception collapsed.png, Exception expanded.jpg, Screen 
 Shot 2015-03-19 at 7.40.02 PM.png, YARN-2901.addendem.1.patch, 
 apache-yarn-2901.0.patch, apache-yarn-2901.1.patch, apache-yarn-2901.2.patch, 
 apache-yarn-2901.3.patch, apache-yarn-2901.4.patch, apache-yarn-2901.5.patch


 It would be really useful to have statistics on the number of errors and 
 warnings in the RM and NM web UI. I'm thinking about -
 1. The number of errors and warnings in the past 5 min/1 hour/12 hours/day
 2. The top 'n'(20?) most common exceptions in the past 5 min/1 hour/12 
 hours/day
 By errors and warnings I'm referring to the log level.
 I suspect we can probably achieve this by writing a custom appender?(I'm open 
 to suggestions on alternate mechanisms for implementing this).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3459) Fix failiure of TestLog4jWarningErrorMetricsAppender


[ 
https://issues.apache.org/jira/browse/YARN-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485667#comment-14485667
 ] 

Hudson commented on YARN-3459:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7533 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7533/])
YARN-3459. Fix failiure of TestLog4jWarningErrorMetricsAppender. (Varun Vasudev 
via wangda) (wangda: rev 7af086a515d573dc90ea4deec7f4e3f23622e0e8)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestLog4jWarningErrorMetricsAppender.java
* hadoop-yarn-project/CHANGES.txt


 Fix failiure of TestLog4jWarningErrorMetricsAppender
 

 Key: YARN-3459
 URL: https://issues.apache.org/jira/browse/YARN-3459
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Li Lu
Assignee: Varun Vasudev
Priority: Blocker
 Fix For: 2.8.0

 Attachments: apache-yarn-3459.0.patch


 TestLog4jWarningErrorMetricsAppender fails with the following message:
 {code}
 Running org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender
 Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.214 sec  
 FAILURE! - in org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender
 testPurge(org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender)  
 Time elapsed: 2.01 sec   FAILURE!
 java.lang.AssertionError: expected:0 but was:1
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.util.TestLog4jWarningErrorMetricsAppender.testPurge(TestLog4jWarningErrorMetricsAppender.java:89)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3110) Few issues in ApplicationHistory web ui


[ 
https://issues.apache.org/jira/browse/YARN-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485691#comment-14485691
 ] 

Hudson commented on YARN-3110:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #158 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/158/])
YARN-3110. Few issues in ApplicationHistory web ui. Contributed by 
Naganarasimha G R (xgong: rev 19a4feaf6fcf42ebbfe98b8a7153ade96d37fb14)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java


 Few issues in ApplicationHistory web ui
 ---

 Key: YARN-3110
 URL: https://issues.apache.org/jira/browse/YARN-3110
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: applications, timelineserver
Affects Versions: 2.6.0
Reporter: Bibin A Chundatt
Assignee: Naganarasimha G R
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3110.20150209-1.patch, YARN-3110.20150315-1.patch, 
 YARN-3110.20150406-1.patch


 Application state and History link wrong when Application is in unassigned 
 state
  
 1.Configure capacity schedular with queue size as 1  also max Absolute Max 
 Capacity:  10.0%
 (Current application state is Accepted and Unassigned from resource manager 
 side)
 2.Submit application to queue and check the state and link in Application 
 history
 State= null and History link shown as N/A in applicationhistory page
 Kill the same application . In timeline server logs the below is show when 
 selecting application link.
 {quote}
 2015-01-29 15:39:50,956 ERROR org.apache.hadoop.yarn.webapp.View: Failed to 
 read the AM container of the application attempt 
 appattempt_1422467063659_0007_01.
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getContainer(ApplicationHistoryManagerOnTimelineStore.java:162)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAMContainer(ApplicationHistoryManagerOnTimelineStore.java:184)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:160)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock$3.run(AppBlock.java:157)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.hadoop.yarn.server.webapp.AppBlock.render(AppBlock.java:156)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77)
   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
   at 
 org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
   at 
 org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
   at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845)
   at 
 org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56)
   at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
   at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
   at 
 org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSController.app(AHSController.java:38)
   at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
   at 
 com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
   at 
 com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
   at

[jira] [Commented] (YARN-3457) NPE when NodeManager.serviceInit fails and stopRecoveryStore called


[ 
https://issues.apache.org/jira/browse/YARN-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485693#comment-14485693
 ] 

Hudson commented on YARN-3457:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #158 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/158/])
YARN-3457. NPE when NodeManager.serviceInit fails and stopRecoveryStore called. 
Contributed by Bibin A Chundatt. (ozawa: rev 
dd852f5b8c8fe9e52d15987605f36b5b60f02701)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java


 NPE when NodeManager.serviceInit fails and stopRecoveryStore called
 ---

 Key: YARN-3457
 URL: https://issues.apache.org/jira/browse/YARN-3457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3457.001.patch


 When NodeManager service init fails during stopRecoveryStore null pointer 
 exception is thrown
 {code}
  @Override
   protected void serviceInit(Configuration conf) throws Exception {
..
   try {
   exec.init();
 } catch (IOException e) {
   throw new YarnRuntimeException(Failed to initialize container 
 executor, e);
 }
 this.context = createNMContext(containerTokenSecretManager,
 nmTokenSecretManager, nmStore);
 
 {code}
 context is null when service init fails
 {code}
   private void stopRecoveryStore() throws IOException {
 nmStore.stop();
 if (context.getDecommissioned()  nmStore.canRecover()) {
..
 }
   }
 {code}
 Null pointer exception thrown
 {quote}
 015-04-07 17:31:45,807 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service NodeManager : java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:280)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:484)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:534)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-08 Thread Karthik Kambatla (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485687#comment-14485687
]

Karthik Kambatla commented on YARN-3464:

bq. Looking at the code closely, I don't see any resources being removed from
pending. So, pending shouldn't be empty after some of the resources have been
downloaded.
Never mind. findNextResource has a call to iterator.remove().

In any case, I think the right approach is to send an explicit event to the
localizer to indicate we are done with localizing all the resources. On
receiving this, the localizer tracker sends the DIE action.

Race condition in LocalizerRunner causes container localization timeout.

Key: YARN-3464
URL: https://issues.apache.org/jira/browse/YARN-3464
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical

Race condition in LocalizerRunner causes container localization timeout.
Currently LocalizerRunner will kill the ContainerLocalizer when pending list
for LocalizerResourceRequestEvent is empty.
{code}
} else if (pending.isEmpty()) {
action = LocalizerAction.DIE;
}
{code}
If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the
ContainerLocalizer due to empty pending list, this
LocalizerResourceRequestEvent will never be handled.
Without ContainerLocalizer, LocalizerRunner#update will never be called.
The container will stay at LOCALIZING state, until the container is killed by
AM due to TASK_TIMEOUT.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken


[ 
https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485700#comment-14485700
 ] 

Hudson commented on YARN-3429:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #158 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/158/])
YARN-3429. Fix incorrect CHANGES.txt (rkanter: rev 
5b8a3ae366294aec492f69f1a429aa7fce5d13be)
* hadoop-yarn-project/CHANGES.txt


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken
 

 Key: YARN-3429
 URL: https://issues.apache.org/jira/browse/YARN-3429
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.8.0

 Attachments: YARN-3429.000.patch


 TestAMRMTokens.testTokenExpiry fails Intermittently with error 
 message:Invalid AMRMToken from appattempt_1427804754787_0001_01
 The error logs is at 
 https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period


[ 
https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485695#comment-14485695
 ] 

Hudson commented on YARN-3294:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #158 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/158/])
YARN-3294. Allow dumping of Capacity Scheduler debug logs via web UI for 
(xgong: rev d27e9241e8676a0edb2d35453cac5f9495fcd605)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestAdHocLogDumper.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/AdHocLogDumper.java


 Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time 
 period
 -

 Key: YARN-3294
 URL: https://issues.apache.org/jira/browse/YARN-3294
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Fix For: 2.8.0

 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, 
 apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, 
 apache-yarn-3294.3.patch, apache-yarn-3294.4.patch


 It would be nice to have a button on the web UI that would allow dumping of 
 debug logs for just the capacity scheduler for a fixed period of time(1 min, 
 5 min or so) in a separate log file. It would be useful when debugging 
 scheduler behavior without affecting the rest of the resourcemanager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3466) Fix RM nodes web page to sort by node HTTP-address, #containers and node-label column


 [ 
https://issues.apache.org/jira/browse/YARN-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3466:
-
Summary: Fix RM nodes web page to sort by node HTTP-address, #containers 
and node-label column  (was: RM nodes web page does not sort by node HTTP 
address or containers)

 Fix RM nodes web page to sort by node HTTP-address, #containers and 
 node-label column
 -

 Key: YARN-3466
 URL: https://issues.apache.org/jira/browse/YARN-3466
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.7.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-3466.001.patch


 The ResourceManager does not support sorting by the node HTTP address nor the 
 container count columns on the cluster nodes page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2901) Add errors and warning metrics page to RM, NM web UI