[jira] [Updated] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover

2015-12-01 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-4392:

Attachment: YARN-4392.2.patch

> ApplicationCreatedEvent event time resets after RM restart/failover
> ---
>
> Key: YARN-4392
> URL: https://issues.apache.org/jira/browse/YARN-4392
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Critical
> Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch, 
> YARN-4392.2.patch
>
>
> {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - 
> Finished time 1437453994768 is ahead of started time 1440308399674 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437454008244 is ahead of started time 1440308399676 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444305171 is ahead of started time 1440308399653 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444293115 is ahead of started time 1440308399647 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444379645 is ahead of started time 1440308399656 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444361234 is ahead of started time 1440308399655 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444342029 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444323447 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143730006 is ahead of started time 1440308399660 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143715698 is ahead of started time 1440308399659 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143719060 is ahead of started time 1440308399658 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444393931 is ahead of started time 1440308399657
> {code} . 
> From ATS logs, we would see a large amount of 'stale alerts' messages 
> periodically



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-01 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034604#comment-15034604
 ] 

Daniel Templeton commented on YARN-4406:


Looking at the {{NodesListManager}}, it looks to me like 
{{ClusterMetrics.getMetrics().getDecomissionedNMs()}} is set to the size of the 
excludes list, but {{getUnusableNodes()}} returns the list of nodes that have 
been decommissioned since the last reboot.

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-12-01 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034633#comment-15034633
 ] 

Eric Payne commented on YARN-4108:
--

Thanks very much [~leftnoteasy] for creating this POC. Just a quick note: The 
patch no longer applies completely cleanly to trunk and branch-2.8.

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4108-design-doc-v1.pdf, YARN-4108.poc.1.patch
>
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-01 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-4406:
---
Assignee: (was: Daniel Templeton)

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-01 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton reassigned YARN-4406:
--

Assignee: Daniel Templeton

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Assignee: Daniel Templeton
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4405) Support node label store in non-appendable file system

2015-12-01 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4405:
-
Attachment: YARN-4405.1.patch

Attached initial patch for review.

> Support node label store in non-appendable file system
> --
>
> Key: YARN-4405
> URL: https://issues.apache.org/jira/browse/YARN-4405
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, client, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4405.1.patch
>
>
> Existing node label file system store implementation uses append to write 
> edit logs. However, some file system doesn't support append, we need add an 
> implementation to support such non-appendable file systems as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-01 Thread Kuhu Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034480#comment-15034480
 ] 

Kuhu Shukla commented on YARN-4406:
---

[~rchiang] thank you for reporting this. I think  the root cause is:

In {{updateMetricsForDeactivatedNode}}, the re-addition of a node does not 
decrement the Decommissioned node count as expected. It looks at previous state 
and there is no switch case for decommissioned nodes.

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4407) Support Resource oversubscription in YARN scheduler

2015-12-01 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034576#comment-15034576
 ] 

Wangda Tan commented on YARN-4407:
--

[~xgong], is this duplicated with YARN-1011/YARN-1013? 

> Support Resource oversubscription in YARN scheduler
> ---
>
> Key: YARN-4407
> URL: https://issues.apache.org/jira/browse/YARN-4407
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>
> Many long running services do not fully use their allocated resources all the 
> time. We could takes advantage of temporarily unused resources to execute low 
> priority jobs such as background analytics, etc. This will definitely improve 
> the resource utilization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover

2015-12-01 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034502#comment-15034502
 ] 

Xuan Gong commented on YARN-4392:
-

Thanks for Suggestion. [~jlowe] That makes sense.

Uploaded a new patch to address the comments. Could you review it, please ?

> ApplicationCreatedEvent event time resets after RM restart/failover
> ---
>
> Key: YARN-4392
> URL: https://issues.apache.org/jira/browse/YARN-4392
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Critical
> Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch, 
> YARN-4392.2.patch
>
>
> {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - 
> Finished time 1437453994768 is ahead of started time 1440308399674 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437454008244 is ahead of started time 1440308399676 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444305171 is ahead of started time 1440308399653 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444293115 is ahead of started time 1440308399647 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444379645 is ahead of started time 1440308399656 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444361234 is ahead of started time 1440308399655 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444342029 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444323447 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143730006 is ahead of started time 1440308399660 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143715698 is ahead of started time 1440308399659 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143719060 is ahead of started time 1440308399658 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444393931 is ahead of started time 1440308399657
> {code} . 
> From ATS logs, we would see a large amount of 'stale alerts' messages 
> periodically



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4407) Support Resource oversubscription in YARN scheduler

2015-12-01 Thread Xuan Gong (JIRA)
Xuan Gong created YARN-4407:
---

 Summary: Support Resource oversubscription in YARN scheduler
 Key: YARN-4407
 URL: https://issues.apache.org/jira/browse/YARN-4407
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Xuan Gong
Assignee: Xuan Gong


Many long running services do not fully use their allocated resources all the 
time. We could takes advantage of temporarily unused resources to execute low 
priority jobs such as background analytics, etc. This will definitely improve 
the resource utilization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler

2015-12-01 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034613#comment-15034613
 ] 

Sangjin Lee commented on YARN-3769:
---

I'll yield to [~djp] on which version this should make (2.6.3 v. 2.6.4).

> Consider user limit when calculating total pending resource for preemption 
> policy in Capacity Scheduler
> ---
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 2.7.3
>
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.6.001.patch, YARN-3769-branch-2.7.002.patch, 
> YARN-3769-branch-2.7.003.patch, YARN-3769-branch-2.7.005.patch, 
> YARN-3769-branch-2.7.006.patch, YARN-3769-branch-2.7.007.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch, YARN-3769.004.patch, YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100

2015-12-01 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034589#comment-15034589
 ] 

Daniel Templeton commented on YARN-4398:


Yep.  +1 (non-binding)

> Yarn recover functionality causes the cluster running slowly and the cluster 
> usage rate is far below 100
> 
>
> Key: YARN-4398
> URL: https://issues.apache.org/jira/browse/YARN-4398
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: NING DING
> Attachments: YARN-4398.2.patch, YARN-4398.3.patch
>
>
> In my hadoop cluster, the resourceManager recover functionality is enabled 
> with FileSystemRMStateStore.
> I found this cause the yarn cluster running slowly and cluster usage rate is 
> just 50 even there are many pending Apps. 
> The scenario is below.
> In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling 
> storeNewApplication method defined in RMStateStore. This storeNewApplication 
> method is synchronized.
> {code:title=RMAppImpl.java|borderStyle=solid}
>   private static final class RMAppNewlySavingTransition extends 
> RMAppTransition {
> @Override
> public void transition(RMAppImpl app, RMAppEvent event) {
>   // If recovery is enabled then store the application information in a
>   // non-blocking call so make sure that RM has stored the information
>   // needed to restart the AM after RM restart without further client
>   // communication
>   LOG.info("Storing application with id " + app.applicationId);
>   app.rmContext.getStateStore().storeNewApplication(app);
> }
>   }
> {code}
> {code:title=RMStateStore.java|borderStyle=solid}
> public synchronized void storeNewApplication(RMApp app) {
> ApplicationSubmissionContext context = app
> 
> .getApplicationSubmissionContext();
> assert context instanceof ApplicationSubmissionContextPBImpl;
> ApplicationStateData appState =
> ApplicationStateData.newInstance(
> app.getSubmitTime(), app.getStartTime(), context, app.getUser());
> dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
>   }
> {code}
> In thread B, the FileSystemRMStateStore is calling 
> storeApplicationStateInternal method. It's also synchronized.
> This storeApplicationStateInternal method saves an ApplicationStateData into 
> HDFS and it normally costs 90~300 milliseconds in my hadoop cluster.
> {code:title=FileSystemRMStateStore.java|borderStyle=solid}
> public synchronized void storeApplicationStateInternal(ApplicationId appId,
>   ApplicationStateData appStateDataPB) throws Exception {
> Path appDirPath = getAppDir(rmAppRoot, appId);
> mkdirsWithRetries(appDirPath);
> Path nodeCreatePath = getNodePath(appDirPath, appId.toString());
> LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath);
> byte[] appStateData = appStateDataPB.getProto().toByteArray();
> try {
>   // currently throw all exceptions. May need to respond differently for 
> HA
>   // based on whether we have lost the right to write to FS
>   writeFileWithRetries(nodeCreatePath, appStateData, true);
> } catch (Exception e) {
>   LOG.info("Error storing info for app: " + appId, e);
>   throw e;
> }
>   }
> {code}
> Think thread B firstly comes into 
> FileSystemRMStateStore.storeApplicationStateInternal method, then thread A 
> will be blocked for a while because of synchronization. In ResourceManager 
> there is only one RMStateStore instance. In my cluster it's 
> FileSystemRMStateStore type.
> Debug the RMAppNewlySavingTransition.transition method, the thread stack 
> shows it's called form AsyncDispatcher.dispatch method. This method code is 
> as below. 
> {code:title=AsyncDispatcher.java|borderStyle=solid}
>   protected void dispatch(Event event) {
> //all events go thru this loop
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Dispatching the event " + event.getClass().getName() + "."
>   + event.toString());
> }
> Class type = event.getType().getDeclaringClass();
> try{
>   EventHandler handler = eventDispatchers.get(type);
>   if(handler != null) {
> handler.handle(event);
>   } else {
> throw new Exception("No handler for registered for " + type);
>   }
> } catch (Throwable t) {
>   //TODO Maybe log the state of the queue
>   LOG.fatal("Error in dispatcher thread", t);
>   // If serviceStop is called, we should exit this thread gracefully.
>   if (exitOnDispatchException
>   && (ShutdownHookManager.get().isShutdownInProgress()) == false
>   && stopped == false) {
> Thread shutDownThread = 

[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100

2015-12-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034619#comment-15034619
 ] 

Jian He commented on YARN-4398:
---

[~iceberg565], after you upload the patch, you can click the "Submit Patch" 
button, which will trigger jenkins to run the unit tests.

> Yarn recover functionality causes the cluster running slowly and the cluster 
> usage rate is far below 100
> 
>
> Key: YARN-4398
> URL: https://issues.apache.org/jira/browse/YARN-4398
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: NING DING
> Attachments: YARN-4398.2.patch, YARN-4398.3.patch
>
>
> In my hadoop cluster, the resourceManager recover functionality is enabled 
> with FileSystemRMStateStore.
> I found this cause the yarn cluster running slowly and cluster usage rate is 
> just 50 even there are many pending Apps. 
> The scenario is below.
> In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling 
> storeNewApplication method defined in RMStateStore. This storeNewApplication 
> method is synchronized.
> {code:title=RMAppImpl.java|borderStyle=solid}
>   private static final class RMAppNewlySavingTransition extends 
> RMAppTransition {
> @Override
> public void transition(RMAppImpl app, RMAppEvent event) {
>   // If recovery is enabled then store the application information in a
>   // non-blocking call so make sure that RM has stored the information
>   // needed to restart the AM after RM restart without further client
>   // communication
>   LOG.info("Storing application with id " + app.applicationId);
>   app.rmContext.getStateStore().storeNewApplication(app);
> }
>   }
> {code}
> {code:title=RMStateStore.java|borderStyle=solid}
> public synchronized void storeNewApplication(RMApp app) {
> ApplicationSubmissionContext context = app
> 
> .getApplicationSubmissionContext();
> assert context instanceof ApplicationSubmissionContextPBImpl;
> ApplicationStateData appState =
> ApplicationStateData.newInstance(
> app.getSubmitTime(), app.getStartTime(), context, app.getUser());
> dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
>   }
> {code}
> In thread B, the FileSystemRMStateStore is calling 
> storeApplicationStateInternal method. It's also synchronized.
> This storeApplicationStateInternal method saves an ApplicationStateData into 
> HDFS and it normally costs 90~300 milliseconds in my hadoop cluster.
> {code:title=FileSystemRMStateStore.java|borderStyle=solid}
> public synchronized void storeApplicationStateInternal(ApplicationId appId,
>   ApplicationStateData appStateDataPB) throws Exception {
> Path appDirPath = getAppDir(rmAppRoot, appId);
> mkdirsWithRetries(appDirPath);
> Path nodeCreatePath = getNodePath(appDirPath, appId.toString());
> LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath);
> byte[] appStateData = appStateDataPB.getProto().toByteArray();
> try {
>   // currently throw all exceptions. May need to respond differently for 
> HA
>   // based on whether we have lost the right to write to FS
>   writeFileWithRetries(nodeCreatePath, appStateData, true);
> } catch (Exception e) {
>   LOG.info("Error storing info for app: " + appId, e);
>   throw e;
> }
>   }
> {code}
> Think thread B firstly comes into 
> FileSystemRMStateStore.storeApplicationStateInternal method, then thread A 
> will be blocked for a while because of synchronization. In ResourceManager 
> there is only one RMStateStore instance. In my cluster it's 
> FileSystemRMStateStore type.
> Debug the RMAppNewlySavingTransition.transition method, the thread stack 
> shows it's called form AsyncDispatcher.dispatch method. This method code is 
> as below. 
> {code:title=AsyncDispatcher.java|borderStyle=solid}
>   protected void dispatch(Event event) {
> //all events go thru this loop
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Dispatching the event " + event.getClass().getName() + "."
>   + event.toString());
> }
> Class type = event.getType().getDeclaringClass();
> try{
>   EventHandler handler = eventDispatchers.get(type);
>   if(handler != null) {
> handler.handle(event);
>   } else {
> throw new Exception("No handler for registered for " + type);
>   }
> } catch (Throwable t) {
>   //TODO Maybe log the state of the queue
>   LOG.fatal("Error in dispatcher thread", t);
>   // If serviceStop is called, we should exit this thread gracefully.
>   if (exitOnDispatchException
>   && 

[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-01 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034490#comment-15034490
 ] 

Ray Chiang commented on YARN-4406:
--

Thanks for letting me know.   I'll take a look at that.

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100

2015-12-01 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034614#comment-15034614
 ] 

Daniel Templeton commented on YARN-4398:


Actually, before I +1 that, [~iceberg565], did you run the full set of RM unit 
tests against the patch?

> Yarn recover functionality causes the cluster running slowly and the cluster 
> usage rate is far below 100
> 
>
> Key: YARN-4398
> URL: https://issues.apache.org/jira/browse/YARN-4398
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: NING DING
> Attachments: YARN-4398.2.patch, YARN-4398.3.patch
>
>
> In my hadoop cluster, the resourceManager recover functionality is enabled 
> with FileSystemRMStateStore.
> I found this cause the yarn cluster running slowly and cluster usage rate is 
> just 50 even there are many pending Apps. 
> The scenario is below.
> In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling 
> storeNewApplication method defined in RMStateStore. This storeNewApplication 
> method is synchronized.
> {code:title=RMAppImpl.java|borderStyle=solid}
>   private static final class RMAppNewlySavingTransition extends 
> RMAppTransition {
> @Override
> public void transition(RMAppImpl app, RMAppEvent event) {
>   // If recovery is enabled then store the application information in a
>   // non-blocking call so make sure that RM has stored the information
>   // needed to restart the AM after RM restart without further client
>   // communication
>   LOG.info("Storing application with id " + app.applicationId);
>   app.rmContext.getStateStore().storeNewApplication(app);
> }
>   }
> {code}
> {code:title=RMStateStore.java|borderStyle=solid}
> public synchronized void storeNewApplication(RMApp app) {
> ApplicationSubmissionContext context = app
> 
> .getApplicationSubmissionContext();
> assert context instanceof ApplicationSubmissionContextPBImpl;
> ApplicationStateData appState =
> ApplicationStateData.newInstance(
> app.getSubmitTime(), app.getStartTime(), context, app.getUser());
> dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
>   }
> {code}
> In thread B, the FileSystemRMStateStore is calling 
> storeApplicationStateInternal method. It's also synchronized.
> This storeApplicationStateInternal method saves an ApplicationStateData into 
> HDFS and it normally costs 90~300 milliseconds in my hadoop cluster.
> {code:title=FileSystemRMStateStore.java|borderStyle=solid}
> public synchronized void storeApplicationStateInternal(ApplicationId appId,
>   ApplicationStateData appStateDataPB) throws Exception {
> Path appDirPath = getAppDir(rmAppRoot, appId);
> mkdirsWithRetries(appDirPath);
> Path nodeCreatePath = getNodePath(appDirPath, appId.toString());
> LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath);
> byte[] appStateData = appStateDataPB.getProto().toByteArray();
> try {
>   // currently throw all exceptions. May need to respond differently for 
> HA
>   // based on whether we have lost the right to write to FS
>   writeFileWithRetries(nodeCreatePath, appStateData, true);
> } catch (Exception e) {
>   LOG.info("Error storing info for app: " + appId, e);
>   throw e;
> }
>   }
> {code}
> Think thread B firstly comes into 
> FileSystemRMStateStore.storeApplicationStateInternal method, then thread A 
> will be blocked for a while because of synchronization. In ResourceManager 
> there is only one RMStateStore instance. In my cluster it's 
> FileSystemRMStateStore type.
> Debug the RMAppNewlySavingTransition.transition method, the thread stack 
> shows it's called form AsyncDispatcher.dispatch method. This method code is 
> as below. 
> {code:title=AsyncDispatcher.java|borderStyle=solid}
>   protected void dispatch(Event event) {
> //all events go thru this loop
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Dispatching the event " + event.getClass().getName() + "."
>   + event.toString());
> }
> Class type = event.getType().getDeclaringClass();
> try{
>   EventHandler handler = eventDispatchers.get(type);
>   if(handler != null) {
> handler.handle(event);
>   } else {
> throw new Exception("No handler for registered for " + type);
>   }
> } catch (Throwable t) {
>   //TODO Maybe log the state of the queue
>   LOG.fatal("Error in dispatcher thread", t);
>   // If serviceStop is called, we should exit this thread gracefully.
>   if (exitOnDispatchException
>   && 

[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-01 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034647#comment-15034647
 ] 

Ray Chiang commented on YARN-4406:
--

So, would the right solution be that getUnuableNodes() should be "excludes 
list" aware?

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-01 Thread Kuhu Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034482#comment-15034482
 ] 

Kuhu Shukla commented on YARN-4406:
---

Spoke too soon. on trunk {{updateMetricsForRejoinedNode}} should handle that.

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover

2015-12-01 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R reassigned YARN-4392:
---

Assignee: Naganarasimha G R  (was: Xuan Gong)

> ApplicationCreatedEvent event time resets after RM restart/failover
> ---
>
> Key: YARN-4392
> URL: https://issues.apache.org/jira/browse/YARN-4392
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Xuan Gong
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch, 
> YARN-4392.2.patch
>
>
> {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - 
> Finished time 1437453994768 is ahead of started time 1440308399674 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437454008244 is ahead of started time 1440308399676 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444305171 is ahead of started time 1440308399653 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444293115 is ahead of started time 1440308399647 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444379645 is ahead of started time 1440308399656 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444361234 is ahead of started time 1440308399655 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444342029 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444323447 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143730006 is ahead of started time 1440308399660 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143715698 is ahead of started time 1440308399659 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143719060 is ahead of started time 1440308399658 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444393931 is ahead of started time 1440308399657
> {code} . 
> From ATS logs, we would see a large amount of 'stale alerts' messages 
> periodically



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4340) Add "list" API to reservation system

2015-12-01 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034792#comment-15034792
 ] 

Subru Krishnan commented on YARN-4340:
--

Thanks [~seanpo03] for working on this. Kindly fix the test-patch warnings.

I went through the patch and please find my feedback below. 

First up, the patch is huge and will only get bigger so I suggest splitting it 
into 2 - the RPC API changes and REST  API changes. I have reviewed only the 
RPC changes here.

Secondly, there are very few test cases. You need to test cases at least for 
_TestInMemoryPlan, TestYarnClient, TestReservationInputValidator, 
TestPBImplRecords_, etc. Take a look at the other Reservation APIs for 
reference.

Lastly, *YarnClient* needs to updated to add the new API so that it's 
consumable by clients.

Now to the new *listReservations* API in *ApplicationClientProtocol*. We should 
be very diligent as we are adding an externally visible API:
   * We should use _queue_ instead of _planName_ as _Plan_ is an internal data 
structure and we only expose queues to users
   * We should clearly explain in the Javadocs how the different params in 
*ReservationListRequest* influence the query. For e.g.: what are mandatory, 
what are optional, how does specifying multiple params affect the result, etc
   * We must explicitly call out the *ResourceAllocation* we are returing in 
*ReservationListResponse* is based on the current state of the _Plan_ and we 
can change it for different reasons like replanning subject to the constraints 
of the user contract as described by *ReservationDefinition*

Minor comments for rest of the code:
   * Typos in argument names for _setPlanName_ and _setUser_ in 
*ReservationListRequest*
   * Link the _Plan/ReservationId and ResourceAllocation_ in the Javadocs of 
*ReservationListRequest*
   * The Javadoc of *ReservationListResponse* has copy paste typo, please 
update it
   * In *ReservationInfo*, all setters should be _@Private_ 
   * Link the _ReservationDefinition/ReservationId and ResourceAllocation_ in 
the Javadocs of *ReservationInfo*
   * We should reconcile *ResourceAllocation* with 
*ReservationAllocationStateProto* which we use for recovery as both look to 
contain the same information
   * Rename *ResourceAllocation* --> *ReservationResourceAllocation*
   * Again call out explicitly in the Javadocs that the allocations are based 
on the current state of the _Plan_ and we can change it for different reasons 
like replanning subject to the constraints of the user contract as described by 
*ReservationDefinition*
   * In *ResourceAllocation*, all setters should be _@Private_ 
   * Link _Resource_ in the Javadocs of *ResourceAllocation*
   * In *ReservationListRequestPBImpl*, return NULL if string field is not set, 
not empty string
   * In *ReservationListRequestPBImpl*, check for negative long values before 
setting. Use _ReservationDefinitionPBImpl_ as reference
   * Align *ReservationListRequestPBImpl::getReservationInfo* with 
_ReservationRequestsPBImpl::getReservationResources_
   * Add existence checks for getters and null/negative check for setters for 
fields in *ReservationInfoPBImpl*
   * Align *ReservationInfoPBImpl::getResourceAllocations* with 
_ReservationRequestsPBImpl::getReservationResources_
   * In *ResourceAllocationPBImpl*, check for negative long values before 
setting
   * Revert the unrelated formatting changes in *ClientRMService*
   * In *ClientRMService::listReservations*, ACL check has to be done before 
querying the _Plan_
   * In *ClientRMService::listReservations*, refactor the translation of 
_Set_ into _List_ into a separate 
helper method. It may make sense to have in *ReservationSystemUtil* as can have 
other consumers in future
   * In *InMemoryPlan::getReservations*, there are 2 read locks - the outer one 
is redundant and can be removed
   * Add null check for _user_ in *InMemoryPlan::getReservations*
   * In *ReservationInputValidator::validateReservationListRequest*, the 
validation seems to be common with other existing checks so would be better to 
extract a common helper private method and reuse it.
   * In *ReservationInputValidator::validateReservationListRequest*, we are 
only validating the _plan queue_ name - what about other params?
   * In *TestClientRMService*, add more _listReservations_ queries after 
submission, deletion. Also check for a narrower time window rather than 0 - 
Long.MAX_VALUE.
   * In *TestClientRMService*, rename _lRequest_ and _lResponse_ with _request_ 
and _response_ respectively.

> Add "list" API to reservation system
> 
>
> Key: YARN-4340
> URL: https://issues.apache.org/jira/browse/YARN-4340
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, fairscheduler, resourcemanager
>Reporter: Carlo Curino
>

[jira] [Commented] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover

2015-12-01 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034869#comment-15034869
 ] 

Naganarasimha G R commented on YARN-4392:
-

Hi [~xgong],
wrt YARN-4392.2.patch, is it required to send the App created event to ATS, 
during restore? , as even before we store the app information in the RM state 
store we would have pushed this app created event to ATS. 

> ApplicationCreatedEvent event time resets after RM restart/failover
> ---
>
> Key: YARN-4392
> URL: https://issues.apache.org/jira/browse/YARN-4392
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Xuan Gong
>Assignee: Naganarasimha G R
>Priority: Critical
> Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch, 
> YARN-4392.2.patch
>
>
> {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - 
> Finished time 1437453994768 is ahead of started time 1440308399674 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437454008244 is ahead of started time 1440308399676 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444305171 is ahead of started time 1440308399653 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444293115 is ahead of started time 1440308399647 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444379645 is ahead of started time 1440308399656 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444361234 is ahead of started time 1440308399655 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444342029 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444323447 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143730006 is ahead of started time 1440308399660 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143715698 is ahead of started time 1440308399659 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143719060 is ahead of started time 1440308399658 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444393931 is ahead of started time 1440308399657
> {code} . 
> From ATS logs, we would see a large amount of 'stale alerts' messages 
> periodically



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4408) NodeManager still reports negative running containers

2015-12-01 Thread Robert Kanter (JIRA)
Robert Kanter created YARN-4408:
---

 Summary: NodeManager still reports negative running containers
 Key: YARN-4408
 URL: https://issues.apache.org/jira/browse/YARN-4408
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Robert Kanter
Assignee: Robert Kanter


YARN-1697 fixed a problem where the NodeManager metrics could report a negative 
number of running containers.  However, it missed a rare case where this can 
still happen.

YARN-1697 added a flag to indicate if the container was actually launched 
({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which is 
then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to 
{{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge if 
we actually ran the container and incremented the gauge .  However, this flag 
is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to {{DONE}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler

2015-12-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034784#comment-15034784
 ] 

Junping Du commented on YARN-3769:
--

Hi [~sjlee0], given this is non-critical/non-blocker, can we make it to 2.6.4? 
Thanks!

> Consider user limit when calculating total pending resource for preemption 
> policy in Capacity Scheduler
> ---
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 2.7.3
>
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.6.001.patch, YARN-3769-branch-2.7.002.patch, 
> YARN-3769-branch-2.7.003.patch, YARN-3769-branch-2.7.005.patch, 
> YARN-3769-branch-2.7.006.patch, YARN-3769-branch-2.7.007.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch, YARN-3769.004.patch, YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4248) REST API for submit/update/delete Reservations

2015-12-01 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035000#comment-15035000
 ] 

Carlo Curino commented on YARN-4248:


[~chris.douglas] thanks for the review. I addressed your comments (3-6) in the 
attached revision, and I am running this again in a cluster to check all works. 
I tested (2) and confirm that the system correctly reject the request without 
NPE in case of missing parameter. 

As you point out for both (1) and (2) the behavior is consistent with the 
remainder of the REST api, but it could be improved. I think we should circle 
back and discuss in general how "rest-y" we want the YARN REST api to be. Based 
on that we could revise this new apis, as well as the remainder.


> REST API for submit/update/delete Reservations
> --
>
> Key: YARN-4248
> URL: https://issues.apache.org/jira/browse/YARN-4248
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: YARN-4248.2.patch, YARN-4248.3.patch, YARN-4248.patch
>
>
> This JIRA tracks work to extend the RMWebService to support REST APIs to 
> submit/update/delete reservations. This will ease integration with external 
> tools that are not java-based.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-12-01 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4108:
-
Attachment: (was: YARN-4108.poc.1.patch)

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4108-design-doc-v1.pdf
>
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-12-01 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4108:
-
Attachment: YARN-4108.poc.1.patch

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4108-design-doc-v1.pdf, YARN-4108.poc.1.patch
>
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-12-01 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034766#comment-15034766
 ] 

Wangda Tan commented on YARN-4108:
--

Thanks [~eepayne], removed the staled patch and rebased patch to latest trunk.

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4108-design-doc-v1.pdf, YARN-4108.poc.1.patch
>
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover

2015-12-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034805#comment-15034805
 ] 

Hadoop QA commented on YARN-4392:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 1s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
39s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
17s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 46s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
33s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
43s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 38s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 38s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 16s 
{color} | {color:red} Patch generated 1 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 (total was 114, now 115). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 45s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
18s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
41s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 62m 57s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 62m 23s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
27s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 148m 9s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions |
| JDK v1.7.0_85 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | 

[jira] [Commented] (YARN-4392) ApplicationCreatedEvent event time resets after RM restart/failover

2015-12-01 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034843#comment-15034843
 ] 

Naganarasimha G R commented on YARN-4392:
-

Thanks [~jlowe],
bq, If we take that approach I'm wondering if there may be cases where we are 
updating the app state before we know for certain that the ATS has received the 
event.
IIUC i think you are pointing out at the finish events. So actually in the 
patch i had followed the approach such that for finish events i had sent 
synchronous push in the ATS side, in this way we are sure that AppFinish event 
is sent out before we store the state of the app in the RM state store. But yes 
this approach looks little shaky but thought it might solve the issue.

bq.  Moving the ATS app start notification out of the constructor and instead 
to that start transition allows us to construct an app and send it a recover 
event without triggering an ATS event. 
Yes this is the same approach i had adopted in my YARN-3127 patch to avoid 
resend AppCreated events to ATS and this was also required for YARN-4350. 

If we are handling this issue here shall i close YARN-3127 ?

> ApplicationCreatedEvent event time resets after RM restart/failover
> ---
>
> Key: YARN-4392
> URL: https://issues.apache.org/jira/browse/YARN-4392
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Critical
> Attachments: YARN-4392-2015-11-24.patch, YARN-4392.1.patch, 
> YARN-4392.2.patch
>
>
> {code}2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - 
> Finished time 1437453994768 is ahead of started time 1440308399674 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437454008244 is ahead of started time 1440308399676 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444305171 is ahead of started time 1440308399653 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444293115 is ahead of started time 1440308399647 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444379645 is ahead of started time 1440308399656 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444361234 is ahead of started time 1440308399655 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444342029 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,852 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444323447 is ahead of started time 1440308399654 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143730006 is ahead of started time 1440308399660 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143715698 is ahead of started time 1440308399659 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 143719060 is ahead of started time 1440308399658 
> 2015-09-01 12:39:09,853 WARN util.Times (Times.java:elapsed(53)) - Finished 
> time 1437444393931 is ahead of started time 1440308399657
> {code} . 
> From ATS logs, we would see a large amount of 'stale alerts' messages 
> periodically



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler

2015-12-01 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034897#comment-15034897
 ] 

Sangjin Lee commented on YARN-3769:
---

I'm fine with 2.6.4.

> Consider user limit when calculating total pending resource for preemption 
> policy in Capacity Scheduler
> ---
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 2.7.3
>
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.6.001.patch, YARN-3769-branch-2.7.002.patch, 
> YARN-3769-branch-2.7.003.patch, YARN-3769-branch-2.7.005.patch, 
> YARN-3769-branch-2.7.006.patch, YARN-3769-branch-2.7.007.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch, YARN-3769.004.patch, YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100

2015-12-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034963#comment-15034963
 ] 

Hadoop QA commented on YARN-4398:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
49s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 5s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 49s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
35s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 28s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
35s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 17s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
23s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 4s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 4s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 50s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 50s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
35s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
35s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
33s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 14s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 16s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 35s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 28s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 36s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_85. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 3s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
28s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 178m 3s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | 

[jira] [Updated] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state in CS

2015-12-01 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3946:

Attachment: YARN-3946.v1.005.patch

Thanks for the comments [~wangda],
bq. When app goes to final state (FINISHED/KILLEd, etc.), should we simply set 
AMLaunchDiagnostics to null?
IIUC you are referring to RMAppAttemptImpl right ?, if so its mistake while 
correcting based on your previous comment missed to revert this part but anyway 
as per your 4th comment in cases of unmanaged AM i have updated it to null here.

bq. Why need two separate methods: 
updateDiagnosticsIfNotRunning/updateDiagnostics?
May be the name needs to be proper but two methods are required as the status 
needs to be updated only if AM is not running for example its called in 
FiCaSchedulerApp.allocate, this method will be called whenever container is 
assiged for a app but we want to update the diagnostic only when the AM is not 
yet launched. and similarly used in LeafQueue.assignContainers. But in some 
cases we are sure that the AM is not yet launched hence to avoid unwanted 
verification (whether AM is running) we have updateDiagnostics. May be i can 
name them as {{checkAndUpdateAMContainerDiagnostics}} and 
{{updateAMContainerDiagnostics}} ?

bq. Do you think is it better to rename AMState.PENDING to inactivated?
Yes, PENDING is not understandable to all hence the diagnostic message for 
{{PENDING}} is already set as *"Application is added to the scheduler and is 
not yet activated."* may be i can mention it as {{Application is added to the 
scheduler but is not yet scheduled.}} Thoughts? 

bq. Instead of setting AMLaunchDiagnostics to null when RMAppAttempt enters 
Scheduled state,do you think is it better to do that in RUNNING and 
FINAL_SAVING state? Unmanaged AM could skip the SCHEDULED state.
IMO i would prefer to set only for Unmanaged AMs in *FINAL_SAVING state* as 
already we are showing the *YarnApplicationState* as running and giving 
description abt it. so again if diagnostics is also showing that AM is launched 
and running then it can becomes repetitive in UI for normal (non unmanaged AM) 
apps.

bq. It will be also very usaful if you can update AM launch diagnostics when 
RMAppAttempt go to LAUNCHED state, 
Actually i wrongly considered AMContainerAllocatedTransition to reset the diag 
message, my intention was to reset only after its launched and registered. This 
would be very usefull for analyzing the state of AM. Have introduced 
{{LAUNCHED}} and setting after AMLauncher sends  LAUNCHED event to RMAppAttempt.

[~wangda] & [~jianhe]
Please review the latest patch,

> Allow fetching exact reason as to why a submitted app is in ACCEPTED state in 
> CS
> 
>
> Key: YARN-3946
> URL: https://issues.apache.org/jira/browse/YARN-3946
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.6.0
>Reporter: Sumit Nigam
>Assignee: Naganarasimha G R
> Attachments: 3946WebImages.zip, YARN-3946.v1.001.patch, 
> YARN-3946.v1.002.patch, YARN-3946.v1.003.Images.zip, YARN-3946.v1.003.patch, 
> YARN-3946.v1.004.patch, YARN-3946.v1.005.patch
>
>
> Currently there is no direct way to get the exact reason as to why a 
> submitted app is still in ACCEPTED state. It should be possible to know 
> through RM REST API as to what aspect is not being met - say, queue limits 
> being reached, or core/ memory requirement not being met, or AM limit being 
> reached, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2883) Queuing of container requests in the NM

2015-12-01 Thread Konstantinos Karanasos (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantinos Karanasos reassigned YARN-2883:


Assignee: Konstantinos Karanasos

> Queuing of container requests in the NM
> ---
>
> Key: YARN-2883
> URL: https://issues.apache.org/jira/browse/YARN-2883
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Konstantinos Karanasos
>Assignee: Konstantinos Karanasos
>
> We propose to add a queue in each NM, where queueable container requests can 
> be held.
> Based on the available resources in the node and the containers in the queue, 
> the NM will decide when to allow the execution of a queued container.
> In order to ensure the instantaneous start of a guaranteed-start container, 
> the NM may decide to pre-empt/kill running queueable containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4408) NodeManager still reports negative running containers

2015-12-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035104#comment-15035104
 ] 

Hadoop QA commented on YARN-4408:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
5s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
11s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 1s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
29s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
10s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 45s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 14s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
23s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 35m 18s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12775166/YARN-4408.001.patch |
| JIRA Issue | YARN-4408 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 508e15da309c 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 3c4a34e |
| findbugs | v3.0.0 |
| JDK v1.7.0_85  Test Results | 

[jira] [Commented] (YARN-4309) Add debug information to application logs when a container fails

2015-12-01 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035122#comment-15035122
 ] 

Sidharta Seethana commented on YARN-4309:
-

[~vvasudev] , I took a look at the patch. Couple of comments : 

* Could you clarify why the debugging information gathering in 
DockerContainerExecutor.writeLaunchEnv is not guarded by a config check? The 
new test you added uses DefaultContainerExecutor so it looks like this was 
missed. 
* There seem to be minor inconsistent line spacing issues in the new test 
function in TestContainerLaunch.java 

Apart from these, assuming it is safe to list user directory contents (as 
already discussed on this JIRA), the patch seems good to me.  Thanks for this 
patch - I expect the launch_container.sh copy to be particularly useful for 
debugging purposes.

> Add debug information to application logs when a container fails
> 
>
> Key: YARN-4309
> URL: https://issues.apache.org/jira/browse/YARN-4309
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: YARN-4309.001.patch, YARN-4309.002.patch, 
> YARN-4309.003.patch
>
>
> Sometimes when a container fails, it can be pretty hard to figure out why it 
> failed.
> My proposal is that if a container fails, we collect information about the 
> container local dir and dump it into the container log dir. Ideally, I'd like 
> to tar up the directory entirely, but I'm not sure of the security and space 
> implications of such a approach. At the very least, we can list all the files 
> in the container local dir, and dump the contents of launch_container.sh(into 
> the container log dir).
> When log aggregation occurs, all this information will automatically get 
> collected and make debugging such failures much easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4408) NodeManager still reports negative running containers

2015-12-01 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-4408:

Attachment: YARN-4408.001.patch

The patch adds the missing check.  It also adds a unit test for this case, and 
for one of the other cases, which was missing.

> NodeManager still reports negative running containers
> -
>
> Key: YARN-4408
> URL: https://issues.apache.org/jira/browse/YARN-4408
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
> Attachments: YARN-4408.001.patch
>
>
> YARN-1697 fixed a problem where the NodeManager metrics could report a 
> negative number of running containers.  However, it missed a rare case where 
> this can still happen.
> YARN-1697 added a flag to indicate if the container was actually launched 
> ({{LOCALIZED}} to {{RUNNING}}) or not ({{LOCALIZED}} to {{KILLING}}), which 
> is then checked when transitioning from {{CONTAINER_CLEANEDUP_AFTER_KILL}} to 
> {{DONE}} and {{EXITED_WITH_FAILURE}} to {{DONE}} to only decrement the gauge 
> if we actually ran the container and incremented the gauge .  However, this 
> flag is not checked while transitioning from {{EXITED_WITH_SUCCESS}} to 
> {{DONE}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4405) Support node label store in non-appendable file system

2015-12-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035021#comment-15035021
 ] 

Hadoop QA commented on YARN-4405:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
48s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 27s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 31s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
32s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 51s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
45s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
26s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 57s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 36s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
1s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 40s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 40s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 44s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 44s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 35s 
{color} | {color:red} Patch generated 7 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 264, now 267). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 7s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
52s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 14 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 51s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
introduced 1 new FindBugs issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 4s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 25s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 33s {color} 
| {color:red} hadoop-yarn-api in the patch failed with JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 23s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 52s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 32s {color} 
| {color:red} hadoop-yarn-api in the patch failed with JDK v1.7.0_85. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 27s {color} 
| {color:red} hadoop-yarn-common in the patch failed with JDK v1.7.0_85. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 40s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
24s {color} | 

[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding ZK's callback work correctly

2015-12-01 Thread Tsuyoshi Ozawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-4348:
-
Summary: ZKRMStateStore.syncInternal shouldn't wait for sync completion for 
avoiding ZK's callback work correctly  (was: ZKRMStateStore.syncInternal should 
wait for zkResyncWaitTime instead of zkSessionTimeout)

> ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding 
> ZK's callback work correctly
> 
>
> Key: YARN-4348
> URL: https://issues.apache.org/jira/browse/YARN-4348
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.2
>Reporter: Tsuyoshi Ozawa
>Assignee: Tsuyoshi Ozawa
>Priority: Blocker
> Attachments: YARN-4348-branch-2.7.002.patch, 
> YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, 
> YARN-4348.001.patch, YARN-4348.001.patch, log.txt
>
>
> Jian mentioned that the current internal ZK configuration of ZKRMStateStore 
> can cause a following situation:
> 1. syncInternal timeouts, 
> 2. but sync succeeded later on.
> We should use zkResyncWaitTime as the timeout value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period

2015-12-01 Thread Junping Du (JIRA)
Junping Du created YARN-4403:


 Summary: (AM/NM/Container)LivelinessMonitor should use monotonic 
time when calculating period
 Key: YARN-4403
 URL: https://issues.apache.org/jira/browse/YARN-4403
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


Currently, (AM/NM/Container)LivelinessMonitor use current system time to 
calculate a duration of expire which could be broken by settimeofday. We should 
use Time.monotonicNow() instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread

2015-12-01 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034071#comment-15034071
 ] 

Tsuyoshi Ozawa commented on YARN-4348:
--

[~zxu] [~jianhe]
I'm rethinking of [this 
comment|https://issues.apache.org/jira/browse/YARN-3798?focusedCommentId=14609769=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14609769]
 about sync callback to wait for sync completion: this can cause [the lock 
problem described 
here|https://issues.apache.org/jira/browse/YARN-4348?focusedCommentId=15018159=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15018159].
 

To deal with problem easily, we can just remove a barrier by the sync callback. 
This works well because ZK client's requests are sent to ZK server in order, 
unless ZK master server fails while recreating ZK connection. Quorum sync, 
ZOOKEEPER-2136, is good helper to deal with the corner case.

What do you think?

> ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding 
> blocking ZK's event thread
> --
>
> Key: YARN-4348
> URL: https://issues.apache.org/jira/browse/YARN-4348
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.2
>Reporter: Tsuyoshi Ozawa
>Assignee: Tsuyoshi Ozawa
>Priority: Blocker
> Attachments: YARN-4348-branch-2.7.002.patch, 
> YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, 
> YARN-4348.001.patch, YARN-4348.001.patch, log.txt
>
>
> Jian mentioned that the current internal ZK configuration of ZKRMStateStore 
> can cause a following situation:
> 1. syncInternal timeouts, 
> 2. but sync succeeded later on.
> We should use zkResyncWaitTime as the timeout value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED

2015-12-01 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034045#comment-15034045
 ] 

Tsuyoshi Ozawa commented on YARN-3798:
--

[~jianhe] 

{quote}
If curator sync up the data it would be fine. Otherwise there could be a chance 
of lag like we discussed earlier. Truly I haven't tried Curator yet, probably 
some one can cross check this part.
{quote}

FYI, when Curator detects the same situation, it call sync automatically in 
{{doSyncForSuspendedConnection}} method in Curator Framework. Therefore, we 
don't need to call sync operation on trunk and branch-2.8 code.

> ZKRMStateStore shouldn't create new session without occurrance of 
> SESSIONEXPIED
> ---
>
> Key: YARN-3798
> URL: https://issues.apache.org/jira/browse/YARN-3798
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Varun Saxena
>Priority: Blocker
> Fix For: 2.7.2, 2.6.2
>
> Attachments: RM.log, YARN-3798-2.7.002.patch, 
> YARN-3798-branch-2.6.01.patch, YARN-3798-branch-2.6.02.patch, 
> YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.003.patch, 
> YARN-3798-branch-2.7.004.patch, YARN-3798-branch-2.7.005.patch, 
> YARN-3798-branch-2.7.006.patch, YARN-3798-branch-2.7.patch
>
>
> RM going down with NoNode exception during create of znode for appattempt
> *Please find the exception logs*
> {code}
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2015-06-09 10:09:44,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2015-06-09 10:09:44,886 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>   at java.lang.Thread.run(Thread.java:745)
> 2015-06-09 10:09:44,887 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
> out ZK retries. Giving up!
> 

[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding blocking ZK's event thread

2015-12-01 Thread Tsuyoshi Ozawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-4348:
-
Summary: ZKRMStateStore.syncInternal shouldn't wait for sync completion for 
avoiding blocking ZK's event thread  (was: ZKRMStateStore.syncInternal 
shouldn't wait for sync completion for avoiding ZK's callback work correctly)

> ZKRMStateStore.syncInternal shouldn't wait for sync completion for avoiding 
> blocking ZK's event thread
> --
>
> Key: YARN-4348
> URL: https://issues.apache.org/jira/browse/YARN-4348
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.2
>Reporter: Tsuyoshi Ozawa
>Assignee: Tsuyoshi Ozawa
>Priority: Blocker
> Attachments: YARN-4348-branch-2.7.002.patch, 
> YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, 
> YARN-4348.001.patch, YARN-4348.001.patch, log.txt
>
>
> Jian mentioned that the current internal ZK configuration of ZKRMStateStore 
> can cause a following situation:
> 1. syncInternal timeouts, 
> 2. but sync succeeded later on.
> We should use zkResyncWaitTime as the timeout value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4403) (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating period

2015-12-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-4403:
-
Target Version/s: 2.8.0

> (AM/NM/Container)LivelinessMonitor should use monotonic time when calculating 
> period
> 
>
> Key: YARN-4403
> URL: https://issues.apache.org/jira/browse/YARN-4403
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
>
> Currently, (AM/NM/Container)LivelinessMonitor use current system time to 
> calculate a duration of expire which could be broken by settimeofday. We 
> should use Time.monotonicNow() instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3862) Support for fetching specific configs and metrics based on prefixes

2015-12-01 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035351#comment-15035351
 ] 

Varun Saxena commented on YARN-3862:


Thanks Sangjin for the review and commit.
Thanks Li and Joep for the reviews.

> Support for fetching specific configs and metrics based on prefixes
> ---
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
> Fix For: YARN-2928
>
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch, YARN-3862-feature-YARN-2928.005.patch, 
> YARN-3862-feature-YARN-2928.04.patch, YARN-3862-feature-YARN-2928.wip.03.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4333) Fair scheduler should support preemption within queue

2015-12-01 Thread Feng Yuan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Yuan updated YARN-4333:

Issue Type: Improvement  (was: Bug)

> Fair scheduler should support preemption within queue
> -
>
> Key: YARN-4333
> URL: https://issues.apache.org/jira/browse/YARN-4333
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.6.0
>Reporter: Tao Jie
>
> Now each app in fair scheduler is allocated its fairshare, however  fairshare 
> resource is not ensured even if fairSharePreemption is enabled.
> Consider: 
> 1, When the cluster is idle, we submit app1 to queueA,which takes maxResource 
> of queueA.  
> 2, Then the cluster becomes busy, but app1 does not release any resource, 
> queueA resource usage is over its fairshare
> 3, Then we submit app2(maybe with higher priority) to queueA. Now app2 has 
> its own fairshare, but could not obtain any resource, since queueA is still 
> over its fairshare and resource will not assign to queueA anymore. Also, 
> preemption is not triggered in this case.
> So we should allow preemption within queue, when app is starved for fairshare.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3862) Support for fetching specific configs and metrics based on prefixes

2015-12-01 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3862:
---
Attachment: (was: YARN-3862-feature-YARN-2928.05.patch)

> Support for fetching specific configs and metrics based on prefixes
> ---
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch, YARN-3862-feature-YARN-2928.005.patch, 
> YARN-3862-feature-YARN-2928.04.patch, YARN-3862-feature-YARN-2928.wip.03.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3862) Support for fetching specific configs and metrics based on prefixes

2015-12-01 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035308#comment-15035308
 ] 

Sangjin Lee commented on YARN-3862:
---

They do need to be fixed (sooner than later), but I'm OK with tackling all of 
these errors in a separate JIRA. Do you mind filing a task for cleaning up all 
javadoc and checkstyle errors in the timelineservice code?

I'm +1 with the latest patch. I'll commit it shortly.

> Support for fetching specific configs and metrics based on prefixes
> ---
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch, YARN-3862-feature-YARN-2928.005.patch, 
> YARN-3862-feature-YARN-2928.04.patch, YARN-3862-feature-YARN-2928.wip.03.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3862) Support for fetching specific configs and metrics based on prefixes

2015-12-01 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035315#comment-15035315
 ] 

Varun Saxena commented on YARN-3862:


Filed YARN-4409

> Support for fetching specific configs and metrics based on prefixes
> ---
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch, YARN-3862-feature-YARN-2928.005.patch, 
> YARN-3862-feature-YARN-2928.04.patch, YARN-3862-feature-YARN-2928.wip.03.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4409) Fix javadoc and checkstyle issues in timelineservice code

2015-12-01 Thread Varun Saxena (JIRA)
Varun Saxena created YARN-4409:
--

 Summary: Fix javadoc and checkstyle issues in timelineservice code
 Key: YARN-4409
 URL: https://issues.apache.org/jira/browse/YARN-4409
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Varun Saxena
Assignee: Varun Saxena


There are a large number of javadoc and checkstyle issues currently open in 
timelineservice code. We need to fix them before we merge it into trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-01 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035223#comment-15035223
 ] 

Naganarasimha G R commented on YARN-4406:
-

Hi [~rchiang] & [~kshukla]
YARN-3102 also is for the same issue, earlier had stopped working on this 
because i was skeptical of YARN-914 (or its subjira's ) might have impact or 
take care of this issue. If you guys have already narrowed down on the cause 
feel free to assign YARN-3102 and close this issue.

> RM Web UI continues to show decommissioned nodes even after RM restart
> --
>
> Key: YARN-4406
> URL: https://issues.apache.org/jira/browse/YARN-4406
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ray Chiang
>Priority: Minor
>
> If you start up a cluster, decommission a NodeManager, and restart the RM, 
> the decommissioned node list will still show a positive number (1 in the case 
> of 1 node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3417) AM to be able to exit with a request saying "restart me with these (possibly updated) resource requirements"

2015-12-01 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035246#comment-15035246
 ] 

Varun Saxena commented on YARN-3417:


Thanks Steve for the clarification.

> AM to be able to exit with a request saying "restart me with these (possibly 
> updated) resource requirements"
> 
>
> Key: YARN-3417
> URL: https://issues.apache.org/jira/browse/YARN-3417
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: Steve Loughran
>Assignee: Varun Saxena
>Priority: Minor
>
> If an AM wants to reconfigure itself or restart with new resources, there's 
> no way to do this without the active participation of a client.
> It can call System.exit and rely on YARN to restart it -but that counts as a 
> failure and may lose the entire app. furthermore, that doesn't allow the AM 
> to resize itself.
> A simple exit-code to be interpreted as restart-without-failure could handle 
> the first case; an explicit call to indicate restart, including potentially 
> new resource/label requirements, could be more reliabile, and certainly more 
> flexible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3862) Support for fetching specific configs and metrics based on prefixes

2015-12-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035267#comment-15035267
 ] 

Hadoop QA commented on YARN-3862:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
31s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 18s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
11s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s 
{color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
38s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 18s 
{color} | {color:red} hadoop-yarn-server-timelineservice in feature-YARN-2928 
failed with JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
24s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 11s 
{color} | {color:red} Patch generated 20 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice
 (total was 98, now 94). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 29s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 
46s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 17s 
{color} | {color:red} hadoop-yarn-server-timelineservice in the patch failed 
with JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 23s 
{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed 
with JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 22s 
{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed 
with JDK v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
25s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 24m 23s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:7c86163 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12775209/YARN-3862-feature-YARN-2928.005.patch
 |
| JIRA Issue | YARN-3862 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | 

[jira] [Commented] (YARN-3862) Support for fetching specific configs and metrics based on prefixes

2015-12-01 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035275#comment-15035275
 ] 

Varun Saxena commented on YARN-3862:


[~sjlee0], kindly review. Regarding some of the checkstyle comments. 

If below two checkstyle issues have to be fixed, I think they should be fixed 
for around 10 other variables as well. We are directly accessing these 
variables in subclasses(***EntityReader). Do you want me to add getters and 
setters for each one of them ?
{noformat}
./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/TimelineEntityReader.java:75:32:
 Variable 'confsToRetrieve' must be private and have accessor methods.
./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/TimelineEntityReader.java:76:32:
 Variable 'metricsToRetrieve' must be private and have accessor methods.
{noformat}

Below checkstyle issue is introduced because of javadoc. This import is 
required in javadoc as I have a link to this class in javadoc. If I give full 
classname along with package name, it threw multiple javadoc warnings in 
previous build. So I guess this should be fine.
{noformat}
./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/TimelineReader.java:31:8:
 Unused import - 
org.apache.hadoop.yarn.server.timelineservice.reader.filter.TimelinePrefixFilter.
{noformat}

> Support for fetching specific configs and metrics based on prefixes
> ---
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch, YARN-3862-feature-YARN-2928.005.patch, 
> YARN-3862-feature-YARN-2928.04.patch, YARN-3862-feature-YARN-2928.wip.03.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4164) Retrospect update ApplicationPriority API return type

2015-12-01 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035302#comment-15035302
 ] 

Rohith Sharma K S commented on YARN-4164:
-

I think just log message update with clarity should be fine. I will update the 
patch with log message change.

> Retrospect update ApplicationPriority API return type
> -
>
> Key: YARN-4164
> URL: https://issues.apache.org/jira/browse/YARN-4164
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
> Attachments: 0001-YARN-4164.patch, 0002-YARN-4164.patch, 
> 0003-YARN-4164.patch
>
>
> Currently {{ApplicationClientProtocol#updateApplicationPriority()}} API 
> returns empty UpdateApplicationPriorityResponse response.
> But RM update priority to the cluster.max-priority if the given priority is 
> greater than cluster.max-priority. In this scenarios, need to intimate back 
> to client that updated  priority rather just keeping quite where client 
> assumes that given priority itself is taken.
> During application submission also has same scenario can happen, but I feel 
> when 
> explicitly invoke via ApplicationClientProtocol#updateApplicationPriority(), 
> response should have updated priority in response. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3862) Support for fetching specific configs and metrics based on prefixes

2015-12-01 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3862:
---
Attachment: YARN-3862-feature-YARN-2928.005.patch

Renamed patch to try and invoke Jenkins

> Support for fetching specific configs and metrics based on prefixes
> ---
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch, YARN-3862-feature-YARN-2928.005.patch, 
> YARN-3862-feature-YARN-2928.04.patch, YARN-3862-feature-YARN-2928.05.patch, 
> YARN-3862-feature-YARN-2928.wip.03.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4292) ResourceUtilization should be a part of NodeInfo REST API

2015-12-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035250#comment-15035250
 ] 

Sunil G commented on YARN-4292:
---

Test case failures are known except for TestApplicationPriority and 
TestCapacityScheduler. Verified locally to ensure that these test cases are 
successful.

> ResourceUtilization should be a part of NodeInfo REST API
> -
>
> Key: YARN-4292
> URL: https://issues.apache.org/jira/browse/YARN-4292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Sunil G
> Attachments: 0001-YARN-4292.patch, 0002-YARN-4292.patch, 
> 0003-YARN-4292.patch, 0004-YARN-4292.patch, 0005-YARN-4292.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4401) A failed app recovery should not prevent the RM from starting

2015-12-01 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035254#comment-15035254
 ] 

Rohith Sharma K S commented on YARN-4401:
-

bq. if a job is stored with a resource allocation that is higher than the 
configured maximum at the time of recovery, the recovery will throw an 
exception which will prevent the RM from starting.
Which version of Hadoop are you using? This issue is fixed in YARN-3493.

And regarding the patch, app should never be removed from RMContext at any 
point of time during recovery, it causes ApplincationNotFoundException to 
client which is incorrect. IAC, to continue  any flows, need to trigger an 
appropriate event which makes state transition complete.

> A failed app recovery should not prevent the RM from starting
> -
>
> Key: YARN-4401
> URL: https://issues.apache.org/jira/browse/YARN-4401
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-4401.001.patch
>
>
> There are many different reasons why an app recovery could fail with an 
> exception, causing the RM start to be aborted.  If that happens the RM will 
> fail to start.  Presumably, the reason the RM is trying to do a recovery is 
> that it's the standby trying to fill in for the active.  Failing to come up 
> defeats the purpose of the HA configuration.  Instead of preventing the RM 
> from starting, a failed app recovery should log an error and skip the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4164) Retrospect update ApplicationPriority API return type

2015-12-01 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035274#comment-15035274
 ] 

Sunil G commented on YARN-4164:
---

Thanks [~rohithsharma] for updating the patch.

{code}
Priority updateApplicationPriority =
client.updateApplicationPriority(appId, newAppPriority);
if (newAppPriority.equals(updateApplicationPriority)) {
  sysout.println("Successfully updated the application "
  + applicationId + " with priority '" + priority + "'");
} else {
  sysout.println("Updated the application  " + applicationId
  + " to cluster max priority '" + updateApplicationPriority + "'");
}
{code}

There is another corner case here when App is in its FINAL states. In that 
case, scheduler will not update the priority and last set priority is now 
returned. So this CLI handling will fall into else case and will print that 
{{"Updated the application to cluster max priority"}} which looks incorrect.

One possibility is that, we can look into cluster max priority here to print 
correct message. I am not very interested in banging RM again with a remote 
request to get cluster priority, so may be we can get cluster max priority also 
in response. Again this seems more information in response. OR simply we can 
print the message in else case as a general message which is common for both 
cases. Thoughts?


> Retrospect update ApplicationPriority API return type
> -
>
> Key: YARN-4164
> URL: https://issues.apache.org/jira/browse/YARN-4164
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
> Attachments: 0001-YARN-4164.patch, 0002-YARN-4164.patch, 
> 0003-YARN-4164.patch
>
>
> Currently {{ApplicationClientProtocol#updateApplicationPriority()}} API 
> returns empty UpdateApplicationPriorityResponse response.
> But RM update priority to the cluster.max-priority if the given priority is 
> greater than cluster.max-priority. In this scenarios, need to intimate back 
> to client that updated  priority rather just keeping quite where client 
> assumes that given priority itself is taken.
> During application submission also has same scenario can happen, but I feel 
> when 
> explicitly invoke via ApplicationClientProtocol#updateApplicationPriority(), 
> response should have updated priority in response. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state in CS

2015-12-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035294#comment-15035294
 ] 

Hadoop QA commented on YARN-3946:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
46s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
19s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
33s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
44s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 41s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 40s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 16s 
{color} | {color:red} Patch generated 21 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 (total was 655, now 672). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 46s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
18s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
44s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 86m 2s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 88m 26s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
46s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 216m 19s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.8.0_66 Timed out junit tests | 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue |
| JDK v1.7.0_91 Failed junit tests | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps |
|   | 

[jira] [Commented] (YARN-3862) Support for fetching specific configs and metrics based on prefixes

2015-12-01 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035310#comment-15035310
 ] 

Varun Saxena commented on YARN-3862:


Ok...Will do so.

> Support for fetching specific configs and metrics based on prefixes
> ---
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch, YARN-3862-feature-YARN-2928.005.patch, 
> YARN-3862-feature-YARN-2928.04.patch, YARN-3862-feature-YARN-2928.wip.03.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4395) Typo in comment in ClientServiceDelegate

2015-12-01 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-4395:
---
Assignee: Devon Michaels  (was: Daniel Templeton)

> Typo in comment in ClientServiceDelegate
> 
>
> Key: YARN-4395
> URL: https://issues.apache.org/jira/browse/YARN-4395
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Daniel Templeton
>Assignee: Devon Michaels
>Priority: Trivial
>
> Line 337 in {{invoke()}} has the following comment:
> {code}
> // if it's AM shut down, do not decrement maxClientRetry as we wait 
> for
> // AM to be restarted.
> {code}
> Ideally it should be:
> {code}
> // If its AM shut down, do not decrement maxClientRetry while we wait
> // for its AM to be restarted.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4164) Retrospect update ApplicationPriority API return type

2015-12-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034354#comment-15034354
 ] 

Hadoop QA commented on YARN-4164:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
30s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m 6s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
33s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
10s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 56s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 
43s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 25s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 45s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
42s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
28s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 10m 28s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 20m 54s 
{color} | {color:red} root-jdk1.8.0_66 with JDK v1.8.0_66 generated 1 new 
issues (was 751, now 751). {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m 28s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
30s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 10m 30s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red} 31m 24s 
{color} | {color:red} root-jdk1.7.0_85 with JDK v1.7.0_85 generated 1 new 
issues (was 745, now 745). {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 10m 30s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 11s 
{color} | {color:red} Patch generated 4 new checkstyle issues in root (total 
was 122, now 126). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 55s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
18s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 
37s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 25s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 47s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 29s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 17s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 2s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit 

[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-12-01 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034353#comment-15034353
 ] 

Wangda Tan commented on YARN-4108:
--

Thanks for sharing your thoughts, [~bikassaha]!

I've done a POC very similar to your proposal and uploaded.

The major difference between my approach and your approach is: I prefer to do 
the resource request matching within scheduler's allocation logic, so we don't 
have to add a separate "scan" logic to match pending resource request and 
to-be-preempted containers. With this approach, preemption logic and allocation 
logic will be synchronized: we will ensure every preempted container can be 
used by apps from under-satisfied queues, and progressively reaching the "ideal 
allocation" calculated by preemption policy.

Please kindly review.

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4108-design-doc-v1.pdf, YARN-4108.poc.1.patch
>
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4405) Support node label store in non-appendable file system

2015-12-01 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-4405:


 Summary: Support node label store in non-appendable file system
 Key: YARN-4405
 URL: https://issues.apache.org/jira/browse/YARN-4405
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan


Existing node label file system store implementation uses append to write edit 
logs. However, some file system doesn't support append, we need add an 
implementation to support such non-appendable file systems as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4390) Consider container request size during CS preemption

2015-12-01 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034376#comment-15034376
 ] 

Wangda Tan commented on YARN-4390:
--

Thanks for sharing your thoughts, [~curino]!

I agree with most of what you said: fixing large imbalance is more important 
than doing micro corrections. It will be enough when a cluster is large and 
resource requests are almost homogeneous, current PCPP can handle such cases 
quite well.

But in other cases, for example:
# Resource requests from different queues are very heterogeneous, some requests 
need 1G mem only, and some requests need 32G.
# Hard locality is required (for example SLIDER-82).
Existing PCPP cannot work well. I have seen many excessive preemption happens 
from a customer's cluster with several hundreds of nodes and requests are 
heterogeneous.

So I'm proposing an approach in YARN-4108 which combines the two: Large 
imbalance will be calculated by preemption monitor and micro corrections will 
be handled by scheduler's allocation logic. I've uploaded doc / POC patch to 
YARN-4108, please kindly review.

> Consider container request size during CS preemption
> 
>
> Key: YARN-4390
> URL: https://issues.apache.org/jira/browse/YARN-4390
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.0.0, 2.8.0, 2.7.3
>Reporter: Eric Payne
>Assignee: Eric Payne
>
> There are multiple reasons why preemption could unnecessarily preempt 
> containers. One is that an app could be requesting a large container (say 
> 8-GB), and the preemption monitor could conceivably preempt multiple 
> containers (say 8, 1-GB containers) in order to fill the large container 
> request. These smaller containers would then be rejected by the requesting AM 
> and potentially given right back to the preempted app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4401) A failed app recovery should not prevent the RM from starting

2015-12-01 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034399#comment-15034399
 ] 

Daniel Templeton commented on YARN-4401:


There are lots of reasons a recovery could fail.  For example, if a job is 
stored with a resource allocation that is higher than the configured maximum at 
the time of recovery, the recovery will throw an exception which will prevent 
the RM from starting.

In a single RM configuration, it makes some sense to allow the RM restart to be 
interrupted by recovery failure, but in an HA scenario, the standby in becoming 
active to prevent an outage.  Causing an outage over a bad application is 
undermining the point of HA.  It becomes a question of trading an application 
failure for a service outage.  I think most sites would choose the former.

There's already yarn.fail-fast and yarn.resourcemanager.fail-fast that control 
this behavior for some of the recovery failure scenarios, such as bad queue 
assignments.  I would propose we extend the meaning of those properties to 
cover the full range of what could go wrong during recovery.

> A failed app recovery should not prevent the RM from starting
> -
>
> Key: YARN-4401
> URL: https://issues.apache.org/jira/browse/YARN-4401
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
>
> There are many different reasons why an app recovery could fail with an 
> exception, causing the RM start to be aborted.  If that happens the RM will 
> fail to start.  Presumably, the reason the RM is trying to do a recovery is 
> that it's the standby trying to fill in for the active.  Failing to come up 
> defeats the purpose of the HA configuration.  Instead of preventing the RM 
> from starting, a failed app recovery should log an error and skip the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4401) A failed app recovery should not prevent the RM from starting

2015-12-01 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-4401:
---
Attachment: YARN-4401.001.patch

Here's the basic idea of what I'm proposing.

> A failed app recovery should not prevent the RM from starting
> -
>
> Key: YARN-4401
> URL: https://issues.apache.org/jira/browse/YARN-4401
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-4401.001.patch
>
>
> There are many different reasons why an app recovery could fail with an 
> exception, causing the RM start to be aborted.  If that happens the RM will 
> fail to start.  Presumably, the reason the RM is trying to do a recovery is 
> that it's the standby trying to fill in for the active.  Failing to come up 
> defeats the purpose of the HA configuration.  Instead of preventing the RM 
> from starting, a failed app recovery should log an error and skip the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-12-01 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4108:
-
Attachment: YARN-4108.poc.1.patch

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4108.poc.1.patch
>
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4404) Typo in comment in SchedulerUtils

2015-12-01 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-4404:
---
Assignee: Devon Michaels  (was: Daniel Templeton)

> Typo in comment in SchedulerUtils
> -
>
> Key: YARN-4404
> URL: https://issues.apache.org/jira/browse/YARN-4404
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Devon Michaels
>Priority: Trivial
>
> The comment starting on line 254 says:
> {code}
>   /**
>* Utility method to validate a resource request, by insuring that the
>* requested memory/vcore is non-negative and not greater than max
>* 
>* @throws InvalidResourceRequestException when there is invalid request
>*/
> {code}
> "Insuring" should be "ensuring."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-12-01 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034346#comment-15034346
 ] 

Wangda Tan commented on YARN-4108:
--

Uploading design doc with initial POC patch for review.

I put general ideas to design doc, and you can also take a look at POC patch if 
you interested in implementation details. "Background" part could be skipped if 
you have understandings of preemption logic implementation.

What I have done in the POC patch:
- Add a "dryrun" flag to CapacityScheduler to do preemption checks.
- Implemented PreemptionMonitor (major code comes from 
ProportionalCapacityPreemptionPolicy). See ProportionalCapacityMonitorPolicy.
- Implemented PreemptionManager, it stores relationship between demanding apps 
(which comes from under-satisifed queues) and to-be-preempted containers (which 
comes from over-satisifed queues).
And also it can make preemption decision according to ideal allocation 
(calculated by PreemptionMonitor).
- Add a simple unit test to demonstrate how it works from end-to-end. (Allocate 
a 4G container which needs to preempt 4 * 1G container on a same node).

TODO items:
- Reserved container preemption should be also considered
- Optimize how to select to-be-preempt containers, selected containers should 
have minimal impact
- Similar to above, when PreemptionManager found a better combination of 
containers to be preempted (such as containers from lower-priority 
applications), it should be able to update preemption target(s).
- Many other corner cases.

Any review comments / suggestions are welcome! And please feel free to let me 
know if you have any questions. 
+ [~curino], [~jlowe], [~eepayne], [~sunilg]

Thanks,

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4404) Typo in comment in SchedulerUtils

2015-12-01 Thread Daniel Templeton (JIRA)
Daniel Templeton created YARN-4404:
--

 Summary: Typo in comment in SchedulerUtils
 Key: YARN-4404
 URL: https://issues.apache.org/jira/browse/YARN-4404
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.1
Reporter: Daniel Templeton
Assignee: Daniel Templeton
Priority: Trivial


The comment starting on line 254 says:

{code}
  /**
   * Utility method to validate a resource request, by insuring that the
   * requested memory/vcore is non-negative and not greater than max
   * 
   * @throws InvalidResourceRequestException when there is invalid request
   */
{code}

"Insuring" should be "ensuring."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4108) CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request

2015-12-01 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-4108:
-
Attachment: YARN-4108-design-doc-v1.pdf

> CapacityScheduler: Improve preemption to preempt only those containers that 
> would satisfy the incoming request
> --
>
> Key: YARN-4108
> URL: https://issues.apache.org/jira/browse/YARN-4108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-4108-design-doc-v1.pdf, YARN-4108.poc.1.patch
>
>
> This is sibling JIRA for YARN-2154. We should make sure container preemption 
> is more effective.
> *Requirements:*:
> 1) Can handle case of user-limit preemption
> 2) Can handle case of resource placement requirements, such as: hard-locality 
> (I only want to use rack-1) / node-constraints (YARN-3409) / black-list (I 
> don't want to use rack1 and host\[1-3\])
> 3) Can handle preemption within a queue: cross user preemption (YARN-2113), 
> cross applicaiton preemption (such as priority-based (YARN-1963) / 
> fairness-based (YARN-3319)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4404) Typo in comment in SchedulerUtils

2015-12-01 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-4404:
---
Issue Type: Improvement  (was: Bug)

> Typo in comment in SchedulerUtils
> -
>
> Key: YARN-4404
> URL: https://issues.apache.org/jira/browse/YARN-4404
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Devon Michaels
>Priority: Trivial
>
> The comment starting on line 254 says:
> {code}
>   /**
>* Utility method to validate a resource request, by insuring that the
>* requested memory/vcore is non-negative and not greater than max
>* 
>* @throws InvalidResourceRequestException when there is invalid request
>*/
> {code}
> "Insuring" should be "ensuring."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3769) Consider user limit when calculating total pending resource for preemption policy in Capacity Scheduler

2015-12-01 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-3769:
-
Attachment: YARN-3769-branch-2.6.001.patch

Attaching {{YARN-3769-branch-2.6.001.patch}} for backport to branch-2.6.

TestLeafQueue unit test for multiple apps by multiple users had to be modified 
specifically to allow for all apps to be active at the same time since the way 
active apps is calculated is different between 2.6 and 2.7.

> Consider user limit when calculating total pending resource for preemption 
> policy in Capacity Scheduler
> ---
>
> Key: YARN-3769
> URL: https://issues.apache.org/jira/browse/YARN-3769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.6.0, 2.7.0, 2.8.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 2.7.3
>
> Attachments: YARN-3769-branch-2.002.patch, 
> YARN-3769-branch-2.6.001.patch, YARN-3769-branch-2.7.002.patch, 
> YARN-3769-branch-2.7.003.patch, YARN-3769-branch-2.7.005.patch, 
> YARN-3769-branch-2.7.006.patch, YARN-3769-branch-2.7.007.patch, 
> YARN-3769.001.branch-2.7.patch, YARN-3769.001.branch-2.8.patch, 
> YARN-3769.003.patch, YARN-3769.004.patch, YARN-3769.005.patch
>
>
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4406) RM Web UI continues to show decommissioned nodes even after RM restart

2015-12-01 Thread Ray Chiang (JIRA)
Ray Chiang created YARN-4406:


 Summary: RM Web UI continues to show decommissioned nodes even 
after RM restart
 Key: YARN-4406
 URL: https://issues.apache.org/jira/browse/YARN-4406
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Ray Chiang
Priority: Minor


If you start up a cluster, decommission a NodeManager, and restart the RM, the 
decommissioned node list will still show a positive number (1 in the case of 1 
node) and if you click on the list, it will be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4398) Yarn recover functionality causes the cluster running slowly and the cluster usage rate is far below 100

2015-12-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034466#comment-15034466
 ] 

Jian He commented on YARN-4398:
---

patch looks good to me.  thanks [~iceberg565], and [~templedf] for the review. 

> Yarn recover functionality causes the cluster running slowly and the cluster 
> usage rate is far below 100
> 
>
> Key: YARN-4398
> URL: https://issues.apache.org/jira/browse/YARN-4398
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: NING DING
> Attachments: YARN-4398.2.patch, YARN-4398.3.patch
>
>
> In my hadoop cluster, the resourceManager recover functionality is enabled 
> with FileSystemRMStateStore.
> I found this cause the yarn cluster running slowly and cluster usage rate is 
> just 50 even there are many pending Apps. 
> The scenario is below.
> In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling 
> storeNewApplication method defined in RMStateStore. This storeNewApplication 
> method is synchronized.
> {code:title=RMAppImpl.java|borderStyle=solid}
>   private static final class RMAppNewlySavingTransition extends 
> RMAppTransition {
> @Override
> public void transition(RMAppImpl app, RMAppEvent event) {
>   // If recovery is enabled then store the application information in a
>   // non-blocking call so make sure that RM has stored the information
>   // needed to restart the AM after RM restart without further client
>   // communication
>   LOG.info("Storing application with id " + app.applicationId);
>   app.rmContext.getStateStore().storeNewApplication(app);
> }
>   }
> {code}
> {code:title=RMStateStore.java|borderStyle=solid}
> public synchronized void storeNewApplication(RMApp app) {
> ApplicationSubmissionContext context = app
> 
> .getApplicationSubmissionContext();
> assert context instanceof ApplicationSubmissionContextPBImpl;
> ApplicationStateData appState =
> ApplicationStateData.newInstance(
> app.getSubmitTime(), app.getStartTime(), context, app.getUser());
> dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
>   }
> {code}
> In thread B, the FileSystemRMStateStore is calling 
> storeApplicationStateInternal method. It's also synchronized.
> This storeApplicationStateInternal method saves an ApplicationStateData into 
> HDFS and it normally costs 90~300 milliseconds in my hadoop cluster.
> {code:title=FileSystemRMStateStore.java|borderStyle=solid}
> public synchronized void storeApplicationStateInternal(ApplicationId appId,
>   ApplicationStateData appStateDataPB) throws Exception {
> Path appDirPath = getAppDir(rmAppRoot, appId);
> mkdirsWithRetries(appDirPath);
> Path nodeCreatePath = getNodePath(appDirPath, appId.toString());
> LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath);
> byte[] appStateData = appStateDataPB.getProto().toByteArray();
> try {
>   // currently throw all exceptions. May need to respond differently for 
> HA
>   // based on whether we have lost the right to write to FS
>   writeFileWithRetries(nodeCreatePath, appStateData, true);
> } catch (Exception e) {
>   LOG.info("Error storing info for app: " + appId, e);
>   throw e;
> }
>   }
> {code}
> Think thread B firstly comes into 
> FileSystemRMStateStore.storeApplicationStateInternal method, then thread A 
> will be blocked for a while because of synchronization. In ResourceManager 
> there is only one RMStateStore instance. In my cluster it's 
> FileSystemRMStateStore type.
> Debug the RMAppNewlySavingTransition.transition method, the thread stack 
> shows it's called form AsyncDispatcher.dispatch method. This method code is 
> as below. 
> {code:title=AsyncDispatcher.java|borderStyle=solid}
>   protected void dispatch(Event event) {
> //all events go thru this loop
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Dispatching the event " + event.getClass().getName() + "."
>   + event.toString());
> }
> Class type = event.getType().getDeclaringClass();
> try{
>   EventHandler handler = eventDispatchers.get(type);
>   if(handler != null) {
> handler.handle(event);
>   } else {
> throw new Exception("No handler for registered for " + type);
>   }
> } catch (Throwable t) {
>   //TODO Maybe log the state of the queue
>   LOG.fatal("Error in dispatcher thread", t);
>   // If serviceStop is called, we should exit this thread gracefully.
>   if (exitOnDispatchException
>   && (ShutdownHookManager.get().isShutdownInProgress()) == false
>   && stopped == 

[jira] [Updated] (YARN-4164) Retrospect update ApplicationPriority API return type

2015-12-01 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-4164:

Attachment: 0003-YARN-4164.patch

Updated the patch fixing review comment that null will not be returned instead 
priority of an application will be returned.

> Retrospect update ApplicationPriority API return type
> -
>
> Key: YARN-4164
> URL: https://issues.apache.org/jira/browse/YARN-4164
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
> Attachments: 0001-YARN-4164.patch, 0002-YARN-4164.patch, 
> 0003-YARN-4164.patch
>
>
> Currently {{ApplicationClientProtocol#updateApplicationPriority()}} API 
> returns empty UpdateApplicationPriorityResponse response.
> But RM update priority to the cluster.max-priority if the given priority is 
> greater than cluster.max-priority. In this scenarios, need to intimate back 
> to client that updated  priority rather just keeping quite where client 
> assumes that given priority itself is taken.
> During application submission also has same scenario can happen, but I feel 
> when 
> explicitly invoke via ApplicationClientProtocol#updateApplicationPriority(), 
> response should have updated priority in response. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4309) Add debug information to application logs when a container fails

2015-12-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033371#comment-15033371
 ] 

Hadoop QA commented on YARN-4309:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
12s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 37s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 13s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
29s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
39s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
56s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 33s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 7s 
{color} | {color:green} trunk passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
39s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 31s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 31s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 43s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 43s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 35s 
{color} | {color:red} Patch generated 2 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 357, now 356). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 48s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
43s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
25s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 31s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 46s 
{color} | {color:green} the patch passed with JDK v1.7.0_85 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 58s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 49s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 28s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_85. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 23s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_85. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 11s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_85. {color} |
| 

[jira] [Commented] (YARN-3417) AM to be able to exit with a request saying "restart me with these (possibly updated) resource requirements"

2015-12-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033454#comment-15033454
 ] 

Steve Loughran commented on YARN-3417:
--

For Slider, we are coping with things today (and fixed resource requirements), 
by doing an exit(-1) and restarting. This handles rolling upgrades provided 
requirements don't change and the restart frequency is below that of the 
failure window.

so ... cut it from your todo list for now and we'll see where it ends up.

thanks for offering to do it though —appreciated


> AM to be able to exit with a request saying "restart me with these (possibly 
> updated) resource requirements"
> 
>
> Key: YARN-3417
> URL: https://issues.apache.org/jira/browse/YARN-3417
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: Steve Loughran
>Assignee: Varun Saxena
>Priority: Minor
>
> If an AM wants to reconfigure itself or restart with new resources, there's 
> no way to do this without the active participation of a client.
> It can call System.exit and rely on YARN to restart it -but that counts as a 
> failure and may lose the entire app. furthermore, that doesn't allow the AM 
> to resize itself.
> A simple exit-code to be interpreted as restart-without-failure could handle 
> the first case; an explicit call to indicate restart, including potentially 
> new resource/label requirements, could be more reliabile, and certainly more 
> flexible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4328) Findbugs warning in resourcemanager in branch-2.7 and branch-2.6

2015-12-01 Thread Akira AJISAKA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated YARN-4328:

Attachment: YARN-4328.branch-2.6.00.patch

> Findbugs warning in resourcemanager in branch-2.7 and branch-2.6
> 
>
> Key: YARN-4328
> URL: https://issues.apache.org/jira/browse/YARN-4328
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Varun Saxena
>Assignee: Akira AJISAKA
>Priority: Minor
> Attachments: YARN-4328.branch-2.6.00.patch, 
> YARN-4328.branch-2.7.00.patch
>
>
> This issue exists in both branch-2.7 and branch-2.6
> {noformat}
>  classname='org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKSyncOperationCallback'>
>  category='PERFORMANCE' message='Should 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKSyncOperationCallback
>  be a _static_ inner class?' lineNumber='118'/>
> {noformat}
> Below issue exists only in branch-2.6
> {noformat}
>  classname='org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt'>
>  category='MT_CORRECTNESS' message='Inconsistent synchronization of 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.queue;
>  locked 57% of time' lineNumber='261'/>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4328) Findbugs warning in resourcemanager in branch-2.7 and branch-2.6

2015-12-01 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033309#comment-15033309
 ] 

Akira AJISAKA commented on YARN-4328:
-

I'm thinking it's okay to add a section in findbugs-exclude.xml to fix the 
below warning.
{noformat}

 Findbugs warning in resourcemanager in branch-2.7 and branch-2.6
> 
>
> Key: YARN-4328
> URL: https://issues.apache.org/jira/browse/YARN-4328
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Varun Saxena
>Assignee: Akira AJISAKA
>Priority: Minor
> Attachments: YARN-4328.branch-2.7.00.patch
>
>
> This issue exists in both branch-2.7 and branch-2.6
> {noformat}
>  classname='org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKSyncOperationCallback'>
>  category='PERFORMANCE' message='Should 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKSyncOperationCallback
>  be a _static_ inner class?' lineNumber='118'/>
> {noformat}
> Below issue exists only in branch-2.6
> {noformat}
>  classname='org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt'>
>  category='MT_CORRECTNESS' message='Inconsistent synchronization of 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.queue;
>  locked 57% of time' lineNumber='261'/>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3456) Improve handling of incomplete TimelineEntities

2015-12-01 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033486#comment-15033486
 ] 

Rohith Sharma K S commented on YARN-3456:
-

+1 lgtm.. 

> Improve handling of incomplete TimelineEntities
> ---
>
> Key: YARN-3456
> URL: https://issues.apache.org/jira/browse/YARN-3456
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Steve Loughran
>Assignee: Varun Saxena
>Priority: Minor
> Attachments: YARN-3456.01.patch, YARN-3456.02.patch
>
>
> If an incomplete TimelineEntity is posted, it isn't checked client side ... 
> it gets all the way to the far end before triggering an NPE in the store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3456) Improve handling of incomplete TimelineEntities

2015-12-01 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033488#comment-15033488
 ] 

Rohith Sharma K S commented on YARN-3456:
-

I will wait for a day before committing it.

> Improve handling of incomplete TimelineEntities
> ---
>
> Key: YARN-3456
> URL: https://issues.apache.org/jira/browse/YARN-3456
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Steve Loughran
>Assignee: Varun Saxena
>Priority: Minor
> Attachments: YARN-3456.01.patch, YARN-3456.02.patch
>
>
> If an incomplete TimelineEntity is posted, it isn't checked client side ... 
> it gets all the way to the far end before triggering an NPE in the store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3862) Support for fetching specific configs and metrics based on prefixes

2015-12-01 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3862:
---
Attachment: YARN-3862-feature-YARN-2928.05.patch

> Support for fetching specific configs and metrics based on prefixes
> ---
>
> Key: YARN-3862
> URL: https://issues.apache.org/jira/browse/YARN-3862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: YARN-2928
>Reporter: Varun Saxena
>Assignee: Varun Saxena
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3862-YARN-2928.wip.01.patch, 
> YARN-3862-YARN-2928.wip.02.patch, YARN-3862-feature-YARN-2928.04.patch, 
> YARN-3862-feature-YARN-2928.05.patch, YARN-3862-feature-YARN-2928.wip.03.patch
>
>
> Currently, we will retrieve all the contents of the field if that field is 
> specified in the query API. In case of configs and metrics, this can become a 
> lot of data even though the user doesn't need it. So we need to provide a way 
> to query only a set of configs or metrics.
> As a comma spearated list of configs/metrics to be returned will be quite 
> cumbersome to specify, we have to support either of the following options :
> # Prefix match
> # Regex
> # Group the configs/metrics and query that group.
> We also need a facility to specify a metric time window to return metrics in 
> a that window. This may be useful in plotting graphs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4292) ResourceUtilization should be a part of NodeInfo REST API

2015-12-01 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-4292:
--
Attachment: 0005-YARN-4292.patch

Thank you [~leftnoteasy] for the comments. Yes, I also agree with your 
suggestion. Updating patch. Kindly help to review.

> ResourceUtilization should be a part of NodeInfo REST API
> -
>
> Key: YARN-4292
> URL: https://issues.apache.org/jira/browse/YARN-4292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Sunil G
> Attachments: 0001-YARN-4292.patch, 0002-YARN-4292.patch, 
> 0003-YARN-4292.patch, 0004-YARN-4292.patch, 0005-YARN-4292.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4402) TestNodeManagerShutdown And TestNodeManagerResync fails with bind exception

2015-12-01 Thread Brahma Reddy Battula (JIRA)
Brahma Reddy Battula created YARN-4402:
--

 Summary: TestNodeManagerShutdown And TestNodeManagerResync fails 
with bind exception
 Key: YARN-4402
 URL: https://issues.apache.org/jira/browse/YARN-4402
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Brahma Reddy Battula
Assignee: Brahma Reddy Battula



https://builds.apache.org/job/Hadoop-Yarn-trunk/1465/testReport/

{noformat}
2015-12-01 04:56:07,150 INFO  [main] http.HttpServer2 
(HttpServer2.java:start(846)) - HttpServer.start() threw a non Bind IOException
java.net.BindException: Port in use: 0.0.0.0:8042
at 
org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:906)
at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:843)
at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:306)
at 
org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer.serviceStart(WebServer.java:73)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:368)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerResync.testContainerPreservationOnResyncImpl(TestNodeManagerResync.java:164)
at 
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerResync.testKillContainersOnResync(TestNodeManagerResync.java:141)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout

2015-12-01 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033915#comment-15033915
 ] 

Tsuyoshi Ozawa commented on YARN-4348:
--

Ran test since last Jenkins failed to launch. javadoc seems to be false 
positive since this patch doesn't include any changes of javadocs.

{quote}
-1 overall.  

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 javadoc.  The javadoc tool appears to have generated 2079 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version ) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.
{quote}

[~jianhe] could you take a look at latest patch?

> ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of 
> zkSessionTimeout
> 
>
> Key: YARN-4348
> URL: https://issues.apache.org/jira/browse/YARN-4348
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.2
>Reporter: Tsuyoshi Ozawa
>Assignee: Tsuyoshi Ozawa
>Priority: Blocker
> Attachments: YARN-4348-branch-2.7.002.patch, 
> YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, 
> YARN-4348.001.patch, YARN-4348.001.patch, log.txt
>
>
> Jian mentioned that the current internal ZK configuration of ZKRMStateStore 
> can cause a following situation:
> 1. syncInternal timeouts, 
> 2. but sync succeeded later on.
> We should use zkResyncWaitTime as the timeout value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4402) TestNodeManagerShutdown And TestNodeManagerResync fails with bind exception

2015-12-01 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-4402:
---
Attachment: YARN-4402.patch

> TestNodeManagerShutdown And TestNodeManagerResync fails with bind exception
> ---
>
> Key: YARN-4402
> URL: https://issues.apache.org/jira/browse/YARN-4402
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
> Attachments: YARN-4402.patch
>
>
> https://builds.apache.org/job/Hadoop-Yarn-trunk/1465/testReport/
> {noformat}
> 2015-12-01 04:56:07,150 INFO  [main] http.HttpServer2 
> (HttpServer2.java:start(846)) - HttpServer.start() threw a non Bind 
> IOException
> java.net.BindException: Port in use: 0.0.0.0:8042
>   at 
> org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:906)
>   at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:843)
>   at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:306)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer.serviceStart(WebServer.java:73)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:368)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerResync.testContainerPreservationOnResyncImpl(TestNodeManagerResync.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerResync.testKillContainersOnResync(TestNodeManagerResync.java:141)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout

2015-12-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033937#comment-15033937
 ] 

Junping Du commented on YARN-4348:
--

Hi [~ozawa] and [~jianhe], is this a blocker bug? Can we move it to 2.6.4?

> ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of 
> zkSessionTimeout
> 
>
> Key: YARN-4348
> URL: https://issues.apache.org/jira/browse/YARN-4348
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.2
>Reporter: Tsuyoshi Ozawa
>Assignee: Tsuyoshi Ozawa
>Priority: Blocker
> Attachments: YARN-4348-branch-2.7.002.patch, 
> YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, 
> YARN-4348.001.patch, YARN-4348.001.patch, log.txt
>
>
> Jian mentioned that the current internal ZK configuration of ZKRMStateStore 
> can cause a following situation:
> 1. syncInternal timeouts, 
> 2. but sync succeeded later on.
> We should use zkResyncWaitTime as the timeout value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1848) Persist ClusterMetrics across RM HA transitions

2015-12-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033756#comment-15033756
 ] 

Junping Du commented on YARN-1848:
--

Moving it to 2.6.4 as no update on this JIRA for a while.

> Persist ClusterMetrics across RM HA transitions
> ---
>
> Key: YARN-1848
> URL: https://issues.apache.org/jira/browse/YARN-1848
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> Post YARN-1705, ClusterMetrics are reset on transition to standby. This is 
> acceptable as the metrics show statistics since an RM has become active. 
> Users might want to see metrics since the cluster was ever started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1767) Windows: Allow a way for users to augment classpath of YARN daemons

2015-12-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-1767:
-
Target Version/s: 2.7.3, 2.6.4  (was: 2.6.3, 2.7.3)

> Windows: Allow a way for users to augment classpath of YARN daemons
> ---
>
> Key: YARN-1767
> URL: https://issues.apache.org/jira/browse/YARN-1767
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>
> YARN-1429 adds a way to augment the classpath for *nix-based systems. Need 
> something similar for Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2457) FairScheduler: Handle preemption to help starved parent queues

2015-12-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2457:
-
Target Version/s: 2.7.3, 2.6.4  (was: 2.6.3, 2.7.3)

> FairScheduler: Handle preemption to help starved parent queues
> --
>
> Key: YARN-2457
> URL: https://issues.apache.org/jira/browse/YARN-2457
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> YARN-2395/YARN-2394 add preemption timeout and threshold per queue, but don't 
> check for parent queue starvation. 
> We need to check that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3478) FairScheduler page not performed because different enum of YarnApplicationState and RMAppState

2015-12-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3478:
-
Target Version/s: 2.7.3, 2.6.4  (was: 2.6.3, 2.7.3)

> FairScheduler page not performed because different enum of 
> YarnApplicationState and RMAppState 
> ---
>
> Key: YARN-3478
> URL: https://issues.apache.org/jira/browse/YARN-3478
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.0
>Reporter: Xu Chen
> Attachments: YARN-3478.1.patch, YARN-3478.2.patch, YARN-3478.3.patch, 
> screenshot-1.png
>
>
> Got exception from log 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> at 
> com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
> at 
> com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
> at 
> com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
> at 
> com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1225)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.lib.DynamicUserWebFilter$DynamicUserFilter.doFilter(DynamicUserWebFilter.java:59)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:326)
> at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> at 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
> at 
> 

[jira] [Commented] (YARN-3478) FairScheduler page not performed because different enum of YarnApplicationState and RMAppState

2015-12-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033763#comment-15033763
 ] 

Junping Du commented on YARN-3478:
--

Move it to 2.6.4 as no update on this JIRA for a while.

> FairScheduler page not performed because different enum of 
> YarnApplicationState and RMAppState 
> ---
>
> Key: YARN-3478
> URL: https://issues.apache.org/jira/browse/YARN-3478
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.6.0
>Reporter: Xu Chen
> Attachments: YARN-3478.1.patch, YARN-3478.2.patch, YARN-3478.3.patch, 
> screenshot-1.png
>
>
> Got exception from log 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
> at 
> com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
> at 
> com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
> at 
> com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
> at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
> at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
> at 
> com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1225)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.lib.DynamicUserWebFilter$DynamicUserFilter.doFilter(DynamicUserWebFilter.java:59)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:326)
> at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> at 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
> 

[jira] [Commented] (YARN-1767) Windows: Allow a way for users to augment classpath of YARN daemons

2015-12-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033761#comment-15033761
 ] 

Junping Du commented on YARN-1767:
--

Move it to 2.6.4 as no update on this JIRA for a while.

> Windows: Allow a way for users to augment classpath of YARN daemons
> ---
>
> Key: YARN-1767
> URL: https://issues.apache.org/jira/browse/YARN-1767
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Karthik Kambatla
>
> YARN-1429 adds a way to augment the classpath for *nix-based systems. Need 
> something similar for Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1480) RM web services getApps() accepts many more filters than ApplicationCLI "list" command

2015-12-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033765#comment-15033765
 ] 

Junping Du commented on YARN-1480:
--

Move it to 2.6.4 as no update on JIRA for a while.

> RM web services getApps() accepts many more filters than ApplicationCLI 
> "list" command
> --
>
> Key: YARN-1480
> URL: https://issues.apache.org/jira/browse/YARN-1480
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Kenji Kikushima
> Attachments: YARN-1480-2.patch, YARN-1480-3.patch, YARN-1480-4.patch, 
> YARN-1480-5.patch, YARN-1480-6.patch, YARN-1480.patch
>
>
> Nowadays RM web services getApps() accepts many more filters than 
> ApplicationCLI "list" command, which only accepts "state" and "type". IMHO, 
> ideally, different interfaces should provide consistent functionality. Is it 
> better to allow more filters in ApplicationCLI?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2037) Add restart support for Unmanaged AMs

2015-12-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2037:
-
Target Version/s: 2.7.3, 2.6.4  (was: 2.6.3, 2.7.3)

> Add restart support for Unmanaged AMs
> -
>
> Key: YARN-2037
> URL: https://issues.apache.org/jira/browse/YARN-2037
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> It would be nice to allow Unmanaged AMs also to restart in a work-preserving 
> way. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2599) Standby RM should also expose some jmx and metrics

2015-12-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2599:
-
Target Version/s: 2.7.3, 2.6.4  (was: 2.6.3, 2.7.3)

> Standby RM should also expose some jmx and metrics
> --
>
> Key: YARN-2599
> URL: https://issues.apache.org/jira/browse/YARN-2599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Karthik Kambatla
>Assignee: Rohith Sharma K S
>
> YARN-1898 redirects jmx and metrics to the Active. As discussed there, we 
> need to separate out metrics displayed so the Standby RM can also be 
> monitored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2457) FairScheduler: Handle preemption to help starved parent queues

2015-12-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033757#comment-15033757
 ] 

Junping Du commented on YARN-2457:
--

Move it to 2.6.4 as no update on this jira for a while.

> FairScheduler: Handle preemption to help starved parent queues
> --
>
> Key: YARN-2457
> URL: https://issues.apache.org/jira/browse/YARN-2457
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> YARN-2395/YARN-2394 add preemption timeout and threshold per queue, but don't 
> check for parent queue starvation. 
> We need to check that. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1480) RM web services getApps() accepts many more filters than ApplicationCLI "list" command

2015-12-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-1480:
-
Target Version/s: 2.7.3, 2.6.4  (was: 2.6.3, 2.7.3)

> RM web services getApps() accepts many more filters than ApplicationCLI 
> "list" command
> --
>
> Key: YARN-1480
> URL: https://issues.apache.org/jira/browse/YARN-1480
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Kenji Kikushima
> Attachments: YARN-1480-2.patch, YARN-1480-3.patch, YARN-1480-4.patch, 
> YARN-1480-5.patch, YARN-1480-6.patch, YARN-1480.patch
>
>
> Nowadays RM web services getApps() accepts many more filters than 
> ApplicationCLI "list" command, which only accepts "state" and "type". IMHO, 
> ideally, different interfaces should provide consistent functionality. Is it 
> better to allow more filters in ApplicationCLI?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2037) Add restart support for Unmanaged AMs

2015-12-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033771#comment-15033771
 ] 

Junping Du commented on YARN-2037:
--

Move it to 2.6.4 as no update on this JIRA for a while.

> Add restart support for Unmanaged AMs
> -
>
> Key: YARN-2037
> URL: https://issues.apache.org/jira/browse/YARN-2037
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Karthik Kambatla
>
> It would be nice to allow Unmanaged AMs also to restart in a work-preserving 
> way. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >