[jira] [Commented] (YARN-1969) Fair Scheduler: Add policy for Earliest Deadline First

2014-05-13 Thread Maysam Yabandeh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995996#comment-13995996
 ] 

Maysam Yabandeh commented on YARN-1969:
---

Talked to [~jira.shegalov] offline. This would indeed allow RM to make more 
efficient scheduling decisions. This seems to be a good candidate for phase 2 
of this jira.

 Fair Scheduler: Add policy for Earliest Deadline First
 --

 Key: YARN-1969
 URL: https://issues.apache.org/jira/browse/YARN-1969
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh

 What we are observing is that some big jobs with many allocated containers 
 are waiting for a few containers to finish. Under *fair-share scheduling* 
 however they have a low priority since there are other jobs (usually much 
 smaller, new comers) that are using resources way below their fair share, 
 hence new released containers are not offered to the big, yet 
 close-to-be-finished job. Nevertheless, everybody would benefit from an 
 unfair scheduling that offers the resource to the big job since the sooner 
 the big job finishes, the sooner it releases its many allocated resources 
 to be used by other jobs.In other words, what we require is a kind of 
 variation of *Earliest Deadline First scheduling*, that takes into account 
 the number of already-allocated resources and estimated time to finish.
 http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling
 For example, if a job is using MEM GB of memory and is expected to finish in 
 TIME minutes, the priority in scheduling would be a function p of (MEM, 
 TIME). The expected time to finish can be estimated by the AppMaster using 
 TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource 
 request messages. To be less susceptible to the issue of apps gaming the 
 system, we can have this scheduling limited to *only within a queue*: i.e., 
 adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues 
 to use it by setting the schedulingPolicy field.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2049) Delegation token stuff for the timeline sever

2014-05-13 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-2049:
-

 Summary: Delegation token stuff for the timeline sever
 Key: YARN-2049
 URL: https://issues.apache.org/jira/browse/YARN-2049
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins

2014-05-13 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G reassigned YARN-1408:
-

Assignee: Sunil G

 Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task 
 timeout for 30mins
 --

 Key: YARN-1408
 URL: https://issues.apache.org/jira/browse/YARN-1408
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.2.0
Reporter: Sunil G
Assignee: Sunil G
 Fix For: 2.5.0

 Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, 
 Yarn-1408.4.patch, Yarn-1408.patch


 Capacity preemption is enabled as follows.
  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
  *  
 yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
 Queue = a,b
 Capacity of Queue A = 80%
 Capacity of Queue B = 20%
 Step 1: Assign a big jobA on queue a which uses full cluster capacity
 Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster 
 capacity
 JobA task which uses queue b capcity is been preempted and killed.
 This caused below problem:
 1. New Container has got allocated for jobA in Queue A as per node update 
 from an NM.
 2. This container has been preempted immediately as per preemption.
 Here ACQUIRED at KILLED Invalid State exception came when the next AM 
 heartbeat reached RM.
 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 ACQUIRED at KILLED
 This also caused the Task to go for a timeout for 30minutes as this Container 
 was already killed by preemption.
 attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-13 Thread Min Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996083#comment-13996083
 ] 

Min Zhou commented on YARN-2048:


[~zjshen] hmm... from your patch,  it's indeed a duplicate. Please go ahead.

 List all of the containers of an application from the yarn web
 --

 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Min Zhou
 Attachments: YARN-2048-trunk-v1.patch


 Currently, Yarn haven't provide a way to list all of the containers of an 
 application from its web. This kind of information is needed by the 
 application user. They can conveniently know how many containers their 
 applications already acquired as well as which nodes those containers were 
 launched on.  They also want to view the logs of each container of an 
 application.
 One approach is maintain a container list in RMAppImpl and expose this info 
 to Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-05-13 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995819#comment-13995819
 ] 

Jian He commented on YARN-1368:
---

Hi [~adhoot], thanks for working on the FS changes, can you please separate it 
and upload onto YARN-1370 for FS specific change? I already  have a local patch 
which changes quite a bit from the latest patch uploaded here. And also  
YARN-2017 is likely to go in first which again conflicts quite a bit with the 
patch here. 

 Common work to re-populate containers’ state into scheduler
 ---

 Key: YARN-1368
 URL: https://issues.apache.org/jira/browse/YARN-1368
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Jian He
 Attachments: YARN-1368.1.patch, YARN-1368.combined.001.patch, 
 YARN-1368.preliminary.patch


 YARN-1367 adds support for the NM to tell the RM about all currently running 
 containers upon registration. The RM needs to send this information to the 
 schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
 the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2032) Implement a scalable, available TimelineStore using HBase

2014-05-13 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created YARN-2032:
-

 Summary: Implement a scalable, available TimelineStore using HBase
 Key: YARN-2032
 URL: https://issues.apache.org/jira/browse/YARN-2032
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


As discussed on YARN-1530, we should pursue implementing a scalable, available 
Timeline store using HBase.

One goal is to reuse most of the code from the levelDB Based store - YARN-1635.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1936) Secured timeline client

2014-05-13 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1936:
--

Attachment: YARN-1936.1.patch

I created a patch:

1. It makes use of the hadoop-auth module and YARN-2049 to talk to the timeline 
server with either kerberos authentication or delegation token:

2. I creates a main method, which allows users to upload the timeline data in a 
JSON file from command line.

3. When using YarnClient to submit an application, if the authentication is 
enabled, YarnClient is going to check whether app submission context has the 
timeline DT or not. If not, it will add the DT to the context, such that when 
AM uses TimelineClient, it can use the DT for authentication, as it can not use 
kerberos instead.

 Secured timeline client
 ---

 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1936.1.patch


 TimelineClient should be able to talk to the timeline server with kerberos 
 authentication or delegation token



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers

2014-05-13 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996165#comment-13996165
 ] 

Wangda Tan commented on YARN-2017:
--

Hi Jian,
Thanks for your efforts on this patch, some comments,

1) SchedulerNode.java
{code}
  private synchronized void deductAvailableResource(Resource resource) {
if (resource == null) {
  LOG.error(Invalid deduction of null resource for 
  + rmNode.getNodeAddress());
{code}
Since this is original logic of SchedulerNode, I think it's better to throw 
exception instead of print a irresponsible log here. Null object passed in 
should be considered big problem in scheduler. And several following places in 
this class.

2) SchedulerNode.java
{code}
+  private synchronized boolean isValidContainer(Container c) {
+if (launchedContainers.containsKey(c.getId()))
+  return true;
+return false;
+  }
{code}
Better add {..} following if

3) SchedulerNode.java
{code}
+  public synchronized RMContainer getReservedContainer() {
+return reservedContainer;
+  }
{code}
I think it's better to add a setReservedContainer(...) instead of manipulating 
super.reservedContainer in its sub classes. And change protected 
reservedContainer to private

4) In YarnScheduler.java
{code}
+  /**
+   * Get the whole resource capacity of the cluster.
+   * @return the whole resource capacity of the cluster.
+   */
+  @LimitedPrivate(yarn)
+  @Unstable
+  public Resource getClusterResource();
{code}
I'm wondering if it is meaningful to merge this method, too much code changes 
due to this merge. I found there're no common logic (like 
SchedulerNode/SchedulerAppAttempt) use it.

5) In FairScheduler.java
{code}
+  protected FSSchedulerApp getCurrentAttemptForContainer(ContainerId 
containerId) {
+return (FSSchedulerApp) super.getCurrentAttemptForContainer(containerId);
}
{code}
I understand this is a adaptor, I agree with [~sandyr] about using generic to 
eliminate such type casting?

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2017.1.patch


 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-05-13 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1902:
--

Target Version/s: 2.5.0  (was: 2.3.0)
  Labels: client  (was: patch)

 Allocation of too many containers when a second request is done with the same 
 resource capability
 -

 Key: YARN-1902
 URL: https://issues.apache.org/jira/browse/YARN-1902
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.2.0, 2.3.0, 2.4.0
Reporter: Sietse T. Au
  Labels: client
 Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch


 Regarding AMRMClientImpl
 Scenario 1:
 Given a ContainerRequest x with Resource y, when addContainerRequest is 
 called z times with x, allocate is called and at least one of the z allocated 
 containers is started, then if another addContainerRequest call is done and 
 subsequently an allocate call to the RM, (z+1) containers will be allocated, 
 where 1 container is expected.
 Scenario 2:
 No containers are started between the allocate calls. 
 Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
 are requested in both scenarios, but that only in the second scenario, the 
 correct behavior is observed.
 Looking at the implementation I have found that this (z+1) request is caused 
 by the structure of the remoteRequestsTable. The consequence of MapResource, 
 ResourceRequestInfo is that ResourceRequestInfo does not hold any 
 information about whether a request has been sent to the RM yet or not.
 There are workarounds for this, such as releasing the excess containers 
 received.
 The solution implemented is to initialize a new ResourceRequest in 
 ResourceRequestInfo when a request has been successfully sent to the RM.
 The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-13 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996055#comment-13996055
 ] 

Zhijie Shen commented on YARN-2048:
---

Is this duplicate of YARN-1809?

In YARN-1809, one change is to make RM web list containers as well.

 List all of the containers of an application from the yarn web
 --

 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Min Zhou
 Attachments: YARN-2048-trunk-v1.patch


 Currently, Yarn haven't provide a way to list all of the containers of an 
 application from its web. This kind of information is needed by the 
 application user. They can conveniently know how many containers their 
 applications already acquired as well as which nodes those containers were 
 launched on.  They also want to view the logs of each container of an 
 application.
 One approach is maintain a container list in RMAppImpl and expose this info 
 to Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1515) Ability to dump the container threads and stop the containers in a single RPC

2014-05-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993762#comment-13993762
 ] 

Jason Lowe commented on YARN-1515:
--

I apologize for the long delay in reviewing and resulting upmerge it caused.  
Patch looks good to me with just some minor comments:

- StopContainerRequest#getDumpThreads#getDumpThreads should have javadocs and 
interface annotations like the other methods
- Why is StopContainersRequest#getStopRequests marked Unstable but 
setStopRequests is Stable?
- Nit: dumpThreads is an event-specific field, would be nice to have an 
AMLauncherCleanupEvent that takes just the app attempt in the constructor and 
derives from AMLauncherEvent.

 Ability to dump the container threads and stop the containers in a single RPC
 -

 Key: YARN-1515
 URL: https://issues.apache.org/jira/browse/YARN-1515
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, 
 YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, 
 YARN-1515.v06.patch


 This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for 
 timed-out task attempts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-322) Add cpu information to queue metrics

2014-05-13 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995093#comment-13995093
 ] 

Nathan Roberts commented on YARN-322:
-

Arun, does this patch address what you were looking for? Happy to adjust if not.

 Add cpu information to queue metrics
 

 Key: YARN-322
 URL: https://issues.apache.org/jira/browse/YARN-322
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, scheduler
Affects Versions: 2.4.0
Reporter: Arun C Murthy
Assignee: Nathan Roberts
 Fix For: 2.5.0

 Attachments: YARN-322.patch, YARN-322.patch


 Post YARN-2 we need to add cpu information to queue metrics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-13 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995824#comment-13995824
 ] 

Wangda Tan commented on YARN-2048:
--

Hi [~coderplay], +1 for this idea, it should be very helpful to debug YARN 
applications.
I took a look at your patch, some comments,
1) When AM restarted, all containers will be copied to new attempt's container 
list, which might lead user confuse why new attempt has all containers from old 
attempts
2) You might need consider following JIRAs to make app-containers page included 
all expected containers, YARN-556,YARN-1885,YARN-1489

 List all of the containers of an application from the yarn web
 --

 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Min Zhou
 Attachments: YARN-2048-trunk-v1.patch


 Currently, Yarn haven't provide a way to list all of the containers of an 
 application from its web. This kind of information is needed by the 
 application user. They can conveniently know how many containers their 
 applications already acquired as well as which nodes those containers were 
 launched on.  They also want to view the logs of each container of an 
 application.
 One approach is maintain a container list in RMAppImpl and expose this info 
 to Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers

2014-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995826#comment-13995826
 ] 

Hadoop QA commented on YARN-2017:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644502/YARN-2017.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 8 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-tools/hadoop-sls 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3740//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3740//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3740//console

This message is automatically generated.

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2017.1.patch


 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1969) Fair Scheduler: Add policy for Earliest Deadline First

2014-05-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996101#comment-13996101
 ] 

Karthik Kambatla commented on YARN-1969:


I am a little confused. Is the original intention of the JIRA to do Earliest 
Endtime First, and *not* Earliest Deadline First? 

 Fair Scheduler: Add policy for Earliest Deadline First
 --

 Key: YARN-1969
 URL: https://issues.apache.org/jira/browse/YARN-1969
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh

 What we are observing is that some big jobs with many allocated containers 
 are waiting for a few containers to finish. Under *fair-share scheduling* 
 however they have a low priority since there are other jobs (usually much 
 smaller, new comers) that are using resources way below their fair share, 
 hence new released containers are not offered to the big, yet 
 close-to-be-finished job. Nevertheless, everybody would benefit from an 
 unfair scheduling that offers the resource to the big job since the sooner 
 the big job finishes, the sooner it releases its many allocated resources 
 to be used by other jobs.In other words, what we require is a kind of 
 variation of *Earliest Deadline First scheduling*, that takes into account 
 the number of already-allocated resources and estimated time to finish.
 http://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling
 For example, if a job is using MEM GB of memory and is expected to finish in 
 TIME minutes, the priority in scheduling would be a function p of (MEM, 
 TIME). The expected time to finish can be estimated by the AppMaster using 
 TaskRuntimeEstimator#estimatedRuntime and be supplied to RM in the resource 
 request messages. To be less susceptible to the issue of apps gaming the 
 system, we can have this scheduling limited to *only within a queue*: i.e., 
 adding a EarliestDeadlinePolicy extends SchedulingPolicy and let the queues 
 to use it by setting the schedulingPolicy field.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-998) Persistent resource change during NM/RM restart

2014-05-13 Thread Kenji Kikushima (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenji Kikushima updated YARN-998:
-

Attachment: YARN-998-sample.patch

Attached file is sample implementation to persist ResourceOption on RM.
Please refer to it if you have interest. I'm okay that leave it until proper 
timing. Thanks.
- This patch needs YARN-1911.patch to avoid NPE
- Sorry this patch supports RM restart only
- It possibly affect scalability on large cluster because of using XML to 
persist ResourceOption

 Persistent resource change during NM/RM restart
 ---

 Key: YARN-998
 URL: https://issues.apache.org/jira/browse/YARN-998
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, scheduler
Reporter: Junping Du
Assignee: Junping Du
 Attachments: YARN-998-sample.patch


 When NM is restarted by plan or from a failure, previous dynamic resource 
 setting should be kept for consistency.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-13 Thread Min Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996126#comment-13996126
 ] 

Min Zhou commented on YARN-2048:


[~zjshen] Currently, the only implementation of ApplcationContext is 
ApplicationHistoryManagerImpl, which retrieves containers information from 
history store. 
Quesions:
# How do you fetch the containers info from a historyserver and display it on 
the RM web? 
# If the information is from history store, seems RM won't get that kind of 
info until the application is done? Sometimes user's application might be a 
long-live application, never finish unless user kill it.
# Seems the only way providing containers info to RM is to maintain a list in 
RMAppAttempImpl, which was my way as well.

Min

 List all of the containers of an application from the yarn web
 --

 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Min Zhou
 Attachments: YARN-2048-trunk-v1.patch


 Currently, Yarn haven't provide a way to list all of the containers of an 
 application from its web. This kind of information is needed by the 
 application user. They can conveniently know how many containers their 
 applications already acquired as well as which nodes those containers were 
 launched on.  They also want to view the logs of each container of an 
 application.
 One approach is maintain a container list in RMAppImpl and expose this info 
 to Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-13 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996062#comment-13996062
 ] 

Wangda Tan commented on YARN-2048:
--

Hi [~zjshen], I think these two JIRAs covers similar issues. I took a quick 
look at your patch, I found you haven't changed RMAppAttempt/RMApp/RMContainer, 
so could you please elaborate a little about what you did to get containers 
from an application? Ask this because I'm thinking [~coderplay]'s patch can be 
considered as complementary of your solution if you haven't implemented it. 

 List all of the containers of an application from the yarn web
 --

 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Min Zhou
 Attachments: YARN-2048-trunk-v1.patch


 Currently, Yarn haven't provide a way to list all of the containers of an 
 application from its web. This kind of information is needed by the 
 application user. They can conveniently know how many containers their 
 applications already acquired as well as which nodes those containers were 
 launched on.  They also want to view the logs of each container of an 
 application.
 One approach is maintain a container list in RMAppImpl and expose this info 
 to Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers

2014-05-13 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995836#comment-13995836
 ] 

Jian He commented on YARN-2017:
---

The findbugs warning should not be a problem, will suppress it.

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2017.1.patch


 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-13 Thread Min Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995863#comment-13995863
 ] 

Min Zhou commented on YARN-2048:


Thanks [~leftnoteasy],

I will take a look on the those stuff you mentioned and resubmit another one.

 List all of the containers of an application from the yarn web
 --

 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Min Zhou
 Attachments: YARN-2048-trunk-v1.patch


 Currently, Yarn haven't provide a way to list all of the containers of an 
 application from its web. This kind of information is needed by the 
 application user. They can conveniently know how many containers their 
 applications already acquired as well as which nodes those containers were 
 launched on.  They also want to view the logs of each container of an 
 application.
 One approach is maintain a container list in RMAppImpl and expose this info 
 to Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1976) Tracking url missing http protocol for FAILED application

2014-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995930#comment-13995930
 ] 

Junping Du commented on YARN-1976:
--

Thanks [~jianhe] for review and commit!

 Tracking url missing http protocol for FAILED application
 -

 Key: YARN-1976
 URL: https://issues.apache.org/jira/browse/YARN-1976
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yesha Vora
Assignee: Junping Du
 Fix For: 2.4.1

 Attachments: YARN-1976-v2.patch, YARN-1976.patch


 Run yarn application -list -appStates FAILED,  It does not print http 
 protocol name like FINISHED apps.
 {noformat}
 -bash-4.1$ yarn application -list -appStates FINISHED,FAILED,KILLED
 14/04/15 23:55:07 INFO client.RMProxy: Connecting to ResourceManager at host
 Total number of applications (application-types: [] and states: [FINISHED, 
 FAILED, KILLED]):4
 Application-IdApplication-Name
 Application-Type  User   Queue   State
  Final-State ProgressTracking-URL
 application_1397598467870_0004   Sleep job   
 MAPREDUCEhrt_qa defaultFINISHED   
 SUCCEEDED 100% 
 http://host:19888/jobhistory/job/job_1397598467870_0004
 application_1397598467870_0003   Sleep job   
 MAPREDUCEhrt_qa defaultFINISHED   
 SUCCEEDED 100% 
 http://host:19888/jobhistory/job/job_1397598467870_0003
 application_1397598467870_0002   Sleep job   
 MAPREDUCEhrt_qa default  FAILED   
FAILED 100% 
 host:8088/cluster/app/application_1397598467870_0002
 application_1397598467870_0001  word count   
 MAPREDUCEhrt_qa defaultFINISHED   
 SUCCEEDED 100% 
 http://host:19888/jobhistory/job/job_1397598467870_0001
 {noformat}
 It only prints 'host:8088/cluster/app/application_1397598467870_0002' instead 
 'http://host:8088/cluster/app/application_1397598467870_0002' 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993895#comment-13993895
 ] 

Karthik Kambatla commented on YARN-556:
---

For the scheduler-related work itself, the offline sync up thought it would be 
best to move as much common code as possible to AbstractYarnScheduler. To 
unblock the restart work at the earliest, we should do it in two phases - the 
first phase that only pulls out stuff that would make it easier to handle the 
recovery, and a more comprehensive re-jig later.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2032) Implement a scalable, available TimelineStore using HBase

2014-05-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993864#comment-13993864
 ] 

Vinod Kumar Vavilapalli commented on YARN-2032:
---

Sure, as I can see, you already took it over while the mailing list is down :)

 Implement a scalable, available TimelineStore using HBase
 -

 Key: YARN-2032
 URL: https://issues.apache.org/jira/browse/YARN-2032
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Mayank Bansal

 As discussed on YARN-1530, we should pursue implementing a scalable, 
 available Timeline store using HBase.
 One goal is to reuse most of the code from the levelDB Based store - 
 YARN-1635.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1922) Process group remains alive after container process is killed externally

2014-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993821#comment-13993821
 ] 

Hadoop QA commented on YARN-1922:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644136/YARN-1922.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3727//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3727//console

This message is automatically generated.

 Process group remains alive after container process is killed externally
 

 Key: YARN-1922
 URL: https://issues.apache.org/jira/browse/YARN-1922
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.4.0
 Environment: CentOS 6.4
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Attachments: YARN-1922.1.patch, YARN-1922.2.patch, YARN-1922.3.patch


 If the main container process is killed externally, ContainerLaunch does not 
 kill the rest of the process group.  Before sending the event that results in 
 the ContainerLaunch.containerCleanup method being called, ContainerLaunch 
 sets the completed flag to true.  Then when cleaning up, it doesn't try to 
 read the pid file if the completed flag is true.  If it read the pid file, it 
 would proceed to send the container a kill signal.  In the case of the 
 DefaultContainerExecutor, this would kill the process group.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) Container ID format and clustertimestamp for Work preserving restart

2014-05-13 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996355#comment-13996355
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

We can have 2 options for now:

1. Changing container Id format. ContainerId should be an opaque string that 
YARN app developers don't take a dependency on. 
2. Preserving container Id format. RM restart Phase 2 should be transparent  
from YARN users.

 Container ID format and clustertimestamp for Work preserving restart
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA

 We've been discussing whether container id format is changed to include 
 cluster timestamp or not on YARN-556 and YARN-2001. This JIRA is for taking 
 the discussion. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart

2014-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996391#comment-13996391
 ] 

Junping Du commented on YARN-1362:
--

Thanks for the patch, [~jlowe]! I just start to work on rolling upgrade so need 
to understand the preserving work in NM restart.
One question here: do we expect all shutdown ops on NM will hint it will get 
restart soon? If not, the work will be preserved as isDecommssioned is set to 
false by default. Is that the behavior we expect? 

 Distinguish between nodemanager shutdown for decommission vs shutdown for 
 restart
 -

 Key: YARN-1362
 URL: https://issues.apache.org/jira/browse/YARN-1362
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1362.patch


 When a nodemanager shuts down it needs to determine if it is likely to be 
 restarted.  If a restart is likely then it needs to preserve container 
 directories, logs, distributed cache entries, etc.  If it is being shutdown 
 more permanently (e.g.: like a decommission) then the nodemanager should 
 cleanup directories and logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-13 Thread Min Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Zhou updated YARN-2048:
---

Description: 
Currently, Yarn haven't provide a way to list all of the containers of an 
application from its web. This kind of information is needed by the application 
user. They can conveniently know how many containers their applications already 
acquired as well as which nodes those containers were launched on.  They also 
want to view the logs of each container of an application.

One approach is maintain a container list in RMAppImpl and expose this info to 
Application page. I will submit a patch soon

  was:
Currently, Yarn haven't provide a way to list all of the containers of an 
application from its web. This kind of information is needed by the application 
user. They can conveniently know how many containers their applications already 
acquired as well as which node those containers were launched on.  They also 
want to view the logs of each container of an application.

One approach is maintain a container list in RMAppImpl and expose this info to 
Application page. I will submit a patch soon


 List all of the containers of an application from the yarn web
 --

 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Reporter: Min Zhou

 Currently, Yarn haven't provide a way to list all of the containers of an 
 application from its web. This kind of information is needed by the 
 application user. They can conveniently know how many containers their 
 applications already acquired as well as which nodes those containers were 
 launched on.  They also want to view the logs of each container of an 
 application.
 One approach is maintain a container list in RMAppImpl and expose this info 
 to Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-182) Unnecessary Container killed by the ApplicationMaster message for successful containers

2014-05-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996398#comment-13996398
 ] 

Jason Lowe commented on YARN-182:
-

bq. In my case the reducers were moved to COMPLETED state after 22 mins, they 
had reached 100% progress at 15 mins.

Having progress reach 100% but the task not completing for 7 more minutes is an 
unrelated issue.  Check your reducer logs and/or the input format which is 
responsible for setting the progress.  This is probably a question better 
suited for the u...@hadoop.apache.org mailing list.

 Unnecessary Container killed by the ApplicationMaster message for 
 successful containers
 -

 Key: YARN-182
 URL: https://issues.apache.org/jira/browse/YARN-182
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.1-alpha
Reporter: zhengqiu cai
Assignee: Omkar Vinit Joshi
  Labels: hadoop, usability
 Attachments: Log.txt


 I was running wordcount and the resourcemanager web UI shown the status as 
 FINISHED SUCCEEDED, but the log shown Container killed by the 
 ApplicationMaster



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart

2014-05-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996395#comment-13996395
 ] 

Jason Lowe commented on YARN-1362:
--

Yes, that's the intended behavior.  If ops is shutting down the NM and not 
expecting it to return anytime soon then it should be decommissioned from the 
RM.

 Distinguish between nodemanager shutdown for decommission vs shutdown for 
 restart
 -

 Key: YARN-1362
 URL: https://issues.apache.org/jira/browse/YARN-1362
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1362.patch


 When a nodemanager shuts down it needs to determine if it is likely to be 
 restarted.  If a restart is likely then it needs to preserve container 
 directories, logs, distributed cache entries, etc.  If it is being shutdown 
 more permanently (e.g.: like a decommission) then the nodemanager should 
 cleanup directories and logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) Container ID format and clustertimestamp for Work preserving restart

2014-05-13 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996400#comment-13996400
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

One discussion point is how YARN apps and cluster management systems(e.g. 
Apache Ambari) depend on the container id format currently. For example, MRv2 
uses utility methods like ConverterUtils.toContainerId(containerIdStr) provided 
in org.apache.hadoop.yarn.util.

 Container ID format and clustertimestamp for Work preserving restart
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA

 We've been discussing whether container id format is changed to include 
 cluster timestamp or not on YARN-556 and YARN-2001. This JIRA is for taking 
 the discussion. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-570) Time strings are formated in different timezone

2014-05-13 Thread Akira AJISAKA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated YARN-570:
---

Assignee: Akira AJISAKA  (was: PengZhang)

Assigned to myself. [~peng.zhang], feel free to reassign if you want to work.

 Time strings are formated in different timezone
 ---

 Key: YARN-570
 URL: https://issues.apache.org/jira/browse/YARN-570
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: PengZhang
Assignee: Akira AJISAKA
 Attachments: MAPREDUCE-5141.patch


 Time strings on different page are displayed in different timezone.
 If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as 
 Wed, 10 Apr 2013 08:29:56 GMT
 If it is formatted by format() in yarn.util.Times, it appears as 10-Apr-2013 
 16:29:56
 Same value, but different timezone.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2014-05-13 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996343#comment-13996343
 ] 

Tsuyoshi OZAWA commented on YARN-556:
-

Good point, Bikas. Created YARN-2052 for tracking container id discussion. 
[~adhoot], let's discuss there.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf, 
 WorkPreservingRestartPrototype.001.patch


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart

2014-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996502#comment-13996502
 ] 

Junping Du commented on YARN-1362:
--

That make sense. The patch LGTM. 
Kick off jenkins again as patch's time is a little long (but still can be 
applied). Will commit it after jenkins +1.

 Distinguish between nodemanager shutdown for decommission vs shutdown for 
 restart
 -

 Key: YARN-1362
 URL: https://issues.apache.org/jira/browse/YARN-1362
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1362.patch


 When a nodemanager shuts down it needs to determine if it is likely to be 
 restarted.  If a restart is likely then it needs to preserve container 
 directories, logs, distributed cache entries, etc.  If it is being shutdown 
 more permanently (e.g.: like a decommission) then the nodemanager should 
 cleanup directories and logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1751) Improve MiniYarnCluster for log aggregation testing

2014-05-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996587#comment-13996587
 ] 

Jason Lowe commented on YARN-1751:
--

+1, committing this.

 Improve MiniYarnCluster for log aggregation testing
 ---

 Key: YARN-1751
 URL: https://issues.apache.org/jira/browse/YARN-1751
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-1751-trunk.patch, YARN-1751.patch


 MiniYarnCluster specifies individual remote log aggregation root dir for each 
 NM. Test code that uses MiniYarnCluster won't be able to get the value of log 
 aggregation root dir. The following code isn't necessary in MiniYarnCluster.
   File remoteLogDir =
   new File(testWorkDir, MiniYARNCluster.this.getName()
   + -remoteLogDir-nm- + index);
   remoteLogDir.mkdir();
   config.set(YarnConfiguration.NM_REMOTE_APP_LOG_DIR,
   remoteLogDir.getAbsolutePath());
 In LogCLIHelpers.java, dumpAllContainersLogs should pass its conf object to 
 FileContext.getFileContext() call.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1302) Add AHSDelegationTokenSecretManager for ApplicationHistoryProtocol

2014-05-13 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996621#comment-13996621
 ] 

Zhijie Shen commented on YARN-1302:
---

It looks like we don't need to implement separate DT stack for a single daemon.

 Add AHSDelegationTokenSecretManager for ApplicationHistoryProtocol
 --

 Key: YARN-1302
 URL: https://issues.apache.org/jira/browse/YARN-1302
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Like the ApplicationClientProtocol, ApplicationHistoryProtocol needs its own 
 security stack. We need to implement AHSDelegationTokenSecretManager, 
 AHSDelegationTokenIndentifier, AHSDelegationTokenSelector and other analogs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-05-13 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996675#comment-13996675
 ] 

Bikas Saha commented on YARN-2052:
--

The RM identifier is effectively the epoch for the RM. We already use it in the 
NM to differentiate between allocations made by old RM vs the new RM. Using the 
appId in the container id prevents us from using this epoch number since the 
appId cannot change across restarts for containers belonging to the same app. 
That will be backwards incompatible.
Another alternative would be to replace the monotonically increasing sequence 
number with a unique identifier like a UUID. But that is also incompatible.
Another alternative is to create another epoch number for the RM in addition to 
the cluster timestamp. The monotonically increasing sequence could be a 
combination (concatenation) of the new epoch number and the sequence number. 
e.g. container_XXX_1000 after epoch 1. When the epoch number is 0 then we can 
drop the epoch number and things look the same as today. e.g. container_XXX_000.

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA

 Container ids are made unique by using the app identifier and appending a 
 monotonically increasing sequence number to it. Since container creation is a 
 high churn activity the RM does not store the sequence number per app. So 
 after restart it does not know what the new sequence number should be for new 
 allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2052) Container ID format and clustertimestamp for Work preserving restart

2014-05-13 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2052:
-

Summary: Container ID format and clustertimestamp for Work preserving 
restart  (was: ClusterId format and clustertimestamp)

 Container ID format and clustertimestamp for Work preserving restart
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA

 We've been discussing whether container id format is changed to include 
 cluster timestamp or not on YARN-556 and YARN-2001. This JIRA is for taking 
 the discussion. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-13 Thread Min Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996752#comment-13996752
 ] 

Min Zhou commented on YARN-2048:


Make sense now,  please go ahead.

 List all of the containers of an application from the yarn web
 --

 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Min Zhou
 Attachments: YARN-2048-trunk-v1.patch


 Currently, Yarn haven't provide a way to list all of the containers of an 
 application from its web. This kind of information is needed by the 
 application user. They can conveniently know how many containers their 
 applications already acquired as well as which nodes those containers were 
 launched on.  They also want to view the logs of each container of an 
 application.
 One approach is maintain a container list in RMAppImpl and expose this info 
 to Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers

2014-05-13 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2017:
--

Attachment: YARN-2017.3.patch

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch


 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers

2014-05-13 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996492#comment-13996492
 ] 

Tsuyoshi OZAWA commented on YARN-2017:
--

Thanks for your patch, Jian! Some comments:

1. +1 for using Generics to avoid warnings by casting as Sandy and Wangda 
mentioned.
2. Can we assert node with Preconditions.checkNotNull() in SchedulerNode?
{code}
  public SchedulerNode(RMNode node, boolean usePortForNodeName) {
this.rmNode = node;
this.availableResource = Resources.clone(node.getTotalCapability());
this.totalResourceCapability = Resources.clone(node.getTotalCapability());
...
  }
{code}
3. Some lines over 80 chars per line. 

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2017.1.patch


 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-13 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995700#comment-13995700
 ] 

Xuan Gong commented on YARN-1861:
-

bq. I tried to just apply the test-case and run it without the core change and 
was expecting the active RM to go to standby and the standby RM to go to active 
once the originally active RM is fenced. Instead I get a NPE somewhere. Can the 
test be fixed to do so?

In the testcase, I manually send the RMFatalEvent with 
RMFatalEventType.STATE_STORE_FENCED to current active RM(rm1). This active RM 
will handle this event, and transit to Standby. Both of the RMs are in standby 
state, while the zk still thinks that rm1 is at active state. So, it will not 
trigger the leader election. I think this can mimic the behavior as we 
described previously. Without the core code change, this testcase will fail. 
Because NM is trying to connect the active RM, but neither of two RMs are 
active. So, the NPE is expected. 

bq. Also, we need to make sure that when automatic failover is enabled, all 
external interventions like a fence like this bug (and forced-manual failover 
from CLI?) do a similar reset into the leader election. There may not be cases 
like this today though..

For the external interventions for automatic failover right now , we have 
transitionToActive/transitionToStandby plus forcemanual from CLI. The current 
behaviors are if we do transitionToActive + forcemanual + current standby rm 
id. The standby rm will transit to Active. In the mean time, it will do the 
fence, and current active rm will transit to Standby. If there are any 
exceptions, the rm will either be terminated or go back to standby state which 
will reset the leader election. Both of the cases, the zk will trigger a new 
run of leader election.

If we do transitionToStandby + forcemanual + current active rm id. Both of rms 
are in standby state. Another transitionToActive command is needed.



 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers

2014-05-13 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996776#comment-13996776
 ] 

Jian He commented on YARN-2017:
---

bq. Since this is original logic of SchedulerNode, I think it's better to throw 
exception instead of print a irresponsible log here. Null object passed in 
should be considered big problem in scheduler. And several following places in 
this class.
On a second thought, user might pass in a resource request with null 
capability. I would prefer not changing it. In fact, we can add many other null 
checks in many places. Changed the patch back.

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2017.1.patch, YARN-2017.2.patch, YARN-2017.3.patch


 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996749#comment-13996749
 ] 

Vinod Kumar Vavilapalli commented on YARN-1861:
---

Okay, that's much better. +1. Will check this in once Jenkins says okay..

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
 YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch


 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-667) Data persisted in RM should be versioned

2014-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996354#comment-13996354
 ] 

Junping Du commented on YARN-667:
-

Thanks for your comments, [~zjshen]! 
bq. Shall we consider the version of history data as well?
That's good point. We should consider history data, like: 
ApplicationHistoryData, ApplicationAttemptHistoryData, ContainerHistoryData, 
etc. could be variable in future and make them proper versioned. In addition, 
any changes on format (i.e. APPLICATION_PREFIX, etc.) and path of history file 
(root+appID for now) could be very challenge on rolling upgrade (also seen as 
incompatible change I think). [~zjshen], do you have sense on how possibility 
it could happen in 2.X as you are currently work on ATS?

 Data persisted in RM should be versioned
 

 Key: YARN-667
 URL: https://issues.apache.org/jira/browse/YARN-667
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.0.4-alpha
Reporter: Siddharth Seth
Assignee: Junping Du

 Includes data persisted for RM restart, NodeManager directory structure and 
 the Aggregated Log Format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2014) Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9

2014-05-13 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2014:
--

Assignee: Jason Lowe

 Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9
 

 Key: YARN-2014
 URL: https://issues.apache.org/jira/browse/YARN-2014
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: patrick white
Assignee: Jason Lowe

 Performance comparison benchmarks from 2.x against 0.23 shows AM scalability 
 benchmark's runtime is approximately 10% slower in 2.4.0. The trend is 
 consistent across later releases in both lines, latest release numbers are:
 2.4.0.0 runtime 255.6 seconds (avg 5 passes)
 0.23.9.12 runtime 230.4 seconds (avg 5 passes)
 Diff: -9.9% 
 AM Scalability test is essentially a sleep job that measures time to launch 
 and complete a large number of mappers.
 The diff is consistent and has been reproduced in both a larger (350 node, 
 100,000 mappers) perf environment, as well as a small (10 node, 2,900 
 mappers) demo cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-45) [Preemption] Scheduler feedback to AM to release containers

2014-05-13 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-45:


Summary: [Preemption] Scheduler feedback to AM to release containers  (was: 
Scheduler feedback to AM to release containers)

 [Preemption] Scheduler feedback to AM to release containers
 ---

 Key: YARN-45
 URL: https://issues.apache.org/jira/browse/YARN-45
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Chris Douglas
Assignee: Carlo Curino
 Fix For: 2.1.0-beta

 Attachments: YARN-45.1.patch, YARN-45.patch, YARN-45.patch, 
 YARN-45.patch, YARN-45.patch, YARN-45.patch, YARN-45.patch, 
 YARN-45_design_thoughts.pdf


 The ResourceManager strikes a balance between cluster utilization and strict 
 enforcement of resource invariants in the cluster. Individual allocations of 
 containers must be reclaimed- or reserved- to restore the global invariants 
 when cluster load shifts. In some cases, the ApplicationMaster can respond to 
 fluctuations in resource availability without losing the work already 
 completed by that task (MAPREDUCE-4584). Supplying it with this information 
 would be helpful for overall cluster utilization [1]. To this end, we want to 
 establish a protocol for the RM to ask the AM to release containers.
 [1] http://research.yahoo.com/files/yl-2012-003.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1184) ClassCastException is thrown during preemption When a huge job is submitted to a queue B whose resources is used by a job in queueA

2014-05-13 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1184:
--

Issue Type: Sub-task  (was: Bug)
Parent: YARN-45

 ClassCastException is thrown during preemption When a huge job is submitted 
 to a queue B whose resources is used by a job in queueA
 ---

 Key: YARN-1184
 URL: https://issues.apache.org/jira/browse/YARN-1184
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: J.Andreina
Assignee: Chris Douglas
 Fix For: 2.1.1-beta

 Attachments: Y1184-0.patch, Y1184-1.patch


 preemption is enabled.
 Queue = a,b
 a capacity = 30%
 b capacity = 70%
 Step 1: Assign a big job to queue a ( so that job_a will utilize some 
 resources from queue b)
 Step 2: Assigne a big job to queue b.
 Following exception is thrown at Resource Manager
 {noformat}
 2013-09-12 10:42:32,535 ERROR [SchedulingMonitor 
 (ProportionalCapacityPreemptionPolicy)] yarn.YarnUncaughtExceptionHandler 
 (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
 Thread[SchedulingMonitor (ProportionalCapacityPreemptionPolicy),5,main] threw 
 an Exception.
 java.lang.ClassCastException: java.util.Collections$UnmodifiableSet cannot be 
 cast to java.util.NavigableSet
   at 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getContainersToPreempt(ProportionalCapacityPreemptionPolicy.java:403)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:202)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:173)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:72)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PreemptionChecker.run(SchedulingMonitor.java:82)
   at java.lang.Thread.run(Thread.java:662)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2014) Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9

2014-05-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996737#comment-13996737
 ] 

Vinod Kumar Vavilapalli commented on YARN-2014:
---

Thanks for the info Jason. Do you have a link to the JIRA covering the FS 
ServiceLoader stuff?

In your configs, what are the file-systems whose impls are defined? Or is it 
just the default impls added by the default config files? May be one thing that 
can be done, if possible and if you have time, is to remove the unnecessary 
service-loader declaration files (not sure what you call them) from the 
installation and try this again.



 Performance: AM scaleability is 10% slower in 2.4 compared to 0.23.9
 

 Key: YARN-2014
 URL: https://issues.apache.org/jira/browse/YARN-2014
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: patrick white
Assignee: Jason Lowe

 Performance comparison benchmarks from 2.x against 0.23 shows AM scalability 
 benchmark's runtime is approximately 10% slower in 2.4.0. The trend is 
 consistent across later releases in both lines, latest release numbers are:
 2.4.0.0 runtime 255.6 seconds (avg 5 passes)
 0.23.9.12 runtime 230.4 seconds (avg 5 passes)
 Diff: -9.9% 
 AM Scalability test is essentially a sleep job that measures time to launch 
 and complete a large number of mappers.
 The diff is consistent and has been reproduced in both a larger (350 node, 
 100,000 mappers) perf environment, as well as a small (10 node, 2,900 
 mappers) demo cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart

2014-05-13 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-1366:
-

Attachment: YARN-1366.1.patch

I updated the patch for follwing changes in AMRMClient(MapReduce is not 
considered here)
1. On Resync from RM, reset lastResponseId and re register with RM.
2. Add back ResourceRequest for last allocate request.
3. Followed by 1 and 2, AMRMClient continue heatbeat

Patch does not contain test, and I will write test in next patches.

Please review initiall patch ,does this satisfy task expectations. 

Work Item to be decided.
1. On resync, last ResourceRequest are added back to ask send back again 
heartbeat.
Here my doubt is, what about old asks which are sent earlier heartbeat but not 
allocated? Earlier requests can be populated using remoteRequestTable.
2. For MapReduce changes, should be handled in this jira?

Current behaviour of AMs treats RESYNC and SHUTDOWN as same.It would be very 
useful if resync and shutdown commands are issued separately by application 
master service. 

 ApplicationMasterService should Resync with the AM upon allocate call after 
 restart
 ---

 Key: YARN-1366
 URL: https://issues.apache.org/jira/browse/YARN-1366
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Rohith
 Attachments: YARN-1366.1.patch, YARN-1366.patch, 
 YARN-1366.prototype.patch, YARN-1366.prototype.patch


 The ApplicationMasterService currently sends a resync response to which the 
 AM responds by shutting down. The AM behavior is expected to change to 
 calling resyncing with the RM. Resync means resetting the allocate RPC 
 sequence number to 0 and the AM should send its entire outstanding request to 
 the RM. Note that if the AM is making its first allocate call to the RM then 
 things should proceed like normal without needing a resync. The RM will 
 return all containers that have completed since the RM last synced with the 
 AM. Some container completions may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-05-13 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997024#comment-13997024
 ] 

Alejandro Abdelnur commented on YARN-1368:
--

[~jianhe], [~vinodkv], 

Unless I'm missing something, Anubhav was working on this JIRA. It is great 
that Jian did the refactoring to have common code for the schedulers and some 
testcases for it, but most of the work has been done by Anubhav and he was 
working actively on it. We should reassign the JIRA back to Anubhav and let him 
drive it to completion, agree?

 Common work to re-populate containers’ state into scheduler
 ---

 Key: YARN-1368
 URL: https://issues.apache.org/jira/browse/YARN-1368
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Jian He
 Attachments: YARN-1368.1.patch, YARN-1368.combined.001.patch, 
 YARN-1368.preliminary.patch


 YARN-1367 adds support for the NM to tell the RM about all currently running 
 containers upon registration. The RM needs to send this information to the 
 schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
 the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1957) ProportionalCapacitPreemptionPolicy handling of corner cases...

2014-05-13 Thread Chris Douglas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated YARN-1957:


Fix Version/s: 3.0.0
   2.5.0

 ProportionalCapacitPreemptionPolicy handling of corner cases...
 ---

 Key: YARN-1957
 URL: https://issues.apache.org/jira/browse/YARN-1957
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Carlo Curino
Assignee: Carlo Curino
  Labels: capacity-scheduler, preemption
 Fix For: 3.0.0, 2.5.0, 2.4.1

 Attachments: YARN-1957.patch, YARN-1957.patch, YARN-1957_test.patch


 The current version of ProportionalCapacityPreemptionPolicy should be 
 improved to deal with the following two scenarios:
 1) when rebalancing over-capacity allocations, it potentially preempts 
 without considering the maxCapacity constraints of a queue (i.e., preempting 
 possibly more than strictly necessary)
 2) a zero capacity queue is preempted even if there is no demand (coherent 
 with old use of zero-capacity to disabled queues)
 The proposed patch fixes both issues, and introduce few new test cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2052) Container ID format and clustertimestamp for Work preserving restart

2014-05-13 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated YARN-2052:
-

Description: (was: We've been discussing whether container id format is 
changed to include cluster timestamp or not on YARN-556 and YARN-2001. This 
JIRA is for taking the discussion. )

 Container ID format and clustertimestamp for Work preserving restart
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2017) Merge some of the common lib code in schedulers

2014-05-13 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996756#comment-13996756
 ] 

Jian He commented on YARN-2017:
---

Thanks Sandy, Wangda, and Tsuyoshi for the review and comments !

bq. Why take out the header comment in SchedulerNode?
accidentally removed, add it back.
bq. Can we use generics to avoid all the casting (and find bugs)?
makes sense. Changed the AbstractYarnScheduler to use generics.
we may need to change SchedulerNode to SchedulerNodeFSSchedulerApp as well to 
avoid the type cast warning. Changing that causes the patch much bigger as too 
many references of SchedulerNode. we may fix it in a separate patch.
bq. I'm wondering if it is meaningful to merge this method, too much code 
changes due to this merge.
this will be commonly used by YARN-1368, that's why I merge it here.

Fixed other comments accordingly.

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2017.1.patch, YARN-2017.2.patch


 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2052) ClusterId format and clustertimestamp

2014-05-13 Thread Tsuyoshi OZAWA (JIRA)
Tsuyoshi OZAWA created YARN-2052:


 Summary: ClusterId format and clustertimestamp
 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tsuyoshi OZAWA


We've been discussing whether container id format is changed to include cluster 
timestamp or not on YARN-556 and YARN-2001. This JIRA is for taking the 
discussion. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-13 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997089#comment-13997089
 ] 

Zhijie Shen commented on YARN-2048:
---

Hi [~coderplay], what is the exact requirement? 2.4 has been already out, and 
YARN-1809 was not able to get in as we didn't have enough time test it 
thoroughly. Hence we target 2.5 for that jira, which should cover the scenario 
here.

Usually we won't include major change in a maintain release, such as 2.4.1. 
However, if you think it is really urgent and required fix for 2.4.1, please 
feel free to reopen it, and set target version to 2.4.1. Thanks!

 List all of the containers of an application from the yarn web
 --

 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Min Zhou
 Attachments: YARN-2048-trunk-v1.patch


 Currently, Yarn haven't provide a way to list all of the containers of an 
 application from its web. This kind of information is needed by the 
 application user. They can conveniently know how many containers their 
 applications already acquired as well as which nodes those containers were 
 launched on.  They also want to view the logs of each container of an 
 application.
 One approach is maintain a container list in RMAppImpl and expose this info 
 to Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-05-13 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997071#comment-13997071
 ] 

Jian He commented on YARN-1368:
---

[~tucu00], Please understand that the patch uploaded here is completely 
different from the prototype patch on YARN-556. The patch here is using a 
different approach to cover all schedulers and also the whole container 
recovery flow is different which simplifies things a lot. This jira itself was 
originally opened as “RM should populate running container allocation 
information from NM resync” and did not cover recovering the schedulers. I 
should have opened a new jira to express the approaches instead of renaming 
this one to avoid confusion.

 Common work to re-populate containers’ state into scheduler
 ---

 Key: YARN-1368
 URL: https://issues.apache.org/jira/browse/YARN-1368
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Jian He
 Attachments: YARN-1368.1.patch, YARN-1368.combined.001.patch, 
 YARN-1368.preliminary.patch


 YARN-1367 adds support for the NM to tell the RM about all currently running 
 containers upon registration. The RM needs to send this information to the 
 schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
 the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2016) Yarn getApplicationRequest start time range is not honored

2014-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996512#comment-13996512
 ] 

Junping Du commented on YARN-2016:
--

[~venkatnrangan], you are right. When similar bug happens, we often suspect the 
logic in client or server but ignore the wire logic. In most cases, we don't 
even have a simple unit test to verify these PBImpls which cause bug get hidden 
easily. Already filed YARN-2051 to address this, it would be great if you want 
to help there. Thanks!

 Yarn getApplicationRequest start time range is not honored
 --

 Key: YARN-2016
 URL: https://issues.apache.org/jira/browse/YARN-2016
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Venkat Ranganathan
Assignee: Junping Du
 Fix For: 2.4.1

 Attachments: YARN-2016.patch, YarnTest.java


 When we query for the previous applications by creating an instance of 
 GetApplicationsRequest and setting the start time range and application tag, 
 we see that the start range provided is not honored and all applications with 
 the tag are returned
 Attaching a reproducer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1474) Make schedulers services

2014-05-13 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1474:
-

Attachment: YARN-1474.12.patch

Thank you for the great review, [~kkambatl]! Updated a patch to address the 
points. I believe that this latest patch gets much simpler than previous one by 
your review.

This is the reply to your review comments and change logs in this patch:
1. TestRMDelegationTokens causes  NPE in AllocationFileLoaderService#stop() if 
we don't check null. This is regression. Therefore, this patch still includes 
the null check in AllocationFileLoaderService. Please let me know if we accept 
this regression. Attached log when TestRMDelegationTokens causes NPE in the 
tail of this comments.
2. Added spaces between each interface ResourceSchedulerWrapper implements.
3. Updated a comment about ResourceScheduler#setRMContext()
4. Changed to call {{reinitialize()}} from serviceInit()/serviceStart() in all 
schedulers.
5. FairScheduler: Removed isUpdateThreadRunning/isSchedulingThreadRunning from 
FairScheduler.
6. FairScheduler: serviceStartInternal()/serviceStopInternal() is removed in 4.
7, 8. Changed to call updateThread/schedulingThread#join in serviceStop(). 
Additionally, AllocationFileLoaderService#reloadThread has same problem, so I 
changed to call join in AllocationFileLoaderService#stop() method.

{quote}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.stop(AllocationFileLoaderService.java:149)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceStop(FairScheduler.java:1268)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at 
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at 
org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
at 
org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:506)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:839)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:889)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:944)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens.testRMDTMasterKeyStateOnRollingMasterKey(TestRMDelegationTokens.java:124)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{quote}

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
 YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, 
 YARN-1474.8.patch, YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2017) Merge some of the common lib code in schedulers

2014-05-13 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2017:
--

Attachment: YARN-2017.2.patch

 Merge some of the common lib code in schedulers
 ---

 Key: YARN-2017
 URL: https://issues.apache.org/jira/browse/YARN-2017
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2017.1.patch, YARN-2017.2.patch


 A bunch of same code is repeated among schedulers, e.g:  between 
 FicaSchedulerNode and FSSchedulerNode. It's good to merge and share them in a 
 common base.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1362) Distinguish between nodemanager shutdown for decommission vs shutdown for restart

2014-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997095#comment-13997095
 ] 

Junping Du commented on YARN-1362:
--

I have commit this to trunk and branch-2. Thank you, [~jlowe]!

 Distinguish between nodemanager shutdown for decommission vs shutdown for 
 restart
 -

 Key: YARN-1362
 URL: https://issues.apache.org/jira/browse/YARN-1362
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Fix For: 2.5.0

 Attachments: YARN-1362.patch


 When a nodemanager shuts down it needs to determine if it is likely to be 
 restarted.  If a restart is likely then it needs to preserve container 
 directories, logs, distributed cache entries, etc.  If it is being shutdown 
 more permanently (e.g.: like a decommission) then the nodemanager should 
 cleanup directories and logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval

2014-05-13 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2054:
---

Attachment: yarn-2054-1.patch

Straight-forward patch that brings the cumulative to 10 seconds, same as the 
yarn.resourcemanager.zk-timeout-ms.

 Poor defaults for YARN ZK configs for retries and retry-inteval
 ---

 Key: YARN-2054
 URL: https://issues.apache.org/jira/browse/YARN-2054
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: yarn-2054-1.patch


 Currenly, we have the following default values:
 # yarn.resourcemanager.zk-num-retries - 500
 # yarn.resourcemanager.zk-retry-interval-ms - 2000
 This leads to a cumulate 1000 seconds before the RM gives up trying to 
 connect to the ZK. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2012) Fair Scheduler : Default rule in queue placement policy can take a queue as an optional attribute

2014-05-13 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997054#comment-13997054
 ] 

Ashwin Shankar commented on YARN-2012:
--

Hi [~sandyr] , do you have any comments ? 

 Fair Scheduler : Default rule in queue placement policy can take a queue as 
 an optional attribute
 -

 Key: YARN-2012
 URL: https://issues.apache.org/jira/browse/YARN-2012
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
  Labels: scheduler
 Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt


 Currently 'default' rule in queue placement policy,if applied,puts the app in 
 root.default queue. It would be great if we can make 'default' rule 
 optionally point to a different queue as default queue . This queue should be 
 an existing queue,if not we fall back to root.default queue hence keeping 
 this rule as terminal.
 This default queue can be a leaf queue or it can also be an parent queue if 
 the 'default' rule is nested inside nestedUserQueue rule(YARN-1864).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-13 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997006#comment-13997006
 ] 

Tsuyoshi OZAWA commented on YARN-1474:
--

Waiting for Jenkins.

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
 YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, 
 YARN-1474.8.patch, YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-185) Add preemption to CS

2014-05-13 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-185:
-

Issue Type: Sub-task  (was: New Feature)
Parent: YARN-45

 Add preemption to CS
 

 Key: YARN-185
 URL: https://issues.apache.org/jira/browse/YARN-185
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Arun C Murthy

 Umbrella jira to track adding preemption to CS, let's track via sub-tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval

2014-05-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997159#comment-13997159
 ] 

Karthik Kambatla commented on YARN-2054:


On a cluster with RM HA and buggy RM, this led to a long wait before failover.

 Poor defaults for YARN ZK configs for retries and retry-inteval
 ---

 Key: YARN-2054
 URL: https://issues.apache.org/jira/browse/YARN-2054
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 Currenly, we have the following default values:
 # yarn.resourcemanager.zk-num-retries - 500
 # yarn.resourcemanager.zk-retry-interval-ms - 2000
 This leads to a cumulate 1000 seconds before the RM gives up trying to 
 connect to the ZK. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1337) Recover containers upon nodemanager restart

2014-05-13 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-1337:
-

Description: To support work-preserving NM restart we need to recover the 
state of the containers when the nodemanager went down.  This includes 
informing the RM of containers that have exited in the interim and a strategy 
for dealing with the exit codes from those containers along with how to 
reacquire the active containers and determine their exit codes when they 
terminate.  The state of finished containers also needs to be recovered.  (was: 
To support work-preserving NM restart we need to recover the state of the 
containers that were active when the nodemanager went down.  This includes 
informing the RM of containers that have exited in the interim and a strategy 
for dealing with the exit codes from those containers along with how to 
reacquire the active containers and determine their exit codes when they 
terminate.)
Summary: Recover containers upon nodemanager restart  (was: Recover 
active container state upon nodemanager restart)

Updating headline and description to note that this task also includes 
recovering the state of finished containers as well.

 Recover containers upon nodemanager restart
 ---

 Key: YARN-1337
 URL: https://issues.apache.org/jira/browse/YARN-1337
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe

 To support work-preserving NM restart we need to recover the state of the 
 containers when the nodemanager went down.  This includes informing the RM of 
 containers that have exited in the interim and a strategy for dealing with 
 the exit codes from those containers along with how to reacquire the active 
 containers and determine their exit codes when they terminate.  The state of 
 finished containers also needs to be recovered.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1368) Common work to re-populate containers’ state into scheduler

2014-05-13 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-1368:


Attachment: YARN-1368.combined.001.patch

Thanks [~jianhe] for making the scheduler changes generic. I have added back 
the FairScheduler changes as per this. I have refactored your unit test so we 
can test both Capacity and Fair. Rest of the patch looks similar to my 
[YARN-566|https://issues.apache.org/jira/browse/YARN-556] prototype patch so we 
are pretty much in sync there.

 Common work to re-populate containers’ state into scheduler
 ---

 Key: YARN-1368
 URL: https://issues.apache.org/jira/browse/YARN-1368
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Jian He
 Attachments: YARN-1368.1.patch, YARN-1368.combined.001.patch, 
 YARN-1368.preliminary.patch


 YARN-1367 adds support for the NM to tell the RM about all currently running 
 containers upon registration. The RM needs to send this information to the 
 schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
 the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1550) the page http:/ip:50030/cluster/scheduler has 500 error in fairScheduler

2014-05-13 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1550:
---

Description: 
three Steps :
1、debug at RMAppManager#submitApplication after code
if (rmContext.getRMApps().putIfAbsent(applicationId, application) !=
null) {
  String message = Application with id  + applicationId
  +  is already present! Cannot add a duplicate!;
  LOG.warn(message);
  throw RPCUtil.getRemoteException(message);
}

2、submit one application:hadoop jar 
~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar
 sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 
-r 1

3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR!

the log:
{noformat}
2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error 
handling URI: /cluster/scheduler
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
{noformat}

  was:
three Steps :
1、debug at RMAppManager#submitApplication after code
if (rmContext.getRMApps().putIfAbsent(applicationId, application) !=
null) {
  String message = Application with id  + applicationId
  +  is already present! Cannot add a duplicate!;
  LOG.warn(message);
  throw RPCUtil.getRemoteException(message);
}

2、submit one application:hadoop jar 
~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar
 sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 
-r 1

3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR!

the log:
2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error 
handling URI: /cluster/scheduler
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
at 
com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
at 
com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
at 
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 

[jira] [Updated] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-05-13 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated YARN-2052:
-

Summary: ContainerId creation after work preserving restart is broken  
(was: Container ID format and clustertimestamp for Work preserving restart)

 ContainerId creation after work preserving restart is broken
 

 Key: YARN-2052
 URL: https://issues.apache.org/jira/browse/YARN-2052
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1803) Signal container support in nodemanager

2014-05-13 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997062#comment-13997062
 ] 

Ming Ma commented on YARN-1803:
---

Vinod, thanks for the great feedback. So to summarize it,

1. Add signalContainers method to both ApplicationClientProtocol and 
ContainerManagementProtocol to support ordered list.
2. stopContainers will be deprecated eventually.
3. MR needs be changed to call signalContainers instead of stopContainers.

For the SignalContainerCommand, I will update that in YARN-1897. We still need 
to define signalContainerRequest in addition to signalContainersRequest.

 Signal container support in nodemanager
 ---

 Key: YARN-1803
 URL: https://issues.apache.org/jira/browse/YARN-1803
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-1803.patch


 It could include the followings.
 1. ContainerManager is able to process a new event type 
 ContainerManagerEventType.SIGNAL_CONTAINERS coming from NodeStatusUpdater and 
 deliver the request to ContainerExecutor.
 2. Translate the platform independent signal command to Linux specific 
 signals. Windows support will be tracked by another task.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1803) Signal container support in nodemanager

2014-05-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997252#comment-13997252
 ] 

Vinod Kumar Vavilapalli commented on YARN-1803:
---

bq. 1. Add signalContainers method to [..] ContainerManagementProtocol to 
support ordered list.
Yup. We can do the above as a follow up though. It seems like most cases are 
centered primarily around the RM API.

 Signal container support in nodemanager
 ---

 Key: YARN-1803
 URL: https://issues.apache.org/jira/browse/YARN-1803
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-1803.patch


 It could include the followings.
 1. ContainerManager is able to process a new event type 
 ContainerManagerEventType.SIGNAL_CONTAINERS coming from NodeStatusUpdater and 
 deliver the request to ContainerExecutor.
 2. Translate the platform independent signal command to Linux specific 
 signals. Windows support will be tracked by another task.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (YARN-2048) List all of the containers of an application from the yarn web

2014-05-13 Thread Min Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Zhou reopened YARN-2048:



I'd like to reopen it, because I think we need this feature for 2.[0-4].x users 
before timeline server get finished. How do you like to do, [~zjshen]? 

 List all of the containers of an application from the yarn web
 --

 Key: YARN-2048
 URL: https://issues.apache.org/jira/browse/YARN-2048
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, webapp
Affects Versions: 2.3.0, 2.4.0, 2.5.0
Reporter: Min Zhou
 Attachments: YARN-2048-trunk-v1.patch


 Currently, Yarn haven't provide a way to list all of the containers of an 
 application from its web. This kind of information is needed by the 
 application user. They can conveniently know how many containers their 
 applications already acquired as well as which nodes those containers were 
 launched on.  They also want to view the logs of each container of an 
 application.
 One approach is maintain a container list in RMAppImpl and expose this info 
 to Application page. I will submit a patch soon



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2051) Add more unit tests for PBImpl that didn't get covered

2014-05-13 Thread Junping Du (JIRA)
Junping Du created YARN-2051:


 Summary: Add more unit tests for PBImpl that didn't get covered
 Key: YARN-2051
 URL: https://issues.apache.org/jira/browse/YARN-2051
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Junping Du
Priority: Critical


From YARN-2016, we can see some bug could exist in PB implementation of 
protocol. The bad news is most of these PBImpl don't have any unit test to 
verify the info is not lost or changed after serialization/deserialization. We 
should add more tests for it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)