[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens

2014-07-25 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075257#comment-14075257
 ] 

Zhijie Shen commented on YARN-2247:
---

+1 for the latest patch. [~vinodkv], do you have more comments about this issue?

> Allow RM web services users to authenticate using delegation tokens
> ---
>
> Key: YARN-2247
> URL: https://issues.apache.org/jira/browse/YARN-2247
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
>Priority: Blocker
> Attachments: apache-yarn-2247.0.patch, apache-yarn-2247.1.patch, 
> apache-yarn-2247.2.patch, apache-yarn-2247.3.patch, apache-yarn-2247.4.patch, 
> apache-yarn-2247.5.patch
>
>
> The RM webapp should allow users to authenticate using delegation tokens to 
> maintain parity with RPC.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2346) Add a 'status' command to yarn-daemon.sh

2014-07-25 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2346:
---

Target Version/s:   (was: 2.5.0)

> Add a 'status' command to yarn-daemon.sh
> 
>
> Key: YARN-2346
> URL: https://issues.apache.org/jira/browse/YARN-2346
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.2.0, 2.3.0, 2.2.1, 2.4.0, 2.4.1
>Reporter: Nikunj Bansal
>Assignee: Allen Wittenauer
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Adding a 'status' command to yarn-daemon.sh will be useful for finding out 
> the status of yarn daemons.
> Running the 'status' command should exit with a 0 exit code if the target 
> daemon is running and non-zero code in case its not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2346) Add a 'status' command to yarn-daemon.sh

2014-07-25 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer reassigned YARN-2346:
--

Assignee: Allen Wittenauer

> Add a 'status' command to yarn-daemon.sh
> 
>
> Key: YARN-2346
> URL: https://issues.apache.org/jira/browse/YARN-2346
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.2.0, 2.3.0, 2.2.1, 2.4.0, 2.4.1
>Reporter: Nikunj Bansal
>Assignee: Allen Wittenauer
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Adding a 'status' command to yarn-daemon.sh will be useful for finding out 
> the status of yarn daemons.
> Running the 'status' command should exit with a 0 exit code if the target 
> daemon is running and non-zero code in case its not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2346) Add a 'status' command to yarn-daemon.sh

2014-07-25 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved YARN-2346.


Resolution: Duplicate

This is part of HADOOP-9902 now. Resolving.

> Add a 'status' command to yarn-daemon.sh
> 
>
> Key: YARN-2346
> URL: https://issues.apache.org/jira/browse/YARN-2346
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.2.0, 2.3.0, 2.2.1, 2.4.0, 2.4.1
>Reporter: Nikunj Bansal
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Adding a 'status' command to yarn-daemon.sh will be useful for finding out 
> the status of yarn daemons.
> Running the 'status' command should exit with a 0 exit code if the target 
> daemon is running and non-zero code in case its not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper broken due to AbstractYarnScheduler

2014-07-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075228#comment-14075228
 ] 

Hudson commented on YARN-1726:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5975 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5975/])
YARN-1726. Add missing files. ResourceSchedulerWrapper broken due to 
AbstractYarnScheduler. (Wei Yan via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613552)
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/appmaster
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/appmaster/TestAMSimulator.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/nodemanager
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/nodemanager/TestNMSimulator.java


> ResourceSchedulerWrapper broken due to AbstractYarnScheduler
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Fix For: 2.5.0
>
> Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, 
> YARN-1726-6.patch, YARN-1726-7-branch2.patch, YARN-1726-7.patch, 
> YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1796) container-executor shouldn't require o-r permissions

2014-07-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075222#comment-14075222
 ] 

Hudson commented on YARN-1796:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5974 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5974/])
YARN-1796. container-executor shouldn't require o-r permissions. Contributed by 
Aaron T. Myers. (atm: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613548)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c


> container-executor shouldn't require o-r permissions
> 
>
> Key: YARN-1796
> URL: https://issues.apache.org/jira/browse/YARN-1796
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Minor
> Fix For: 2.6.0
>
> Attachments: YARN-1796.patch
>
>
> The container-executor currently checks that "other" users don't have read 
> permissions. This is unnecessary and runs contrary to the debian packaging 
> policy manual.
> This is the analogous fix for YARN that was done for MR1 in MAPREDUCE-2103.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075224#comment-14075224
 ] 

Hadoop QA commented on YARN-2354:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12657943/YARN-2354-072514.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell:

  
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build///testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build///console

This message is automatically generated.

> DistributedShell may allocate more containers than client specified after it 
> restarts
> -
>
> Key: YARN-2354
> URL: https://issues.apache.org/jira/browse/YARN-2354
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jian He
>Assignee: Li Lu
> Attachments: YARN-2354-072514.patch
>
>
> To reproduce, run distributed shell with -num_containers option,
> In ApplicationMaster.java, the following code has some issue.
> {code}
>   int numTotalContainersToRequest =
> numTotalContainers - previousAMRunningContainers.size();
> for (int i = 0; i < numTotalContainersToRequest; ++i) {
>   ContainerRequest containerAsk = setupContainerAskForRM();
>   amRMClient.addContainerRequest(containerAsk);
> }
> numRequestedContainers.set(numTotalContainersToRequest);
> {code}
>  numRequestedContainers doesn't account for previous AM's requested 
> containers. so numRequestedContainers should be set to numTotalContainers



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits

2014-07-25 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075220#comment-14075220
 ] 

Wangda Tan commented on YARN-2069:
--

Hi Mayank,
Thanks for your detailed explanation, I think I understood your approach.

However, I think the current way to compute target user limit is not correct, 
let me explain:
I found basically, your created {{computeTargetedUserLimit}} is modified from 
{{computeUserLimit}}, it will calculate as following
{code}
target_capacity = used_capacity - resToObtain
min(
max(target_capacity / #active_user,
 target_capacity * user_limit_percent),
target_capacity * user_limit_factor)),
{code}
So when a user_limit_percent is set as default (100%), it is possible that 
target_user_limit * #active_user > queue_max_capacity.
In this case, it is possible that any of the user-usage is below 
target_user_limit, but the usage of the queue is larger than guaranteed 
resource.

Let me give you an example
{code}
Assume queue capacity = 50, used_resource = 70, resToObtain = 20
So target_capacity = 50, there're 5 users in the queue
user_limit_percent = 100%, user_limit_factor = 1 (both are default)

So target_user_capacity = min(max(50 / 5, 50 * 100%), 50) = 50
User1 used 20
User2 used 10
User3 used 10
User4 used 20
User5 used 10

So all user's used capacity are < target_user_capacity
{code}

In existing logic of {{balanceUserLimitsinQueueForPreemption}}
{code}
  if (Resources.lessThan(rc, clusterResource, userLimitforQueue,
  userConsumedResource)) {
 // do preemption
  } else 
  continue;
{code}
If a user used resource < target_user_capacity, it will not be preempted.

Mayank, is that correct? Or I misunderstood your logic? Please let me know you 
comments,

Thanks,
Wangda

> CS queue level preemption should respect user-limits
> 
>
> Key: YARN-2069
> URL: https://issues.apache.org/jira/browse/YARN-2069
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Mayank Bansal
> Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, 
> YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, 
> YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch
>
>
> This is different from (even if related to, and likely share code with) 
> YARN-2113.
> YARN-2113 focuses on making sure that even if queue has its guaranteed 
> capacity, it's individual users are treated in-line with their limits 
> irrespective of when they join in.
> This JIRA is about respecting user-limits while preempting containers to 
> balance queue capacities.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1796) container-executor shouldn't require o-r permissions

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075212#comment-14075212
 ] 

Hadoop QA commented on YARN-1796:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12633282/YARN-1796.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4445//console

This message is automatically generated.

> container-executor shouldn't require o-r permissions
> 
>
> Key: YARN-1796
> URL: https://issues.apache.org/jira/browse/YARN-1796
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Minor
> Attachments: YARN-1796.patch
>
>
> The container-executor currently checks that "other" users don't have read 
> permissions. This is unnecessary and runs contrary to the debian packaging 
> policy manual.
> This is the analogous fix for YARN that was done for MR1 in MAPREDUCE-2103.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper broken due to AbstractYarnScheduler

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075213#comment-14075213
 ] 

Hadoop QA commented on YARN-1726:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12657962/YARN-1726-7-branch2.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4446//console

This message is automatically generated.

> ResourceSchedulerWrapper broken due to AbstractYarnScheduler
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, 
> YARN-1726-6.patch, YARN-1726-7-branch2.patch, YARN-1726-7.patch, 
> YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1796) container-executor shouldn't require o-r permissions

2014-07-25 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated YARN-1796:
-

Target Version/s: 2.6.0  (was: 2.4.0)

> container-executor shouldn't require o-r permissions
> 
>
> Key: YARN-1796
> URL: https://issues.apache.org/jira/browse/YARN-1796
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Minor
> Attachments: YARN-1796.patch
>
>
> The container-executor currently checks that "other" users don't have read 
> permissions. This is unnecessary and runs contrary to the debian packaging 
> policy manual.
> This is the analogous fix for YARN that was done for MR1 in MAPREDUCE-2103.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper broken due to AbstractYarnScheduler

2014-07-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075203#comment-14075203
 ] 

Hudson commented on YARN-1726:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5973 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5973/])
YARN-1726. ResourceSchedulerWrapper broken due to AbstractYarnScheduler. (Wei 
Yan via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613547)
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/MRAMSimulator.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/nodemanager/NMSimulator.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/ResourceSchedulerWrapper.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSCapacityScheduler.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/TaskRunner.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/TestSLSRunner.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt


> ResourceSchedulerWrapper broken due to AbstractYarnScheduler
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, 
> YARN-1726-6.patch, YARN-1726-7-branch2.patch, YARN-1726-7.patch, 
> YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2361) remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075200#comment-14075200
 ] 

Hadoop QA commented on YARN-2361:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657941/YARN-2361.000.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4442//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4442//console

This message is automatically generated.

> remove duplicate entries (EXPIRE event) in the EnumSet of event type in 
> RMAppAttempt state machine
> --
>
> Key: YARN-2361
> URL: https://issues.apache.org/jira/browse/YARN-2361
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: zhihai xu
>Priority: Minor
> Attachments: YARN-2361.000.patch
>
>
> remove duplicate entries in the EnumSet of event type in RMAppAttempt state 
> machine. The  event RMAppAttemptEventType.EXPIRE is duplicated in the 
> following code.
> {code}
>   EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED,
>   RMAppAttemptEventType.EXPIRE,
>   RMAppAttemptEventType.LAUNCHED,
>   RMAppAttemptEventType.LAUNCH_FAILED,
>   RMAppAttemptEventType.EXPIRE,
>   RMAppAttemptEventType.REGISTERED,
>   RMAppAttemptEventType.CONTAINER_ALLOCATED,
>   RMAppAttemptEventType.UNREGISTERED,
>   RMAppAttemptEventType.KILL,
>   RMAppAttemptEventType.STATUS_UPDATE))
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper broken due to AbstractYarnScheduler

2014-07-25 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-1726:
--

Attachment: YARN-1726-7-branch2.patch

update a patch for branch-2.

> ResourceSchedulerWrapper broken due to AbstractYarnScheduler
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, 
> YARN-1726-6.patch, YARN-1726-7-branch2.patch, YARN-1726-7.patch, 
> YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper broken due to AbstractYarnScheduler

2014-07-25 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1726:
---

Summary: ResourceSchedulerWrapper broken due to AbstractYarnScheduler  
(was: ResourceSchedulerWrapper failed due to the AbstractYarnScheduler 
introduced in YARN-1041)

> ResourceSchedulerWrapper broken due to AbstractYarnScheduler
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, 
> YARN-1726-6.patch, YARN-1726-7.patch, YARN-1726.patch, YARN-1726.patch, 
> YARN-1726.patch, YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041

2014-07-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075187#comment-14075187
 ] 

Karthik Kambatla commented on YARN-1726:


Checking this in..

> ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced 
> in YARN-1041
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, 
> YARN-1726-6.patch, YARN-1726-7.patch, YARN-1726.patch, YARN-1726.patch, 
> YARN-1726.patch, YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075185#comment-14075185
 ] 

Hadoop QA commented on YARN-1726:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657940/YARN-1726-7.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-tools/hadoop-sls.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4443//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4443//console

This message is automatically generated.

> ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced 
> in YARN-1041
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, 
> YARN-1726-6.patch, YARN-1726-7.patch, YARN-1726.patch, YARN-1726.patch, 
> YARN-1726.patch, YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically

2014-07-25 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075175#comment-14075175
 ] 

Xuan Gong commented on YARN-2212:
-

Testcase failure is not related.

> ApplicationMaster needs to find a way to update the AMRMToken periodically
> --
>
> Key: YARN-2212
> URL: https://issues.apache.org/jira/browse/YARN-2212
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2212.1.patch, YARN-2212.2.patch, 
> YARN-2212.3.1.patch, YARN-2212.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075173#comment-14075173
 ] 

Hadoop QA commented on YARN-2212:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657937/YARN-2212.3.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4441//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4441//console

This message is automatically generated.

> ApplicationMaster needs to find a way to update the AMRMToken periodically
> --
>
> Key: YARN-2212
> URL: https://issues.apache.org/jira/browse/YARN-2212
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2212.1.patch, YARN-2212.2.patch, 
> YARN-2212.3.1.patch, YARN-2212.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2209) Replace allocate#resync command with ApplicationMasterNotRegisteredException to indicate AM to re-register on RM restart

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075172#comment-14075172
 ] 

Hadoop QA commented on YARN-2209:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657938/YARN-2209.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4440//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4440//console

This message is automatically generated.

> Replace allocate#resync command with ApplicationMasterNotRegisteredException 
> to indicate AM to re-register on RM restart
> 
>
> Key: YARN-2209
> URL: https://issues.apache.org/jira/browse/YARN-2209
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
> YARN-2209.4.patch
>
>
> YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
> application to re-register on RM restart. we should do the same for 
> AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041

2014-07-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075166#comment-14075166
 ] 

Karthik Kambatla commented on YARN-1726:


+1 pending Jenkins. 

> ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced 
> in YARN-1041
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, 
> YARN-1726-6.patch, YARN-1726-7.patch, YARN-1726.patch, YARN-1726.patch, 
> YARN-1726.patch, YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts

2014-07-25 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075165#comment-14075165
 ] 

Jian He commented on YARN-2354:
---

looks good. thanks for working on the patch Li !

> DistributedShell may allocate more containers than client specified after it 
> restarts
> -
>
> Key: YARN-2354
> URL: https://issues.apache.org/jira/browse/YARN-2354
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jian He
>Assignee: Li Lu
> Attachments: YARN-2354-072514.patch
>
>
> To reproduce, run distributed shell with -num_containers option,
> In ApplicationMaster.java, the following code has some issue.
> {code}
>   int numTotalContainersToRequest =
> numTotalContainers - previousAMRunningContainers.size();
> for (int i = 0; i < numTotalContainersToRequest; ++i) {
>   ContainerRequest containerAsk = setupContainerAskForRM();
>   amRMClient.addContainerRequest(containerAsk);
> }
> numRequestedContainers.set(numTotalContainersToRequest);
> {code}
>  numRequestedContainers doesn't account for previous AM's requested 
> containers. so numRequestedContainers should be set to numTotalContainers



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts

2014-07-25 Thread Li Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2354:


Attachment: YARN-2354-072514.patch

The problem was on numRequestedContainers. In the previous version, initially, 
it was set to numTotalContainers - previousAMRunningContainers.size(). Then, on 
container completion, the number of containers that need to to relaunched is 
calculated by numTotalContainers - numRequestedContainers, and normally this 
equals to previousAMRunningContainers.size(). If the containers are not reused 
(no -keep_containers_across_application_attempts), there should be no 
previousAMRunningContainers, so this problem only occurs when 
-keep_containers_across_application_attempts is set. 

I'm also fixing the testDSRestartWithPreviousRunningContainers UT associated 
with this issue. 

> DistributedShell may allocate more containers than client specified after it 
> restarts
> -
>
> Key: YARN-2354
> URL: https://issues.apache.org/jira/browse/YARN-2354
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jian He
>Assignee: Li Lu
> Attachments: YARN-2354-072514.patch
>
>
> To reproduce, run distributed shell with -num_containers option,
> In ApplicationMaster.java, the following code has some issue.
> {code}
>   int numTotalContainersToRequest =
> numTotalContainers - previousAMRunningContainers.size();
> for (int i = 0; i < numTotalContainersToRequest; ++i) {
>   ContainerRequest containerAsk = setupContainerAskForRM();
>   amRMClient.addContainerRequest(containerAsk);
> }
> numRequestedContainers.set(numTotalContainersToRequest);
> {code}
>  numRequestedContainers doesn't account for previous AM's requested 
> containers. so numRequestedContainers should be set to numTotalContainers



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2361) remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine

2014-07-25 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2361:


Attachment: YARN-2361.000.patch

> remove duplicate entries (EXPIRE event) in the EnumSet of event type in 
> RMAppAttempt state machine
> --
>
> Key: YARN-2361
> URL: https://issues.apache.org/jira/browse/YARN-2361
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: zhihai xu
>Priority: Minor
> Attachments: YARN-2361.000.patch
>
>
> remove duplicate entries in the EnumSet of event type in RMAppAttempt state 
> machine. The  event RMAppAttemptEventType.EXPIRE is duplicated in the 
> following code.
> {code}
>   EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED,
>   RMAppAttemptEventType.EXPIRE,
>   RMAppAttemptEventType.LAUNCHED,
>   RMAppAttemptEventType.LAUNCH_FAILED,
>   RMAppAttemptEventType.EXPIRE,
>   RMAppAttemptEventType.REGISTERED,
>   RMAppAttemptEventType.CONTAINER_ALLOCATED,
>   RMAppAttemptEventType.UNREGISTERED,
>   RMAppAttemptEventType.KILL,
>   RMAppAttemptEventType.STATUS_UPDATE))
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2361) remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine

2014-07-25 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2361:


Component/s: resourcemanager

> remove duplicate entries (EXPIRE event) in the EnumSet of event type in 
> RMAppAttempt state machine
> --
>
> Key: YARN-2361
> URL: https://issues.apache.org/jira/browse/YARN-2361
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: zhihai xu
>Priority: Minor
> Attachments: YARN-2361.000.patch
>
>
> remove duplicate entries in the EnumSet of event type in RMAppAttempt state 
> machine. The  event RMAppAttemptEventType.EXPIRE is duplicated in the 
> following code.
> {code}
>   EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED,
>   RMAppAttemptEventType.EXPIRE,
>   RMAppAttemptEventType.LAUNCHED,
>   RMAppAttemptEventType.LAUNCH_FAILED,
>   RMAppAttemptEventType.EXPIRE,
>   RMAppAttemptEventType.REGISTERED,
>   RMAppAttemptEventType.CONTAINER_ALLOCATED,
>   RMAppAttemptEventType.UNREGISTERED,
>   RMAppAttemptEventType.KILL,
>   RMAppAttemptEventType.STATUS_UPDATE))
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2361) remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine

2014-07-25 Thread zhihai xu (JIRA)
zhihai xu created YARN-2361:
---

 Summary: remove duplicate entries (EXPIRE event) in the EnumSet of 
event type in RMAppAttempt state machine
 Key: YARN-2361
 URL: https://issues.apache.org/jira/browse/YARN-2361
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Priority: Minor
 Attachments: YARN-2361.000.patch

remove duplicate entries in the EnumSet of event type in RMAppAttempt state 
machine. The  event RMAppAttemptEventType.EXPIRE is duplicated in the following 
code.
{code}
  EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED,
  RMAppAttemptEventType.EXPIRE,
  RMAppAttemptEventType.LAUNCHED,
  RMAppAttemptEventType.LAUNCH_FAILED,
  RMAppAttemptEventType.EXPIRE,
  RMAppAttemptEventType.REGISTERED,
  RMAppAttemptEventType.CONTAINER_ALLOCATED,
  RMAppAttemptEventType.UNREGISTERED,
  RMAppAttemptEventType.KILL,
  RMAppAttemptEventType.STATUS_UPDATE))
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041

2014-07-25 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-1726:
--

Attachment: YARN-1726-7.patch

Update a patch to fix the comments.

> ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced 
> in YARN-1041
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, 
> YARN-1726-6.patch, YARN-1726-7.patch, YARN-1726.patch, YARN-1726.patch, 
> YARN-1726.patch, YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2209) Replace allocate#resync command with ApplicationMasterNotRegisteredException to indicate AM to re-register on RM restart

2014-07-25 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075125#comment-14075125
 ] 

Jian He commented on YARN-2209:
---

Thanks for the review, Rohith. 
Uploaded a new patch which fixed the above comments.

> Replace allocate#resync command with ApplicationMasterNotRegisteredException 
> to indicate AM to re-register on RM restart
> 
>
> Key: YARN-2209
> URL: https://issues.apache.org/jira/browse/YARN-2209
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
> YARN-2209.4.patch
>
>
> YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
> application to re-register on RM restart. we should do the same for 
> AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically

2014-07-25 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2212:


Attachment: YARN-2212.3.1.patch

> ApplicationMaster needs to find a way to update the AMRMToken periodically
> --
>
> Key: YARN-2212
> URL: https://issues.apache.org/jira/browse/YARN-2212
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2212.1.patch, YARN-2212.2.patch, 
> YARN-2212.3.1.patch, YARN-2212.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2209) Replace allocate#resync command with ApplicationMasterNotRegisteredException to indicate AM to re-register on RM restart

2014-07-25 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2209:
--

Attachment: YARN-2209.4.patch

> Replace allocate#resync command with ApplicationMasterNotRegisteredException 
> to indicate AM to re-register on RM restart
> 
>
> Key: YARN-2209
> URL: https://issues.apache.org/jira/browse/YARN-2209
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
> YARN-2209.4.patch
>
>
> YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
> application to re-register on RM restart. we should do the same for 
> AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075097#comment-14075097
 ] 

Hadoop QA commented on YARN-1707:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657926/YARN-1707.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4439//console

This message is automatically generated.

> Making the CapacityScheduler more dynamic
> -
>
> Key: YARN-1707
> URL: https://issues.apache.org/jira/browse/YARN-1707
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Carlo Curino
>Assignee: Carlo Curino
>  Labels: capacity-scheduler
> Attachments: YARN-1707.patch
>
>
> The CapacityScheduler is a rather static at the moment, and refreshqueue 
> provides a rather heavy-handed way to reconfigure it. Moving towards 
> long-running services (tracked in YARN-896) and to enable more advanced 
> admission control and resource parcelling we need to make the 
> CapacityScheduler more dynamic. This is instrumental to the umbrella jira 
> YARN-1051.
> Concretely this require the following changes:
> * create queues dynamically
> * destroy queues dynamically
> * dynamically change queue parameters (e.g., capacity) 
> * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% 
> instead of ==100%
> We limit this to LeafQueues. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075084#comment-14075084
 ] 

Hadoop QA commented on YARN-2026:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657909/YARN-2026-v3.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4438//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4438//console

This message is automatically generated.

> Fair scheduler : Fair share for inactive queues causes unfair allocation in 
> some scenarios
> --
>
> Key: YARN-2026
> URL: https://issues.apache.org/jira/browse/YARN-2026
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
>  Labels: scheduler
> Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt
>
>
> Problem1- While using hierarchical queues in fair scheduler,there are few 
> scenarios where we have seen a leaf queue with least fair share can take 
> majority of the cluster and starve a sibling parent queue which has greater 
> weight/fair share and preemption doesn’t kick in to reclaim resources.
> The root cause seems to be that fair share of a parent queue is distributed 
> to all its children irrespective of whether its an active or an inactive(no 
> apps running) queue. Preemption based on fair share kicks in only if the 
> usage of a queue is less than 50% of its fair share and if it has demands 
> greater than that. When there are many queues under a parent queue(with high 
> fair share),the child queue’s fair share becomes really low. As a result when 
> only few of these child queues have apps running,they reach their *tiny* fair 
> share quickly and preemption doesn’t happen even if other leaf 
> queues(non-sibling) are hogging the cluster.
> This can be solved by dividing fair share of parent queue only to active 
> child queues.
> Here is an example describing the problem and proposed solution:
> root.lowPriorityQueue is a leaf queue with weight 2
> root.HighPriorityQueue is parent queue with weight 8
> root.HighPriorityQueue has 10 child leaf queues : 
> root.HighPriorityQueue.childQ(1..10)
> Above config,results in root.HighPriorityQueue having 80% fair share
> and each of its ten child queue would have 8% fair share. Preemption would 
> happen only if the child queue is <4% (0.5*8=4). 
> Lets say at the moment no apps are running in any of the 
> root.HighPriorityQueue.childQ(1..10) and few apps are running in 
> root.lowPriorityQueue which is taking up 95% of the cluster.
> Up till this point,the behavior of FS is correct.
> Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% 
> of the cluster. It would get only the available 5% in the cluster and 
> preemption wouldn't kick in since its above 4%(half fair share).This is bad 
> considering childQ1 is under a highPriority parent queue which has *80% fair 
> share*.
> Until root.lowPriorityQueue starts relinquishing containers,we would see the 
> following allocation on the scheduler page:
> *root.lowPriorityQueue = 95%*
> *root.HighPriorityQueue.childQ1=5%*
> This can be solved by distributing a parent’s fair share only to active 
> queues.
> So in the example above,since childQ1 is the only active queue
> under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 
> 80%.
> This would cause preemption to reclaim the 30% needed by childQ1 from 
> root.lowPriorityQueue after fairSharePreemptionTimeout seconds.
> Problem2 - Also note that similar situation can happen between 
> root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 
> hogs the

[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic

2014-07-25 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075076#comment-14075076
 ] 

Carlo Curino commented on YARN-1707:


The attached patch is part of the YARN-1051 effort, as for the other patches in 
this series does not work on itself but it has been cut for ease of reviewing.

Given previous discussions, we introduced subclasses for ParentQueue and 
LeafQueue that are dynamically addable/removable/resizeable, 
as well as changes in the CapacityScheduler to support the "move" of 
applications across queues. These are core features, we tested on a cluster
running lots of gridmix and manual jobs, and seems to work fine, but I am sure 
there are corner cases and possibly metrics that are not updated 
correctly under all cases.  We should also create a new set of tests for the 
dynamic behavior of the CapacityScheduler.



> Making the CapacityScheduler more dynamic
> -
>
> Key: YARN-1707
> URL: https://issues.apache.org/jira/browse/YARN-1707
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: YARN-1707.patch
>
>
> The CapacityScheduler is a rather static at the moment, and refreshqueue 
> provides a rather heavy-handed way to reconfigure it. Moving towards 
> long-running services (tracked in YARN-896) and to enable more advanced 
> admission control and resource parcelling we need to make the 
> CapacityScheduler more dynamic. This is instrumental to the umbrella jira 
> YARN-1051.
> Concretely this require the following changes:
> * create queues dynamically
> * destroy queues dynamically
> * dynamically change queue parameters (e.g., capacity) 
> * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% 
> instead of ==100%
> We limit this to LeafQueues. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075073#comment-14075073
 ] 

Hadoop QA commented on YARN-2212:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657905/YARN-2212.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4437//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4437//console

This message is automatically generated.

> ApplicationMaster needs to find a way to update the AMRMToken periodically
> --
>
> Key: YARN-2212
> URL: https://issues.apache.org/jira/browse/YARN-2212
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1707) Making the CapacityScheduler more dynamic

2014-07-25 Thread Carlo Curino (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlo Curino updated YARN-1707:
---

Attachment: YARN-1707.patch

> Making the CapacityScheduler more dynamic
> -
>
> Key: YARN-1707
> URL: https://issues.apache.org/jira/browse/YARN-1707
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: YARN-1707.patch
>
>
> The CapacityScheduler is a rather static at the moment, and refreshqueue 
> provides a rather heavy-handed way to reconfigure it. Moving towards 
> long-running services (tracked in YARN-896) and to enable more advanced 
> admission control and resource parcelling we need to make the 
> CapacityScheduler more dynamic. This is instrumental to the umbrella jira 
> YARN-1051.
> Concretely this require the following changes:
> * create queues dynamically
> * destroy queues dynamically
> * dynamically change queue parameters (e.g., capacity) 
> * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% 
> instead of ==100%
> We limit this to LeafQueues. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-07-25 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075011#comment-14075011
 ] 

Ashwin Shankar commented on YARN-2026:
--

Incorporated [~kasha] suggestion of having two notions of fairness.
Also incorporated [~sandyr] unit test comments.
Please let me know if you have any other comments.

Created YARN-2360 to deal with UI changes to display dynamic fair share on 
scheduler page.
I've not added dynamic fair share in FSQueueMetrics . Could you please let me 
know how these metrics are used and if we want to add dynamic fair share to it ?

> Fair scheduler : Fair share for inactive queues causes unfair allocation in 
> some scenarios
> --
>
> Key: YARN-2026
> URL: https://issues.apache.org/jira/browse/YARN-2026
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
>  Labels: scheduler
> Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt
>
>
> Problem1- While using hierarchical queues in fair scheduler,there are few 
> scenarios where we have seen a leaf queue with least fair share can take 
> majority of the cluster and starve a sibling parent queue which has greater 
> weight/fair share and preemption doesn’t kick in to reclaim resources.
> The root cause seems to be that fair share of a parent queue is distributed 
> to all its children irrespective of whether its an active or an inactive(no 
> apps running) queue. Preemption based on fair share kicks in only if the 
> usage of a queue is less than 50% of its fair share and if it has demands 
> greater than that. When there are many queues under a parent queue(with high 
> fair share),the child queue’s fair share becomes really low. As a result when 
> only few of these child queues have apps running,they reach their *tiny* fair 
> share quickly and preemption doesn’t happen even if other leaf 
> queues(non-sibling) are hogging the cluster.
> This can be solved by dividing fair share of parent queue only to active 
> child queues.
> Here is an example describing the problem and proposed solution:
> root.lowPriorityQueue is a leaf queue with weight 2
> root.HighPriorityQueue is parent queue with weight 8
> root.HighPriorityQueue has 10 child leaf queues : 
> root.HighPriorityQueue.childQ(1..10)
> Above config,results in root.HighPriorityQueue having 80% fair share
> and each of its ten child queue would have 8% fair share. Preemption would 
> happen only if the child queue is <4% (0.5*8=4). 
> Lets say at the moment no apps are running in any of the 
> root.HighPriorityQueue.childQ(1..10) and few apps are running in 
> root.lowPriorityQueue which is taking up 95% of the cluster.
> Up till this point,the behavior of FS is correct.
> Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% 
> of the cluster. It would get only the available 5% in the cluster and 
> preemption wouldn't kick in since its above 4%(half fair share).This is bad 
> considering childQ1 is under a highPriority parent queue which has *80% fair 
> share*.
> Until root.lowPriorityQueue starts relinquishing containers,we would see the 
> following allocation on the scheduler page:
> *root.lowPriorityQueue = 95%*
> *root.HighPriorityQueue.childQ1=5%*
> This can be solved by distributing a parent’s fair share only to active 
> queues.
> So in the example above,since childQ1 is the only active queue
> under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 
> 80%.
> This would cause preemption to reclaim the 30% needed by childQ1 from 
> root.lowPriorityQueue after fairSharePreemptionTimeout seconds.
> Problem2 - Also note that similar situation can happen between 
> root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 
> hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck 
> at 5%,until childQ2 starts relinquishing containers. We would like each of 
> childQ1 and childQ2 to get half of root.HighPriorityQueue  fair share ie 
> 40%,which would ensure childQ1 gets upto 40% resource if needed through 
> preemption.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-25 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2359:


Description: 
Application is hung without timeout and retry after DNS/network is down. 
It is because right after the container is allocated for the AM, the 
DNS/network is down for the node which has the AM container.
The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
IllegalArgumentException(due to DNS error) happened, it stay at state 
RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
processed at this state:
RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
which will be generated when the node and container timeout. So even the node 
is removed, the Application is still hung in this state 
RMAppAttemptState.SCHEDULED.
The only way to make the application exit this state is to send 
RMAppAttemptEventType.KILL event which will only be generated when you manually 
kill the application from Job Client by forceKillApplication.

To fix the issue, we should add an entry in the state machine table to handle 
RMAppAttemptEventType.CONTAINER_FINISHED event at state 
RMAppAttemptState.SCHEDULED
add the following code in StateMachineFactory:
{code}.addTransition(RMAppAttemptState.SCHEDULED, 
  RMAppAttemptState.FINAL_SAVING,
  RMAppAttemptEventType.CONTAINER_FINISHED,
  new FinalSavingTransition(
new AMContainerCrashedBeforeRunningTransition(), 
RMAppAttemptState.FAILED)){code}

  was:
Application is hung without timeout and retry after DNS/network is down. 
It is because right after the container is allocated for the AM, the 
DNS/network is down for the node which has the AM container.
The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
IllegalArgumentException(due to DNS error) happened, it stay at state 
RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
processed at this state:
RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
which will be generated by the node and container timeout. So even the node is 
removed, the Application is still hung in this state 
RMAppAttemptState.SCHEDULED.
The only way to make the application exit this state is to send 
RMAppAttemptEventType.KILL event which will only be generated when you manually 
kill the application from Job Client by forceKillApplication.

To fix the issue, we should add an entry in the state machine table to handle 
RMAppAttemptEventType.CONTAINER_FINISHED event at state 
RMAppAttemptState.SCHEDULED
add the following code in StateMachineFactory:
{code}.addTransition(RMAppAttemptState.SCHEDULED, 
  RMAppAttemptState.FINAL_SAVING,
  RMAppAttemptEventType.CONTAINER_FINISHED,
  new FinalSavingTransition(
new AMContainerCrashedBeforeRunningTransition(), 
RMAppAttemptState.FAILED)){code}


> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated when the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at

[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041

2014-07-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074989#comment-14074989
 ] 

Karthik Kambatla commented on YARN-1726:


Comments on the trunk patch:

# NMSimulator: May be change this signature to match AMSimulator and {{throws 
Exception}}
{code}
  public void middleStep()
  throws YarnException, InterruptedException, IOException {
{code}
# Would the following affect performance? Is there a better alternative, may be 
wait-notify? 
{code}
while (rmAppAttempt.getAppAttemptState() != RMAppAttemptState.LAUNCHED) {
  Thread.sleep(50);
{code}
# In the test, remove the space between in {{count --}}. Also, is there a 
reason we have to wait for 45 seconds? Can we use a MockClock to speed this 
test up?

> ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced 
> in YARN-1041
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, 
> YARN-1726-6.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, 
> YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-07-25 Thread Ashwin Shankar (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar updated YARN-2026:
-

Attachment: YARN-2026-v3.txt

> Fair scheduler : Fair share for inactive queues causes unfair allocation in 
> some scenarios
> --
>
> Key: YARN-2026
> URL: https://issues.apache.org/jira/browse/YARN-2026
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
>  Labels: scheduler
> Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt
>
>
> Problem1- While using hierarchical queues in fair scheduler,there are few 
> scenarios where we have seen a leaf queue with least fair share can take 
> majority of the cluster and starve a sibling parent queue which has greater 
> weight/fair share and preemption doesn’t kick in to reclaim resources.
> The root cause seems to be that fair share of a parent queue is distributed 
> to all its children irrespective of whether its an active or an inactive(no 
> apps running) queue. Preemption based on fair share kicks in only if the 
> usage of a queue is less than 50% of its fair share and if it has demands 
> greater than that. When there are many queues under a parent queue(with high 
> fair share),the child queue’s fair share becomes really low. As a result when 
> only few of these child queues have apps running,they reach their *tiny* fair 
> share quickly and preemption doesn’t happen even if other leaf 
> queues(non-sibling) are hogging the cluster.
> This can be solved by dividing fair share of parent queue only to active 
> child queues.
> Here is an example describing the problem and proposed solution:
> root.lowPriorityQueue is a leaf queue with weight 2
> root.HighPriorityQueue is parent queue with weight 8
> root.HighPriorityQueue has 10 child leaf queues : 
> root.HighPriorityQueue.childQ(1..10)
> Above config,results in root.HighPriorityQueue having 80% fair share
> and each of its ten child queue would have 8% fair share. Preemption would 
> happen only if the child queue is <4% (0.5*8=4). 
> Lets say at the moment no apps are running in any of the 
> root.HighPriorityQueue.childQ(1..10) and few apps are running in 
> root.lowPriorityQueue which is taking up 95% of the cluster.
> Up till this point,the behavior of FS is correct.
> Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% 
> of the cluster. It would get only the available 5% in the cluster and 
> preemption wouldn't kick in since its above 4%(half fair share).This is bad 
> considering childQ1 is under a highPriority parent queue which has *80% fair 
> share*.
> Until root.lowPriorityQueue starts relinquishing containers,we would see the 
> following allocation on the scheduler page:
> *root.lowPriorityQueue = 95%*
> *root.HighPriorityQueue.childQ1=5%*
> This can be solved by distributing a parent’s fair share only to active 
> queues.
> So in the example above,since childQ1 is the only active queue
> under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 
> 80%.
> This would cause preemption to reclaim the 30% needed by childQ1 from 
> root.lowPriorityQueue after fairSharePreemptionTimeout seconds.
> Problem2 - Also note that similar situation can happen between 
> root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 
> hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck 
> at 5%,until childQ2 starts relinquishing containers. We would like each of 
> childQ1 and childQ2 to get half of root.HighPriorityQueue  fair share ie 
> 40%,which would ensure childQ1 gets upto 40% resource if needed through 
> preemption.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-07-25 Thread Ashwin Shankar (JIRA)
Ashwin Shankar created YARN-2360:


 Summary: Fair Scheduler : Display dynamic fair share for queues on 
the scheduler page
 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar


Based on the discussion in YARN-2026,  we'd like to display dynamic fair share 
for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits

2014-07-25 Thread Mayank Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074940#comment-14074940
 ] 

Mayank Bansal commented on YARN-2069:
-

HI [~wangda] ,

Thanks for the review.

Let me explain what this algo is doing .

Lets say you have queueA in your cluster with capacity 30% allocated to it.
Now Queue A is using 50% resources. Queue A has 5 users with 20% user limit.
That means with each user is using 10% of the capacity of the cluster.

Now Another queueB is there with allocated capacity 70%.
Used capacity of queue B is 50%. Now if another application gets submitted to 
Queue B which needs 10% capacity.

Now 10% capacity has to be claimed back from queue A .
So restoobtain = 10%
Targated user limit will be = 8% (This is always calculated based on how much 
we need to calim back from user)

So based on the current alogorithm , it will take out 2% resources from every 
user and will leave behind the balance for each users.
This will also be true if all the users are not using same number of resources 
so based on this algo it will take out more from the users 
which are using more to balance till targated user limit.

Other thing this algo also does is it preempt application which is submitted 
last that means if user1 has 2 application, it will try to take the maximum 
containers from the last application submitted leaving behind the AM container 
however user limit will be honoured with combined all applications in the queue.

This algo does not remove AM continer if its not absolutely needed, it goes to 
get all the tasks containers first and then  consider AM containers.to be 
preempted.

Thanks,
Mayank


> CS queue level preemption should respect user-limits
> 
>
> Key: YARN-2069
> URL: https://issues.apache.org/jira/browse/YARN-2069
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Mayank Bansal
> Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, 
> YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, 
> YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch
>
>
> This is different from (even if related to, and likely share code with) 
> YARN-2113.
> YARN-2113 focuses on making sure that even if queue has its guaranteed 
> capacity, it's individual users are treated in-line with their limits 
> irrespective of when they join in.
> This JIRA is about respecting user-limits while preempting containers to 
> balance queue capacities.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically

2014-07-25 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2212:


Attachment: YARN-2212.3.patch

Adding more testcases

> ApplicationMaster needs to find a way to update the AMRMToken periodically
> --
>
> Key: YARN-2212
> URL: https://issues.apache.org/jira/browse/YARN-2212
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-25 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2359:


Description: 
Application is hung without timeout and retry after DNS/network is down. 
It is because right after the container is allocated for the AM, the 
DNS/network is down for the node which has the AM container.
The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
IllegalArgumentException(due to DNS error) happened, it stay at state 
RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
processed at this state:
RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
which will be generated by the node and container timeout. So even the node is 
removed, the Application is still hung in this state 
RMAppAttemptState.SCHEDULED.
The only way to make the application exit this state is to send 
RMAppAttemptEventType.KILL event which will only be generated when you manually 
kill the application from Job Client by forceKillApplication.

To fix the issue, we should add an entry in the state machine table to handle 
RMAppAttemptEventType.CONTAINER_FINISHED event at state 
RMAppAttemptState.SCHEDULED
add the following code in StateMachineFactory:
{{ .addTransition(RMAppAttemptState.SCHEDULED, 
  RMAppAttemptState.FINAL_SAVING,
  RMAppAttemptEventType.CONTAINER_FINISHED,
  new FinalSavingTransition(
new AMContainerCrashedBeforeRunningTransition(), 
RMAppAttemptState.FAILED))}}

  was:
Application is hung without timeout and retry after DNS/network is down. 
It is because right after the container is allocated for the AM, the 
DNS/network is down for the node which has the AM container.
The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
IllegalArgumentException(due to DNS error) happened, it stay at state 
RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
processed at this state:
RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
which will be generated by the node and container timeout. So even the node is 
removed, the Application is still hung in this state 
RMAppAttemptState.SCHEDULED.
The only way to make the application exit this state is to send 
RMAppAttemptEventType.KILL event which will only be generated when you manually 
kill the application from Job Client by forceKillApplication.

To fix the issue, we should add an entry in the state machine table to handle 
RMAppAttemptEventType.CONTAINER_FINISHED event at state 
RMAppAttemptState.SCHEDULED
add the following code in StateMachineFactory:
 .addTransition(RMAppAttemptState.SCHEDULED, 
  RMAppAttemptState.FINAL_SAVING,
  RMAppAttemptEventType.CONTAINER_FINISHED,
  new FinalSavingTransition(
new AMContainerCrashedBeforeRunningTransition(), 
RMAppAttemptState.FAILED))


> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated by the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttempt

[jira] [Updated] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-25 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2359:


Description: 
Application is hung without timeout and retry after DNS/network is down. 
It is because right after the container is allocated for the AM, the 
DNS/network is down for the node which has the AM container.
The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
IllegalArgumentException(due to DNS error) happened, it stay at state 
RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
processed at this state:
RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
which will be generated by the node and container timeout. So even the node is 
removed, the Application is still hung in this state 
RMAppAttemptState.SCHEDULED.
The only way to make the application exit this state is to send 
RMAppAttemptEventType.KILL event which will only be generated when you manually 
kill the application from Job Client by forceKillApplication.

To fix the issue, we should add an entry in the state machine table to handle 
RMAppAttemptEventType.CONTAINER_FINISHED event at state 
RMAppAttemptState.SCHEDULED
add the following code in StateMachineFactory:
{code}.addTransition(RMAppAttemptState.SCHEDULED, 
  RMAppAttemptState.FINAL_SAVING,
  RMAppAttemptEventType.CONTAINER_FINISHED,
  new FinalSavingTransition(
new AMContainerCrashedBeforeRunningTransition(), 
RMAppAttemptState.FAILED)){code}

  was:
Application is hung without timeout and retry after DNS/network is down. 
It is because right after the container is allocated for the AM, the 
DNS/network is down for the node which has the AM container.
The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
IllegalArgumentException(due to DNS error) happened, it stay at state 
RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
processed at this state:
RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
which will be generated by the node and container timeout. So even the node is 
removed, the Application is still hung in this state 
RMAppAttemptState.SCHEDULED.
The only way to make the application exit this state is to send 
RMAppAttemptEventType.KILL event which will only be generated when you manually 
kill the application from Job Client by forceKillApplication.

To fix the issue, we should add an entry in the state machine table to handle 
RMAppAttemptEventType.CONTAINER_FINISHED event at state 
RMAppAttemptState.SCHEDULED
add the following code in StateMachineFactory:
{{ .addTransition(RMAppAttemptState.SCHEDULED, 
  RMAppAttemptState.FINAL_SAVING,
  RMAppAttemptEventType.CONTAINER_FINISHED,
  new FinalSavingTransition(
new AMContainerCrashedBeforeRunningTransition(), 
RMAppAttemptState.FAILED))}}


> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated by the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> R

[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-25 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074917#comment-14074917
 ] 

zhihai xu commented on YARN-2359:
-

I can pass the test TestAMRestart in my local build.

---
 T E S T S
---
Running 
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 90.076 sec - in 
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

Results :

Tests run: 5, Failures: 0, Errors: 0, Skipped: 0

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated by the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
>  .addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED))



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens

2014-07-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074890#comment-14074890
 ] 

Hudson commented on YARN-2211:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5970 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5970/])
YARN-2211. Persist AMRMToken master key in RMStateStore for RM recovery. 
Contributed by Xuan Gong (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613515)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/ProtocolHATestBase.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestApplicationMasterServiceOnHA.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMSecretManagerService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/MemoryRMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/NullRMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/records/AMRMTokenSecretManagerState.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/records/impl/pb/AMRMTokenSecretManagerStatePBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/AMRMTokenSecretManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/proto/yarn_server_resourcemanager_recovery.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestAMRMTokens.java


> RMStateStore need

[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens

2014-07-25 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074876#comment-14074876
 ] 

Jian He commented on YARN-2211:
---

looks good, +1

> RMStateStore needs to save AMRMToken master key for recovery when RM 
> restart/failover happens 
> --
>
> Key: YARN-2211
> URL: https://issues.apache.org/jira/browse/YARN-2211
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, 
> YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, 
> YARN-2211.6.1.patch, YARN-2211.6.patch, YARN-2211.7.1.patch, 
> YARN-2211.7.patch, YARN-2211.8.1.patch, YARN-2211.8.patch
>
>
> After YARN-2208, AMRMToken can be rolled over periodically. We need to save 
> related Master Keys and use them to recover the AMRMToken when RM 
> restart/failover happens



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074873#comment-14074873
 ] 

Hadoop QA commented on YARN-2359:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657887/YARN-2359.000.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4436//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4436//console

This message is automatically generated.

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated by the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
>  .addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED))



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074819#comment-14074819
 ] 

Hadoop QA commented on YARN-1726:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12657881/YARN-1726-6-branch2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-tools/hadoop-sls.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4435//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4435//console

This message is automatically generated.

> ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced 
> in YARN-1041
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, 
> YARN-1726-6.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, 
> YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074815#comment-14074815
 ] 

Hadoop QA commented on YARN-2211:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657877/YARN-2211.8.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 10 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4434//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4434//console

This message is automatically generated.

> RMStateStore needs to save AMRMToken master key for recovery when RM 
> restart/failover happens 
> --
>
> Key: YARN-2211
> URL: https://issues.apache.org/jira/browse/YARN-2211
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, 
> YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, 
> YARN-2211.6.1.patch, YARN-2211.6.patch, YARN-2211.7.1.patch, 
> YARN-2211.7.patch, YARN-2211.8.1.patch, YARN-2211.8.patch
>
>
> After YARN-2208, AMRMToken can be rolled over periodically. We need to save 
> related Master Keys and use them to recover the AMRMToken when RM 
> restart/failover happens



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-25 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2359:


Attachment: YARN-2359.000.patch

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated by the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
>  .addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED))



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041

2014-07-25 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-1726:
--

Attachment: YARN-1726-6-branch2.patch

update a patch for branch2

> ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced 
> in YARN-1041
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, 
> YARN-1726-6.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, 
> YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens

2014-07-25 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074753#comment-14074753
 ] 

Xuan Gong commented on YARN-2211:
-

Same patch. just fix the log on TestAMRMTokens which can fix the test-case 
failure.
We changed the exception message on AMRMTokenSecretManager#retrievePWD. But did 
not change the exception message which is used to assert in testcase which 
cause the testcase failure.

The new patch fixes it.


> RMStateStore needs to save AMRMToken master key for recovery when RM 
> restart/failover happens 
> --
>
> Key: YARN-2211
> URL: https://issues.apache.org/jira/browse/YARN-2211
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, 
> YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, 
> YARN-2211.6.1.patch, YARN-2211.6.patch, YARN-2211.7.1.patch, 
> YARN-2211.7.patch, YARN-2211.8.1.patch, YARN-2211.8.patch
>
>
> After YARN-2208, AMRMToken can be rolled over periodically. We need to save 
> related Master Keys and use them to recover the AMRMToken when RM 
> restart/failover happens



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074749#comment-14074749
 ] 

Hadoop QA commented on YARN-1726:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657870/YARN-1726-6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-tools/hadoop-sls.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4433//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4433//console

This message is automatically generated.

> ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced 
> in YARN-1041
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Attachments: YARN-1726-5.patch, YARN-1726-6.patch, YARN-1726.patch, 
> YARN-1726.patch, YARN-1726.patch, YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens

2014-07-25 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2211:


Attachment: YARN-2211.8.1.patch

> RMStateStore needs to save AMRMToken master key for recovery when RM 
> restart/failover happens 
> --
>
> Key: YARN-2211
> URL: https://issues.apache.org/jira/browse/YARN-2211
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, 
> YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, 
> YARN-2211.6.1.patch, YARN-2211.6.patch, YARN-2211.7.1.patch, 
> YARN-2211.7.patch, YARN-2211.8.1.patch, YARN-2211.8.patch
>
>
> After YARN-2208, AMRMToken can be rolled over periodically. We need to save 
> related Master Keys and use them to recover the AMRMToken when RM 
> restart/failover happens



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074731#comment-14074731
 ] 

Hadoop QA commented on YARN-2211:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657856/YARN-2211.8.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4430//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4430//console

This message is automatically generated.

> RMStateStore needs to save AMRMToken master key for recovery when RM 
> restart/failover happens 
> --
>
> Key: YARN-2211
> URL: https://issues.apache.org/jira/browse/YARN-2211
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, 
> YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, 
> YARN-2211.6.1.patch, YARN-2211.6.patch, YARN-2211.7.1.patch, 
> YARN-2211.7.patch, YARN-2211.8.patch
>
>
> After YARN-2208, AMRMToken can be rolled over periodically. We need to save 
> related Master Keys and use them to recover the AMRMToken when RM 
> restart/failover happens



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074728#comment-14074728
 ] 

Hadoop QA commented on YARN-1354:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657864/YARN-1354-v5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4432//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4432//console

This message is automatically generated.

> Recover applications upon nodemanager restart
> -
>
> Key: YARN-1354
> URL: https://issues.apache.org/jira/browse/YARN-1354
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-1354-v1.patch, 
> YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, 
> YARN-1354-v4.patch, YARN-1354-v5.patch
>
>
> The set of active applications in the nodemanager context need to be 
> recovered for work-preserving nodemanager restart



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041

2014-07-25 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-1726:
--

Attachment: YARN-1726-6.patch

rebase the patch after YARN-2335. May not work with branch-2, will update a 
patch for branch-2.

> ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced 
> in YARN-1041
> 
>
> Key: YARN-1726
> URL: https://issues.apache.org/jira/browse/YARN-1726
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Blocker
> Attachments: YARN-1726-5.patch, YARN-1726-6.patch, YARN-1726.patch, 
> YARN-1726.patch, YARN-1726.patch, YARN-1726.patch
>
>
> The YARN scheduler simulator failed when running Fair Scheduler, due to 
> AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper 
> should inherit AbstractYarnScheduler, instead of implementing 
> ResourceScheduler interface directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart

2014-07-25 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074677#comment-14074677
 ] 

Zhijie Shen commented on YARN-2262:
---

[~nishan] and [~Naganarasimha], I've a general suggestion on this issue. 
According to the logs, the problem is likely to be related to the FS history 
store. On the other side, we're seeking rebasing on the timeline store to 
persist the generic history data (See YARN-2033 for the motivation and the more 
details). Given this is done, we may deprecate the current FS history store 
because there are limitations around the FS history store, and it is expensive 
to maintain two store interfaces.

It's always welcome if you'd like to help fix the bug, but I hope you're aware 
of this plan, in case your effort is likely not to be leveraged. If you have 
bandwidth, I would appreciate if you can help with other issue such as 
YARN-2033, where I've a patch available for the timeline store based generic 
history service, but I still didn't have a chance to test it with RM restart. 
Anyway, thanks for your interest on the timeline server. Please feel free to 
share your thoughts.

> Few fields displaying wrong values in Timeline server after RM restart
> --
>
> Key: YARN-2262
> URL: https://issues.apache.org/jira/browse/YARN-2262
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Affects Versions: 2.4.0
>Reporter: Nishan Shetty
>Assignee: Naganarasimha G R
> Attachments: Capture.PNG, Capture1.PNG, 
> yarn-testos-historyserver-HOST-10-18-40-95.log, 
> yarn-testos-resourcemanager-HOST-10-18-40-84.log, 
> yarn-testos-resourcemanager-HOST-10-18-40-95.log
>
>
> Few fields displaying wrong values in Timeline server after RM restart
> State:null
> FinalStatus:  UNDEFINED
> Started:  8-Jul-2014 14:58:08
> Elapsed:  2562047397789hrs, 44mins, 47sec 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1354) Recover applications upon nodemanager restart

2014-07-25 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-1354:
-

Attachment: YARN-1354-v5.patch

Updating patch to fix the warning.

> Recover applications upon nodemanager restart
> -
>
> Key: YARN-1354
> URL: https://issues.apache.org/jira/browse/YARN-1354
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-1354-v1.patch, 
> YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, 
> YARN-1354-v4.patch, YARN-1354-v5.patch
>
>
> The set of active applications in the nodemanager context need to be 
> recovered for work-preserving nodemanager restart



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart

2014-07-25 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074635#comment-14074635
 ] 

Anubhav Dhoot commented on YARN-2229:
-

We cannot simply add a field and have old code not know about it. That will 
cause it to silently work with a wrong id (missing field). And because of the 
way we construct containerIds we need to add the new field (details in 
YARN-2052).

The only way i see it working (without a cluster shutdown) is if we support 
deserializing both the older format and newer format. When serializing we can 
choose to emit a new field based on a condition (flag or version number of the 
daemon).
So the first rolling upgrade will not turn on the condition but will ensure all 
the code supports deserializing the newer field if it exists. In the next 
rolling upgrade we can turn on the condition to serialize the new field.

RM can ensure that  NMs are upgraded to a specific version (support 
deserializing the new field) before allowing the flag to be turned on. That 
will take care of the case when someone does not follow the approach above.
Any problems with this approach?

> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, 
> YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, 
> YARN-2229.8.patch, YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart

2014-07-25 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074636#comment-14074636
 ] 

Zhijie Shen commented on YARN-2262:
---

[~nishan], thanks for sharing the logs. I've done a preliminary investigation 
into the RM logs. It seems that the FS history store messed up after RM failed 
over. There're two types of exception:

1. The history file of application_1406035038624_0005 shouldn't occur, because 
before failover, I didn't see application_1406035038624_0005 was already 
started according to the log (or the log is not complete?). However, FS store 
found the history file on HDFS, and wanted to append more information into the 
file, but failed to open the file in append mode.
{code}
2014-07-23 17:00:03,066 ERROR 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
 Error when openning history file of application application_1406035038624_0005
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
 Failed to create file 
[/home/testos/timelinedata/generic-history/ApplicationHistoryDataRoot/application_1406035038624_0005]
 for [DFSClient_NONMAPREDUCE_-903472038_1] for client [10.18.40.84], because 
this file is already being created by [DFSClient_NONMAPREDUCE_1878412866_1] on 
[10.18.40.84]
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2549)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2378)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2613)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2576)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:537)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

at org.apache.hadoop.ipc.Client.call(Client.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at $Proxy14.append(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.append(ClientNamenodeProtocolTranslatorPB.java:276)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at $Proxy15.append(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1569)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1609)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1597)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:320)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:316)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:316)
at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1161)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileWriter.(FileSystemApplicationHistoryStore.java:723)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.applicationStarted(FileSystemApplicationHistoryStore.java:418)
at 
org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.handleWritingApplicationHistoryEvent(RMApplicationHistoryWriter.

[jira] [Commented] (YARN-2335) Annotate all hadoop-sls APIs as @Private

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074634#comment-14074634
 ] 

Hadoop QA commented on YARN-2335:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12657858/YARN-2335-1.branch2.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4431//console

This message is automatically generated.

> Annotate all hadoop-sls APIs as @Private
> 
>
> Key: YARN-2335
> URL: https://issues.apache.org/jira/browse/YARN-2335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Minor
> Attachments: YARN-2335-1.branch2.patch, YARN-2335-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2335) Annotate all hadoop-sls APIs as @Private

2014-07-25 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2335:
--

Attachment: YARN-2335-1.branch2.patch

update patch for branch-2

> Annotate all hadoop-sls APIs as @Private
> 
>
> Key: YARN-2335
> URL: https://issues.apache.org/jira/browse/YARN-2335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Minor
> Attachments: YARN-2335-1.branch2.patch, YARN-2335-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2335) Annotate all hadoop-sls APIs as @Private

2014-07-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074622#comment-14074622
 ] 

Hudson commented on YARN-2335:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5967 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5967/])
YARN-2335. Annotate all hadoop-sls APIs as @Private. (Wei Yan via kasha) 
(kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613478)
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/RumenToSLSConverter.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/SLSRunner.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/MRAMSimulator.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/conf/SLSConfiguration.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/nodemanager/NMSimulator.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/nodemanager/NodeInfo.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/CapacitySchedulerMetrics.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/ContainerSimulator.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/FairSchedulerMetrics.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/FifoSchedulerMetrics.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/NodeUpdateSchedulerEventWrapper.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/RMNodeWrapper.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/ResourceSchedulerWrapper.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSCapacityScheduler.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SchedulerMetrics.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SchedulerWrapper.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/TaskRunner.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/utils/SLSUtils.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/web/SLSWebApp.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt


> Annotate all hadoop-sls APIs as @Private
> 
>
> Key: YARN-2335
> URL: https://issues.apache.org/jira/browse/YARN-2335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Minor
> Attachments: YARN-2335-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens

2014-07-25 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2211:
--

Attachment: YARN-2211.8.patch

Looks good overall,  fixed some log msgs myself. re-submit the patch.

> RMStateStore needs to save AMRMToken master key for recovery when RM 
> restart/failover happens 
> --
>
> Key: YARN-2211
> URL: https://issues.apache.org/jira/browse/YARN-2211
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, 
> YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, 
> YARN-2211.6.1.patch, YARN-2211.6.patch, YARN-2211.7.1.patch, 
> YARN-2211.7.patch, YARN-2211.8.patch
>
>
> After YARN-2208, AMRMToken can be rolled over periodically. We need to save 
> related Master Keys and use them to recover the AMRMToken when RM 
> restart/failover happens



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2335) Annotate all hadoop-sls APIs as @Private

2014-07-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074600#comment-14074600
 ] 

Karthik Kambatla commented on YARN-2335:


Committed to trunk. branch-2 had conflicts. Mind updating the patch for 
branch-2? 

> Annotate all hadoop-sls APIs as @Private
> 
>
> Key: YARN-2335
> URL: https://issues.apache.org/jira/browse/YARN-2335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Minor
> Attachments: YARN-2335-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2335) Annotate all hadoop-sls APIs as @Private

2014-07-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074593#comment-14074593
 ] 

Karthik Kambatla commented on YARN-2335:


+1

> Annotate all hadoop-sls APIs as @Private
> 
>
> Key: YARN-2335
> URL: https://issues.apache.org/jira/browse/YARN-2335
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Minor
> Attachments: YARN-2335-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-25 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2359:


Priority: Critical  (was: Major)

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated by the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
>  .addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED))



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-25 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu reassigned YARN-2359:
---

Assignee: zhihai xu

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated by the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
>  .addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED))



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-25 Thread zhihai xu (JIRA)
zhihai xu created YARN-2359:
---

 Summary: Application is hung without timeout and retry after 
DNS/network is down. 
 Key: YARN-2359
 URL: https://issues.apache.org/jira/browse/YARN-2359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu


Application is hung without timeout and retry after DNS/network is down. 
It is because right after the container is allocated for the AM, the 
DNS/network is down for the node which has the AM container.
The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
IllegalArgumentException(due to DNS error) happened, it stay at state 
RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
processed at this state:
RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
which will be generated by the node and container timeout. So even the node is 
removed, the Application is still hung in this state 
RMAppAttemptState.SCHEDULED.
The only way to make the application exit this state is to send 
RMAppAttemptEventType.KILL event which will only be generated when you manually 
kill the application from Job Client by forceKillApplication.

To fix the issue, we should add an entry in the state machine table to handle 
RMAppAttemptEventType.CONTAINER_FINISHED event at state 
RMAppAttemptState.SCHEDULED
add the following code in StateMachineFactory:
 .addTransition(RMAppAttemptState.SCHEDULED, 
  RMAppAttemptState.FINAL_SAVING,
  RMAppAttemptEventType.CONTAINER_FINISHED,
  new FinalSavingTransition(
new AMContainerCrashedBeforeRunningTransition(), 
RMAppAttemptState.FAILED))



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2214) FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness

2014-07-25 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074548#comment-14074548
 ] 

Ashwin Shankar commented on YARN-2214:
--

Thanks Karthik !

> FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence 
> towards fairness
> --
>
> Key: YARN-2214
> URL: https://issues.apache.org/jira/browse/YARN-2214
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
> Fix For: 2.6.0
>
> Attachments: YARN-2214-v1.txt, YARN-2214-v2.txt
>
>
> preemptContainerPreCheck() in FSParentQueue rejects preemption requests if 
> the parent queue is below fair share. This can cause a delay in converging 
> towards fairness when the starved leaf queue and the queue above fairshare 
> belong under a non-root parent queue(ie their least common ancestor is a 
> parent queue which is not root).
> Here is an example :
> root.parent has fair share = 80% and usage = 80%
> root.parent.child1 has fair share =40% usage = 80%
> root.parent.child2 has fair share=40% usage=0%
> Now a job is submitted to child2 and the demand is 40%.
> Preemption will kick in and try to reclaim all the 40% from child1.
> When it preempts the first container from child1,the usage of root.parent 
> will become <80%, which is less than root.parent's fair share,causing 
> preemption to stop.So only one container gets preempted in this round 
> although the need is a lot more. child2 would eventually get to half its fair 
> share but only after multiple rounds of preemption.
> Solution is to remove preemptContainerPreCheck() in FSParentQueue and keep it 
> only in FSLeafQueue(which is already there).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2214) FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness

2014-07-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074527#comment-14074527
 ] 

Hudson commented on YARN-2214:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5966 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5966/])
YARN-2214. FairScheduler: preemptContainerPreCheck() in FSParentQueue delays 
convergence towards fairness. (Ashwin Shankar via kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613459)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSParentQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


> FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence 
> towards fairness
> --
>
> Key: YARN-2214
> URL: https://issues.apache.org/jira/browse/YARN-2214
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
> Attachments: YARN-2214-v1.txt, YARN-2214-v2.txt
>
>
> preemptContainerPreCheck() in FSParentQueue rejects preemption requests if 
> the parent queue is below fair share. This can cause a delay in converging 
> towards fairness when the starved leaf queue and the queue above fairshare 
> belong under a non-root parent queue(ie their least common ancestor is a 
> parent queue which is not root).
> Here is an example :
> root.parent has fair share = 80% and usage = 80%
> root.parent.child1 has fair share =40% usage = 80%
> root.parent.child2 has fair share=40% usage=0%
> Now a job is submitted to child2 and the demand is 40%.
> Preemption will kick in and try to reclaim all the 40% from child1.
> When it preempts the first container from child1,the usage of root.parent 
> will become <80%, which is less than root.parent's fair share,causing 
> preemption to stop.So only one container gets preempted in this round 
> although the need is a lot more. child2 would eventually get to half its fair 
> share but only after multiple rounds of preemption.
> Solution is to remove preemptContainerPreCheck() in FSParentQueue and keep it 
> only in FSLeafQueue(which is already there).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074523#comment-14074523
 ] 

Hadoop QA commented on YARN-1354:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657689/YARN-1354-v4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1259 javac 
compiler warnings (more than the trunk's current 1258 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4429//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/4429//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4429//console

This message is automatically generated.

> Recover applications upon nodemanager restart
> -
>
> Key: YARN-1354
> URL: https://issues.apache.org/jira/browse/YARN-1354
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-1354-v1.patch, 
> YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, 
> YARN-1354-v4.patch
>
>
> The set of active applications in the nodemanager context need to be 
> recovered for work-preserving nodemanager restart



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2358) TestNamenodeCapacityReport.testXceiverCount may sometimes fail due to lack of retry

2014-07-25 Thread Mit Desai (JIRA)
Mit Desai created YARN-2358:
---

 Summary: TestNamenodeCapacityReport.testXceiverCount may sometimes 
fail due to lack of retry
 Key: YARN-2358
 URL: https://issues.apache.org/jira/browse/YARN-2358
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai


I have seen TestNamenodeCapacityReport.testXceiverCount fail intermittently in 
our nightly builds with the following error:
{noformat}
java.io.IOException: Unable to close file because the last block does not have 
enough number of replicas.
at 
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2151)
at 
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2119)
at 
org.apache.hadoop.hdfs.server.namenode.TestNamenodeCapacityReport.testXceiverCount(TestNamenodeCapacityReport.java:281)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2214) FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness

2014-07-25 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2214:
---

Summary: FairScheduler: preemptContainerPreCheck() in FSParentQueue delays 
convergence towards fairness  (was: preemptContainerPreCheck() in FSParentQueue 
delays convergence towards fairness)

> FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence 
> towards fairness
> --
>
> Key: YARN-2214
> URL: https://issues.apache.org/jira/browse/YARN-2214
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
> Attachments: YARN-2214-v1.txt, YARN-2214-v2.txt
>
>
> preemptContainerPreCheck() in FSParentQueue rejects preemption requests if 
> the parent queue is below fair share. This can cause a delay in converging 
> towards fairness when the starved leaf queue and the queue above fairshare 
> belong under a non-root parent queue(ie their least common ancestor is a 
> parent queue which is not root).
> Here is an example :
> root.parent has fair share = 80% and usage = 80%
> root.parent.child1 has fair share =40% usage = 80%
> root.parent.child2 has fair share=40% usage=0%
> Now a job is submitted to child2 and the demand is 40%.
> Preemption will kick in and try to reclaim all the 40% from child1.
> When it preempts the first container from child1,the usage of root.parent 
> will become <80%, which is less than root.parent's fair share,causing 
> preemption to stop.So only one container gets preempted in this round 
> although the need is a lot more. child2 would eventually get to half its fair 
> share but only after multiple rounds of preemption.
> Solution is to remove preemptContainerPreCheck() in FSParentQueue and keep it 
> only in FSLeafQueue(which is already there).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2214) FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness

2014-07-25 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2214:
---

Issue Type: Improvement  (was: Bug)

> FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence 
> towards fairness
> --
>
> Key: YARN-2214
> URL: https://issues.apache.org/jira/browse/YARN-2214
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
> Attachments: YARN-2214-v1.txt, YARN-2214-v2.txt
>
>
> preemptContainerPreCheck() in FSParentQueue rejects preemption requests if 
> the parent queue is below fair share. This can cause a delay in converging 
> towards fairness when the starved leaf queue and the queue above fairshare 
> belong under a non-root parent queue(ie their least common ancestor is a 
> parent queue which is not root).
> Here is an example :
> root.parent has fair share = 80% and usage = 80%
> root.parent.child1 has fair share =40% usage = 80%
> root.parent.child2 has fair share=40% usage=0%
> Now a job is submitted to child2 and the demand is 40%.
> Preemption will kick in and try to reclaim all the 40% from child1.
> When it preempts the first container from child1,the usage of root.parent 
> will become <80%, which is less than root.parent's fair share,causing 
> preemption to stop.So only one container gets preempted in this round 
> although the need is a lot more. child2 would eventually get to half its fair 
> share but only after multiple rounds of preemption.
> Solution is to remove preemptContainerPreCheck() in FSParentQueue and keep it 
> only in FSLeafQueue(which is already there).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2214) preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness

2014-07-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074509#comment-14074509
 ] 

Karthik Kambatla commented on YARN-2214:


+1. Checking this in. 

> preemptContainerPreCheck() in FSParentQueue delays convergence towards 
> fairness
> ---
>
> Key: YARN-2214
> URL: https://issues.apache.org/jira/browse/YARN-2214
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: Ashwin Shankar
>Assignee: Ashwin Shankar
> Attachments: YARN-2214-v1.txt, YARN-2214-v2.txt
>
>
> preemptContainerPreCheck() in FSParentQueue rejects preemption requests if 
> the parent queue is below fair share. This can cause a delay in converging 
> towards fairness when the starved leaf queue and the queue above fairshare 
> belong under a non-root parent queue(ie their least common ancestor is a 
> parent queue which is not root).
> Here is an example :
> root.parent has fair share = 80% and usage = 80%
> root.parent.child1 has fair share =40% usage = 80%
> root.parent.child2 has fair share=40% usage=0%
> Now a job is submitted to child2 and the demand is 40%.
> Preemption will kick in and try to reclaim all the 40% from child1.
> When it preempts the first container from child1,the usage of root.parent 
> will become <80%, which is less than root.parent's fair share,causing 
> preemption to stop.So only one container gets preempted in this round 
> although the need is a lot more. child2 would eventually get to half its fair 
> share but only after multiple rounds of preemption.
> Solution is to remove preemptContainerPreCheck() in FSParentQueue and keep it 
> only in FSLeafQueue(which is already there).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2209) Replace allocate#resync command with ApplicationMasterNotRegisteredException to indicate AM to re-register on RM restart

2014-07-25 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074366#comment-14074366
 ] 

Rohith commented on YARN-2209:
--

Hi [~jianhe], I reviewed patch and found some comments

1. Missing lastResponseID=0 in RMContainerAllocator#getResources(). 
{code}
catch (ApplicationMasterNotRegisteredException e) {
  LOG.info("ApplicationMaster is out of sync with ResourceManager,"
  + " hence resync and send outstanding requests.");
  // RM may have restarted, re-register with RM.
  register();
  addOutstandingRequestOnResync();
  return null;
 }
{code}

2. In AMRMClientAsyncImpl, below code may loose one response since it is not 
adding back to responseQueue when InterruptedException ocure. This may be worst 
case, but still it can ocure may because java itself Interrupting or os may be 
Interrupting.
Can we add reponse back to responseQueue on InterruptedException?
{code}

  if (response != null) {
 try {
   responseQueue.put(response);
  break;
 } catch (InterruptedException ex) {
   LOG.debug("Interrupted while waiting to put on response queue", 
ex);
 }
{code}

> Replace allocate#resync command with ApplicationMasterNotRegisteredException 
> to indicate AM to re-register on RM restart
> 
>
> Key: YARN-2209
> URL: https://issues.apache.org/jira/browse/YARN-2209
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch
>
>
> YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
> application to re-register on RM restart. we should do the same for 
> AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2

2014-07-25 Thread Remus Rusanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2357:
---

Attachment: (was: YARN-2357.1.patch)

> Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 
> changes to branch-2
> --
>
> Key: YARN-2357
> URL: https://issues.apache.org/jira/browse/YARN-2357
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>Priority: Critical
>  Labels: security, windows
> Attachments: YARN-2357.1.patch
>
>
> As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to 
> trunk, they need to be backported to branch-2



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2

2014-07-25 Thread Remus Rusanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2357:
---

Attachment: YARN-2357.1.patch

Now with compile fix!

> Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 
> changes to branch-2
> --
>
> Key: YARN-2357
> URL: https://issues.apache.org/jira/browse/YARN-2357
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>Priority: Critical
>  Labels: security, windows
> Attachments: YARN-2357.1.patch
>
>
> As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to 
> trunk, they need to be backported to branch-2



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2

2014-07-25 Thread Remus Rusanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2357:
---

Attachment: YARN-2357.1.patch

Patch .1 is port of currently uploaded YARN-1063 .6, YARN-1972 .3  and 
YARN-2198 .2 patches.

> Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 
> changes to branch-2
> --
>
> Key: YARN-2357
> URL: https://issues.apache.org/jira/browse/YARN-2357
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>Priority: Critical
>  Labels: security, windows
> Attachments: YARN-2357.1.patch
>
>
> As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to 
> trunk, they need to be backported to branch-2



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2

2014-07-25 Thread Remus Rusanu (JIRA)
Remus Rusanu created YARN-2357:
--

 Summary: Port Windows Secure Container Executor YARN-1063, 
YARN-1972, YARN-2198 changes to branch-2
 Key: YARN-2357
 URL: https://issues.apache.org/jira/browse/YARN-2357
 Project: Hadoop YARN
  Issue Type: Task
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Priority: Critical


As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to trunk, 
they need to be backported to branch-2



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2356) yarn status command for non-existent application/application attempt/container is too verbose

2014-07-25 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-2356:
--

Attachment: Yarn-2356.1.patch

Fixed to handle ApplicationNotFoundException, 
ApplicationAttemptNotFoundException and ContainerNotFoundException for 
"-status" commands generally for non-existent entries.



> yarn status command for non-existent application/application 
> attempt/container is too verbose 
> --
>
> Key: YARN-2356
> URL: https://issues.apache.org/jira/browse/YARN-2356
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Minor
> Attachments: Yarn-2356.1.patch
>
>
> *yarn application -status* or *applicationattempt -status* or *container 
> status* commands can suppress exception such as ApplicationNotFound, 
> ApplicationAttemptNotFound and ContainerNotFound for non-existent entries in 
> RM or History Server. 
> For example, below exception can be suppressed better
> sunildev@host-a:~/hadoop/hadoop/bin> ./yarn application -status 
> application_1402668848165_0015
> No GC_PROFILE is given. Defaults to medium.
> 14/07/25 16:21:45 INFO client.RMProxy: Connecting to ResourceManager at 
> /10.18.40.77:45022
> Exception in thread "main" 
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1402668848165_0015' doesn't exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:285)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:607)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:932)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2099)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2095)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1626)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2093)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:166)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
> at $Proxy12.getApplicationReport(Unknown Source)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.printApplicationReport(ApplicationCLI.java:428)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:153)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:76)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException):
>  Application with id 'application_1402668848165_0015' doesn't exist in RM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2356) yarn status command for non-existent application/application attempt/container is too verbose

2014-07-25 Thread Sunil G (JIRA)
Sunil G created YARN-2356:
-

 Summary: yarn status command for non-existent 
application/application attempt/container is too verbose 
 Key: YARN-2356
 URL: https://issues.apache.org/jira/browse/YARN-2356
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Sunil G
Assignee: Sunil G
Priority: Minor


*yarn application -status* or *applicationattempt -status* or *container 
status* commands can suppress exception such as ApplicationNotFound, 
ApplicationAttemptNotFound and ContainerNotFound for non-existent entries in RM 
or History Server. 

For example, below exception can be suppressed better

sunildev@host-a:~/hadoop/hadoop/bin> ./yarn application -status 
application_1402668848165_0015
No GC_PROFILE is given. Defaults to medium.
14/07/25 16:21:45 INFO client.RMProxy: Connecting to ResourceManager at 
/10.18.40.77:45022
Exception in thread "main" 
org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
with id 'application_1402668848165_0015' doesn't exist in RM.
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:285)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:607)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:932)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2099)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2095)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1626)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2093)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:166)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at $Proxy12.getApplicationReport(Unknown Source)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.printApplicationReport(ApplicationCLI.java:428)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at 
org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:76)
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException):
 Application with id 'application_1402668848165_0015' doesn't exist in RM.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree

2014-07-25 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074268#comment-14074268
 ] 

Akira AJISAKA commented on YARN-2336:
-

Thanks [~kj-ki] for the update. +1 (non-binding).

> Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
> --
>
> Key: YARN-2336
> URL: https://issues.apache.org/jira/browse/YARN-2336
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.4.1
>Reporter: Kenji Kikushima
>Assignee: Kenji Kikushima
> Attachments: YARN-2336-2.patch, YARN-2336-3.patch, YARN-2336.patch
>
>
> When we have sub queues in Fair Scheduler, REST api returns a missing '[' 
> blacket JSON for childQueues.
> This issue found by [~ajisakaa] at YARN-1050.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits

2014-07-25 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074249#comment-14074249
 ] 

Wangda Tan commented on YARN-2069:
--

Hi [~mayank_bansal],
Thanks for working on this again. I've taken a brief look at your patch, I 
think the general appoarch in your patch is:
- Compute a target-user-limit for a given queue,
- Preempt containers according to a user's current comsumption and 
target-user-limit,
- If more resource need to be preempted, we should consider preempt AM 
container,

I think there're couple of rules we need respect (Please let me know if you 
don't agree with any of them),
# Used resource of users in a queue after preempted should be as average as 
possible
# Before we start preempting AM containers, all task containers should be 
preempted (according to YARN-2022, keep preempting AM container as least 
priority)
# If we should preempt AM container, we should respect #1 too

For #1,
If we want to quantize the result, it should be:
{code}
i∈{user}
Let rp_i = used-resource-after-preemption of user_i
Minimize sqrt(Σ(rp - Σ(rp_i)/#{user})^2)
  i  i
{code}
In another word, we should minimize standard deviation of 
used-resource-after-preemption.

Since not all containers are equal in size, so it is possible that 
used-resource-after-preemption of a given user cannot precisely equal to 
target-user-limit. In our current logic, we will make 
used-resource-after-preemption <= target-user-limit. considering following 
example,
{code}
qA: has user {V, W, X, Y, Z}; each user has one application
V: app5: {4, 4, 4, 4}, //means V has 4 containers, each one has memory=4G, 
minimum_allocation=1G
W: app4: {4, 4, 4, 4},
X: app3: {4, 4, 4, 4},
Y: app2: {4, 4, 4, 4, 4, 4},
Z: app1: {4}
target-user-limit=11,
resource-to-obtain=23

After preemption:
V: {4, 4}
W: {4, 4}
X: {4, 4}
Y: {4, 4, 4, 4, 4, 4}
Z: {4}
{code}
This imbalance happens because, for every application we preempted, may excess 
user-limit (bias), the more user we processed, the more potentially accumulated 
bias we might have. In another word, the un-balanced is linear correlated 
number-of-user-in-a-queue multiplies average-container-size

And we cannot solve this problem by preempting from user has most usage, still 
the example: 
{code}

qA: has user {V, W, X, Y, Z}; each user has one application
V: app5: {4, 4, 4, 4}, //means V has 4 containers, each one has memory=4G, 
minimum_allocation=1G
W: app4: {4, 4, 4, 4},
X: app3: {4, 4, 4, 4},
Y: app2: {4, 4, 4, 4, 4, 4},
Z: app1: {4}
target-user-limit=11,
resource-to-obtain=23

After preemption (from user has most usage, the sequence is Y->V->W->X->Z):
V: {4, 4}
W: {4, 4, 4, 4}
X: {4, 4, 4, 4}
Y: {4, 4}
Z: {4} 
{code}
Still not very balanced, the ideal result should be:
{code}

V: {4, 4, 4}
W: {4, 4, 4}
X: {4, 4, 4}
Y: {4, 4, 4}
Z: {4} 
{code}

In addition, this appoarch cannot resolve rule #2/#3 as well if 
target-user-limit is not appropriately computed. 

So I propose to do in another way,
We should recompute used-resource - marked-preempted-resource every time for a 
user after making decision of preemption each container. Maybe we can use a 
priority queue here to store (used-resource - marked-preempted-resource) here. 
And we don’t need to compute a target user limit here.
The pseudo code for preempting resource of a queue might look like:
{code}
compute resToObtain first;

// first preempt task containers
while (resToObtain > 0) {
  pick a user-x which has most (used-resource - marked-preempted-resource)
  pick one container-y from user to preempted
  resToObtain -= container-y.resource
}

if (resToObtain <= 0) {
  return;
}

// if more resource need to be preempted, we should preempt AM container
while (resToObtain > 0 && total-am-resource - marked-preempted-am-resource > 
max-am-percentage) {
  // do the same thing again:
  pick a user-x which has most (used-resource - marked-preempted-resource)
  pick one container-y from user to preempted
  resToObtain -= container-y.resource 
}
{code}

With this, we can make the un-balanced linear correlated with 
average-container-size only and solved the #2/#3 rules we should respect I 
mentioned before altogether.
Mayank, do you think is it looks like a reasonable suggestion? Any other 
thoughts? [~vinodkv], [~curino], [~sunilg].

Thanks,
Wangda

> CS queue level preemption should respect user-limits
> 
>
> Key: YARN-2069
> URL: https://issues.apache.org/jira/browse/YARN-2069
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Mayank Bansal
> Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, 
> YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, 
> YARN-2069-trunk-6.pat

[jira] [Commented] (YARN-2347) Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common

2014-07-25 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074204#comment-14074204
 ] 

Zhijie Shen commented on YARN-2347:
---

Please ignore previous comment 1. I posted the wrong one. The right comment 1 
I'd like to post:

1. The javadoc seems not to be correct after refactoring.
{code}
+/**
+ * The version information of RM state.
+ */
+@Private
+@Unstable
+public abstract class StateVersion {
{code}

> Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in 
> yarn-server-common
> 
>
> Key: YARN-2347
> URL: https://issues.apache.org/jira/browse/YARN-2347
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: YARN-2347-v2.patch, YARN-2347-v3.patch, YARN-2347.patch
>
>
> We have similar things for version state for RM, NM, TS (TimelineServer), 
> etc. I think we should consolidate them into a common object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2347) Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common

2014-07-25 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074198#comment-14074198
 ] 

Zhijie Shen commented on YARN-2347:
---

[~djp], it's a good idea to refactor the code to make the common classes. The 
changes are straightfoward, and look good to me almost. Just some minor 
comments.

1. Mark the class \@Prviate and \@Unstable?
{code}
+public class StateVersionPBImpl extends StateVersion {
{code}

2. I'm not sure StateVersion is the best name in this case. For example, 
StateVersion for a db schema sounds weird to me. Why not YarnVersion or even 
Version?

> Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in 
> yarn-server-common
> 
>
> Key: YARN-2347
> URL: https://issues.apache.org/jira/browse/YARN-2347
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: YARN-2347-v2.patch, YARN-2347-v3.patch, YARN-2347.patch
>
>
> We have similar things for version state for RM, NM, TS (TimelineServer), 
> etc. I think we should consolidate them into a common object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-641) Make AMLauncher in RM Use NMClient

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074147#comment-14074147
 ] 

Hadoop QA commented on YARN-641:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12587395/YARN-641.3.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4428//console

This message is automatically generated.

> Make AMLauncher in RM Use NMClient
> --
>
> Key: YARN-641
> URL: https://issues.apache.org/jira/browse/YARN-641
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-641.1.patch, YARN-641.2.patch, YARN-641.3.patch
>
>
> YARN-422 adds NMClient. RM's AMLauncher is responsible for the interactions 
> with an application's AM container. AMLauncher should also replace the raw 
> ContainerManager proxy with NMClient.



--
This message was sent by Atlassian JIRA
(v6.2#6252)