[jira] [Assigned] (YARN-56) Handle container requests that request more resources than currently available in the cluster

2014-12-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-56?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla reassigned YARN-56:


Assignee: (was: Karthik Kambatla)

 Handle container requests that request more resources than currently 
 available in the cluster
 -

 Key: YARN-56
 URL: https://issues.apache.org/jira/browse/YARN-56
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.0.2-alpha, 0.23.3
Reporter: Hitesh Shah

 In heterogenous clusters, a simple check at the scheduler to check if the 
 allocation request is within the max allocatable range is not enough. 
 If there are large nodes in the cluster which are not available, there may be 
 situations where some allocation requests will never be fulfilled. Need an 
 approach to decide when to invalidate such requests. For application 
 submissions, there will need to be a feedback loop for applications that 
 could not be launched. For running AMs, AllocationResponse may need to 
 augmented with information for invalidated/cancelled container requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2984) Metrics for container's actual memory usage

2014-12-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2984:
---
Issue Type: Sub-task  (was: Improvement)
Parent: YARN-2141

 Metrics for container's actual memory usage
 ---

 Key: YARN-2984
 URL: https://issues.apache.org/jira/browse/YARN-2984
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: yarn-2984-prelim.patch


 It would be nice to capture resource usage per container, for a variety of 
 reasons. This JIRA is to track memory usage. 
 YARN-2965 tracks the resource usage on the node, and the two implementations 
 should reuse code as much as possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2532) Track pending resources at the application level

2014-12-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla reassigned YARN-2532:
--

Assignee: (was: Karthik Kambatla)

 Track pending resources at the application level 
 -

 Key: YARN-2532
 URL: https://issues.apache.org/jira/browse/YARN-2532
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.5.1
Reporter: Karthik Kambatla

 SchedulerApplicationAttempt keeps track of current consumption of an app. It 
 would be nice to have a similar value tracked for pending requests. 
 The immediate uses I see are: (1) Showing this on the Web UI (YARN-2333) and 
 (2) updating demand in FS in an event-driven style (YARN-2353)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers

2014-12-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259695#comment-14259695
 ] 

Karthik Kambatla commented on YARN-1856:


I haven't had a chance to work on this further. [~beckham007] - how did your 
testing go? Please feel free to take this JIRA over if you want to contribute 
what you guys have done. 

 cgroups based memory monitoring for containers
 --

 Key: YARN-1856
 URL: https://issues.apache.org/jira/browse/YARN-1856
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-1856) cgroups based memory monitoring for containers

2014-12-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla reassigned YARN-1856:
--

Assignee: (was: Karthik Kambatla)

 cgroups based memory monitoring for containers
 --

 Key: YARN-1856
 URL: https://issues.apache.org/jira/browse/YARN-1856
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-1535) Add an option to yarn rmadmin to clear the znode used by embedded elector

2014-12-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla reassigned YARN-1535:
--

Assignee: (was: Karthik Kambatla)

 Add an option to yarn rmadmin to clear the znode used by embedded elector
 -

 Key: YARN-1535
 URL: https://issues.apache.org/jira/browse/YARN-1535
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla

 YARN-1029 implements EmbeddedElectorService. Admins should have a way to 
 clear the znode that this elector uses. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2965) Enhance Node Managers to monitor and report the resource usage on machines

2014-12-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259697#comment-14259697
 ] 

Karthik Kambatla commented on YARN-2965:


[~srikanthkandula], [~rgrandl] - any updates here? I am particularly keen to 
see how you plan to capture per-container usages, at least memory and CPU. I 
filed YARN-2984 and posted a preliminary patch there that captures memory 
consumption. 

 Enhance Node Managers to monitor and report the resource usage on machines
 --

 Key: YARN-2965
 URL: https://issues.apache.org/jira/browse/YARN-2965
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Robert Grandl
Assignee: Robert Grandl
 Attachments: ddoc_RT.docx


 This JIRA is about augmenting Node Managers to monitor the resource usage on 
 the machine, aggregates these reports and exposes them to the RM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2141) [Umbrella] Capture container and node resource consumption

2014-12-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2141:
---
Summary: [Umbrella] Capture container and node resource consumption  (was: 
Capture container and node resource consumption)

 [Umbrella] Capture container and node resource consumption
 --

 Key: YARN-2141
 URL: https://issues.apache.org/jira/browse/YARN-2141
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Carlo Curino
Assignee: Karthik Kambatla
Priority: Minor

 Collecting per-container and per-node resource consumption statistics in a 
 fairly granular manner, and making them available to both infrastructure code 
 (e.g., schedulers) and users (e.g., AMs or directly users via webapps), can 
 facilitate several performance work. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2141) [Umbrella] Capture container and node resource consumption

2014-12-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla reassigned YARN-2141:
--

Assignee: (was: Karthik Kambatla)

Filed YARN-2984 to capture container's actual memory consumption. Will file 
another sub-task for CPU (Carlo has emailed me his implementation offline). 

YARN-2965 covers capturing node resource consumption. 



 [Umbrella] Capture container and node resource consumption
 --

 Key: YARN-2141
 URL: https://issues.apache.org/jira/browse/YARN-2141
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Carlo Curino
Priority: Minor

 Collecting per-container and per-node resource consumption statistics in a 
 fairly granular manner, and making them available to both infrastructure code 
 (e.g., schedulers) and users (e.g., AMs or directly users via webapps), can 
 facilitate several performance work. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator

2014-12-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259699#comment-14259699
 ] 

Karthik Kambatla commented on YARN-2716:


[~rkanter] - is it okay for me to take this over? We have recently seen more 
issues with the current implementation, and this rewrite could greatly help. 

 Refactor ZKRMStateStore retry code with Apache Curator
 --

 Key: YARN-2716
 URL: https://issues.apache.org/jira/browse/YARN-2716
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Robert Kanter

 Per suggestion by [~kasha] in YARN-2131,  it's nice to use curator to 
 simplify the retry logic in ZKRMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2063) ZKRMStateStore: Better handling of operation failures

2014-12-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla resolved YARN-2063.

Resolution: Duplicate

 ZKRMStateStore: Better handling of operation failures
 -

 Key: YARN-2063
 URL: https://issues.apache.org/jira/browse/YARN-2063
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical

 Today, when a ZK operation fails, we handle connection-loss and 
 operation-timeout the same way. This could definitely use some improvements:
 # Add special handling for other error codes
 # Connection-loss: Nullify zkClient, so a new connection is established
 # Operation-timeout: Retry a few times with exponential delay?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2062) Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover

2014-12-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2062:
---
Target Version/s: 2.7.0  (was: 2.6.0)

 Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover
 ---

 Key: YARN-2062
 URL: https://issues.apache.org/jira/browse/YARN-2062
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 On busy clusters, we see several 
 {{org.apache.hadoop.yarn.state.InvalidStateTransitonException}} for events 
 invoked against NEW nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2062) Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover

2014-12-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2062:
---
Attachment: yarn-2062-1.patch

Straight-forward patch that adds a dummy transition to not log invalid 
transitions. 

[~jianhe], [~adhoot] - does this patch make any sense? Should we be handling 
all these transitions to better handle work-preserving RM restart? 

 Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover
 ---

 Key: YARN-2062
 URL: https://issues.apache.org/jira/browse/YARN-2062
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: yarn-2062-1.patch


 On busy clusters, we see several 
 {{org.apache.hadoop.yarn.state.InvalidStateTransitonException}} for events 
 invoked against NEW nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2993) Several fixes (missing acl check, error log msg ...) and some refinement in AdminService

2014-12-28 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated YARN-2993:
-
Fix Version/s: 2.7.0

 Several fixes (missing acl check, error log msg ...) and some refinement in 
 AdminService
 

 Key: YARN-2993
 URL: https://issues.apache.org/jira/browse/YARN-2993
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Yi Liu
Assignee: Yi Liu
 Fix For: 2.7.0

 Attachments: YARN-2993.001.patch


 This JIRA is to resolve following issues in 
 {{org.apache.hadoop.yarn.server.resourcemanager.AdminService}}:
 *1.* There is no ACLs check for {{refreshServiceAcls}}
 *2.* log message in {{refreshAdminAcls}} is incorrect, it should be ... Can 
 not refresh Admin ACLs. instead of ... Can not refresh user-groups.
 *3.* some unnecessary header import.
 *4.* {code}
 if (!isRMActive()) {
   RMAuditLogger.logFailure(user.getShortUserName(), argName,
   adminAcl.toString(), AdminService,
   ResourceManager is not active. Can not remove labels.);
   throwStandbyException();
 }
 {code}
 is common in lots of methods, just the message is different, we should refine 
 it into one common method.
 *5.* {code}
 LOG.info(Exception remove labels, ioe);
 RMAuditLogger.logFailure(user.getShortUserName(), argName,
 adminAcl.toString(), AdminService, Exception remove label);
 throw RPCUtil.getRemoteException(ioe);
 {code}
 is common in lots of methods, just the message is different, we should refine 
 it into one common method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2993) Several fixes (missing acl check, error log msg ...) and some refinement in AdminService

2014-12-28 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259776#comment-14259776
 ] 

Yi Liu commented on YARN-2993:
--

Thanks [~djp] for the review and commit.

 Several fixes (missing acl check, error log msg ...) and some refinement in 
 AdminService
 

 Key: YARN-2993
 URL: https://issues.apache.org/jira/browse/YARN-2993
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Yi Liu
Assignee: Yi Liu
 Fix For: 2.7.0

 Attachments: YARN-2993.001.patch


 This JIRA is to resolve following issues in 
 {{org.apache.hadoop.yarn.server.resourcemanager.AdminService}}:
 *1.* There is no ACLs check for {{refreshServiceAcls}}
 *2.* log message in {{refreshAdminAcls}} is incorrect, it should be ... Can 
 not refresh Admin ACLs. instead of ... Can not refresh user-groups.
 *3.* some unnecessary header import.
 *4.* {code}
 if (!isRMActive()) {
   RMAuditLogger.logFailure(user.getShortUserName(), argName,
   adminAcl.toString(), AdminService,
   ResourceManager is not active. Can not remove labels.);
   throwStandbyException();
 }
 {code}
 is common in lots of methods, just the message is different, we should refine 
 it into one common method.
 *5.* {code}
 LOG.info(Exception remove labels, ioe);
 RMAuditLogger.logFailure(user.getShortUserName(), argName,
 adminAcl.toString(), AdminService, Exception remove label);
 throw RPCUtil.getRemoteException(ioe);
 {code}
 is common in lots of methods, just the message is different, we should refine 
 it into one common method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase

2014-12-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2797:
---
Attachment: yarn-2797-1.patch

Straight-forward patch.

 TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
 

 Key: YARN-2797
 URL: https://issues.apache.org/jira/browse/YARN-2797
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Attachments: yarn-2797-1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2062) Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover

2014-12-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259793#comment-14259793
 ] 

Hadoop QA commented on YARN-2062:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12689271/yarn-2062-1.patch
  against trunk revision 1454efe.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 15 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6200//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/6200//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6200//console

This message is automatically generated.

 Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover
 ---

 Key: YARN-2062
 URL: https://issues.apache.org/jira/browse/YARN-2062
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: yarn-2062-1.patch


 On busy clusters, we see several 
 {{org.apache.hadoop.yarn.state.InvalidStateTransitonException}} for events 
 invoked against NEW nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase

2014-12-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259814#comment-14259814
 ] 

Hadoop QA commented on YARN-2797:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12689275/yarn-2797-1.patch
  against trunk revision 1454efe.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 15 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6201//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/6201//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6201//console

This message is automatically generated.

 TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
 

 Key: YARN-2797
 URL: https://issues.apache.org/jira/browse/YARN-2797
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Attachments: yarn-2797-1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2994) Document work-preserving RM restart

2014-12-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259853#comment-14259853
 ] 

Rohith commented on YARN-2994:
--

Thanks [~jianhe] for woking on documenting work preserving restart feature. I 
quickly read patch, changes are fine. I have one basic doubt that does work 
preserving restart work only for ZKRMStateStore? It is also can be used with 
FileSysytemStore also right? 

 Document work-preserving RM restart
 ---

 Key: YARN-2994
 URL: https://issues.apache.org/jira/browse/YARN-2994
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2994.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase

2014-12-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259863#comment-14259863
 ] 

Rohith commented on YARN-2797:
--

Thanks Karthik working on this. One quick comment, 
ParameterizedSchedulerTestBase does not have FIFO scheduler configurations. 
TestWorkPreservingRMRestart run for fifoscheduler too. I think FIFO also should 
be included.

 TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
 

 Key: YARN-2797
 URL: https://issues.apache.org/jira/browse/YARN-2797
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Attachments: yarn-2797-1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2994) Document work-preserving RM restart

2014-12-28 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259864#comment-14259864
 ] 

Jian He commented on YARN-2994:
---

yes, that's correct. 

 Document work-preserving RM restart
 ---

 Key: YARN-2994
 URL: https://issues.apache.org/jira/browse/YARN-2994
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2994.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry

2014-12-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259887#comment-14259887
 ] 

Rohith commented on YARN-2992:
--

I see, yes.. In my cluster, configured retry was very less, so Rm was exiting 
very soon.

 ZKRMStateStore crashes due to session expiry
 

 Key: YARN-2992
 URL: https://issues.apache.org/jira/browse/YARN-2992
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Fix For: 2.7.0

 Attachments: yarn-2992-1.patch


 We recently saw the RM crash with the following stacktrace. On session 
 expiry, we should gracefully transition to standby. 
 {noformat}
 2014-12-18 06:28:42,689 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
 = Session expired 
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
 at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
 at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687)
  
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry

2014-12-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259885#comment-14259885
 ] 

Rohith commented on YARN-2992:
--

I see, yes.. In my cluster, configured retry was very less, so Rm was exiting 
very soon.

 ZKRMStateStore crashes due to session expiry
 

 Key: YARN-2992
 URL: https://issues.apache.org/jira/browse/YARN-2992
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Fix For: 2.7.0

 Attachments: yarn-2992-1.patch


 We recently saw the RM crash with the following stacktrace. On session 
 expiry, we should gracefully transition to standby. 
 {noformat}
 2014-12-18 06:28:42,689 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
 = Session expired 
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
 at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
 at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687)
  
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry

2014-12-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259886#comment-14259886
 ] 

Rohith commented on YARN-2992:
--

I see, yes.. In my cluster, configured retry was very less, so Rm was exiting 
very soon.

 ZKRMStateStore crashes due to session expiry
 

 Key: YARN-2992
 URL: https://issues.apache.org/jira/browse/YARN-2992
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Fix For: 2.7.0

 Attachments: yarn-2992-1.patch


 We recently saw the RM crash with the following stacktrace. On session 
 expiry, we should gracefully transition to standby. 
 {noformat}
 2014-12-18 06:28:42,689 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
 = Session expired 
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
 at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
 at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687)
  
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry

2014-12-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259884#comment-14259884
 ] 

Rohith commented on YARN-2992:
--

I see, yes.. In my cluster, configured retry was very less, so Rm was exiting 
very soon.

 ZKRMStateStore crashes due to session expiry
 

 Key: YARN-2992
 URL: https://issues.apache.org/jira/browse/YARN-2992
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Fix For: 2.7.0

 Attachments: yarn-2992-1.patch


 We recently saw the RM crash with the following stacktrace. On session 
 expiry, we should gracefully transition to standby. 
 {noformat}
 2014-12-18 06:28:42,689 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause: 
 org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
 = Session expired 
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
 at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
 at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687)
  
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2922) Concurrent Modification Exception in LeafQueue when collecting applications

2014-12-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-2922:
-
Target Version/s: 2.7.0

 Concurrent Modification Exception in LeafQueue when collecting applications
 ---

 Key: YARN-2922
 URL: https://issues.apache.org/jira/browse/YARN-2922
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.5.1
Reporter: Jason Tufo
Assignee: Rohith
 Attachments: 0001-YARN-2922.patch


 java.util.ConcurrentModificationException
 at 
 java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115)
 at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.collectSchedulerApplications(LeafQueue.java:1618)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getAppsInQueue(CapacityScheduler.java:1119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueInfo(ClientRMService.java:798)
 at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:234)
 at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:333)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
 at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
 at 
 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2991) TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on trunk

2014-12-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith reassigned YARN-2991:


Assignee: Rohith

 TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on 
 trunk
 --

 Key: YARN-2991
 URL: https://issues.apache.org/jira/browse/YARN-2991
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Zhijie Shen
Assignee: Rohith

 {code}
 Error Message
 test timed out after 6 milliseconds
 Stacktrace
 java.lang.Exception: test timed out after 6 milliseconds
   at java.lang.Object.wait(Native Method)
   at java.lang.Thread.join(Thread.java:1281)
   at java.lang.Thread.join(Thread.java:1355)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
   at 
 org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1106)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testDecomissionedNMsMetricsOnRMRestart(TestRMRestart.java:1873)
 {code}
 It happened twice this months:
 https://builds.apache.org/job/PreCommit-YARN-Build/6096/
 https://builds.apache.org/job/PreCommit-YARN-Build/6182/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2991) TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on trunk

2014-12-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-2991:
-
Priority: Blocker  (was: Major)

 TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on 
 trunk
 --

 Key: YARN-2991
 URL: https://issues.apache.org/jira/browse/YARN-2991
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Zhijie Shen
Assignee: Rohith
Priority: Blocker

 {code}
 Error Message
 test timed out after 6 milliseconds
 Stacktrace
 java.lang.Exception: test timed out after 6 milliseconds
   at java.lang.Object.wait(Native Method)
   at java.lang.Thread.join(Thread.java:1281)
   at java.lang.Thread.join(Thread.java:1355)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
   at 
 org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1106)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testDecomissionedNMsMetricsOnRMRestart(TestRMRestart.java:1873)
 {code}
 It happened twice this months:
 https://builds.apache.org/job/PreCommit-YARN-Build/6096/
 https://builds.apache.org/job/PreCommit-YARN-Build/6182/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)