[jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.

2015-03-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375212#comment-14375212
 ] 

Hadoop QA commented on YARN-3385:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12706414/YARN-3385.000.patch
  against trunk revision 4cd54d9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7069//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7069//console

This message is automatically generated.

 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion.
 ---

 Key: YARN-3385
 URL: https://issues.apache.org/jira/browse/YARN-3385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3385.000.patch


 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion(Op.delete).
 The race condition is similar as YARN-2721 and YARN-3023.
 since the race condition exists for ZK node creation, it should also exist 
 for  ZK node deletion.
 We see this issue with the following stack trace:
 {code}
 2015-03-17 19:18:58,958 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
   at java.lang.Thread.run(Thread.java:745)
 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
 status 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1621) Add CLI to list rows of task attempt ID, container ID, host of container, state of container

2015-03-22 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375071#comment-14375071
 ] 

Naganarasimha G R commented on YARN-1621:
-

Hi [~noddi]
Sorry for the delayed response was held up with other activties:
* Requires rebase as {{TestYarnCLI.java}} and {{ApplicationCLI.java}} is not 
compiling
* small nits : ContainerCLI,YarnClientImpl, etc : many places line have more 
than 80 chars, may be we can do eclipse formatting once.
* format of Listing of containers for AppAttemptID and AppId can be unified,  
{{writer.println(ApplicationAttempt-Id:  + 
attemptReport.getApplicationAttemptId());}} can be added for  AppAttemptID too 
. ur opinion ?
* code for printing the containers is common for AppAttemptID and AppId, hence 
we can reduce the duplicate code by extracting them to a common method.
* listApplicationContainers can take the converted ApplicationId as argument @ 
ln 160, 
* ApplicationNotFoundException can come even in 
{{client.getContainers(appAttemptId,containerStates)}} better to catch for 
exception and  return error exitCode for listApplicationAttemptContainers too. 
Common try catch block for listApplicationContainers and   capturing 
YarnException, IOException would be good ?
* Exception might become too verbose, Overall was expecting some thing like 
{code}
String id = cliParser.getOptionValue(LIST_CMD);
try { 
  try {

listApplicationContainers(ConverterUtils.toApplicationId(id), containerStates);
  } catch (IllegalArgumentException e) {
try {
listApplicationAttemptContainers(
ConverterUtils.toApplicationAttemptId(appAttemptId),
containerStates);
} catch (IllegalArgumentException e) {
  sysout.println(Wrong format of application ID or 
application attempt ID);
  return exitCode;
}
  }
 } catch (YarnException e) {
return exitCode;
 } catch (IOException e) {
return exitCode;
 }
{code}
* Instead of throwing ApplicationNotFoundException we can throw YarnException 
when {{app == null || 
!validApplicationStates.contains(app.getYarnApplicationState())}} in 
listApplicationContainers(applicationId,states)
* Better to comment in YarnClientImpl.getContainers for 
{noformat}isContainerStatesEmpty || !(containerStates.size() == 1
 containerStates.contains(ContainerState.COMPLETE)){noformat}
* {{Boolean showFinishedContainers}}, better use boolean instead of wrapper 
class
* May be we can leverage the benifit of passing the states to AHS too, this 
will reduce the transfer of
data from AHS to the client. ur opinion ? 
* If we are incorporating the above point then i feel only only when 
appNotFoundInRM we need to query for all states 
from AHS if not querying for COMPLETE state would be sufficient.
* No test cases for modification of 
GetContainersRequestPBImpl/GetContainersRequestProto
* there are some test case failures and findbugs issues reported can you take a 
look at it 

Have not gone through the Test code and applied this patch and tested, once you 
have rebased and we finalized on the above points will check test code and also 
do some verification.

 Add CLI to list rows of task attempt ID, container ID, host of container, 
 state of container
 --

 Key: YARN-1621
 URL: https://issues.apache.org/jira/browse/YARN-1621
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Tassapol Athiapinya
Assignee: Bartosz Ɓugowski
 Attachments: YARN-1621.1.patch, YARN-1621.2.patch, YARN-1621.3.patch, 
 YARN-1621.4.patch, YARN-1621.5.patch


 As more applications are moved to YARN, we need generic CLI to list rows of 
 task attempt ID, container ID, host of container, state of container. Today 
 if YARN application running in a container does hang, there is no way to find 
 out more info because a user does not know where each attempt is running in.
 For each running application, it is useful to differentiate between 
 running/succeeded/failed/killed containers.
  
 {code:title=proposed yarn cli}
 $ yarn application -list-containers -applicationId appId [-containerState 
 state of container]
 where containerState is optional filter to list container in given state only.
 container state can be running/succeeded/killed/failed/all.
 A user can specify more than one container state at once e.g. KILLED,FAILED.
 task attempt ID container ID host of container state of container 
 {code}
 CLI should work with running application/completed application. If a 
 container 

[jira] [Updated] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.

2015-03-22 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3385:

Summary: Race condition: KeeperException$NoNodeException will cause RM 
shutdown during ZK node deletion.  (was: Race condition: 
KeeperException$NoNodeException will cause RM shutdown during ZK node 
deletion(Op.delete).)

 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion.
 ---

 Key: YARN-3385
 URL: https://issues.apache.org/jira/browse/YARN-3385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical

 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion(Op.delete).
 The race condition is similar as YARN-2721 and YARN-3023.
 since the race condition exists for ZK node creation, it should also exist 
 for  ZK node deletion.
 We see this issue with the following stack trace:
 {code}
 2015-03-17 19:18:58,958 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
   at java.lang.Thread.run(Thread.java:745)
 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
 status 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2178) TestApplicationMasterService sometimes fails in trunk

2015-03-22 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu resolved YARN-2178.
--
Resolution: Cannot Reproduce

 TestApplicationMasterService sometimes fails in trunk
 -

 Key: YARN-2178
 URL: https://issues.apache.org/jira/browse/YARN-2178
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Priority: Minor
  Labels: test

 From https://builds.apache.org/job/Hadoop-Yarn-trunk/587/ :
 {code}
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService
 Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 55.763 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService
 testInvalidContainerReleaseRequest(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService)
   Time elapsed: 41.336 sec   FAILURE!
 java.lang.AssertionError: AppAttempt state is not correct (timedout) 
 expected:ALLOCATED but was:SCHEDULED
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:401)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService.testInvalidContainerReleaseRequest(TestApplicationMasterService.java:143)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete).

2015-03-22 Thread zhihai xu (JIRA)
zhihai xu created YARN-3385:
---

 Summary: Race condition: KeeperException$NoNodeException will 
cause RM shutdown during ZK node deletion(Op.delete).
 Key: YARN-3385
 URL: https://issues.apache.org/jira/browse/YARN-3385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical


Race condition: KeeperException$NoNodeException will cause RM shutdown during 
ZK node deletion(Op.delete).
The race condition is similar as YARN-2721 and YARN-3023.
When the race condition exists for ZK node creation, it should also exist for  
ZK node deletion.
We see this issue with the following stack trace:
{code}
2015-03-17 19:18:58,958 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete).

2015-03-22 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375098#comment-14375098
 ] 

zhihai xu commented on YARN-3385:
-

The sequence for the Race condition is the following:
1, RM try to remove application application_1426560404988_0132 state from 
ZKRMStateStore.
{code}
2015-03-17 19:18:48,075 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Max number of 
completed apps kept in state store met: maxCompletedAppsInStateStore = 1, 
removing app application_1426560404988_0132 from state store.
2015-03-17 19:18:48,075 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
info for app: application_1426560404988_0132
{code}

2. Unluckily ConnectionLoss for the ZK session happened at the same time as RM 
remove application state from ZK.
The ZooKeeper server deleted the node successfully, But due to ConnectionLoss, 
RM didn't know the operation succeeded.
{code}
2015-03-17 19:18:51,836 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
{code}

3.RM did retry to remove application state to ZK
{code}
2015-03-17 19:18:51,837 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying 
operation on ZK. Retry no. 1
{code}

4. during the retry, the ZK session is reconnected.
{code}
2015-03-17 19:18:58,924 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server, sessionid = 0x24be28f536e2006, negotiated 
timeout = 1
{code}

5. Because the node was already deleted successfully at ZooKeeper in the 
previous operation, it will fail with NoNode KeeperException for the retry
{code}
2015-03-17 19:18:58,956 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
2015-03-17 19:18:58,956 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
out ZK retries. Giving up!
{code}

6.This NoNode KeeperException will cause removing app failure in RMStateStore
{code}
2015-03-17 19:18:58,956 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
removing app: application_1426560404988_0132
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
{code}

7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to 
ResourceManager
{code}
  protected void notifyStoreOperationFailed(Exception failureCause) {
RMFatalEventType type;
if (failureCause instanceof StoreFencedException) {
  type = RMFatalEventType.STATE_STORE_FENCED;
} else {
  type = RMFatalEventType.STATE_STORE_OP_FAILED;
}
rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause));
  }
{code}

8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED 
RMFatalEvent.
{code}
2015-03-17 19:18:58,958 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{code}


 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion(Op.delete).
 --

 Key: YARN-3385
 URL: https://issues.apache.org/jira/browse/YARN-3385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical

 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion(Op.delete).
 The race condition is similar as YARN-2721 and YARN-3023.
 When the race condition exists for ZK node creation, it should also exist for 
  ZK node deletion.
 We see this issue with the following stack trace:
 {code}
 2015-03-17 19:18:58,958 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
   at 
 

[jira] [Updated] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.

2015-03-22 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3385:

Attachment: YARN-3385.000.patch

 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion.
 ---

 Key: YARN-3385
 URL: https://issues.apache.org/jira/browse/YARN-3385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3385.000.patch


 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion(Op.delete).
 The race condition is similar as YARN-2721 and YARN-3023.
 since the race condition exists for ZK node creation, it should also exist 
 for  ZK node deletion.
 We see this issue with the following stack trace:
 {code}
 2015-03-17 19:18:58,958 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
   at java.lang.Thread.run(Thread.java:745)
 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
 status 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete).

2015-03-22 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3385:

Description: 
Race condition: KeeperException$NoNodeException will cause RM shutdown during 
ZK node deletion(Op.delete).
The race condition is similar as YARN-2721 and YARN-3023.
since the race condition exists for ZK node creation, it should also exist for  
ZK node deletion.
We see this issue with the following stack trace:
{code}
2015-03-17 19:18:58,958 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{code}

  was:
Race condition: KeeperException$NoNodeException will cause RM shutdown during 
ZK node deletion(Op.delete).
The race condition is similar as YARN-2721 and YARN-3023.
When the race condition exists for ZK node creation, it should also exist for  
ZK node deletion.
We see this issue with the following stack trace:
{code}
2015-03-17 19:18:58,958 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{code}


 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion(Op.delete).
 

[jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.

2015-03-22 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375185#comment-14375185
 ] 

zhihai xu commented on YARN-3385:
-

I uploaded a patch YARN-3385.000.patch for review. The patch fixed both 
Op.delete and zkClient.delete for NoNodeException and optimized the code at 
removeRMDelegationTokenState to skip ZK delete operation if the node doesn't 
exist.

Without the patch, the test will fail with the following message
{code}
---
 T E S T S
---
Running 
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.853 sec  
FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
testRMAppDeleteNoNodeException(org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore)
  Time elapsed: 1.253 sec   FAILURE!
java.lang.AssertionError: NoNodeException should not happen.
at org.junit.Assert.fail(Assert.java:88)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore.testRMAppDeleteNoNodeException(TestZKRMStateStore.java:405)
Results :
Failed tests: 
  TestZKRMStateStore.testRMAppDeleteNoNodeException:405 NoNodeException should 
not happen.
Tests run: 5, Failures: 1, Errors: 0, Skipped: 0

org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:920)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:916)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1080)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1101)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:916)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:928)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:697)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore.testRMAppDelete(TestZKRMStateStore.java:401)
{code}

 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion.
 ---

 Key: YARN-3385
 URL: https://issues.apache.org/jira/browse/YARN-3385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3385.000.patch


 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion(Op.delete).
 The race condition is similar as YARN-2721 and YARN-3023.
 since the race condition exists for ZK node creation, it should also exist 
 for  ZK node deletion.
 We see this issue with the following stack trace:
 {code}
 2015-03-17 19:18:58,958 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
   at 
 

[jira] [Updated] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.

2015-03-22 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3385:

Description: 
Race condition: KeeperException$NoNodeException will cause RM shutdown during 
ZK node deletion(Op.delete).
The race condition is similar as YARN-3023.
since the race condition exists for ZK node creation, it should also exist for  
ZK node deletion.
We see this issue with the following stack trace:
{code}
2015-03-17 19:18:58,958 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{code}

  was:
Race condition: KeeperException$NoNodeException will cause RM shutdown during 
ZK node deletion(Op.delete).
The race condition is similar as YARN-2721 and YARN-3023.
since the race condition exists for ZK node creation, it should also exist for  
ZK node deletion.
We see this issue with the following stack trace:
{code}
2015-03-17 19:18:58,958 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{code}


 Race condition: KeeperException$NoNodeException will cause RM shutdown during 
 ZK node deletion.
 ---

 Key: 

[jira] [Updated] (YARN-3384) test case failures in TestLogAggregationService

2015-03-22 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3384:

Labels: test-fail  (was: )

 test case failures in TestLogAggregationService
 ---

 Key: YARN-3384
 URL: https://issues.apache.org/jira/browse/YARN-3384
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Priority: Minor
  Labels: test-fail
 Attachments: YARN-3384.20150321-1.patch


 Following test cases of TestLogAggregationService is failing :
 testMultipleAppsLogAggregation
 testLogAggregationServiceWithRetention
 testLogAggregationServiceWithInterval
 testLogAggregationServiceWithPatterns 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3386) Cgroups feature should work with default hierarchy settings of CentOS 7

2015-03-22 Thread Masatake Iwasaki (JIRA)
Masatake Iwasaki created YARN-3386:
--

 Summary: Cgroups feature should work with default hierarchy 
settings of CentOS 7
 Key: YARN-3386
 URL: https://issues.apache.org/jira/browse/YARN-3386
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki


The path found by CgroupsLCEResourcesHandler#parseMtab contains comma and 
results in failure of container-executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3386) Cgroups feature should work with default hierarchy settings of CentOS 7

2015-03-22 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375382#comment-14375382
 ] 

Masatake Iwasaki commented on YARN-3386:


The list below is the default settings in CentOS 7::
{noformat}
$ cat /proc/mounts | grep cgroup
tmpfs /sys/fs/cgroup tmpfs rw,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup 
rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup 
rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup 
rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
{noformat}

{{CgroupsLCEResourcesHandler#parseMtab}} parses this and set the value of 
{{controllerPath}} for cpu to {{/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn}}.

As a result, container-executor tries to write the pid to 
{{/sys/fs/cgroup/cpu}} (which is the part before commna in the path) and fails.
{noformat}
2015-03-23 21:32:01,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 27
2015-03-23 21:32:01,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: 
ExitCodeException exitCode=27:
2015-03-23 21:32:01,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
2015-03-23 21:32:01,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
org.apache.hadoop.util.Shell.run(Shell.java:455)
2015-03-23 21:32:01,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
2015-03-23 21:32:01,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:293)
2015-03-23 21:32:01,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
2015-03-23 21:32:01,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
2015-03-23 21:32:01,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
java.util.concurrent.FutureTask.run(FutureTask.java:262)
2015-03-23 21:32:01,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
2015-03-23 21:32:01,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
2015-03-23 21:32:01,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:   at 
java.lang.Thread.run(Thread.java:744)
2015-03-23 21:32:01,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
2015-03-23 21:32:01,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell output: main 
: command provided 1
2015-03-23 21:32:01,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is 
nobody
2015-03-23 21:32:01,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : requested 
yarn user is iwasakims
2015-03-23 21:32:01,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Can't open file 
/sys/fs/cgroup/cpu as node manager - Is a directory
2015-03-23 21:32:01,187 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Container exited with a non-zero exit code 27
{noformat}


 Cgroups feature should work with default hierarchy settings of CentOS 7
 ---

 Key: YARN-3386
 URL: https://issues.apache.org/jira/browse/YARN-3386
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki

 The path found by CgroupsLCEResourcesHandler#parseMtab contains comma and 
 results in failure of container-executor.



--
This message was sent by