[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.

2015-06-16 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589294#comment-14589294
 ] 

Rohith commented on YARN-2305:
--

Updated the duplicated id link.

 When a container is in reserved state then total cluster memory is displayed 
 wrongly.
 -

 Key: YARN-2305
 URL: https://issues.apache.org/jira/browse/YARN-2305
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: J.Andreina
Assignee: Sunil G
 Attachments: Capture.jpg


 ENV Details:
 =  
  3 queues  :  a(50%),b(25%),c(25%) --- All max utilization is set to 
 100
  2 Node cluster with total memory as 16GB
 TestSteps:
 =
   Execute following 3 jobs with different memory configurations for 
 Map , reducer and AM task
   ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a 
 -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 
 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 
 /dir8 /preempt_85 (application_1405414066690_0023)
  ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b 
 -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 
 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 
 /dir2 /preempt_86 (application_1405414066690_0025)
  
  ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c 
 -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 
 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 
 /dir2 /preempt_62
 Issue
 =
   when 2GB memory is in reserved state  totoal memory is shown as 
 15GB and used as 15GB  ( while total memory is 16GB)
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-16 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587519#comment-14587519
 ] 

Rohith commented on YARN-3809:
--

This is interesting scenario, but am not sure why ThreadPool is set to 10 which 
is not configurable.
bq. the default RPC time out is 15 mins.. 
I see RPC timeout is 1 minute, am I missing anything?
{code}
static final int DEFAULT_COMMAND_TIMEOUT = 6;
.
  int expireIntvl = conf.getInt(NM_COMMAND_TIMEOUT, DEFAULT_COMMAND_TIMEOUT);
proxy =
(ContainerManagementProtocolPB) 
RPC.getProxy(ContainerManagementProtocolPB.class,
  clientVersion, addr, ugi, conf,
  NetUtils.getDefaultSocketFactory(conf), expireIntvl);
{code}

 Failed to launch new attempts because ApplicationMasterLauncher's threads all 
 hang
 --

 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong

 ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
 AMLauncherEventType(LAUNCH and CLEANUP).
 In our cluster, there was many NM with 10+ AM running on it, and one shut 
 down for some reason. After RM found the NM LOST, it cleaned up AMs running 
 on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
 ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
 in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
 down, the default RPC time out is 15 mins. It means that in 15 mins 
 ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
 attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585664#comment-14585664
 ] 

Rohith commented on YARN-3789:
--

+1(non-binding)

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587412#comment-14587412
 ] 

Rohith commented on YARN-3789:
--

Looks good to me too.. 

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-1382) NodeListManager has a memory leak, unusableRMNodesConcurrentSet is never purged

2015-06-14 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith reassigned YARN-1382:


Assignee: Rohith

 NodeListManager has a memory leak, unusableRMNodesConcurrentSet is never 
 purged
 ---

 Key: YARN-1382
 URL: https://issues.apache.org/jira/browse/YARN-1382
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Rohith

 If a node is in the unusable nodes set (unusableRMNodesConcurrentSet) and 
 never comes back, the node will be there forever.
 While the leak is not big, it gets aggravated if the NM addresses are 
 configured with ephemeral ports as when the nodes come back they come back as 
 new.
 Some related details in YARN-1343



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585450#comment-14585450
 ] 

Rohith commented on YARN-3790:
--

Thank @zhihai for your detailed explanation.. I got the problem..:-)
Overall patch looks good to me, I think we should change this JIRA component to 
scheduler since code change is in FairScheduler

 TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
 trunk for FS scheduler
 

 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Rohith
Assignee: zhihai xu
 Attachments: YARN-3790.000.patch


 Failure trace is as follows
 {noformat}
 Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
   Time elapsed: 6.502 sec   FAILURE!
 java.lang.AssertionError: expected:6144 but was:8192
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-06-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585431#comment-14585431
 ] 

Rohith commented on YARN-3543:
--

Thanks [~xgong] for the review..
bq. Could we not directly change the ApplicationReport.newInstance() ? This 
will break other applications, such as Tez.
IIUC, ApplicationReport#newInstance() is @private annotated, so ohter client 
should not able to use this. And in the ealier patch I was added new method 
which does not break compatibility, but [~vinodkv] suggested me not to change 
this API in his reveiw comment 
[link|https://issues.apache.org/jira/browse/YARN-3543?focusedCommentId=14533819page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14533819]

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 
 YARN-3543-AH.PNG, YARN-3543-RM.PNG


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-10 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580255#comment-14580255
 ] 

Rohith commented on YARN-3790:
--

Thanks for looking into this issue,
bq. If UpdateThread call update after recoverContainersOnNode, the test will 
succeed
In the test, I see below code which verify for contaner to recover right?
{code}
// Wait for RM to settle down on recovering containers;
waitForNumContainersToRecover(2, rm2, am1.getApplicationAttemptId());
SetContainerId launchedContainers =
((RMNodeImpl) rm2.getRMContext().getRMNodes().get(nm1.getNodeId()))
  .getLaunchedContainers();
assertTrue(launchedContainers.contains(amContainer.getContainerId()));
assertTrue(launchedContainers.contains(runningContainer.getContainerId()));
{code}

Am I missing anything?

 TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
 trunk for FS scheduler
 

 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Rohith
Assignee: zhihai xu

 Failure trace is as follows
 {noformat}
 Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
   Time elapsed: 6.502 sec   FAILURE!
 java.lang.AssertionError: expected:6144 but was:8192
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler

2015-06-10 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580228#comment-14580228
 ] 

Rohith commented on YARN-3790:
--

bq. I think this test fails intermittently.
Yes, it is failing intermittenlty. May be issue summary can be updated.

 TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS 
 scheduler
 -

 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Rohith
Assignee: zhihai xu

 Failure trace is as follows
 {noformat}
 Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
   Time elapsed: 6.502 sec   FAILURE!
 java.lang.AssertionError: expected:6144 but was:8192
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-10 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3790:
-
Summary: TestWorkPreservingRMRestart#testSchedulerRecovery fails 
intermittently in trunk for FS scheduler  (was: 
TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS 
scheduler)

 TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
 trunk for FS scheduler
 

 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Rohith
Assignee: zhihai xu

 Failure trace is as follows
 {noformat}
 Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
 testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
   Time elapsed: 6.502 sec   FAILURE!
 java.lang.AssertionError: expected:6144 but was:8192
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3697) FairScheduler: ContinuousSchedulingThread can't be shutdown after stop sometimes.

2015-06-09 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578677#comment-14578677
 ] 

Rohith commented on YARN-3697:
--

Hi [~zxu], 
 Trying for understanding the problem, Is it ocured when the RM shutdown is 
called which tries to stop FS service? Does it causing RM to hang during 
shutdown?

 FairScheduler: ContinuousSchedulingThread can't be shutdown after stop 
 sometimes. 
 --

 Key: YARN-3697
 URL: https://issues.apache.org/jira/browse/YARN-3697
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3697.000.patch


 FairScheduler: ContinuousSchedulingThread can't be shutdown after stop 
 sometimes. 
 The reason is because the InterruptedException is blocked in 
 continuousSchedulingAttempt
 {code}
   try {
 if (node != null  Resources.fitsIn(minimumAllocation,
 node.getAvailableResource())) {
   attemptScheduling(node);
 }
   } catch (Throwable ex) {
 LOG.error(Error while attempting scheduling for node  + node +
 :  + ex.toString(), ex);
   }
 {code}
 I saw the following exception after stop:
 {code}
 2015-05-17 23:30:43,065 WARN  [FairSchedulerContinuousScheduling] 
 event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
 thread interrupted
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285)
 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] 
 fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - 
 Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 
 available=memory:7168, vCores:7 used=memory:1024, vCores:1: 
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 java.lang.InterruptedException
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 java.lang.InterruptedException
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
  

[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-09 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579198#comment-14579198
 ] 

Rohith commented on YARN-3789:
--

I think, instead of *Not starting*, *Not activating the application* would make 
more meaningful. 

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-09 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579184#comment-14579184
 ] 

Rohith commented on YARN-3789:
--

Thanks [~bibinchundatt] for reporting and providing patch
Some comments
# Log message can be made more clear for log analysis. The messages can be like
## Not starting the application applicationId as usedAMResource  amIfStarted 
 exceeds AMResourceLimit amLimit
## Not starting the application applicationId for the user user as 
usedUserAMResource  userAmIfStarted  exceeds userAMResourceLimit  
userAMLimit 
# Can  you update issue summary and description as real problem i.e  issue is 
in log message correction, not removing duplicate logging.

 Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Attachments: 0001-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3788) Application Master and Task Tracker timeouts are applied incorrectly

2015-06-09 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579188#comment-14579188
 ] 

Rohith commented on YARN-3788:
--

This is MapReduce project issue/query, moving to MR for further discussion.

 Application Master and Task Tracker timeouts are applied incorrectly
 

 Key: YARN-3788
 URL: https://issues.apache.org/jira/browse/YARN-3788
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Dmitry Sivachenko

 I am running a streaming job which requires a big (~50GB) data file to run 
 (file is attached via hadoop jar ... -file BigFile.dat).
 Most likely this command will fail as follows (note that error message is 
 rather meaningless):
 2015-05-27 15:55:00,754 WARN  [main] streaming.StreamJob 
 (StreamJob.java:parseArgv(291)) - -file option is deprecated, please use 
 generic option -files instead.
 packageJobJar: [/ssd/mt/lm/en_reorder.ylm, mapper.py, 
 /tmp/hadoop-mitya/hadoop-unjar3778165585140840383/] [] 
 /var/tmp/streamjob633547925483233845.jar tmpDir=null
 2015-05-27 19:46:22,942 INFO  [main] client.RMProxy 
 (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at 
 nezabudka1-00.yandex.ru/5.255.231.129:8032
 2015-05-27 19:46:23,733 INFO  [main] client.RMProxy 
 (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at 
 nezabudka1-00.yandex.ru/5.255.231.129:8032
 2015-05-27 20:13:37,231 INFO  [main] mapred.FileInputFormat 
 (FileInputFormat.java:listStatus(247)) - Total input paths to process : 1
 2015-05-27 20:13:38,110 INFO  [main] mapreduce.JobSubmitter 
 (JobSubmitter.java:submitJobInternal(396)) - number of splits:1
 2015-05-27 20:13:38,136 INFO  [main] Configuration.deprecation 
 (Configuration.java:warnOnceIfDeprecated(1009)) - mapred.reduce.tasks is 
 deprecated. Instead, use mapreduce.job.reduces
 2015-05-27 20:13:38,390 INFO  [main] mapreduce.JobSubmitter 
 (JobSubmitter.java:printTokens(479)) - Submitting tokens for job: 
 job_1431704916575_2531
 2015-05-27 20:13:38,689 INFO  [main] impl.YarnClientImpl 
 (YarnClientImpl.java:submitApplication(204)) - Submitted application 
 application_1431704916575_2531
 2015-05-27 20:13:38,743 INFO  [main] mapreduce.Job (Job.java:submit(1289)) - 
 The url to track the job: 
 http://nezabudka1-00.yandex.ru:8088/proxy/application_1431704916575_2531/
 2015-05-27 20:13:38,746 INFO  [main] mapreduce.Job 
 (Job.java:monitorAndPrintJob(1334)) - Running job: job_1431704916575_2531
 2015-05-27 21:04:12,353 INFO  [main] mapreduce.Job 
 (Job.java:monitorAndPrintJob(1355)) - Job job_1431704916575_2531 running in 
 uber mode : false
 2015-05-27 21:04:12,356 INFO  [main] mapreduce.Job 
 (Job.java:monitorAndPrintJob(1362)) - map 0% reduce 0%
 2015-05-27 21:04:12,374 INFO  [main] mapreduce.Job 
 (Job.java:monitorAndPrintJob(1375)) - Job job_1431704916575_2531 failed with 
 state FAILED due to: Application application_1431704916575_2531 failed 2 
 times due to ApplicationMaster for attempt 
 appattempt_1431704916575_2531_02 timed out. Failing the application.
 2015-05-27 21:04:12,473 INFO  [main] mapreduce.Job 
 (Job.java:monitorAndPrintJob(1380)) - Counters: 0
 2015-05-27 21:04:12,474 ERROR [main] streaming.StreamJob 
 (StreamJob.java:submitAndMonitorJob(1019)) - Job not Successful!
 Streaming Command Failed!
 This is because yarn.am.liveness-monitor.expiry-interval-ms (defaults to 600 
 sec) timeout expires before large data file is transferred.
 Next step I increase yarn.am.liveness-monitor.expiry-interval-ms.  After that 
 application is successfully initialized and tasks are spawned.
 But I encounter another error: the default 600 seconds mapreduce.task.timeout 
 expires before tasks are initialized and tasks fail.
 Error message Task attempt_XXX failed to report status for 600 seconds is 
 also misleading: this timeout is supposed to kill non-responsive (stuck) 
 tasks but it rather strikes because auxiliary data files are copying slowly.
 So I need to increase mapreduce.task.timeout too and only after that my job 
 is successful.
 At the very least error messages need to be tweaked to indicate that 
 Application (or Task) is failing because auxiliary files are not copied 
 during that time, not just generic timeout expired.
 Better solution would be not to account time spent for data files 
 distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler

2015-06-09 Thread Rohith (JIRA)
Rohith created YARN-3790:


 Summary: TestWorkPreservingRMRestart#testSchedulerRecovery fails 
in trunk for FS scheduler
 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Rohith


Failure trace is as follows

{noformat}
Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
 FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
  Time elapsed: 6.502 sec   FAILURE!
java.lang.AssertionError: expected:6144 but was:8192
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID

2015-06-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578296#comment-14578296
 ] 

Rohith commented on YARN-3017:
--

Thanks [~ozawa] for confirmation:-)

 ContainerID in ResourceManager Log Has Slightly Different Format From 
 AppAttemptID
 --

 Key: YARN-3017
 URL: https://issues.apache.org/jira/browse/YARN-3017
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.8.0
Reporter: MUFEED USMAN
Priority: Minor
  Labels: PatchAvailable
 Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch


 Not sure if this should be filed as a bug or not.
 In the ResourceManager log in the events surrounding the creation of a new
 application attempt,
 ...
 ...
 2014-11-14 17:45:37,258 INFO
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching
 masterappattempt_1412150883650_0001_02
 ...
 ...
 The application attempt has the ID format _1412150883650_0001_02.
 Whereas the associated ContainerID goes by _1412150883650_0001_02_.
 ...
 ...
 2014-11-14 17:45:37,260 INFO
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
 up
 container Container: [ContainerId: container_1412150883650_0001_02_01,
 NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: memory:2048, 
 vCores:1,
 disks:0.0, Priority: 0, Token: Token { kind: ContainerToken, service:
 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02
 ...
 ...
 Curious to know if this is kept like that for a reason. If not while using
 filtering tools to, say, grep events surrounding a specific attempt by the
 numeric ID part information may slip out during troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID

2015-06-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576652#comment-14576652
 ] 

Rohith commented on YARN-3017:
--

I see.. Thanks for the detailed explanation..

 ContainerID in ResourceManager Log Has Slightly Different Format From 
 AppAttemptID
 --

 Key: YARN-3017
 URL: https://issues.apache.org/jira/browse/YARN-3017
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.8.0
Reporter: MUFEED USMAN
Priority: Minor
  Labels: PatchAvailable
 Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch


 Not sure if this should be filed as a bug or not.
 In the ResourceManager log in the events surrounding the creation of a new
 application attempt,
 ...
 ...
 2014-11-14 17:45:37,258 INFO
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching
 masterappattempt_1412150883650_0001_02
 ...
 ...
 The application attempt has the ID format _1412150883650_0001_02.
 Whereas the associated ContainerID goes by _1412150883650_0001_02_.
 ...
 ...
 2014-11-14 17:45:37,260 INFO
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
 up
 container Container: [ContainerId: container_1412150883650_0001_02_01,
 NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: memory:2048, 
 vCores:1,
 disks:0.0, Priority: 0, Token: Token { kind: ContainerToken, service:
 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02
 ...
 ...
 Curious to know if this is kept like that for a reason. If not while using
 filtering tools to, say, grep events surrounding a specific attempt by the
 numeric ID part information may slip out during troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID

2015-06-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576671#comment-14576671
 ] 

Rohith commented on YARN-3017:
--

+1 lgtm (non-binding)

 ContainerID in ResourceManager Log Has Slightly Different Format From 
 AppAttemptID
 --

 Key: YARN-3017
 URL: https://issues.apache.org/jira/browse/YARN-3017
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.8.0
Reporter: MUFEED USMAN
Priority: Minor
  Labels: PatchAvailable
 Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch


 Not sure if this should be filed as a bug or not.
 In the ResourceManager log in the events surrounding the creation of a new
 application attempt,
 ...
 ...
 2014-11-14 17:45:37,258 INFO
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching
 masterappattempt_1412150883650_0001_02
 ...
 ...
 The application attempt has the ID format _1412150883650_0001_02.
 Whereas the associated ContainerID goes by _1412150883650_0001_02_.
 ...
 ...
 2014-11-14 17:45:37,260 INFO
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
 up
 container Container: [ContainerId: container_1412150883650_0001_02_01,
 NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: memory:2048, 
 vCores:1,
 disks:0.0, Priority: 0, Token: Token { kind: ContainerToken, service:
 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02
 ...
 ...
 Curious to know if this is kept like that for a reason. If not while using
 filtering tools to, say, grep events surrounding a specific attempt by the
 numeric ID part information may slip out during troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-06-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576774#comment-14576774
 ] 

Rohith commented on YARN-3535:
--

Recently in test we faced same issue,  [~peng.zhang] would you mind updating 
the patch?

  ResourceRequest should be restored back to scheduler when RMContainer is 
 killed at ALLOCATED
 -

 Key: YARN-3535
 URL: https://issues.apache.org/jira/browse/YARN-3535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Critical
  Labels: BB2015-05-TBR
 Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
 yarn-app.log


 During rolling update of NM, AM start of container on NM failed. 
 And then job hang there.
 Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-06-08 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3535:
-
Priority: Critical  (was: Major)

  ResourceRequest should be restored back to scheduler when RMContainer is 
 killed at ALLOCATED
 -

 Key: YARN-3535
 URL: https://issues.apache.org/jira/browse/YARN-3535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
Priority: Critical
  Labels: BB2015-05-TBR
 Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
 yarn-app.log


 During rolling update of NM, AM start of container on NM failed. 
 And then job hang there.
 Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3508) Preemption processing occuring on the main RM dispatcher

2015-06-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577026#comment-14577026
 ] 

Rohith commented on YARN-3508:
--

The problem I see in the clubbing with scheduler events is if there is many 
scheduler events already in the event queue then it delays pre-emption events 
to trigger. As [~varun_saxena] said, container preemption events should be 
considered as higher priority than scheduler events. Having separate event 
disaptcher for preemptiong events would allow preemption events to participate 
in obtaining the lock in--earlier--stages rather then waiting for scheuduler 
events queue to complete.  I think current patch approach make sense to me i.e 
having individual dispatcher thread for preemption events. 

 Preemption processing occuring on the main RM dispatcher
 

 Key: YARN-3508
 URL: https://issues.apache.org/jira/browse/YARN-3508
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Varun Saxena
 Attachments: YARN-3508.002.patch, YARN-3508.01.patch


 We recently saw the RM for a large cluster lag far behind on the 
 AsyncDispacher event queue.  The AsyncDispatcher thread was consistently 
 blocked on the highly-contended CapacityScheduler lock trying to dispatch 
 preemption-related events for RMContainerPreemptEventDispatcher.  Preemption 
 processing should occur on the scheduler event dispatcher thread or a 
 separate thread to avoid delaying the processing of other events in the 
 primary dispatcher queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3775) Job does not exit after all node become unhealthy

2015-06-08 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith resolved YARN-3775.
--
Resolution: Not A Problem

Closing as Not A Problem. Please Reopen if you disagree..

 Job does not exit after all node become unhealthy
 -

 Key: YARN-3775
 URL: https://issues.apache.org/jira/browse/YARN-3775
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.1
 Environment: Environment:
 Version : 2.7.0
 OS: RHEL7 
 NameNodes:  xiachsh11 xiachsh12 (HA enabled)
 DataNodes:  5 xiachsh13-17
 ResourceManage:  xiachsh11
 NodeManage: 5 xiachsh13-17 
 all nodes are openstack provisioned:  
 MEM: 1.5G 
 Disk: 16G 
Reporter: Chengshun Xia
 Attachments: logs.tar.gz


 Running Terasort with data size 10G, all the containers exit since the disk 
 space threshold 0.90 reached,at this point,the job does not exit with error 
 15/06/05 13:13:28 INFO mapreduce.Job:  map 9% reduce 0%
 15/06/05 13:13:52 INFO mapreduce.Job:  map 10% reduce 0%
 15/06/05 13:14:30 INFO mapreduce.Job:  map 11% reduce 0%
 15/06/05 13:15:11 INFO mapreduce.Job:  map 12% reduce 0%
 15/06/05 13:15:43 INFO mapreduce.Job:  map 13% reduce 0%
 15/06/05 13:16:38 INFO mapreduce.Job:  map 14% reduce 0%
 15/06/05 13:16:41 INFO mapreduce.Job:  map 15% reduce 0%
 15/06/05 13:16:53 INFO mapreduce.Job:  map 16% reduce 0%
 15/06/05 13:17:24 INFO mapreduce.Job:  map 17% reduce 0%
 15/06/05 13:17:53 INFO mapreduce.Job:  map 18% reduce 0%
 15/06/05 13:18:36 INFO mapreduce.Job:  map 19% reduce 0%
 15/06/05 13:19:03 INFO mapreduce.Job:  map 20% reduce 0%
 15/06/05 13:19:09 INFO mapreduce.Job:  map 15% reduce 0%
 15/06/05 13:19:32 INFO mapreduce.Job:  map 16% reduce 0%
 15/06/05 13:20:00 INFO mapreduce.Job:  map 17% reduce 0%
 15/06/05 13:20:36 INFO mapreduce.Job:  map 18% reduce 0%
 15/06/05 13:20:57 INFO mapreduce.Job:  map 19% reduce 0%
 15/06/05 13:21:22 INFO mapreduce.Job:  map 18% reduce 0%
 15/06/05 13:21:24 INFO mapreduce.Job:  map 14% reduce 0%
 15/06/05 13:21:25 INFO mapreduce.Job:  map 9% reduce 0%
 15/06/05 13:21:28 INFO mapreduce.Job:  map 10% reduce 0%
 15/06/05 13:22:22 INFO mapreduce.Job:  map 11% reduce 0%
 15/06/05 13:23:06 INFO mapreduce.Job:  map 12% reduce 0%
 15/06/05 13:23:41 INFO mapreduce.Job:  map 9% reduce 0%
 15/06/05 13:23:42 INFO mapreduce.Job:  map 5% reduce 0%
 15/06/05 13:24:38 INFO mapreduce.Job:  map 6% reduce 0%
 15/06/05 13:25:16 INFO mapreduce.Job:  map 7% reduce 0%
 15/06/05 13:25:53 INFO mapreduce.Job:  map 8% reduce 0%
 15/06/05 13:26:35 INFO mapreduce.Job:  map 9% reduce 0%
 the last response time is  15/06/05 13:26:35
 and current time :
 [root@xiachsh11 logs]# date
 Fri Jun  5 19:19:59 EDT 2015
 [root@xiachsh11 logs]#
 [root@xiachsh11 logs]# yarn node -list
 15/06/05 19:20:18 INFO client.RMProxy: Connecting to ResourceManager at 
 xiachsh11.eng.platformlab.ibm.com/9.21.62.234:8032
 Total Nodes:0
  Node-Id Node-State Node-Http-Address   
 Number-of-Running-Containers
 [root@xiachsh11 logs]#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3775) Job does not exit after all node become unhealthy

2015-06-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577065#comment-14577065
 ] 

Rohith commented on YARN-3775:
--

[~xiachengs...@yeah.net] Thanks for reporting the issue. IIUC, This is expected 
behavior
If the Application attempt is killed because of the following reason, then 
current attempt failure is not considered as attempt failures count. 
# Preempted
# Aborted
# Disk_failed(i.e NM unhealthy)
# killed by ResourceManager.

In your case, applicaitons attempt got killed because of disk_failed, which RM 
never consider this as attempt failure. So RM wait for this applications to 
launch and run in further NM register to it.

 Job does not exit after all node become unhealthy
 -

 Key: YARN-3775
 URL: https://issues.apache.org/jira/browse/YARN-3775
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.1
 Environment: Environment:
 Version : 2.7.0
 OS: RHEL7 
 NameNodes:  xiachsh11 xiachsh12 (HA enabled)
 DataNodes:  5 xiachsh13-17
 ResourceManage:  xiachsh11
 NodeManage: 5 xiachsh13-17 
 all nodes are openstack provisioned:  
 MEM: 1.5G 
 Disk: 16G 
Reporter: Chengshun Xia
 Attachments: logs.tar.gz


 Running Terasort with data size 10G, all the containers exit since the disk 
 space threshold 0.90 reached,at this point,the job does not exit with error 
 15/06/05 13:13:28 INFO mapreduce.Job:  map 9% reduce 0%
 15/06/05 13:13:52 INFO mapreduce.Job:  map 10% reduce 0%
 15/06/05 13:14:30 INFO mapreduce.Job:  map 11% reduce 0%
 15/06/05 13:15:11 INFO mapreduce.Job:  map 12% reduce 0%
 15/06/05 13:15:43 INFO mapreduce.Job:  map 13% reduce 0%
 15/06/05 13:16:38 INFO mapreduce.Job:  map 14% reduce 0%
 15/06/05 13:16:41 INFO mapreduce.Job:  map 15% reduce 0%
 15/06/05 13:16:53 INFO mapreduce.Job:  map 16% reduce 0%
 15/06/05 13:17:24 INFO mapreduce.Job:  map 17% reduce 0%
 15/06/05 13:17:53 INFO mapreduce.Job:  map 18% reduce 0%
 15/06/05 13:18:36 INFO mapreduce.Job:  map 19% reduce 0%
 15/06/05 13:19:03 INFO mapreduce.Job:  map 20% reduce 0%
 15/06/05 13:19:09 INFO mapreduce.Job:  map 15% reduce 0%
 15/06/05 13:19:32 INFO mapreduce.Job:  map 16% reduce 0%
 15/06/05 13:20:00 INFO mapreduce.Job:  map 17% reduce 0%
 15/06/05 13:20:36 INFO mapreduce.Job:  map 18% reduce 0%
 15/06/05 13:20:57 INFO mapreduce.Job:  map 19% reduce 0%
 15/06/05 13:21:22 INFO mapreduce.Job:  map 18% reduce 0%
 15/06/05 13:21:24 INFO mapreduce.Job:  map 14% reduce 0%
 15/06/05 13:21:25 INFO mapreduce.Job:  map 9% reduce 0%
 15/06/05 13:21:28 INFO mapreduce.Job:  map 10% reduce 0%
 15/06/05 13:22:22 INFO mapreduce.Job:  map 11% reduce 0%
 15/06/05 13:23:06 INFO mapreduce.Job:  map 12% reduce 0%
 15/06/05 13:23:41 INFO mapreduce.Job:  map 9% reduce 0%
 15/06/05 13:23:42 INFO mapreduce.Job:  map 5% reduce 0%
 15/06/05 13:24:38 INFO mapreduce.Job:  map 6% reduce 0%
 15/06/05 13:25:16 INFO mapreduce.Job:  map 7% reduce 0%
 15/06/05 13:25:53 INFO mapreduce.Job:  map 8% reduce 0%
 15/06/05 13:26:35 INFO mapreduce.Job:  map 9% reduce 0%
 the last response time is  15/06/05 13:26:35
 and current time :
 [root@xiachsh11 logs]# date
 Fri Jun  5 19:19:59 EDT 2015
 [root@xiachsh11 logs]#
 [root@xiachsh11 logs]# yarn node -list
 15/06/05 19:20:18 INFO client.RMProxy: Connecting to ResourceManager at 
 xiachsh11.eng.platformlab.ibm.com/9.21.62.234:8032
 Total Nodes:0
  Node-Id Node-State Node-Http-Address   
 Number-of-Running-Containers
 [root@xiachsh11 logs]#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3780) Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition

2015-06-07 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576550#comment-14576550
 ] 

Rohith commented on YARN-3780:
--

Makse sense, 
+1 lgtm (non-binding)

 Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition
 -

 Key: YARN-3780
 URL: https://issues.apache.org/jira/browse/YARN-3780
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-3780.000.patch


 Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition 
 to avoid unnecessary NodeResourceUpdateSchedulerEvent.
 The current code use {{!=}} to compare Resource totalCapability, which will 
 compare reference not the real value in Resource. So we should use equals to 
 compare Resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID

2015-06-05 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574228#comment-14574228
 ] 

Rohith commented on YARN-3017:
--

bq. Could you give a little more detail about the possibility to break the 
rolling upgrade?
I was thinking that does it cause any issue while parsing the containerId after 
upgrade. Say, current container id format is 
container_1430441527236_0001_01_01 which is running in the NM-1, after 
upgrade container-id format changes container_1430441527236_0001_01_01. 
But NM reports running containers as container_1430441527236_0001_01_01. 

 ContainerID in ResourceManager Log Has Slightly Different Format From 
 AppAttemptID
 --

 Key: YARN-3017
 URL: https://issues.apache.org/jira/browse/YARN-3017
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.8.0
Reporter: MUFEED USMAN
Priority: Minor
  Labels: PatchAvailable
 Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch


 Not sure if this should be filed as a bug or not.
 In the ResourceManager log in the events surrounding the creation of a new
 application attempt,
 ...
 ...
 2014-11-14 17:45:37,258 INFO
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching
 masterappattempt_1412150883650_0001_02
 ...
 ...
 The application attempt has the ID format _1412150883650_0001_02.
 Whereas the associated ContainerID goes by _1412150883650_0001_02_.
 ...
 ...
 2014-11-14 17:45:37,260 INFO
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
 up
 container Container: [ContainerId: container_1412150883650_0001_02_01,
 NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: memory:2048, 
 vCores:1,
 disks:0.0, Priority: 0, Token: Token { kind: ContainerToken, service:
 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02
 ...
 ...
 Curious to know if this is kept like that for a reason. If not while using
 filtering tools to, say, grep events surrounding a specific attempt by the
 numeric ID part information may slip out during troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working as expected in FairScheduler

2015-06-05 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574393#comment-14574393
 ] 

Rohith commented on YARN-3758:
--

All these confusion should be solved probably after YARN-2986. This issue can 
be raised there whether they will be handling it.

 The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
 working as expected in FairScheduler
 

 Key: YARN-3758
 URL: https://issues.apache.org/jira/browse/YARN-3758
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: skrho

 Hello there~~
 I have 2 clusters
 First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
 Physical memory each node
 Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
 Physical memory each node
 Wherever a mapreduce job is running, I want resourcemanager is to set the 
 minimum memory  256m to container
 So I was changing configuration in yarn-site.xml  mapred-site.xml
 yarn.scheduler.minimum-allocation-mb : 256
 mapreduce.map.java.opts : -Xms256m 
 mapreduce.reduce.java.opts : -Xms256m 
 mapreduce.map.memory.mb : 256 
 mapreduce.reduce.memory.mb : 256 
 In First cluster  whenever a mapreduce job is running , I can see used memory 
 256m in web console( http://installedIP:8088/cluster/nodes )
 But In Second cluster whenever a mapreduce job is running , I can see used 
 memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
 I know default memory value is 1024m, so if there is not changing memory 
 setting, the default value is working.
 I have been testing for two weeks, but I don't know why mimimum memory 
 setting is not working in second cluster
 Why this difference is happened? 
 Am I wrong setting configuration?
 or Is there bug?
 Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID

2015-06-04 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572289#comment-14572289
 ] 

Rohith commented on YARN-3017:
--

Apoligies for coming very late into this issue.. Thinking that changing 
containerId format may breaks complatability when rolling upgrade has been done 
with RM HA + work preserving enabled? IIUC, using ZKRMStateStore, rolling 
upgrade can be done now.

 ContainerID in ResourceManager Log Has Slightly Different Format From 
 AppAttemptID
 --

 Key: YARN-3017
 URL: https://issues.apache.org/jira/browse/YARN-3017
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.8.0
Reporter: MUFEED USMAN
Priority: Minor
  Labels: PatchAvailable
 Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch


 Not sure if this should be filed as a bug or not.
 In the ResourceManager log in the events surrounding the creation of a new
 application attempt,
 ...
 ...
 2014-11-14 17:45:37,258 INFO
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching
 masterappattempt_1412150883650_0001_02
 ...
 ...
 The application attempt has the ID format _1412150883650_0001_02.
 Whereas the associated ContainerID goes by _1412150883650_0001_02_.
 ...
 ...
 2014-11-14 17:45:37,260 INFO
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
 up
 container Container: [ContainerId: container_1412150883650_0001_02_01,
 NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: memory:2048, 
 vCores:1,
 disks:0.0, Priority: 0, Token: Token { kind: ContainerToken, service:
 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02
 ...
 ...
 Curious to know if this is kept like that for a reason. If not while using
 filtering tools to, say, grep events surrounding a specific attempt by the
 numeric ID part information may slip out during troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-04 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572247#comment-14572247
 ] 

Rohith commented on YARN-3733:
--

+1 for handling virtual core's. This will good immprovement for testing 
DominantRC functionality precicely. 

 DominantRC#compare() does not work as expected if cluster resource is empty
 ---

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-YARN-3733.patch, 0002-YARN-3733.patch, 
 0002-YARN-3733.patch, YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched

2015-06-04 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572244#comment-14572244
 ] 

Rohith commented on YARN-3754:
--

bq. When NM is shutting down, ContainerLaunch is also interrupted. During this 
interrupted exception handling, NM tries to update container diagnostics. But 
from main thread statestore is down ,hence caused the DB Close exception.
I think this issue caused since NM jvm did not exit on_time which allowed to 
process the statestore event. After YARN-3585 , I think this should be OK.
[~bibinchundatt] Can you regression it pls

 Race condition when the NodeManager is shutting down and container is launched
 --

 Key: YARN-3754
 URL: https://issues.apache.org/jira/browse/YARN-3754
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Sunil G
Priority: Critical
 Attachments: NM.log


 Container is launched and returned to ContainerImpl
 NodeManager closed the DB connection which resulting in 
 {{org.iq80.leveldb.DBException: Closed}}. 
 *Attaching the exception trace*
 {code}
 2015-05-30 02:11:49,122 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
  Unable to update state store diagnostics for 
 container_e310_1432817693365_3338_01_02
 java.io.IOException: org.iq80.leveldb.DBException: Closed
 at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.iq80.leveldb.DBException: Closed
 at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
 at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
 at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
 ... 15 more
 {code}
 we can add a check whether DB is closed while we move container from ACQUIRED 
 state.
 As per the discussion in YARN-3585 have add the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-04 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572628#comment-14572628
 ] 

Rohith commented on YARN-3758:
--

Had looked into code for CS and FS. The minimum allocation understanding and 
its behavior is different acros CS and FS.
# CS : It is straight forward that if any request with less than 
min-allocation-mb, then the CS normalize the request to min-allocation-mb. And 
containers are allocated with minimum-allocation-mb. 
# FS : if any request with less than min-allocation-mb then the FS normalize 
the request with the factor {{yarn.scheduler.increment-allocation-mb}}. Example 
in description, min-alocation-mb is 256mb, but increment-allocation-mb default 
1024mb which always allocate 1024mb to containers. There is huge effect of 
{{yarn.scheduler.increment-allocation-mb}} which changes the requested memory 
and assign with newly calculated resource.

The behavior is not consistent with CS and FS. I am not sure why there an 
additional configuration introduced in FS? Is it bug ?

 The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
 working in container
 

 Key: YARN-3758
 URL: https://issues.apache.org/jira/browse/YARN-3758
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: skrho

 Hello there~~
 I have 2 clusters
 First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
 Physical memory each node
 Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
 Physical memory each node
 Wherever a mapreduce job is running, I want resourcemanager is to set the 
 minimum memory  256m to container
 So I was changing configuration in yarn-site.xml  mapred-site.xml
 yarn.scheduler.minimum-allocation-mb : 256
 mapreduce.map.java.opts : -Xms256m 
 mapreduce.reduce.java.opts : -Xms256m 
 mapreduce.map.memory.mb : 256 
 mapreduce.reduce.memory.mb : 256 
 In First cluster  whenever a mapreduce job is running , I can see used memory 
 256m in web console( http://installedIP:8088/cluster/nodes )
 But In Second cluster whenever a mapreduce job is running , I can see used 
 memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
 I know default memory value is 1024m, so if there is not changing memory 
 setting, the default value is working.
 I have been testing for two weeks, but I don't know why mimimum memory 
 setting is not working in second cluster
 Why this difference is happened? 
 Am I wrong setting configuration?
 or Is there bug?
 Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-04 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572630#comment-14572630
 ] 

Rohith commented on YARN-3758:
--

bq. Is it bug ?
To be clear, is the inconsistent behavior is bug? or implemented intentionally 
for FS?

 The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
 working in container
 

 Key: YARN-3758
 URL: https://issues.apache.org/jira/browse/YARN-3758
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: skrho

 Hello there~~
 I have 2 clusters
 First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
 Physical memory each node
 Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
 Physical memory each node
 Wherever a mapreduce job is running, I want resourcemanager is to set the 
 minimum memory  256m to container
 So I was changing configuration in yarn-site.xml  mapred-site.xml
 yarn.scheduler.minimum-allocation-mb : 256
 mapreduce.map.java.opts : -Xms256m 
 mapreduce.reduce.java.opts : -Xms256m 
 mapreduce.map.memory.mb : 256 
 mapreduce.reduce.memory.mb : 256 
 In First cluster  whenever a mapreduce job is running , I can see used memory 
 256m in web console( http://installedIP:8088/cluster/nodes )
 But In Second cluster whenever a mapreduce job is running , I can see used 
 memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
 I know default memory value is 1024m, so if there is not changing memory 
 setting, the default value is working.
 I have been testing for two weeks, but I don't know why mimimum memory 
 setting is not working in second cluster
 Why this difference is happened? 
 Am I wrong setting configuration?
 or Is there bug?
 Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working as expected in FairScheduler

2015-06-04 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3758:
-
Summary: The mininum memory setting(yarn.scheduler.minimum-allocation-mb) 
is not working as expected in FairScheduler  (was: The mininum memory 
setting(yarn.scheduler.minimum-allocation-mb) is not working in container)

 The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
 working as expected in FairScheduler
 

 Key: YARN-3758
 URL: https://issues.apache.org/jira/browse/YARN-3758
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: skrho

 Hello there~~
 I have 2 clusters
 First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
 Physical memory each node
 Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
 Physical memory each node
 Wherever a mapreduce job is running, I want resourcemanager is to set the 
 minimum memory  256m to container
 So I was changing configuration in yarn-site.xml  mapred-site.xml
 yarn.scheduler.minimum-allocation-mb : 256
 mapreduce.map.java.opts : -Xms256m 
 mapreduce.reduce.java.opts : -Xms256m 
 mapreduce.map.memory.mb : 256 
 mapreduce.reduce.memory.mb : 256 
 In First cluster  whenever a mapreduce job is running , I can see used memory 
 256m in web console( http://installedIP:8088/cluster/nodes )
 But In Second cluster whenever a mapreduce job is running , I can see used 
 memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
 I know default memory value is 1024m, so if there is not changing memory 
 setting, the default value is working.
 I have been testing for two weeks, but I don't know why mimimum memory 
 setting is not working in second cluster
 Why this difference is happened? 
 Am I wrong setting configuration?
 or Is there bug?
 Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-03 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: 0002-YARN-3733.patch

Thanks [~sunilg] and [~leftnoteasy] for sharing your thoughts..

I modified bit of logic and the order of if check so that it should handle all 
the possible combination of inputs below table. The problem was in 5th and 7th 
inputs. The validation returning 1 but it was expected to be zero  for 5th 
combinations i.e flow never reach 2nd check since 1st step is OR for memory vs 
cpu.
||Sl.no||cr||lhs||rhs||Output||
|1|0,0| 1,1 | 1,1 | 0 |
|2|0,0| 1,1 | 0,0 | 1 |
|3|0,0| 0,0 | 1,1 | -1 |
|4|0,0| 0,1 | 1,0 |  0 |
|5|0,0| 1,0 | 0,1 |  0 |
|6|0,0| 1,1 | 1,0 | 1  |
|7|0,0| 1,0 | 1,1 | -1  |

Updated Patch has followig change : 
# Changed the logic for comparing lhs and rhs resources when clusterResource is 
empty as suggested.
# Added test for AMLimit usage.
# Addred test for all above cobination of inputs.

Kindly review the patch

 DominantRC#compare() does not work as expected if cluster resource is empty
 ---

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-YARN-3733.patch, 0002-YARN-3733.patch, 
 YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-03 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572085#comment-14572085
 ] 

Rohith commented on YARN-3733:
--

bq. only memory or vcores are more in TestCapacityScheduler.
All the combination of inputs are verified in the TestResourceCalculator. And 
in TestCapacityScheduler, app submission happens only for memory in 
{{MockRM.submitApp}}, so default vcore minimum allocation is 1 which will be 
taken by default. So just changing memory to {{amResourceLimit.getMemory() + 
2}} should enough.

bq. TestCapacityScheduler#verifyAMLimitForLeafQueue, while submitting second 
app, you could change the app name to app-2.
Agree.

I will upload a patch soon

 DominantRC#compare() does not work as expected if cluster resource is empty
 ---

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-YARN-3733.patch, 0002-YARN-3733.patch, 
 YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-03 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: 0002-YARN-3733.patch

Updated the patch fixing test side comments.. Kindly review the patch

 DominantRC#compare() does not work as expected if cluster resource is empty
 ---

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-YARN-3733.patch, 0002-YARN-3733.patch, 
 0002-YARN-3733.patch, YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568682#comment-14568682
 ] 

Rohith commented on YARN-3733:
--

Updated the summary as per defect.

 DominantRC#compare() does not work as expected if cluster resource is empty
 ---

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: 0001-YARN-3733.patch

The updated patch that fixes for 2nd and 3rd scenarios(This issue scenario  
fixes) in above table and refactored the test code.

As a overall solution that solves input combination like 4th and 5th from above 
table, need to explore more on how to define fraction and how to decide which 
one is dominant. Any suggestions on this?



 DominantRC#compare() does not work as expected if cluster resource is empty
 ---

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: 0001-YARN-3733.patch, YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Summary: DominantRC#compare() does not work as expected if cluster resource 
is empty  (was:  On RM restart AM getting more than maximum possible memory 
when many  tasks in queue)

 DominantRC#compare() does not work as expected if cluster resource is empty
 ---

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566993#comment-14566993
 ] 

Rohith commented on YARN-3585:
--

This is race condition when the NodeManager is shutting down and container is 
launched. By the time container is launched and returned to ContainerImpl, 
NodeManager closed the DB connection which resulting in 
{{org.iq80.leveldb.DBException: Closed
}}

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567189#comment-14567189
 ] 

Rohith commented on YARN-3733:
--

bq. Verify infinity by calling isInfinite(float v). Quoting from jdk7 
Since infinity is derived from lhs and rhs, infinity can not be differentiated 
for the clusterResource=0,0 lhs=1,1, and rhs2,2. Method 
{{getResourceAsValue()}} return infinity for both l and r which cant compare it.

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567196#comment-14567196
 ] 

Rohith commented on YARN-3585:
--

Yes, we can raise different Jira. [~bibinchundatt] Can you raise Jira, we can 
validate the issue there?

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567186#comment-14567186
 ] 

Rohith commented on YARN-3733:
--

bq. 2. The newly added code is duplicated in two places, can you eliminate the 
duplicate code?
sencond time validation is not required ICO NaN,will remove this in next patch.

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567184#comment-14567184
 ] 

Rohith commented on YARN-3733:
--

Thanks [~devaraj.k] and [~sunilg] for review

bq. Can we check for lhs/rhs emptiness and compare these before ending up with 
infinite values? 
If we calculater for emptyness, this would affect specific input values like 
clusterResource=0,0 lhs=1,1, and rhs2,2. Then which one is considered as 
dominant? bcs directly dominant component can not be retrieved by memory or cpu.

And I listed out what are the possible combination of inputs would ocure in 
YARN. These are
||Sl.no||clusterResorce||lhs||rhs||Remark||
|1|0,0|0,0|0,0|Valid Input;Handled|
|2|0,0|positive integer,positive integer|0,0|NaN vs Infinity: Patch 
Handle This scenario|
|3|0,0|0,0|positive integer,positive integer|Nan vs Infinity: Patch 
Handle This scenario|
|4|0,0|positive integer,positive integer|positive integer,positive 
integer|Infinity vs Infinity: Can this type can ocur in YARN?|
|5|0,0|positive integer,0|0,positive integer|Is this valid input? Can 
this type can ocur in YARN?|


  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567201#comment-14567201
 ] 

Rohith commented on YARN-3585:
--

-1 for findbug, does not show any error report, but not sure why -1 given.
Test failure is unrelated to this patch.

[~jlowe] Kindly review the patch. 

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568462#comment-14568462
 ] 

Rohith commented on YARN-3733:
--

This issue fix need to go in for 2.7.1. Updated the target version as 2.7.1

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568539#comment-14568539
 ] 

Rohith commented on YARN-3585:
--

Thanks [~jlowe] for the review .. 

bq. if we should flip the logic to not exit but then have NodeManager.main 
override that. This probably precludes the need to update existing tests.
Make sense to me.. Changed the logic to call jvm exit when NodeMananager is 
instantiated from main function.

bq. We should be using ExitUtil instead of System.exit directly.
Done

bq. Nit: setexitOnShutdownEvent s/b setExitOnShutdownEvent
This method is not necessary now since patch preassume true when it is called 
from only main funtion. I have removed this.

Kindly reveiw updated patch

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: 0001-YARN-3585.patch, YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3585:
-
Attachment: 0001-YARN-3585.patch

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: 0001-YARN-3585.patch, YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Target Version/s: 2.7.1

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-05-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: YARN-3733.patch

Attached the patch fixing the issue. Kindly review the patch.

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3585:
-
Attachment: YARN-3585.patch

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical
 Attachments: YARN-3585.patch


 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-05-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: YARN-3733.patch

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-05-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: (was: YARN-3733.patch)

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Blocker
 Attachments: YARN-3733.patch


 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-05-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith reassigned YARN-3733:


Assignee: Rohith

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Critical

 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-05-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562622#comment-14562622
 ] 

Rohith commented on YARN-3733:
--

Verified the RM logs from [~bibinchundatt] offline. The sequence of events 
ocured are 
# 30 applications are submitted to RM1 concurrently. *pendingApplications=18 
and activeApplications=12*. Active applications are started RUNNING state.
# RM1 switched to standby, RM2 transitioned to Active state. Currently active 
RM is RM2.
# Previous submitted 30 applications started recovering. As part of recovery 
process, all the 30 applications submitted to schedulers and all these 
applications become active i.e *activeApplications=30 and 
pendingApplications=0* which is not expected to happen.
# NM registered with RM and running AM's registered with RM.
# Since 30 applications are activated, schedulers tries to launch all the 
activated applications ApplicatonMater and occupied full cluster capacity.

Basically the issue AM limit check in LeafQueue#activateApplications is not 
working as expected for {{DominantResourceAllocator}}. In order to confirm 
this, written simple program for both Default and Dominant resource allocator 
like below memory configurations. Output of the program is 
For DefaultResourceAllocator, result is false which Limits the applications 
being activated when AM resource Limit is exceeded.
For DominatReosurceAllocator, result is true  which allows all the applications 
to be activated even if AM resource Limit is exceeded.
{noformat}
2015-05-28 14:00:52,704 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
application AMResource memory:4096, vCores:1 maxAMResourcePerQueuePercent 0.5 
amLimit memory:0, vCores:0 lastClusterResource memory:0, vCores:0 
amIfStarted memory:4096, vCores:1
{noformat}

{code}
package com.test.hadoop;

import org.apache.hadoop.yarn.api.records.Resource;
import org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator;
import org.apache.hadoop.yarn.util.resource.DominantResourceCalculator;
import org.apache.hadoop.yarn.util.resource.ResourceCalculator;
import org.apache.hadoop.yarn.util.resource.Resources;

public class TestResourceCalculator {

  public static void main(String[] args) {
// Default Resource Allocator
ResourceCalculator defaultResourceCalculator =
new DefaultResourceCalculator();

// Dominant Resource Allocator
ResourceCalculator dominantResourceCalculator =
new DominantResourceCalculator();

Resource lastClusterResource = Resource.newInstance(0, 0);
Resource amIfStarted = Resource.newInstance(4096, 1);
Resource amLimit = Resource.newInstance(0, 0);

   // expected result false, but actual also false
System.out.println(DefaultResourceCalculator : 
+ Resources.lessThanOrEqual(defaultResourceCalculator,
lastClusterResource, amIfStarted, amLimit));

   // expected result false, but actual also true for DominantResourceAllocator
System.out.println(DominantResourceCalculator : 
+ Resources.lessThanOrEqual(dominantResourceCalculator,
lastClusterResource, amIfStarted, amLimit));

  }
}

{code}

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Critical

 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-05-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562638#comment-14562638
 ] 

Rohith commented on YARN-3733:
--

Steps to reproduce the scenario quickly. Assume that configuration for 
max-am-resurce-limit is 0.5 and cluster capacity is 10GB after NM is 
registered. So,Max AM ResourceLimit is 5B
# Start RM configuring DominantResourceAllocator.(Dont start NM in the cluster)
# Submit 10 applications with 1GB each, and all 10 applications get activated.
# Start NM, RM launched all 10 applications AM's and cluster is full where 
cluster is hangs forever.
When there is no NM is registered, submitted applications should not be 
activated i.e should not participate in scheduling.

  On RM restart AM getting more than maximum possible memory when many  tasks 
 in queue
 -

 Key: YARN-3733
 URL: https://issues.apache.org/jira/browse/YARN-3733
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3 , 2 NM , 2 RM
 one NM - 3 GB 6 v core
Reporter: Bibin A Chundatt
Assignee: Rohith
Priority: Critical

 Steps to reproduce
 =
 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
 size to 512 MB
 3. Configure capacity scheduler and AM limit to .5 
 (DominantResourceCalculator is configured)
 4. Submit 30 concurrent task 
 5. Switch RM
 Actual
 =
 For 12 Jobs AM gets allocated and all 12 starts running
 No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
 Expected
 ===
 Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith reassigned YARN-3585:


Assignee: Rohith

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Rohith
Priority: Critical

 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563180#comment-14563180
 ] 

Rohith commented on YARN-3585:
--

Another observation is I enabled debug logs for NodeManger. And noticed that 
occurrence of this issue become relative low. I think it a timing of db close 
causing issue in LevelDb. And this Issue won't appear always on all the nodes, 
but in cluster at lease one node in the cluster is going for toss.

I too think it should a level db issue. I think we should report issue in 
LevelDb. 

 For calling {{adding system.exit}} in NodeManager gracefully shutdown will 
mask many issues. Given this is acceptable , I will upload a patch.

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Priority: Critical

 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563124#comment-14563124
 ] 

Rohith commented on YARN-3585:
--

Tested with patch to log before and after db.close,  but found that db is 
closed.There were no exception thrown while closing db.close.

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Priority: Critical

 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-05-27 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560774#comment-14560774
 ] 

Rohith commented on YARN-3535:
--

Thanks [~peng.zhang] for working on this issue..  
Some comments
# I think the method {{recoverResourceRequestForContainer}} should be 
synchronized, any thought?
# Why do we require {{RMContextImpl.java}} changes? I think this we can avoid, 
not necessarily required.

Tests : 
# Any specific reason for chaning {{TestAMRestart.java}}?
# IIUC, this issue can occur in all the scheduler given AM-RM heart beat is 
lesser than NM-RM heart beat interval. So can it include FT test case that 
applicable for both CS and FS. May it you can add test in the extending class 
{{ParameterizedSchedulerTestBase}} i.e TestAbstractYarnScheduler.


  ResourceRequest should be restored back to scheduler when RMContainer is 
 killed at ALLOCATED
 -

 Key: YARN-3535
 URL: https://issues.apache.org/jira/browse/YARN-3535
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Assignee: Peng Zhang
  Labels: BB2015-05-TBR
 Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
 yarn-app.log


 During rolling update of NM, AM start of container on NM failed. 
 And then job hang there.
 Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-27 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560965#comment-14560965
 ] 

Rohith commented on YARN-3585:
--

I have attached NM logs and thread dump in YARN-3640. Would get it from 
YARN-3640?

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Priority: Critical

 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-27 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561256#comment-14561256
 ] 

Rohith commented on YARN-3585:
--

bq. Could you to instrument logs in the state store code to verify the leveldb 
database is indeed being closed even when it hangs? 
sorry, did not get it exactly what and where should I add logs? Do you mean 
should I add log after {{NMLeveldbStateStoreService#closeStorage()}} being 
called?

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Priority: Critical

 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-27 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560470#comment-14560470
 ] 

Rohith commented on YARN-3585:
--

I tested locally using YARN-3641 FIX, issue is still exist.

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Priority: Critical

 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3731) Unknown container. Container either has not started or has already completed or doesn’t belong to this node at all.

2015-05-27 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith resolved YARN-3731.
--
Resolution: Invalid

 Unknown container. Container either has not started or has already completed 
 or doesn’t belong to this node at all. 
 

 Key: YARN-3731
 URL: https://issues.apache.org/jira/browse/YARN-3731
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: amit
Priority: Critical

 Hi 
 I am importing data from sql server to hdfs and below is the command
 sqoop import –connect 
 “jdbc:sqlserver://Servername:1433;username=hadoop;password=Password;database=MSBI”
  –table DimDate –target-dir /Hadoop/hdpdatadn/dn/DW/msbi
 but I am getting following error:
 User: amit.tomar
  Name: DimDate.jar
  Application Type: MAPREDUCE
  Application Tags:
  State: FAILED
  FinalStatus: FAILED
  Started: Wed May 27 12:39:48 +0800 2015
  Elapsed: 23sec
  Tracking URL: History
  Diagnostics: Application application_1432698911303_0005 failed 2 times due 
 to AM Container for appattempt_1432698911303_0005_02 exited with 
 exitCode: 1
  For more detailed output, check application tracking 
 page:http://ServerName/proxy/application_1432698911303_0005/Then, click on 
 links to logs of each attempt.
  Diagnostics: Exception from container-launch.
  Container id: container_1432698911303_0005_02_01
  Exit code: 1
  Stack trace: ExitCodeException exitCode=1:
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
  at org.apache.hadoop.util.Shell.run(Shell.java:455)
  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
  at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
  Shell output: 1 file(s) moved.
  Container exited with a non-zero exit code 1
  Failing this attempt. Failing the application. 
 From the log below is the message:
 java.lang.Exception: Unknown container. Container either has not started or 
 has already completed or doesn’t belong to this node at all. 
 Thanks in advance
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3731) Unknown container. Container either has not started or has already completed or doesn’t belong to this node at all.

2015-05-27 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562280#comment-14562280
 ] 

Rohith commented on YARN-3731:
--

Closing the issue as invalid.

 Unknown container. Container either has not started or has already completed 
 or doesn’t belong to this node at all. 
 

 Key: YARN-3731
 URL: https://issues.apache.org/jira/browse/YARN-3731
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: amit
Priority: Critical

 Hi 
 I am importing data from sql server to hdfs and below is the command
 sqoop import –connect 
 “jdbc:sqlserver://Servername:1433;username=hadoop;password=Password;database=MSBI”
  –table DimDate –target-dir /Hadoop/hdpdatadn/dn/DW/msbi
 but I am getting following error:
 User: amit.tomar
  Name: DimDate.jar
  Application Type: MAPREDUCE
  Application Tags:
  State: FAILED
  FinalStatus: FAILED
  Started: Wed May 27 12:39:48 +0800 2015
  Elapsed: 23sec
  Tracking URL: History
  Diagnostics: Application application_1432698911303_0005 failed 2 times due 
 to AM Container for appattempt_1432698911303_0005_02 exited with 
 exitCode: 1
  For more detailed output, check application tracking 
 page:http://ServerName/proxy/application_1432698911303_0005/Then, click on 
 links to logs of each attempt.
  Diagnostics: Exception from container-launch.
  Container id: container_1432698911303_0005_02_01
  Exit code: 1
  Stack trace: ExitCodeException exitCode=1:
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
  at org.apache.hadoop.util.Shell.run(Shell.java:455)
  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
  at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
  Shell output: 1 file(s) moved.
  Container exited with a non-zero exit code 1
  Failing this attempt. Failing the application. 
 From the log below is the message:
 java.lang.Exception: Unknown container. Container either has not started or 
 has already completed or doesn’t belong to this node at all. 
 Thanks in advance
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3731) Unknown container. Container either has not started or has already completed or doesn’t belong to this node at all.

2015-05-27 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562279#comment-14562279
 ] 

Rohith commented on YARN-3731:
--

Hi [~amitmsbi]
Thanks for using Hadoop. You are trying to access the log link where the 
application master itself never launched. From the diagnosis message, it is 
clear that application is not launched. So first and formost, you need to check 
the application mater that why it is not launched. There would be some 
application configuration or classpath issue which you can get it from syserr 
container logs.

And JIRA is meant for tracking development activities.For queries kinldy 
register to [mailing list|https://hadoop.apache.org/mailing_lists.html] and 
send mail to users mailing list i.e {{u...@hadoop.apache.org}}. Definitely 
folks will help you to solve or answer your queries.

 Unknown container. Container either has not started or has already completed 
 or doesn’t belong to this node at all. 
 

 Key: YARN-3731
 URL: https://issues.apache.org/jira/browse/YARN-3731
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: amit
Priority: Critical

 Hi 
 I am importing data from sql server to hdfs and below is the command
 sqoop import –connect 
 “jdbc:sqlserver://Servername:1433;username=hadoop;password=Password;database=MSBI”
  –table DimDate –target-dir /Hadoop/hdpdatadn/dn/DW/msbi
 but I am getting following error:
 User: amit.tomar
  Name: DimDate.jar
  Application Type: MAPREDUCE
  Application Tags:
  State: FAILED
  FinalStatus: FAILED
  Started: Wed May 27 12:39:48 +0800 2015
  Elapsed: 23sec
  Tracking URL: History
  Diagnostics: Application application_1432698911303_0005 failed 2 times due 
 to AM Container for appattempt_1432698911303_0005_02 exited with 
 exitCode: 1
  For more detailed output, check application tracking 
 page:http://ServerName/proxy/application_1432698911303_0005/Then, click on 
 links to logs of each attempt.
  Diagnostics: Exception from container-launch.
  Container id: container_1432698911303_0005_02_01
  Exit code: 1
  Stack trace: ExitCodeException exitCode=1:
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
  at org.apache.hadoop.util.Shell.run(Shell.java:455)
  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
  at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
  Shell output: 1 file(s) moved.
  Container exited with a non-zero exit code 1
  Failing this attempt. Failing the application. 
 From the log below is the message:
 java.lang.Exception: Unknown container. Container either has not started or 
 has already completed or doesn’t belong to this node at all. 
 Thanks in advance
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-26 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560410#comment-14560410
 ] 

Rohith commented on YARN-3585:
--

I will test YARN-3641 fix for this JIRA scenario. About the patch, I think 
calling System.exit() explicitely after shutdown thead exit is one option.

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Priority: Critical

 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-25 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558050#comment-14558050
 ] 

Rohith commented on YARN-3543:
--

[~vinodkv] Kindly review the updated patch..

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 
 YARN-3543-AH.PNG, YARN-3543-RM.PNG


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page

2015-05-24 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557960#comment-14557960
 ] 

Rohith commented on YARN-2238:
--

+1 lgtm (non-binding)

 filtering on UI sticks even if I move away from the page
 

 Key: YARN-2238
 URL: https://issues.apache.org/jira/browse/YARN-2238
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.4.0
Reporter: Sangjin Lee
Assignee: Jian He
  Labels: usability
 Attachments: YARN-2238.patch, YARN-2238.png, filtered.png


 The main data table in many web pages (RM, AM, etc.) seems to show an 
 unexpected filtering behavior.
 If I filter the table by typing something in the key or value field (or I 
 suspect any search field), the data table gets filtered. The example I used 
 is the job configuration page for a MR job. That is expected.
 However, when I move away from that page and visit any other web page of the 
 same type (e.g. a job configuration page), the page is rendered with the 
 filtering! That is unexpected.
 What's even stranger is that it does not render the filtering term. As a 
 result, I have a page that's mysteriously filtered but doesn't tell me what 
 it's filtering on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page

2015-05-24 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557959#comment-14557959
 ] 

Rohith commented on YARN-2238:
--

Tested locally with YARN-3707 fix, working fine:-)

 filtering on UI sticks even if I move away from the page
 

 Key: YARN-2238
 URL: https://issues.apache.org/jira/browse/YARN-2238
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.4.0
Reporter: Sangjin Lee
Assignee: Jian He
  Labels: usability
 Attachments: YARN-2238.patch, YARN-2238.png, filtered.png


 The main data table in many web pages (RM, AM, etc.) seems to show an 
 unexpected filtering behavior.
 If I filter the table by typing something in the key or value field (or I 
 suspect any search field), the data table gets filtered. The example I used 
 is the job configuration page for a MR job. That is expected.
 However, when I move away from that page and visit any other web page of the 
 same type (e.g. a job configuration page), the page is rendered with the 
 filtering! That is unexpected.
 What's even stranger is that it does not render the filtering term. As a 
 result, I have a page that's mysteriously filtered but doesn't tell me what 
 it's filtering on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-24 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3543:
-
Attachment: 0004-YARN-3543.patch

Attaching same patch as previous to kick off Jenkins

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 
 YARN-3543-AH.PNG, YARN-3543-RM.PNG


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3708) container num become -1 after job finished

2015-05-24 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith resolved YARN-3708.
--
Resolution: Duplicate

This is duplicate of YARN-3552.  Closing the issue as duplicate..

 container num become -1 after job finished
 --

 Key: YARN-3708
 URL: https://issues.apache.org/jira/browse/YARN-3708
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.0
Reporter: tongshiquan
Priority: Minor
 Attachments: screenshot-1.png






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-23 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557170#comment-14557170
 ] 

Rohith commented on YARN-3585:
--

I think we can invoke System.exit once the NodeManger is shutdown in finally 
block. For test case execution, bypass using flag. Any thoughts?

 NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Priority: Critical

 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page

2015-05-23 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557277#comment-14557277
 ] 

Rohith commented on YARN-2238:
--

I do not have much idea on JQuery, but I did blackbox testing in 1 node cluster 
applying the patch. 
Some observations
# Filtering on scheduler page does not carry to application page. This is JIRA 
scenario which is working fine.
# Once navigate to scheduler page, the click on LeafQueue bar apply the filters 
but does not show any apps running on that queue in the scheduler page.

 filtering on UI sticks even if I move away from the page
 

 Key: YARN-2238
 URL: https://issues.apache.org/jira/browse/YARN-2238
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.4.0
Reporter: Sangjin Lee
Assignee: Jian He
  Labels: usability
 Attachments: YARN-2238.patch, filtered.png


 The main data table in many web pages (RM, AM, etc.) seems to show an 
 unexpected filtering behavior.
 If I filter the table by typing something in the key or value field (or I 
 suspect any search field), the data table gets filtered. The example I used 
 is the job configuration page for a MR job. That is expected.
 However, when I move away from that page and visit any other web page of the 
 same type (e.g. a job configuration page), the page is rendered with the 
 filtering! That is unexpected.
 What's even stranger is that it does not render the filtering term. As a 
 result, I have a page that's mysteriously filtered but doesn't tell me what 
 it's filtering on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page

2015-05-23 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557279#comment-14557279
 ] 

Rohith commented on YARN-2238:
--

Attached the RM web UI page image file which depicts the problem-2 in my 
previous comment.

 filtering on UI sticks even if I move away from the page
 

 Key: YARN-2238
 URL: https://issues.apache.org/jira/browse/YARN-2238
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.4.0
Reporter: Sangjin Lee
Assignee: Jian He
  Labels: usability
 Attachments: YARN-2238.patch, YARN-2238.png, filtered.png


 The main data table in many web pages (RM, AM, etc.) seems to show an 
 unexpected filtering behavior.
 If I filter the table by typing something in the key or value field (or I 
 suspect any search field), the data table gets filtered. The example I used 
 is the job configuration page for a MR job. That is expected.
 However, when I move away from that page and visit any other web page of the 
 same type (e.g. a job configuration page), the page is rendered with the 
 filtering! That is unexpected.
 What's even stranger is that it does not render the filtering term. As a 
 result, I have a page that's mysteriously filtered but doesn't tell me what 
 it's filtering on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2238) filtering on UI sticks even if I move away from the page

2015-05-23 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-2238:
-
Attachment: YARN-2238.png

 filtering on UI sticks even if I move away from the page
 

 Key: YARN-2238
 URL: https://issues.apache.org/jira/browse/YARN-2238
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.4.0
Reporter: Sangjin Lee
Assignee: Jian He
  Labels: usability
 Attachments: YARN-2238.patch, YARN-2238.png, filtered.png


 The main data table in many web pages (RM, AM, etc.) seems to show an 
 unexpected filtering behavior.
 If I filter the table by typing something in the key or value field (or I 
 suspect any search field), the data table gets filtered. The example I used 
 is the job configuration page for a MR job. That is expected.
 However, when I move away from that page and visit any other web page of the 
 same type (e.g. a job configuration page), the page is rendered with the 
 filtering! That is unexpected.
 What's even stranger is that it does not render the filtering term. As a 
 result, I have a page that's mysteriously filtered but doesn't tell me what 
 it's filtering on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-22 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556239#comment-14556239
 ] 

Rohith commented on YARN-3543:
--

[~aw] Would you help to understand and resolve build issue? Basically the issue 
what I observe is the patch containes many file changes that includes many 
projects. When the test cases are triggered, it is ignoring the applied patches 
and taking existing class files which causing the compilation failure and other 
issues. But if I apply patch and build , it is successfull.

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
 0004-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3692) Allow REST API to set a user generated message when killing an application

2015-05-21 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554057#comment-14554057
 ] 

Rohith commented on YARN-3692:
--

All the applications are killed by user only. Diagnostic message for KILLED 
application by user is internal to YARN either it can be from REST or 
ApplicationClientProtocol who kills it. 
Is this let user set the reason for killing applications?

 Allow REST API to set a user generated message when killing an application
 --

 Key: YARN-3692
 URL: https://issues.apache.org/jira/browse/YARN-3692
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Rajat Jain
Assignee: Rohith

 Currently YARN's REST API supports killing an application without setting a 
 diagnostic message. It would be good to provide that support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-20 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3543:
-
Attachment: (was: 0003-YARN-3543.patch)

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
  Labels: BB2015-05-TBR
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-20 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14552225#comment-14552225
 ] 

Rohith commented on YARN-3646:
--

+1 lgtm (non-binding)..  wait for jenkins report!!

 Applications are getting stuck some times in case of retry policy forever
 -

 Key: YARN-3646
 URL: https://issues.apache.org/jira/browse/YARN-3646
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Raju Bairishetti
 Attachments: YARN-3646.001.patch, YARN-3646.002.patch, YARN-3646.patch


 We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
 retry policy.
 Yarn client is infinitely retrying in case of exceptions from the RM as it is 
 using retrying policy as FOREVER. The problem is it is retrying for all kinds 
 of exceptions (like ApplicationNotFoundException), even though it is not a 
 connection failure. Due to this my application is not progressing further.
 *Yarn client should not retry infinitely in case of non connection failures.*
 We have written a simple yarn-client which is trying to get an application 
 report for an invalid  or older appId. ResourceManager is throwing an 
 ApplicationNotFoundException as this is an invalid or older appId.  But 
 because of retry policy FOREVER, client is keep on retrying for getting the 
 application report and ResourceManager is throwing 
 ApplicationNotFoundException continuously.
 {code}
 private void testYarnClientRetryPolicy() throws  Exception{
 YarnConfiguration conf = new YarnConfiguration();
 conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
 -1);
 YarnClient yarnClient = YarnClient.createYarnClient();
 yarnClient.init(conf);
 yarnClient.start();
 ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
 10645);
 ApplicationReport report = yarnClient.getApplicationReport(appId);
 }
 {code}
 *RM logs:*
 {noformat}
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875162 Retry#0
 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
 with id 'application_1430126768987_10645' doesn't exist in RM.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
   at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875163 Retry#0
 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-20 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14552165#comment-14552165
 ] 

Rohith commented on YARN-3543:
--

Build machine is not able to run all those test at one shot. Similar issue had 
faced earlier in YARN-2784.  I think need to split the  JIRA into proto change, 
WebUI change, AH change and more.

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
  Labels: BB2015-05-TBR
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
 0004-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-20 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3543:
-
Attachment: 0004-YARN-3543.patch

Attached same patch to kick off Jenkins

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
  Labels: BB2015-05-TBR
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
 0004-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-20 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14552091#comment-14552091
 ] 

Rohith commented on YARN-3646:
--

Thanks for updating the patch, some comments on tests 
# I think we can remove the tests added in the hadoop-common project, since 
yarn-client verifies required funcitionality. And basically hadoop-common test 
was mocking the RMProxy functionality which test was passing without RMProxy 
fix also.
# code never reach {{Assert.fail();}}. better to remove it
# Catch the ApplicationNotFoundException instead of catching throwable. I think 
you can add {{expected = ApplicationNotFoundException.class}} in the @Test 
annotation  like below.
{code}
@Test(timeout = 3, expected = ApplicationNotFoundException.class)
  public void testClientWithRetryPolicyForEver() throws Exception {
YarnConfiguration conf = new YarnConfiguration();
conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, -1);

ResourceManager rm = null;
YarnClient yarnClient = null;
try {
  // start rm
  rm = new ResourceManager();
  rm.init(conf);
  rm.start();

  yarnClient = YarnClient.createYarnClient();
  yarnClient.init(conf);
  yarnClient.start();

  // create invalid application id
  ApplicationId appId = ApplicationId.newInstance(1430126768987L, 10645);

  // RM should throw ApplicationNotFoundException exception
  yarnClient.getApplicationReport(appId);
} finally {
  if (yarnClient != null) {
yarnClient.stop();
  }
  if (rm != null) {
rm.stop();
  }
}
  }
{code}
# can you rename the test name with actual functionality test, like 
{{testShouldNotRetryForeverForNonNetworkExceptions}}

 Applications are getting stuck some times in case of retry policy forever
 -

 Key: YARN-3646
 URL: https://issues.apache.org/jira/browse/YARN-3646
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Raju Bairishetti
 Attachments: YARN-3646.001.patch, YARN-3646.patch


 We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
 retry policy.
 Yarn client is infinitely retrying in case of exceptions from the RM as it is 
 using retrying policy as FOREVER. The problem is it is retrying for all kinds 
 of exceptions (like ApplicationNotFoundException), even though it is not a 
 connection failure. Due to this my application is not progressing further.
 *Yarn client should not retry infinitely in case of non connection failures.*
 We have written a simple yarn-client which is trying to get an application 
 report for an invalid  or older appId. ResourceManager is throwing an 
 ApplicationNotFoundException as this is an invalid or older appId.  But 
 because of retry policy FOREVER, client is keep on retrying for getting the 
 application report and ResourceManager is throwing 
 ApplicationNotFoundException continuously.
 {code}
 private void testYarnClientRetryPolicy() throws  Exception{
 YarnConfiguration conf = new YarnConfiguration();
 conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
 -1);
 YarnClient yarnClient = YarnClient.createYarnClient();
 yarnClient.init(conf);
 yarnClient.start();
 ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
 10645);
 ApplicationReport report = yarnClient.getApplicationReport(appId);
 }
 {code}
 *RM logs:*
 {noformat}
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875162 Retry#0
 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
 with id 'application_1430126768987_10645' doesn't exist in RM.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
   at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 

[jira] [Assigned] (YARN-3692) Allow REST API to set a user generated message when killing an application

2015-05-20 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith reassigned YARN-3692:


Assignee: Rohith

 Allow REST API to set a user generated message when killing an application
 --

 Key: YARN-3692
 URL: https://issues.apache.org/jira/browse/YARN-3692
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Rajat Jain
Assignee: Rohith

 Currently YARN's REST API supports killing an application without setting a 
 diagnostic message. It would be good to provide that support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3674) YARN application disappears from view

2015-05-19 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549928#comment-14549928
 ] 

Rohith commented on YARN-3674:
--

Is this dup of YARN-2238?

 YARN application disappears from view
 -

 Key: YARN-3674
 URL: https://issues.apache.org/jira/browse/YARN-3674
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.0
Reporter: Sergey Shelukhin

 I have 2 tabs open at exact same URL with RUNNING applications view. There is 
 an application that is, in fact, running, that is visible in one tab but not 
 the other. This persists across refreshes. If I open new tab from the tab 
 where the application is not visible, in that tab it shows up ok.
 I didn't change scheduler/queue settings before this behavior happened; on 
 [~sseth]'s advice I went and tried to click the root node of the scheduler on 
 scheduler page; the app still does not become visible.
 Something got stuck somewhere...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-19 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550233#comment-14550233
 ] 

Rohith commented on YARN-3646:
--

bq. Seems we do not even require exceptionToPolicy for FOREVER policy if we 
catch the exception in shouldRetry method.
make sense to me,will reveiw the patch, thanks

 Applications are getting stuck some times in case of retry policy forever
 -

 Key: YARN-3646
 URL: https://issues.apache.org/jira/browse/YARN-3646
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Raju Bairishetti
 Attachments: YARN-3646.patch


 We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
 retry policy.
 Yarn client is infinitely retrying in case of exceptions from the RM as it is 
 using retrying policy as FOREVER. The problem is it is retrying for all kinds 
 of exceptions (like ApplicationNotFoundException), even though it is not a 
 connection failure. Due to this my application is not progressing further.
 *Yarn client should not retry infinitely in case of non connection failures.*
 We have written a simple yarn-client which is trying to get an application 
 report for an invalid  or older appId. ResourceManager is throwing an 
 ApplicationNotFoundException as this is an invalid or older appId.  But 
 because of retry policy FOREVER, client is keep on retrying for getting the 
 application report and ResourceManager is throwing 
 ApplicationNotFoundException continuously.
 {code}
 private void testYarnClientRetryPolicy() throws  Exception{
 YarnConfiguration conf = new YarnConfiguration();
 conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
 -1);
 YarnClient yarnClient = YarnClient.createYarnClient();
 yarnClient.init(conf);
 yarnClient.start();
 ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
 10645);
 ApplicationReport report = yarnClient.getApplicationReport(appId);
 }
 {code}
 *RM logs:*
 {noformat}
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875162 Retry#0
 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
 with id 'application_1430126768987_10645' doesn't exist in RM.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
   at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875163 Retry#0
 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-19 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550256#comment-14550256
 ] 

Rohith commented on YARN-3646:
--

Thanks for working on this issue.. The patch overall looks good to me.
nit : Can the test moved to Yarn package since issue is in Yarn? Otherwise if 
there is any changed in the RMProxy, test will not run.

 Applications are getting stuck some times in case of retry policy forever
 -

 Key: YARN-3646
 URL: https://issues.apache.org/jira/browse/YARN-3646
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Raju Bairishetti
 Attachments: YARN-3646.patch


 We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
 retry policy.
 Yarn client is infinitely retrying in case of exceptions from the RM as it is 
 using retrying policy as FOREVER. The problem is it is retrying for all kinds 
 of exceptions (like ApplicationNotFoundException), even though it is not a 
 connection failure. Due to this my application is not progressing further.
 *Yarn client should not retry infinitely in case of non connection failures.*
 We have written a simple yarn-client which is trying to get an application 
 report for an invalid  or older appId. ResourceManager is throwing an 
 ApplicationNotFoundException as this is an invalid or older appId.  But 
 because of retry policy FOREVER, client is keep on retrying for getting the 
 application report and ResourceManager is throwing 
 ApplicationNotFoundException continuously.
 {code}
 private void testYarnClientRetryPolicy() throws  Exception{
 YarnConfiguration conf = new YarnConfiguration();
 conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
 -1);
 YarnClient yarnClient = YarnClient.createYarnClient();
 yarnClient.init(conf);
 yarnClient.start();
 ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
 10645);
 ApplicationReport report = yarnClient.getApplicationReport(appId);
 }
 {code}
 *RM logs:*
 {noformat}
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875162 Retry#0
 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
 with id 'application_1430126768987_10645' doesn't exist in RM.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
   at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875163 Retry#0
 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-19 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550258#comment-14550258
 ] 

Rohith commented on YARN-3646:
--

And I verified in one node cluster by enabling and disabling retryforever 
policy.

 Applications are getting stuck some times in case of retry policy forever
 -

 Key: YARN-3646
 URL: https://issues.apache.org/jira/browse/YARN-3646
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Raju Bairishetti
 Attachments: YARN-3646.patch


 We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
 retry policy.
 Yarn client is infinitely retrying in case of exceptions from the RM as it is 
 using retrying policy as FOREVER. The problem is it is retrying for all kinds 
 of exceptions (like ApplicationNotFoundException), even though it is not a 
 connection failure. Due to this my application is not progressing further.
 *Yarn client should not retry infinitely in case of non connection failures.*
 We have written a simple yarn-client which is trying to get an application 
 report for an invalid  or older appId. ResourceManager is throwing an 
 ApplicationNotFoundException as this is an invalid or older appId.  But 
 because of retry policy FOREVER, client is keep on retrying for getting the 
 application report and ResourceManager is throwing 
 ApplicationNotFoundException continuously.
 {code}
 private void testYarnClientRetryPolicy() throws  Exception{
 YarnConfiguration conf = new YarnConfiguration();
 conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
 -1);
 YarnClient yarnClient = YarnClient.createYarnClient();
 yarnClient.init(conf);
 yarnClient.start();
 ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
 10645);
 ApplicationReport report = yarnClient.getApplicationReport(appId);
 }
 {code}
 *RM logs:*
 {noformat}
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875162 Retry#0
 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
 with id 'application_1430126768987_10645' doesn't exist in RM.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
   at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875163 Retry#0
 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-19 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3543:
-
Attachment: 0004-YARN-3543.patch

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
  Labels: BB2015-05-TBR
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
 0003-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running

2015-05-19 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551826#comment-14551826
 ] 

Rohith commented on YARN-2268:
--

Thanks [~sunilg] [~jianhe] [~kasha] for sharing your thoughts.. 
bq. Given we recommend using the ZK-store when using HA, how about adding this 
for the ZK-store using an ephemeral znode for lock first?
+1 given state store recommend for ZKRMStateStore. 

bq. How about creating a lock file and declaring it stale after a stipulated 
period of time.
If we use stipulated period, am thinking that within the stiplated period, 
neither RM cant be started nor state store format cant be done. And the file 
has to be stored in hdfs neverthless of RMStateStore which is extra binding 
with filesytem. 

I am thinking , why can't we use general approach of polling the web service, 
it will give more accurate state. ? 


 Disallow formatting the RMStateStore when there is an RM running
 

 Key: YARN-2268
 URL: https://issues.apache.org/jira/browse/YARN-2268
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Rohith
 Attachments: 0001-YARN-2268.patch


 YARN-2131 adds a way to format the RMStateStore. However, it can be a problem 
 if we format the store while an RM is actively using it. It would be nice to 
 fail the format if there is an RM running and using this store. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-19 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550854#comment-14550854
 ] 

Rohith commented on YARN-3543:
--

About the -1's from QA, 
# Findbugs is YARN-3677 exists to track  issue.
# Checkstyle error is number of parameter exceeds 7, which need to be ignored i 
think.  Am not sure , should it be added to any ignore file or just ignore it.
# Reg test failures, I am doubt on the test machines, many tests are  failing 
.. 
## Type-1, Address already in use exception. 
## Type-2, NoSuchMethodError
## Type-3, ClassCasteException and many others

I am pretty doubt on the order of compilation and test execution. Probably , 
for running resourcemanager tests, it is not including the modified classes in 
yarn-api/yarn-common. so NoSuchMethod error is thrown.

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
  Labels: BB2015-05-TBR
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
 0003-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544988#comment-14544988
 ] 

Rohith commented on YARN-3646:
--

Setting RetryPolicies.RETRY_FOREVER for exceptionToPolicyMap as default policy 
is not sufficient, but also {{RetryPolicies.RetryForever.shouldRetry()}} should 
check for Connect exceptions and handle it. Otherwise shouldRetry always return 
RetryAction.RETRY action.

 Applications are getting stuck some times in case of retry policy forever
 -

 Key: YARN-3646
 URL: https://issues.apache.org/jira/browse/YARN-3646
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Raju Bairishetti

 We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
 retry policy.
 Yarn client is infinitely retrying in case of exceptions from the RM as it is 
 using retrying policy as FOREVER. The problem is it is retrying for all kinds 
 of exceptions (like ApplicationNotFoundException), even though it is not a 
 connection failure. Due to this my application is not progressing further.
 *Yarn client should not retry infinitely in case of non connection failures.*
 We have written a simple yarn-client which is trying to get an application 
 report for an invalid  or older appId. ResourceManager is throwing an 
 ApplicationNotFoundException as this is an invalid or older appId.  But 
 because of retry policy FOREVER, client is keep on retrying for getting the 
 application report and ResourceManager is throwing 
 ApplicationNotFoundException continuously.
 {code}
 private void testYarnClientRetryPolicy() throws  Exception{
 YarnConfiguration conf = new YarnConfiguration();
 conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
 -1);
 YarnClient yarnClient = YarnClient.createYarnClient();
 yarnClient.init(conf);
 yarnClient.start();
 ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
 10645);
 ApplicationReport report = yarnClient.getApplicationReport(appId);
 }
 {code}
 *RM logs:*
 {noformat}
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875162 Retry#0
 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
 with id 'application_1430126768987_10645' doesn't exist in RM.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
   at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875163 Retry#0
 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546560#comment-14546560
 ] 

Rohith commented on YARN-3543:
--

bq. ApplicationReport.newInstance() is Private, so you should simply update the 
existing method instead of adding a new one.
 I understood your comment above like since it is private, newInstance() method 
should not be modified. So I just added setter and getter methods in 
ApplicationReport. But doesn't impact compatibility? 

bq. app == null ? null : app.getUser()); What are these changes for?
This is for fixing findbug in earlier jenkins report. One thing observed is 
# when {{return ApplicationReport.newInstance}}, does not give findbug warning 
but
# when assign {{ApplicationReport.newInstance}} to new variable and return the 
variable giving findbug waining. So I changed above null check.

bq. AppInfo.getUnmanagedAM() needs to be renamed too.
Agree

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
  Labels: BB2015-05-TBR
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
 0003-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546562#comment-14546562
 ] 

Rohith commented on YARN-3543:
--

bq. But doesn't impact compatibility?
I meant to say ApplicationReport.newInstance() is called from out side of YARN. 
Ex : In MR, NotRunningJob#getUnknownApplicationReport. Similarly, if any other 
yarn clients using for ApplicationReport.newInstance, it would cause 
compatibility issue.  So I just provided setters and getters for UnmanagedApp.

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
  Labels: BB2015-05-TBR
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
 0003-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545916#comment-14545916
 ] 

Rohith commented on YARN-3543:
--

Need to kick off jenkins again to check test failure are regular.

 ApplicationReport should be able to tell whether the Application is AM 
 managed or not. 
 ---

 Key: YARN-3543
 URL: https://issues.apache.org/jira/browse/YARN-3543
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.6.0
Reporter: Spandan Dutta
Assignee: Rohith
  Labels: BB2015-05-TBR
 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
 0003-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG


 Currently we can know whether the application submitted by the user is AM 
 managed from the applicationSubmissionContext. This can be only done  at the 
 time when the user submits the job. We should have access to this info from 
 the ApplicationReport as well so that we can check whether an app is AM 
 managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3642) Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java

2015-05-15 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith resolved YARN-3642.
--
Resolution: Invalid

Closing as Invalid.

If there is any queries or basic environment problems , I suggest to use user 
mailing lists to ask queries.

 Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java
 -

 Key: YARN-3642
 URL: https://issues.apache.org/jira/browse/YARN-3642
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: yarn-site.xml:
 configuration
property
   nameyarn.nodemanager.aux-services/name
   valuemapreduce_shuffle/value
/property
property
   nameyarn.nodemanager.aux-services.mapreduce.shuffle.class/name
   valueorg.apache.hadoop.mapred.ShuffleHandler/value
/property
property
   nameyarn.resourcemanager.hostname/name
   valueqadoop-nn001.apsalar.com/value
/property
property
   nameyarn.resourcemanager.scheduler.address/name
   valueqadoop-nn001.apsalar.com:8030/value
/property
property
   nameyarn.resourcemanager.address/name
   valueqadoop-nn001.apsalar.com:8032/value
/property
property
   nameyarn.resourcemanager.webap.address/name
   valueqadoop-nn001.apsalar.com:8088/value
/property
property
   nameyarn.resourcemanager.resource-tracker.address/name
   valueqadoop-nn001.apsalar.com:8031/value
/property
property
   nameyarn.resourcemanager.admin.address/name
   valueqadoop-nn001.apsalar.com:8033/value
/property
property
   nameyarn.log-aggregation-enable/name
   valuetrue/value
/property
property
   descriptionWhere to aggregate logs to./description
   nameyarn.nodemanager.remote-app-log-dir/name
   value/var/log/hadoop/apps/value
/property
property
   nameyarn.web-proxy.address/name
   valueqadoop-nn001.apsalar.com:8088/value
/property
 /configuration
 core-site.xml:
 configuration
property
   namefs.defaultFS/name
   valuehdfs://qadoop-nn001.apsalar.com/value
/property
property
   namehadoop.proxyuser.hdfs.hosts/name
   value*/value
/property
property
   namehadoop.proxyuser.hdfs.groups/name
   value*/value
/property
 /configuration
 hdfs-site.xml:
 configuration
property
   namedfs.replication/name
   value2/value
/property
property
   namedfs.namenode.name.dir/name
   valuefile:/hadoop/nn/value
/property
property
   namedfs.datanode.data.dir/name
   valuefile:/hadoop/dn/dfs/value
/property
property
   namedfs.http.address/name
   valueqadoop-nn001.apsalar.com:50070/value
/property
property
   namedfs.secondary.http.address/name
   valueqadoop-nn002.apsalar.com:50090/value
/property
 /configuration
 mapred-site.xml:
 configuration
property 
   namemapred.job.tracker/name 
   valueqadoop-nn001.apsalar.com:8032/value 
/property
property
   namemapreduce.framework.name/name
   valueyarn/value
/property
property
   namemapreduce.jobhistory.address/name
   valueqadoop-nn001.apsalar.com:10020/value
   descriptionthe JobHistoryServer address./description
/property
property  
   namemapreduce.jobhistory.webapp.address/name  
   valueqadoop-nn001.apsalar.com:19888/value  
   descriptionthe JobHistoryServer web address/description
/property
 /configuration
 hbase-site.xml:
 configuration
 property 
 namehbase.master/name 
 valueqadoop-nn001.apsalar.com:6/value 
 /property 
 property 
 namehbase.rootdir/name 
 valuehdfs://qadoop-nn001.apsalar.com:8020/hbase/value 
 /property 
 property 
 namehbase.cluster.distributed/name 
 valuetrue/value 
 /property 
 property
 namehbase.zookeeper.property.dataDir/name
 value/opt/local/zookeeper/value
 /property 
 property
 namehbase.zookeeper.property.clientPort/name
 value2181/value 
 /property
 property 
 namehbase.zookeeper.quorum/name 
 valueqadoop-nn001.apsalar.com/value 
 /property 
 property 
 namezookeeper.session.timeout/name 
 value18/value 
 /property 
 /configuration
Reporter: Lee Hounshell

 There is an issue with Hadoop 2.7.0 when in distributed operation the 
 datanode is unable to reach the yarn scheduler.  In our yarn-site.xml, we 
 have defined this path to be:
 {code}
property
   nameyarn.resourcemanager.scheduler.address/name
   valueqadoop-nn001.apsalar.com:8030/value
/property
 {code}
 But when running an oozie job, the problem manifests when looking at the job 
 

[jira] [Commented] (YARN-3642) Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java

2015-05-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543551#comment-14543551
 ] 

Rohith commented on YARN-3642:
--

I think this is related to /etc/hosts mapping. Does /etc/hosts mapping exits in 
all the machines of NodeManager for *qadoop-nn001.apsalar.com*? 
In your changed code, you are setting for an ip which will work. Can you set 
hostname and try? It wont work I guess. 

 Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java
 -

 Key: YARN-3642
 URL: https://issues.apache.org/jira/browse/YARN-3642
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: yarn-site.xml:
 configuration
property
   nameyarn.nodemanager.aux-services/name
   valuemapreduce_shuffle/value
/property
property
   nameyarn.nodemanager.aux-services.mapreduce.shuffle.class/name
   valueorg.apache.hadoop.mapred.ShuffleHandler/value
/property
property
   nameyarn.resourcemanager.hostname/name
   valueqadoop-nn001.apsalar.com/value
/property
property
   nameyarn.resourcemanager.scheduler.address/name
   valueqadoop-nn001.apsalar.com:8030/value
/property
property
   nameyarn.resourcemanager.address/name
   valueqadoop-nn001.apsalar.com:8032/value
/property
property
   nameyarn.resourcemanager.webap.address/name
   valueqadoop-nn001.apsalar.com:8088/value
/property
property
   nameyarn.resourcemanager.resource-tracker.address/name
   valueqadoop-nn001.apsalar.com:8031/value
/property
property
   nameyarn.resourcemanager.admin.address/name
   valueqadoop-nn001.apsalar.com:8033/value
/property
property
   nameyarn.log-aggregation-enable/name
   valuetrue/value
/property
property
   descriptionWhere to aggregate logs to./description
   nameyarn.nodemanager.remote-app-log-dir/name
   value/var/log/hadoop/apps/value
/property
property
   nameyarn.web-proxy.address/name
   valueqadoop-nn001.apsalar.com:8088/value
/property
 /configuration
 core-site.xml:
 configuration
property
   namefs.defaultFS/name
   valuehdfs://qadoop-nn001.apsalar.com/value
/property
property
   namehadoop.proxyuser.hdfs.hosts/name
   value*/value
/property
property
   namehadoop.proxyuser.hdfs.groups/name
   value*/value
/property
 /configuration
 hdfs-site.xml:
 configuration
property
   namedfs.replication/name
   value2/value
/property
property
   namedfs.namenode.name.dir/name
   valuefile:/hadoop/nn/value
/property
property
   namedfs.datanode.data.dir/name
   valuefile:/hadoop/dn/dfs/value
/property
property
   namedfs.http.address/name
   valueqadoop-nn001.apsalar.com:50070/value
/property
property
   namedfs.secondary.http.address/name
   valueqadoop-nn002.apsalar.com:50090/value
/property
 /configuration
 mapred-site.xml:
 configuration
property 
   namemapred.job.tracker/name 
   valueqadoop-nn001.apsalar.com:8032/value 
/property
property
   namemapreduce.framework.name/name
   valueyarn/value
/property
property
   namemapreduce.jobhistory.address/name
   valueqadoop-nn001.apsalar.com:10020/value
   descriptionthe JobHistoryServer address./description
/property
property  
   namemapreduce.jobhistory.webapp.address/name  
   valueqadoop-nn001.apsalar.com:19888/value  
   descriptionthe JobHistoryServer web address/description
/property
 /configuration
 hbase-site.xml:
 configuration
 property 
 namehbase.master/name 
 valueqadoop-nn001.apsalar.com:6/value 
 /property 
 property 
 namehbase.rootdir/name 
 valuehdfs://qadoop-nn001.apsalar.com:8020/hbase/value 
 /property 
 property 
 namehbase.cluster.distributed/name 
 valuetrue/value 
 /property 
 property
 namehbase.zookeeper.property.dataDir/name
 value/opt/local/zookeeper/value
 /property 
 property
 namehbase.zookeeper.property.clientPort/name
 value2181/value 
 /property
 property 
 namehbase.zookeeper.quorum/name 
 valueqadoop-nn001.apsalar.com/value 
 /property 
 property 
 namezookeeper.session.timeout/name 
 value18/value 
 /property 
 /configuration
Reporter: Lee Hounshell

 There is an issue with Hadoop 2.7.0 when in distributed operation the 
 datanode is unable to reach the yarn scheduler.  In our yarn-site.xml, we 
 have defined this path to be:
 {code}
property
   

[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543739#comment-14543739
 ] 

Rohith commented on YARN-3641:
--

bq. so we probably should still call ExitUtil.terminate.
I think this is right way to overcome from JVM hang during graceful shutdown.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543776#comment-14543776
 ] 

Rohith commented on YARN-3646:
--

Which version of Hadoop are you using? I don't see this problem in trunk or 
branch-2.

 Applications are getting stuck some times in case of retry policy forever
 -

 Key: YARN-3646
 URL: https://issues.apache.org/jira/browse/YARN-3646
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Raju Bairishetti

 We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
 retry policy.
 Yarn client is infinitely retrying in case of exceptions from the RM as it is 
 using retrying policy as FOREVER. The problem is it is retrying for all kinds 
 of exceptions (like ApplicationNotFoundException), even though it is not a 
 connection failure. Due to this my application is not progressing further.
 *Yarn client should not retry infinitely in case of non connection failures.*
 We have written a simple yarn-client which is trying to get an application 
 report for an invalid  or older appId. ResourceManager is throwing an 
 ApplicationNotFoundException as this is an invalid or older appId.  But 
 because of retry policy FOREVER, client is keep on retrying for getting the 
 application report and ResourceManager is throwing 
 ApplicationNotFoundException continuously.
 {code}
 private void testYarnClientRetryPolicy() throws  Exception{
 YarnConfiguration conf = new YarnConfiguration();
 conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
 -1);
 YarnClient yarnClient = YarnClient.createYarnClient();
 yarnClient.init(conf);
 yarnClient.start();
 ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
 10645);
 ApplicationReport report = yarnClient.getApplicationReport(appId);
 }
 {code}
 *RM logs:*
 {noformat}
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875162 Retry#0
 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
 with id 'application_1430126768987_10645' doesn't exist in RM.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
   at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875163 Retry#0
 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3642) Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java

2015-05-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544920#comment-14544920
 ] 

Rohith commented on YARN-3642:
--

How many nodemanagers are running? If it more than 1 then I am thinking what 
would have happen in your case is yarn-site.xml never read by clent i.e oozi 
job but still you are able to submit the job because you might be submitting 
job from the local machine i.e where RM is running. So with default port job is 
able to submit , but when AppplicationManster is launched , it is launched in 
different machine where NodeManager is running. Since scheduler address is not 
loaded by any configuration, AM tries to connect default address i.e 
0.0.0.0:8030 which never connect. 

I suggest that you can make sure your yarn-site.xml is loaded into classpath 
before submitting the job. So the AM gets the 
yarn.resourcemanager.scheduler.address and connect to RM. Otherway is 
explicitely set yarn.resourcemanager.scheduler.address  using job client.

 Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java
 -

 Key: YARN-3642
 URL: https://issues.apache.org/jira/browse/YARN-3642
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: yarn-site.xml:
 configuration
property
   nameyarn.nodemanager.aux-services/name
   valuemapreduce_shuffle/value
/property
property
   nameyarn.nodemanager.aux-services.mapreduce.shuffle.class/name
   valueorg.apache.hadoop.mapred.ShuffleHandler/value
/property
property
   nameyarn.resourcemanager.hostname/name
   valueqadoop-nn001.apsalar.com/value
/property
property
   nameyarn.resourcemanager.scheduler.address/name
   valueqadoop-nn001.apsalar.com:8030/value
/property
property
   nameyarn.resourcemanager.address/name
   valueqadoop-nn001.apsalar.com:8032/value
/property
property
   nameyarn.resourcemanager.webap.address/name
   valueqadoop-nn001.apsalar.com:8088/value
/property
property
   nameyarn.resourcemanager.resource-tracker.address/name
   valueqadoop-nn001.apsalar.com:8031/value
/property
property
   nameyarn.resourcemanager.admin.address/name
   valueqadoop-nn001.apsalar.com:8033/value
/property
property
   nameyarn.log-aggregation-enable/name
   valuetrue/value
/property
property
   descriptionWhere to aggregate logs to./description
   nameyarn.nodemanager.remote-app-log-dir/name
   value/var/log/hadoop/apps/value
/property
property
   nameyarn.web-proxy.address/name
   valueqadoop-nn001.apsalar.com:8088/value
/property
 /configuration
 core-site.xml:
 configuration
property
   namefs.defaultFS/name
   valuehdfs://qadoop-nn001.apsalar.com/value
/property
property
   namehadoop.proxyuser.hdfs.hosts/name
   value*/value
/property
property
   namehadoop.proxyuser.hdfs.groups/name
   value*/value
/property
 /configuration
 hdfs-site.xml:
 configuration
property
   namedfs.replication/name
   value2/value
/property
property
   namedfs.namenode.name.dir/name
   valuefile:/hadoop/nn/value
/property
property
   namedfs.datanode.data.dir/name
   valuefile:/hadoop/dn/dfs/value
/property
property
   namedfs.http.address/name
   valueqadoop-nn001.apsalar.com:50070/value
/property
property
   namedfs.secondary.http.address/name
   valueqadoop-nn002.apsalar.com:50090/value
/property
 /configuration
 mapred-site.xml:
 configuration
property 
   namemapred.job.tracker/name 
   valueqadoop-nn001.apsalar.com:8032/value 
/property
property
   namemapreduce.framework.name/name
   valueyarn/value
/property
property
   namemapreduce.jobhistory.address/name
   valueqadoop-nn001.apsalar.com:10020/value
   descriptionthe JobHistoryServer address./description
/property
property  
   namemapreduce.jobhistory.webapp.address/name  
   valueqadoop-nn001.apsalar.com:19888/value  
   descriptionthe JobHistoryServer web address/description
/property
 /configuration
 hbase-site.xml:
 configuration
 property 
 namehbase.master/name 
 valueqadoop-nn001.apsalar.com:6/value 
 /property 
 property 
 namehbase.rootdir/name 
 valuehdfs://qadoop-nn001.apsalar.com:8020/hbase/value 
 /property 
 property 
 namehbase.cluster.distributed/name 
 valuetrue/value 
 /property 
 property
 namehbase.zookeeper.property.dataDir/name
 value/opt/local/zookeeper/value
 /property 
 property
 

[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544938#comment-14544938
 ] 

Rohith commented on YARN-3646:
--

Thanks for the explanation.. I got the problem in my machines too. Last time 
when I test, the configuration settings had issue. 

 Applications are getting stuck some times in case of retry policy forever
 -

 Key: YARN-3646
 URL: https://issues.apache.org/jira/browse/YARN-3646
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Raju Bairishetti

 We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
 retry policy.
 Yarn client is infinitely retrying in case of exceptions from the RM as it is 
 using retrying policy as FOREVER. The problem is it is retrying for all kinds 
 of exceptions (like ApplicationNotFoundException), even though it is not a 
 connection failure. Due to this my application is not progressing further.
 *Yarn client should not retry infinitely in case of non connection failures.*
 We have written a simple yarn-client which is trying to get an application 
 report for an invalid  or older appId. ResourceManager is throwing an 
 ApplicationNotFoundException as this is an invalid or older appId.  But 
 because of retry policy FOREVER, client is keep on retrying for getting the 
 application report and ResourceManager is throwing 
 ApplicationNotFoundException continuously.
 {code}
 private void testYarnClientRetryPolicy() throws  Exception{
 YarnConfiguration conf = new YarnConfiguration();
 conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
 -1);
 YarnClient yarnClient = YarnClient.createYarnClient();
 yarnClient.init(conf);
 yarnClient.start();
 ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
 10645);
 ApplicationReport report = yarnClient.getApplicationReport(appId);
 }
 {code}
 *RM logs:*
 {noformat}
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875162 Retry#0
 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
 with id 'application_1430126768987_10645' doesn't exist in RM.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
   at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875163 Retry#0
 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   >