[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.

2015-06-16 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589294#comment-14589294
 ] 

Rohith commented on YARN-2305:
--

Updated the duplicated id link.

> When a container is in reserved state then total cluster memory is displayed 
> wrongly.
> -
>
> Key: YARN-2305
> URL: https://issues.apache.org/jira/browse/YARN-2305
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: J.Andreina
>Assignee: Sunil G
> Attachments: Capture.jpg
>
>
> ENV Details:
> =  
>  3 queues  :  a(50%),b(25%),c(25%) ---> All max utilization is set to 
> 100
>  2 Node cluster with total memory as 16GB
> TestSteps:
> =
>   Execute following 3 jobs with different memory configurations for 
> Map , reducer and AM task
>   ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a 
> -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 
> -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 
> /dir8 /preempt_85 (application_1405414066690_0023)
>  ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b 
> -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 
> -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 
> /dir2 /preempt_86 (application_1405414066690_0025)
>  
>  ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c 
> -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 
> -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 
> /dir2 /preempt_62
> Issue
> =
>   when 2GB memory is in reserved state  totoal memory is shown as 
> 15GB and used as 15GB  ( while total memory is 16GB)
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587519#comment-14587519
 ] 

Rohith commented on YARN-3809:
--

This is interesting scenario, but am not sure why ThreadPool is set to 10 which 
is not configurable.
bq. the default RPC time out is 15 mins.. 
I see RPC timeout is 1 minute, am I missing anything?
{code}
static final int DEFAULT_COMMAND_TIMEOUT = 6;
.
  int expireIntvl = conf.getInt(NM_COMMAND_TIMEOUT, DEFAULT_COMMAND_TIMEOUT);
proxy =
(ContainerManagementProtocolPB) 
RPC.getProxy(ContainerManagementProtocolPB.class,
  clientVersion, addr, ugi, conf,
  NetUtils.getDefaultSocketFactory(conf), expireIntvl);
{code}

> Failed to launch new attempts because ApplicationMasterLauncher's threads all 
> hang
> --
>
> Key: YARN-3809
> URL: https://issues.apache.org/jira/browse/YARN-3809
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
> AMLauncherEventType(LAUNCH and CLEANUP).
> In our cluster, there was many NM with 10+ AM running on it, and one shut 
> down for some reason. After RM found the NM LOST, it cleaned up AMs running 
> on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
> ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
> in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
> down, the default RPC time out is 15 mins. It means that in 15 mins 
> ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
> attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587412#comment-14587412
 ] 

Rohith commented on YARN-3789:
--

Looks good to me too.. 

> Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
> --
>
> Key: YARN-3789
> URL: https://issues.apache.org/jira/browse/YARN-3789
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
> 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch
>
>
> Duplicate logging from resource manager
> during am limit check for each application
> {code}
> 015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585664#comment-14585664
 ] 

Rohith commented on YARN-3789:
--

+1(non-binding)

> Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
> --
>
> Key: YARN-3789
> URL: https://issues.apache.org/jira/browse/YARN-3789
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
> 0003-YARN-3789.patch, 0004-YARN-3789.patch
>
>
> Duplicate logging from resource manager
> during am limit check for each application
> {code}
> 015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585450#comment-14585450
 ] 

Rohith commented on YARN-3790:
--

Thank @zhihai for your detailed explanation.. I got the problem..:-)
Overall patch looks good to me, I think we should change this JIRA component to 
scheduler since code change is in FairScheduler

> TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
> trunk for FS scheduler
> 
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith
>Assignee: zhihai xu
> Attachments: YARN-3790.000.patch
>
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-1382) NodeListManager has a memory leak, unusableRMNodesConcurrentSet is never purged

2015-06-14 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith reassigned YARN-1382:


Assignee: Rohith

> NodeListManager has a memory leak, unusableRMNodesConcurrentSet is never 
> purged
> ---
>
> Key: YARN-1382
> URL: https://issues.apache.org/jira/browse/YARN-1382
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.2.0
>Reporter: Alejandro Abdelnur
>Assignee: Rohith
>
> If a node is in the unusable nodes set (unusableRMNodesConcurrentSet) and 
> never comes back, the node will be there forever.
> While the leak is not big, it gets aggravated if the NM addresses are 
> configured with ephemeral ports as when the nodes come back they come back as 
> new.
> Some related details in YARN-1343



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-06-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585431#comment-14585431
 ] 

Rohith commented on YARN-3543:
--

Thanks [~xgong] for the review..
bq. Could we not directly change the ApplicationReport.newInstance() ? This 
will break other applications, such as Tez.
IIUC, ApplicationReport#newInstance() is @private annotated, so ohter client 
should not able to use this. And in the ealier patch I was added new method 
which does not break compatibility, but [~vinodkv] suggested me not to change 
this API in his reveiw comment 
[link|https://issues.apache.org/jira/browse/YARN-3543?focusedCommentId=14533819&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14533819]

> ApplicationReport should be able to tell whether the Application is AM 
> managed or not. 
> ---
>
> Key: YARN-3543
> URL: https://issues.apache.org/jira/browse/YARN-3543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.6.0
>Reporter: Spandan Dutta
>Assignee: Rohith
> Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
> 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
> 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 
> YARN-3543-AH.PNG, YARN-3543-RM.PNG
>
>
> Currently we can know whether the application submitted by the user is AM 
> managed from the applicationSubmissionContext. This can be only done  at the 
> time when the user submits the job. We should have access to this info from 
> the ApplicationReport as well so that we can check whether an app is AM 
> managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-10 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580255#comment-14580255
 ] 

Rohith commented on YARN-3790:
--

Thanks for looking into this issue,
bq. If UpdateThread call update after recoverContainersOnNode, the test will 
succeed
In the test, I see below code which verify for contaner to recover right?
{code}
// Wait for RM to settle down on recovering containers;
waitForNumContainersToRecover(2, rm2, am1.getApplicationAttemptId());
Set launchedContainers =
((RMNodeImpl) rm2.getRMContext().getRMNodes().get(nm1.getNodeId()))
  .getLaunchedContainers();
assertTrue(launchedContainers.contains(amContainer.getContainerId()));
assertTrue(launchedContainers.contains(runningContainer.getContainerId()));
{code}

Am I missing anything?

> TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
> trunk for FS scheduler
> 
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith
>Assignee: zhihai xu
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler

2015-06-10 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3790:
-
Summary: TestWorkPreservingRMRestart#testSchedulerRecovery fails 
intermittently in trunk for FS scheduler  (was: 
TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS 
scheduler)

> TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in 
> trunk for FS scheduler
> 
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith
>Assignee: zhihai xu
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler

2015-06-10 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580228#comment-14580228
 ] 

Rohith commented on YARN-3790:
--

bq. I think this test fails intermittently.
Yes, it is failing intermittenlty. May be issue summary can be updated.

> TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS 
> scheduler
> -
>
> Key: YARN-3790
> URL: https://issues.apache.org/jira/browse/YARN-3790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith
>Assignee: zhihai xu
>
> Failure trace is as follows
> {noformat}
> Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
> testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
>   Time elapsed: 6.502 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<6144> but was:<8192>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:555)
>   at org.junit.Assert.assertEquals(Assert.java:542)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler

2015-06-09 Thread Rohith (JIRA)
Rohith created YARN-3790:


 Summary: TestWorkPreservingRMRestart#testSchedulerRecovery fails 
in trunk for FS scheduler
 Key: YARN-3790
 URL: https://issues.apache.org/jira/browse/YARN-3790
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Rohith


Failure trace is as follows

{noformat}
Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec 
<<< FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)
  Time elapsed: 6.502 sec  <<< FAILURE!
java.lang.AssertionError: expected:<6144> but was:<8192>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-09 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579198#comment-14579198
 ] 

Rohith commented on YARN-3789:
--

I think, instead of *Not starting*, *Not activating the application* would make 
more meaningful. 

> Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
> --
>
> Key: YARN-3789
> URL: https://issues.apache.org/jira/browse/YARN-3789
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-3789.patch
>
>
> Duplicate logging from resource manager
> during am limit check for each application
> {code}
> 015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3788) Application Master and Task Tracker timeouts are applied incorrectly

2015-06-09 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579188#comment-14579188
 ] 

Rohith commented on YARN-3788:
--

This is MapReduce project issue/query, moving to MR for further discussion.

> Application Master and Task Tracker timeouts are applied incorrectly
> 
>
> Key: YARN-3788
> URL: https://issues.apache.org/jira/browse/YARN-3788
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Dmitry Sivachenko
>
> I am running a streaming job which requires a big (~50GB) data file to run 
> (file is attached via hadoop jar <...> -file BigFile.dat).
> Most likely this command will fail as follows (note that error message is 
> rather meaningless):
> 2015-05-27 15:55:00,754 WARN  [main] streaming.StreamJob 
> (StreamJob.java:parseArgv(291)) - -file option is deprecated, please use 
> generic option -files instead.
> packageJobJar: [/ssd/mt/lm/en_reorder.ylm, mapper.py, 
> /tmp/hadoop-mitya/hadoop-unjar3778165585140840383/] [] 
> /var/tmp/streamjob633547925483233845.jar tmpDir=null
> 2015-05-27 19:46:22,942 INFO  [main] client.RMProxy 
> (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at 
> nezabudka1-00.yandex.ru/5.255.231.129:8032
> 2015-05-27 19:46:23,733 INFO  [main] client.RMProxy 
> (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at 
> nezabudka1-00.yandex.ru/5.255.231.129:8032
> 2015-05-27 20:13:37,231 INFO  [main] mapred.FileInputFormat 
> (FileInputFormat.java:listStatus(247)) - Total input paths to process : 1
> 2015-05-27 20:13:38,110 INFO  [main] mapreduce.JobSubmitter 
> (JobSubmitter.java:submitJobInternal(396)) - number of splits:1
> 2015-05-27 20:13:38,136 INFO  [main] Configuration.deprecation 
> (Configuration.java:warnOnceIfDeprecated(1009)) - mapred.reduce.tasks is 
> deprecated. Instead, use mapreduce.job.reduces
> 2015-05-27 20:13:38,390 INFO  [main] mapreduce.JobSubmitter 
> (JobSubmitter.java:printTokens(479)) - Submitting tokens for job: 
> job_1431704916575_2531
> 2015-05-27 20:13:38,689 INFO  [main] impl.YarnClientImpl 
> (YarnClientImpl.java:submitApplication(204)) - Submitted application 
> application_1431704916575_2531
> 2015-05-27 20:13:38,743 INFO  [main] mapreduce.Job (Job.java:submit(1289)) - 
> The url to track the job: 
> http://nezabudka1-00.yandex.ru:8088/proxy/application_1431704916575_2531/
> 2015-05-27 20:13:38,746 INFO  [main] mapreduce.Job 
> (Job.java:monitorAndPrintJob(1334)) - Running job: job_1431704916575_2531
> 2015-05-27 21:04:12,353 INFO  [main] mapreduce.Job 
> (Job.java:monitorAndPrintJob(1355)) - Job job_1431704916575_2531 running in 
> uber mode : false
> 2015-05-27 21:04:12,356 INFO  [main] mapreduce.Job 
> (Job.java:monitorAndPrintJob(1362)) - map 0% reduce 0%
> 2015-05-27 21:04:12,374 INFO  [main] mapreduce.Job 
> (Job.java:monitorAndPrintJob(1375)) - Job job_1431704916575_2531 failed with 
> state FAILED due to: Application application_1431704916575_2531 failed 2 
> times due to ApplicationMaster for attempt 
> appattempt_1431704916575_2531_02 timed out. Failing the application.
> 2015-05-27 21:04:12,473 INFO  [main] mapreduce.Job 
> (Job.java:monitorAndPrintJob(1380)) - Counters: 0
> 2015-05-27 21:04:12,474 ERROR [main] streaming.StreamJob 
> (StreamJob.java:submitAndMonitorJob(1019)) - Job not Successful!
> Streaming Command Failed!
> This is because yarn.am.liveness-monitor.expiry-interval-ms (defaults to 600 
> sec) timeout expires before large data file is transferred.
> Next step I increase yarn.am.liveness-monitor.expiry-interval-ms.  After that 
> application is successfully initialized and tasks are spawned.
> But I encounter another error: the default 600 seconds mapreduce.task.timeout 
> expires before tasks are initialized and tasks fail.
> Error message Task attempt_XXX failed to report status for 600 seconds is 
> also misleading: this timeout is supposed to kill non-responsive (stuck) 
> tasks but it rather strikes because auxiliary data files are copying slowly.
> So I need to increase mapreduce.task.timeout too and only after that my job 
> is successful.
> At the very least error messages need to be tweaked to indicate that 
> Application (or Task) is failing because auxiliary files are not copied 
> during that time, not just generic "timeout expired".
> Better solution would be not to account time spent for data files 
> distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging

2015-06-09 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579184#comment-14579184
 ] 

Rohith commented on YARN-3789:
--

Thanks [~bibinchundatt] for reporting and providing patch
Some comments
# Log message can be made more clear for log analysis. The messages can be like
## Not starting the application  as usedAMResource < amIfStarted 
> exceeds AMResourceLimit 
## Not starting the application  for the user  as 
usedUserAMResource < userAmIfStarted > exceeds userAMResourceLimit < 
userAMLimit >
# Can  you update issue summary and description as real problem i.e  issue is 
in log message correction, not removing duplicate logging.

> Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
> --
>
> Key: YARN-3789
> URL: https://issues.apache.org/jira/browse/YARN-3789
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Minor
> Attachments: 0001-YARN-3789.patch
>
>
> Duplicate logging from resource manager
> during am limit check for each application
> {code}
> 015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> 2015-06-09 17:32:40,019 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> not starting application as amIfStarted exceeds amLimit
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3697) FairScheduler: ContinuousSchedulingThread can't be shutdown after stop sometimes.

2015-06-09 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578677#comment-14578677
 ] 

Rohith commented on YARN-3697:
--

Hi [~zxu], 
 Trying for understanding the problem, Is it ocured when the RM shutdown is 
called which tries to stop FS service? Does it causing RM to hang during 
shutdown?

> FairScheduler: ContinuousSchedulingThread can't be shutdown after stop 
> sometimes. 
> --
>
> Key: YARN-3697
> URL: https://issues.apache.org/jira/browse/YARN-3697
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3697.000.patch
>
>
> FairScheduler: ContinuousSchedulingThread can't be shutdown after stop 
> sometimes. 
> The reason is because the InterruptedException is blocked in 
> continuousSchedulingAttempt
> {code}
>   try {
> if (node != null && Resources.fitsIn(minimumAllocation,
> node.getAvailableResource())) {
>   attemptScheduling(node);
> }
>   } catch (Throwable ex) {
> LOG.error("Error while attempting scheduling for node " + node +
> ": " + ex.toString(), ex);
>   }
> {code}
> I saw the following exception after stop:
> {code}
> 2015-05-17 23:30:43,065 WARN  [FairSchedulerContinuousScheduling] 
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
> thread interrupted
> java.lang.InterruptedException
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>   at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>   at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285)
> 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] 
> fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - 
> Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 
> available= used=: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.InterruptedException
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerS

[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID

2015-06-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578296#comment-14578296
 ] 

Rohith commented on YARN-3017:
--

Thanks [~ozawa] for confirmation:-)

> ContainerID in ResourceManager Log Has Slightly Different Format From 
> AppAttemptID
> --
>
> Key: YARN-3017
> URL: https://issues.apache.org/jira/browse/YARN-3017
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.8.0
>Reporter: MUFEED USMAN
>Priority: Minor
>  Labels: PatchAvailable
> Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch
>
>
> Not sure if this should be filed as a bug or not.
> In the ResourceManager log in the events surrounding the creation of a new
> application attempt,
> ...
> ...
> 2014-11-14 17:45:37,258 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching
> masterappattempt_1412150883650_0001_02
> ...
> ...
> The application attempt has the ID format "_1412150883650_0001_02".
> Whereas the associated ContainerID goes by "_1412150883650_0001_02_".
> ...
> ...
> 2014-11-14 17:45:37,260 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
> up
> container Container: [ContainerId: container_1412150883650_0001_02_01,
> NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource:  vCores:1,
> disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service:
> 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02
> ...
> ...
> Curious to know if this is kept like that for a reason. If not while using
> filtering tools to, say, grep events surrounding a specific attempt by the
> numeric ID part information may slip out during troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3775) Job does not exit after all node become unhealthy

2015-06-08 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith resolved YARN-3775.
--
Resolution: Not A Problem

Closing as Not A Problem. Please Reopen if you disagree..

> Job does not exit after all node become unhealthy
> -
>
> Key: YARN-3775
> URL: https://issues.apache.org/jira/browse/YARN-3775
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
> Environment: Environment:
> Version : 2.7.0
> OS: RHEL7 
> NameNodes:  xiachsh11 xiachsh12 (HA enabled)
> DataNodes:  5 xiachsh13-17
> ResourceManage:  xiachsh11
> NodeManage: 5 xiachsh13-17 
> all nodes are openstack provisioned:  
> MEM: 1.5G 
> Disk: 16G 
>Reporter: Chengshun Xia
> Attachments: logs.tar.gz
>
>
> Running Terasort with data size 10G, all the containers exit since the disk 
> space threshold 0.90 reached,at this point,the job does not exit with error 
> 15/06/05 13:13:28 INFO mapreduce.Job:  map 9% reduce 0%
> 15/06/05 13:13:52 INFO mapreduce.Job:  map 10% reduce 0%
> 15/06/05 13:14:30 INFO mapreduce.Job:  map 11% reduce 0%
> 15/06/05 13:15:11 INFO mapreduce.Job:  map 12% reduce 0%
> 15/06/05 13:15:43 INFO mapreduce.Job:  map 13% reduce 0%
> 15/06/05 13:16:38 INFO mapreduce.Job:  map 14% reduce 0%
> 15/06/05 13:16:41 INFO mapreduce.Job:  map 15% reduce 0%
> 15/06/05 13:16:53 INFO mapreduce.Job:  map 16% reduce 0%
> 15/06/05 13:17:24 INFO mapreduce.Job:  map 17% reduce 0%
> 15/06/05 13:17:53 INFO mapreduce.Job:  map 18% reduce 0%
> 15/06/05 13:18:36 INFO mapreduce.Job:  map 19% reduce 0%
> 15/06/05 13:19:03 INFO mapreduce.Job:  map 20% reduce 0%
> 15/06/05 13:19:09 INFO mapreduce.Job:  map 15% reduce 0%
> 15/06/05 13:19:32 INFO mapreduce.Job:  map 16% reduce 0%
> 15/06/05 13:20:00 INFO mapreduce.Job:  map 17% reduce 0%
> 15/06/05 13:20:36 INFO mapreduce.Job:  map 18% reduce 0%
> 15/06/05 13:20:57 INFO mapreduce.Job:  map 19% reduce 0%
> 15/06/05 13:21:22 INFO mapreduce.Job:  map 18% reduce 0%
> 15/06/05 13:21:24 INFO mapreduce.Job:  map 14% reduce 0%
> 15/06/05 13:21:25 INFO mapreduce.Job:  map 9% reduce 0%
> 15/06/05 13:21:28 INFO mapreduce.Job:  map 10% reduce 0%
> 15/06/05 13:22:22 INFO mapreduce.Job:  map 11% reduce 0%
> 15/06/05 13:23:06 INFO mapreduce.Job:  map 12% reduce 0%
> 15/06/05 13:23:41 INFO mapreduce.Job:  map 9% reduce 0%
> 15/06/05 13:23:42 INFO mapreduce.Job:  map 5% reduce 0%
> 15/06/05 13:24:38 INFO mapreduce.Job:  map 6% reduce 0%
> 15/06/05 13:25:16 INFO mapreduce.Job:  map 7% reduce 0%
> 15/06/05 13:25:53 INFO mapreduce.Job:  map 8% reduce 0%
> 15/06/05 13:26:35 INFO mapreduce.Job:  map 9% reduce 0%
> the last response time is  15/06/05 13:26:35
> and current time :
> [root@xiachsh11 logs]# date
> Fri Jun  5 19:19:59 EDT 2015
> [root@xiachsh11 logs]#
> [root@xiachsh11 logs]# yarn node -list
> 15/06/05 19:20:18 INFO client.RMProxy: Connecting to ResourceManager at 
> xiachsh11.eng.platformlab.ibm.com/9.21.62.234:8032
> Total Nodes:0
>  Node-Id Node-State Node-Http-Address   
> Number-of-Running-Containers
> [root@xiachsh11 logs]#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3775) Job does not exit after all node become unhealthy

2015-06-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577065#comment-14577065
 ] 

Rohith commented on YARN-3775:
--

[~xiachengs...@yeah.net] Thanks for reporting the issue. IIUC, This is expected 
behavior
If the Application attempt is killed because of the following reason, then 
current attempt failure is not considered as attempt failures count. 
# Preempted
# Aborted
# Disk_failed(i.e NM unhealthy)
# killed by ResourceManager.

In your case, applicaitons attempt got killed because of disk_failed, which RM 
never consider this as attempt failure. So RM wait for this applications to 
launch and run in further NM register to it.

> Job does not exit after all node become unhealthy
> -
>
> Key: YARN-3775
> URL: https://issues.apache.org/jira/browse/YARN-3775
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.1
> Environment: Environment:
> Version : 2.7.0
> OS: RHEL7 
> NameNodes:  xiachsh11 xiachsh12 (HA enabled)
> DataNodes:  5 xiachsh13-17
> ResourceManage:  xiachsh11
> NodeManage: 5 xiachsh13-17 
> all nodes are openstack provisioned:  
> MEM: 1.5G 
> Disk: 16G 
>Reporter: Chengshun Xia
> Attachments: logs.tar.gz
>
>
> Running Terasort with data size 10G, all the containers exit since the disk 
> space threshold 0.90 reached,at this point,the job does not exit with error 
> 15/06/05 13:13:28 INFO mapreduce.Job:  map 9% reduce 0%
> 15/06/05 13:13:52 INFO mapreduce.Job:  map 10% reduce 0%
> 15/06/05 13:14:30 INFO mapreduce.Job:  map 11% reduce 0%
> 15/06/05 13:15:11 INFO mapreduce.Job:  map 12% reduce 0%
> 15/06/05 13:15:43 INFO mapreduce.Job:  map 13% reduce 0%
> 15/06/05 13:16:38 INFO mapreduce.Job:  map 14% reduce 0%
> 15/06/05 13:16:41 INFO mapreduce.Job:  map 15% reduce 0%
> 15/06/05 13:16:53 INFO mapreduce.Job:  map 16% reduce 0%
> 15/06/05 13:17:24 INFO mapreduce.Job:  map 17% reduce 0%
> 15/06/05 13:17:53 INFO mapreduce.Job:  map 18% reduce 0%
> 15/06/05 13:18:36 INFO mapreduce.Job:  map 19% reduce 0%
> 15/06/05 13:19:03 INFO mapreduce.Job:  map 20% reduce 0%
> 15/06/05 13:19:09 INFO mapreduce.Job:  map 15% reduce 0%
> 15/06/05 13:19:32 INFO mapreduce.Job:  map 16% reduce 0%
> 15/06/05 13:20:00 INFO mapreduce.Job:  map 17% reduce 0%
> 15/06/05 13:20:36 INFO mapreduce.Job:  map 18% reduce 0%
> 15/06/05 13:20:57 INFO mapreduce.Job:  map 19% reduce 0%
> 15/06/05 13:21:22 INFO mapreduce.Job:  map 18% reduce 0%
> 15/06/05 13:21:24 INFO mapreduce.Job:  map 14% reduce 0%
> 15/06/05 13:21:25 INFO mapreduce.Job:  map 9% reduce 0%
> 15/06/05 13:21:28 INFO mapreduce.Job:  map 10% reduce 0%
> 15/06/05 13:22:22 INFO mapreduce.Job:  map 11% reduce 0%
> 15/06/05 13:23:06 INFO mapreduce.Job:  map 12% reduce 0%
> 15/06/05 13:23:41 INFO mapreduce.Job:  map 9% reduce 0%
> 15/06/05 13:23:42 INFO mapreduce.Job:  map 5% reduce 0%
> 15/06/05 13:24:38 INFO mapreduce.Job:  map 6% reduce 0%
> 15/06/05 13:25:16 INFO mapreduce.Job:  map 7% reduce 0%
> 15/06/05 13:25:53 INFO mapreduce.Job:  map 8% reduce 0%
> 15/06/05 13:26:35 INFO mapreduce.Job:  map 9% reduce 0%
> the last response time is  15/06/05 13:26:35
> and current time :
> [root@xiachsh11 logs]# date
> Fri Jun  5 19:19:59 EDT 2015
> [root@xiachsh11 logs]#
> [root@xiachsh11 logs]# yarn node -list
> 15/06/05 19:20:18 INFO client.RMProxy: Connecting to ResourceManager at 
> xiachsh11.eng.platformlab.ibm.com/9.21.62.234:8032
> Total Nodes:0
>  Node-Id Node-State Node-Http-Address   
> Number-of-Running-Containers
> [root@xiachsh11 logs]#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3508) Preemption processing occuring on the main RM dispatcher

2015-06-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577026#comment-14577026
 ] 

Rohith commented on YARN-3508:
--

The problem I see in the clubbing with scheduler events is if there is many 
scheduler events already in the event queue then it delays pre-emption events 
to trigger. As [~varun_saxena] said, container preemption events should be 
considered as higher priority than scheduler events. Having separate event 
disaptcher for preemptiong events would allow preemption events to participate 
in obtaining the lock in--earlier--stages rather then waiting for scheuduler 
events queue to complete.  I think current patch approach make sense to me i.e 
having individual dispatcher thread for preemption events. 

> Preemption processing occuring on the main RM dispatcher
> 
>
> Key: YARN-3508
> URL: https://issues.apache.org/jira/browse/YARN-3508
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, scheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Varun Saxena
> Attachments: YARN-3508.002.patch, YARN-3508.01.patch
>
>
> We recently saw the RM for a large cluster lag far behind on the 
> AsyncDispacher event queue.  The AsyncDispatcher thread was consistently 
> blocked on the highly-contended CapacityScheduler lock trying to dispatch 
> preemption-related events for RMContainerPreemptEventDispatcher.  Preemption 
> processing should occur on the scheduler event dispatcher thread or a 
> separate thread to avoid delaying the processing of other events in the 
> primary dispatcher queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-06-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576774#comment-14576774
 ] 

Rohith commented on YARN-3535:
--

Recently in test we faced same issue,  [~peng.zhang] would you mind updating 
the patch?

>  ResourceRequest should be restored back to scheduler when RMContainer is 
> killed at ALLOCATED
> -
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
>  Labels: BB2015-05-TBR
> Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
> yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-06-08 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3535:
-
Priority: Critical  (was: Major)

>  ResourceRequest should be restored back to scheduler when RMContainer is 
> killed at ALLOCATED
> -
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>Priority: Critical
>  Labels: BB2015-05-TBR
> Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
> yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID

2015-06-07 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576671#comment-14576671
 ] 

Rohith commented on YARN-3017:
--

+1 lgtm (non-binding)

> ContainerID in ResourceManager Log Has Slightly Different Format From 
> AppAttemptID
> --
>
> Key: YARN-3017
> URL: https://issues.apache.org/jira/browse/YARN-3017
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.8.0
>Reporter: MUFEED USMAN
>Priority: Minor
>  Labels: PatchAvailable
> Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch
>
>
> Not sure if this should be filed as a bug or not.
> In the ResourceManager log in the events surrounding the creation of a new
> application attempt,
> ...
> ...
> 2014-11-14 17:45:37,258 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching
> masterappattempt_1412150883650_0001_02
> ...
> ...
> The application attempt has the ID format "_1412150883650_0001_02".
> Whereas the associated ContainerID goes by "_1412150883650_0001_02_".
> ...
> ...
> 2014-11-14 17:45:37,260 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
> up
> container Container: [ContainerId: container_1412150883650_0001_02_01,
> NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource:  vCores:1,
> disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service:
> 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02
> ...
> ...
> Curious to know if this is kept like that for a reason. If not while using
> filtering tools to, say, grep events surrounding a specific attempt by the
> numeric ID part information may slip out during troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID

2015-06-07 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576652#comment-14576652
 ] 

Rohith commented on YARN-3017:
--

I see.. Thanks for the detailed explanation..

> ContainerID in ResourceManager Log Has Slightly Different Format From 
> AppAttemptID
> --
>
> Key: YARN-3017
> URL: https://issues.apache.org/jira/browse/YARN-3017
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.8.0
>Reporter: MUFEED USMAN
>Priority: Minor
>  Labels: PatchAvailable
> Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch
>
>
> Not sure if this should be filed as a bug or not.
> In the ResourceManager log in the events surrounding the creation of a new
> application attempt,
> ...
> ...
> 2014-11-14 17:45:37,258 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching
> masterappattempt_1412150883650_0001_02
> ...
> ...
> The application attempt has the ID format "_1412150883650_0001_02".
> Whereas the associated ContainerID goes by "_1412150883650_0001_02_".
> ...
> ...
> 2014-11-14 17:45:37,260 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
> up
> container Container: [ContainerId: container_1412150883650_0001_02_01,
> NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource:  vCores:1,
> disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service:
> 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02
> ...
> ...
> Curious to know if this is kept like that for a reason. If not while using
> filtering tools to, say, grep events surrounding a specific attempt by the
> numeric ID part information may slip out during troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3780) Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition

2015-06-07 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576550#comment-14576550
 ] 

Rohith commented on YARN-3780:
--

Makse sense, 
+1 lgtm (non-binding)

> Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition
> -
>
> Key: YARN-3780
> URL: https://issues.apache.org/jira/browse/YARN-3780
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-3780.000.patch
>
>
> Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition 
> to avoid unnecessary NodeResourceUpdateSchedulerEvent.
> The current code use {{!=}} to compare Resource totalCapability, which will 
> compare reference not the real value in Resource. So we should use equals to 
> compare Resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working as expected in FairScheduler

2015-06-05 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574393#comment-14574393
 ] 

Rohith commented on YARN-3758:
--

All these confusion should be solved probably after YARN-2986. This issue can 
be raised there whether they will be handling it.

> The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
> working as expected in FairScheduler
> 
>
> Key: YARN-3758
> URL: https://issues.apache.org/jira/browse/YARN-3758
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: skrho
>
> Hello there~~
> I have 2 clusters
> First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
> Physical memory each node
> Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
> Physical memory each node
> Wherever a mapreduce job is running, I want resourcemanager is to set the 
> minimum memory  256m to container
> So I was changing configuration in yarn-site.xml & mapred-site.xml
> yarn.scheduler.minimum-allocation-mb : 256
> mapreduce.map.java.opts : -Xms256m 
> mapreduce.reduce.java.opts : -Xms256m 
> mapreduce.map.memory.mb : 256 
> mapreduce.reduce.memory.mb : 256 
> In First cluster  whenever a mapreduce job is running , I can see used memory 
> 256m in web console( http://installedIP:8088/cluster/nodes )
> But In Second cluster whenever a mapreduce job is running , I can see used 
> memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
> I know default memory value is 1024m, so if there is not changing memory 
> setting, the default value is working.
> I have been testing for two weeks, but I don't know why mimimum memory 
> setting is not working in second cluster
> Why this difference is happened? 
> Am I wrong setting configuration?
> or Is there bug?
> Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID

2015-06-05 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574228#comment-14574228
 ] 

Rohith commented on YARN-3017:
--

bq. Could you give a little more detail about the possibility to break the 
rolling upgrade?
I was thinking that does it cause any issue while parsing the containerId after 
upgrade. Say, current container id format is 
container_1430441527236_0001_01_01 which is running in the NM-1, after 
upgrade container-id format changes container_1430441527236_0001_01_01. 
But NM reports running containers as container_1430441527236_0001_01_01. 

> ContainerID in ResourceManager Log Has Slightly Different Format From 
> AppAttemptID
> --
>
> Key: YARN-3017
> URL: https://issues.apache.org/jira/browse/YARN-3017
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.8.0
>Reporter: MUFEED USMAN
>Priority: Minor
>  Labels: PatchAvailable
> Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch
>
>
> Not sure if this should be filed as a bug or not.
> In the ResourceManager log in the events surrounding the creation of a new
> application attempt,
> ...
> ...
> 2014-11-14 17:45:37,258 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching
> masterappattempt_1412150883650_0001_02
> ...
> ...
> The application attempt has the ID format "_1412150883650_0001_02".
> Whereas the associated ContainerID goes by "_1412150883650_0001_02_".
> ...
> ...
> 2014-11-14 17:45:37,260 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
> up
> container Container: [ContainerId: container_1412150883650_0001_02_01,
> NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource:  vCores:1,
> disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service:
> 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02
> ...
> ...
> Curious to know if this is kept like that for a reason. If not while using
> filtering tools to, say, grep events surrounding a specific attempt by the
> numeric ID part information may slip out during troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working as expected in FairScheduler

2015-06-04 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3758:
-
Summary: The mininum memory setting(yarn.scheduler.minimum-allocation-mb) 
is not working as expected in FairScheduler  (was: The mininum memory 
setting(yarn.scheduler.minimum-allocation-mb) is not working in container)

> The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
> working as expected in FairScheduler
> 
>
> Key: YARN-3758
> URL: https://issues.apache.org/jira/browse/YARN-3758
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: skrho
>
> Hello there~~
> I have 2 clusters
> First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
> Physical memory each node
> Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
> Physical memory each node
> Wherever a mapreduce job is running, I want resourcemanager is to set the 
> minimum memory  256m to container
> So I was changing configuration in yarn-site.xml & mapred-site.xml
> yarn.scheduler.minimum-allocation-mb : 256
> mapreduce.map.java.opts : -Xms256m 
> mapreduce.reduce.java.opts : -Xms256m 
> mapreduce.map.memory.mb : 256 
> mapreduce.reduce.memory.mb : 256 
> In First cluster  whenever a mapreduce job is running , I can see used memory 
> 256m in web console( http://installedIP:8088/cluster/nodes )
> But In Second cluster whenever a mapreduce job is running , I can see used 
> memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
> I know default memory value is 1024m, so if there is not changing memory 
> setting, the default value is working.
> I have been testing for two weeks, but I don't know why mimimum memory 
> setting is not working in second cluster
> Why this difference is happened? 
> Am I wrong setting configuration?
> or Is there bug?
> Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-04 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572630#comment-14572630
 ] 

Rohith commented on YARN-3758:
--

bq. Is it bug ?
To be clear, is the inconsistent behavior is bug? or implemented intentionally 
for FS?

> The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
> working in container
> 
>
> Key: YARN-3758
> URL: https://issues.apache.org/jira/browse/YARN-3758
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: skrho
>
> Hello there~~
> I have 2 clusters
> First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
> Physical memory each node
> Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
> Physical memory each node
> Wherever a mapreduce job is running, I want resourcemanager is to set the 
> minimum memory  256m to container
> So I was changing configuration in yarn-site.xml & mapred-site.xml
> yarn.scheduler.minimum-allocation-mb : 256
> mapreduce.map.java.opts : -Xms256m 
> mapreduce.reduce.java.opts : -Xms256m 
> mapreduce.map.memory.mb : 256 
> mapreduce.reduce.memory.mb : 256 
> In First cluster  whenever a mapreduce job is running , I can see used memory 
> 256m in web console( http://installedIP:8088/cluster/nodes )
> But In Second cluster whenever a mapreduce job is running , I can see used 
> memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
> I know default memory value is 1024m, so if there is not changing memory 
> setting, the default value is working.
> I have been testing for two weeks, but I don't know why mimimum memory 
> setting is not working in second cluster
> Why this difference is happened? 
> Am I wrong setting configuration?
> or Is there bug?
> Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container

2015-06-04 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572628#comment-14572628
 ] 

Rohith commented on YARN-3758:
--

Had looked into code for CS and FS. The minimum allocation understanding and 
its behavior is different acros CS and FS.
# CS : It is straight forward that if any request with less than 
min-allocation-mb, then the CS normalize the request to min-allocation-mb. And 
containers are allocated with minimum-allocation-mb. 
# FS : if any request with less than min-allocation-mb then the FS normalize 
the request with the factor {{yarn.scheduler.increment-allocation-mb}}. Example 
in description, min-alocation-mb is 256mb, but increment-allocation-mb default 
1024mb which always allocate 1024mb to containers. There is huge effect of 
{{yarn.scheduler.increment-allocation-mb}} which changes the requested memory 
and assign with newly calculated resource.

The behavior is not consistent with CS and FS. I am not sure why there an 
additional configuration introduced in FS? Is it bug ?

> The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not 
> working in container
> 
>
> Key: YARN-3758
> URL: https://issues.apache.org/jira/browse/YARN-3758
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: skrho
>
> Hello there~~
> I have 2 clusters
> First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G 
> Physical memory each node
> Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G 
> Physical memory each node
> Wherever a mapreduce job is running, I want resourcemanager is to set the 
> minimum memory  256m to container
> So I was changing configuration in yarn-site.xml & mapred-site.xml
> yarn.scheduler.minimum-allocation-mb : 256
> mapreduce.map.java.opts : -Xms256m 
> mapreduce.reduce.java.opts : -Xms256m 
> mapreduce.map.memory.mb : 256 
> mapreduce.reduce.memory.mb : 256 
> In First cluster  whenever a mapreduce job is running , I can see used memory 
> 256m in web console( http://installedIP:8088/cluster/nodes )
> But In Second cluster whenever a mapreduce job is running , I can see used 
> memory 1024m in web console( http://installedIP:8088/cluster/nodes ) 
> I know default memory value is 1024m, so if there is not changing memory 
> setting, the default value is working.
> I have been testing for two weeks, but I don't know why mimimum memory 
> setting is not working in second cluster
> Why this difference is happened? 
> Am I wrong setting configuration?
> or Is there bug?
> Thank you for reading~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID

2015-06-04 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572289#comment-14572289
 ] 

Rohith commented on YARN-3017:
--

Apoligies for coming very late into this issue.. Thinking that changing 
containerId format may breaks complatability when rolling upgrade has been done 
with RM HA + work preserving enabled? IIUC, using ZKRMStateStore, rolling 
upgrade can be done now.

> ContainerID in ResourceManager Log Has Slightly Different Format From 
> AppAttemptID
> --
>
> Key: YARN-3017
> URL: https://issues.apache.org/jira/browse/YARN-3017
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.8.0
>Reporter: MUFEED USMAN
>Priority: Minor
>  Labels: PatchAvailable
> Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch
>
>
> Not sure if this should be filed as a bug or not.
> In the ResourceManager log in the events surrounding the creation of a new
> application attempt,
> ...
> ...
> 2014-11-14 17:45:37,258 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching
> masterappattempt_1412150883650_0001_02
> ...
> ...
> The application attempt has the ID format "_1412150883650_0001_02".
> Whereas the associated ContainerID goes by "_1412150883650_0001_02_".
> ...
> ...
> 2014-11-14 17:45:37,260 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting 
> up
> container Container: [ContainerId: container_1412150883650_0001_02_01,
> NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource:  vCores:1,
> disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service:
> 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02
> ...
> ...
> Curious to know if this is kept like that for a reason. If not while using
> filtering tools to, say, grep events surrounding a specific attempt by the
> numeric ID part information may slip out during troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-03 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572247#comment-14572247
 ] 

Rohith commented on YARN-3733:
--

+1 for handling virtual core's. This will good immprovement for testing 
DominantRC functionality precicely. 

> DominantRC#compare() does not work as expected if cluster resource is empty
> ---
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: 0001-YARN-3733.patch, 0002-YARN-3733.patch, 
> 0002-YARN-3733.patch, YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched

2015-06-03 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572244#comment-14572244
 ] 

Rohith commented on YARN-3754:
--

bq. When NM is shutting down, ContainerLaunch is also interrupted. During this 
interrupted exception handling, NM tries to update container diagnostics. But 
from main thread statestore is down ,hence caused the DB Close exception.
I think this issue caused since NM jvm did not exit on_time which allowed to 
process the statestore event. After YARN-3585 , I think this should be OK.
[~bibinchundatt] Can you regression it pls

> Race condition when the NodeManager is shutting down and container is launched
> --
>
> Key: YARN-3754
> URL: https://issues.apache.org/jira/browse/YARN-3754
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
> Environment: Suse 11 Sp3
>Reporter: Bibin A Chundatt
>Assignee: Sunil G
>Priority: Critical
> Attachments: NM.log
>
>
> Container is launched and returned to ContainerImpl
> NodeManager closed the DB connection which resulting in 
> {{org.iq80.leveldb.DBException: Closed}}. 
> *Attaching the exception trace*
> {code}
> 2015-05-30 02:11:49,122 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Unable to update state store diagnostics for 
> container_e310_1432817693365_3338_01_02
> java.io.IOException: org.iq80.leveldb.DBException: Closed
> at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.iq80.leveldb.DBException: Closed
> at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
> at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
> at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
> ... 15 more
> {code}
> we can add a check whether DB is closed while we move container from ACQUIRED 
> state.
> As per the discussion in YARN-3585 have add the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-03 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: 0002-YARN-3733.patch

Updated the patch fixing test side comments.. Kindly review the patch

> DominantRC#compare() does not work as expected if cluster resource is empty
> ---
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: 0001-YARN-3733.patch, 0002-YARN-3733.patch, 
> 0002-YARN-3733.patch, YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-03 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572085#comment-14572085
 ] 

Rohith commented on YARN-3733:
--

bq. only memory or vcores are more in TestCapacityScheduler.
All the combination of inputs are verified in the TestResourceCalculator. And 
in TestCapacityScheduler, app submission happens only for memory in 
{{MockRM.submitApp}}, so default vcore minimum allocation is 1 which will be 
taken by default. So just changing memory to {{amResourceLimit.getMemory() + 
2}} should enough.

bq. TestCapacityScheduler#verifyAMLimitForLeafQueue, while submitting second 
app, you could change the app name to "app-2".
Agree.

I will upload a patch soon

> DominantRC#compare() does not work as expected if cluster resource is empty
> ---
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: 0001-YARN-3733.patch, 0002-YARN-3733.patch, 
> YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-03 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: 0002-YARN-3733.patch

Thanks [~sunilg] and [~leftnoteasy] for sharing your thoughts..

I modified bit of logic and the order of if check so that it should handle all 
the possible combination of inputs below table. The problem was in 5th and 7th 
inputs. The validation returning 1 but it was expected to be zero  for 5th 
combinations i.e flow never reach 2nd check since 1st step is OR for memory vs 
cpu.
||Sl.no||cr||lhs||rhs||Output||
|1|<0,0>| <1,1> | <1,1> | 0 |
|2|<0,0>| <1,1> | <0,0> | 1 |
|3|<0,0>| <0,0> | <1,1> | -1 |
|4|<0,0>| <0,1> | <1,0> |  0 |
|5|<0,0>| <1,0> | <0,1> |  0 |
|6|<0,0>| <1,1> | <1,0> | 1  |
|7|<0,0>| <1,0> | <1,1> | -1  |

Updated Patch has followig change : 
# Changed the logic for comparing lhs and rhs resources when clusterResource is 
empty as suggested.
# Added test for AMLimit usage.
# Addred test for all above cobination of inputs.

Kindly review the patch

> DominantRC#compare() does not work as expected if cluster resource is empty
> ---
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: 0001-YARN-3733.patch, 0002-YARN-3733.patch, 
> YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: 0001-YARN-3733.patch

The updated patch that fixes for 2nd and 3rd scenarios(This issue scenario  
fixes) in above table and refactored the test code.

As a overall solution that solves input combination like 4th and 5th from above 
table, need to explore more on how to define fraction and how to decide which 
one is dominant. Any suggestions on this?



> DominantRC#compare() does not work as expected if cluster resource is empty
> ---
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: 0001-YARN-3733.patch, YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-02 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568682#comment-14568682
 ] 

Rohith commented on YARN-3733:
--

Updated the summary as per defect.

> DominantRC#compare() does not work as expected if cluster resource is empty
> ---
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty

2015-06-01 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Summary: DominantRC#compare() does not work as expected if cluster resource 
is empty  (was:  On RM restart AM getting more than maximum possible memory 
when many  tasks in queue)

> DominantRC#compare() does not work as expected if cluster resource is empty
> ---
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568539#comment-14568539
 ] 

Rohith commented on YARN-3585:
--

Thanks [~jlowe] for the review .. 

bq. if we should flip the logic to not exit but then have NodeManager.main 
override that. This probably precludes the need to update existing tests.
Make sense to me.. Changed the logic to call jvm exit when NodeMananager is 
instantiated from main function.

bq. We should be using ExitUtil instead of System.exit directly.
Done

bq. Nit: "setexitOnShutdownEvent" s/b "setExitOnShutdownEvent"
This method is not necessary now since patch preassume true when it is called 
from only main funtion. I have removed this.

Kindly reveiw updated patch

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith
>Priority: Critical
> Attachments: 0001-YARN-3585.patch, YARN-3585.patch
>
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3585:
-
Attachment: 0001-YARN-3585.patch

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith
>Priority: Critical
> Attachments: 0001-YARN-3585.patch, YARN-3585.patch
>
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568462#comment-14568462
 ] 

Rohith commented on YARN-3733:
--

This issue fix need to go in for 2.7.1. Updated the target version as 2.7.1

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Target Version/s: 2.7.1

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567201#comment-14567201
 ] 

Rohith commented on YARN-3585:
--

-1 for findbug, does not show any error report, but not sure why -1 given.
Test failure is unrelated to this patch.

[~jlowe] Kindly review the patch. 

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-3585.patch
>
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567196#comment-14567196
 ] 

Rohith commented on YARN-3585:
--

Yes, we can raise different Jira. [~bibinchundatt] Can you raise Jira, we can 
validate the issue there?

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-3585.patch
>
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567189#comment-14567189
 ] 

Rohith commented on YARN-3733:
--

bq. Verify infinity by calling isInfinite(float v). Quoting from jdk7 
Since infinity is derived from lhs and rhs, infinity can not be differentiated 
for the clusterResource=<0,0> lhs=<1,1>, and rhs<2,2>. Method 
{{getResourceAsValue()}} return infinity for both l and r which cant compare it.

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567186#comment-14567186
 ] 

Rohith commented on YARN-3733:
--

bq. 2. The newly added code is duplicated in two places, can you eliminate the 
duplicate code?
sencond time validation is not required ICO NaN,will remove this in next patch.

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-06-01 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567184#comment-14567184
 ] 

Rohith commented on YARN-3733:
--

Thanks [~devaraj.k] and [~sunilg] for review

bq. Can we check for lhs/rhs emptiness and compare these before ending up with 
infinite values? 
If we calculater for emptyness, this would affect specific input values like 
clusterResource=<0,0> lhs=<1,1>, and rhs<2,2>. Then which one is considered as 
dominant? bcs directly dominant component can not be retrieved by memory or cpu.

And I listed out what are the possible combination of inputs would ocure in 
YARN. These are
||Sl.no||clusterResorce||lhs||rhs||Remark||
|1|<0,0>|<0,0>|<0,0>|Valid Input;Handled|
|2|<0,0>||<0,0>|NaN vs Infinity: Patch 
Handle This scenario|
|3|<0,0>|<0,0>||Nan vs Infinity: Patch 
Handle This scenario|
|4|<0,0>|||Infinity vs Infinity: Can this type can ocur in YARN?|
|5|<0,0>||<0,positive integer>|Is this valid input? Can 
this type can ocur in YARN?|


>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-31 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566993#comment-14566993
 ] 

Rohith commented on YARN-3585:
--

This is race condition when the NodeManager is shutting down and container is 
launched. By the time container is launched and returned to ContainerImpl, 
NodeManager closed the DB connection which resulting in 
{{org.iq80.leveldb.DBException: Closed
}}

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-3585.patch
>
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-05-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: YARN-3733.patch

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-05-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: (was: YARN-3733.patch)

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-05-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3733:
-
Attachment: YARN-3733.patch

Attached the patch fixing the issue. Kindly review the patch.

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-3733.patch
>
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3585:
-
Attachment: YARN-3585.patch

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith
>Priority: Critical
> Attachments: YARN-3585.patch
>
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith reassigned YARN-3585:


Assignee: Rohith

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Rohith
>Priority: Critical
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563180#comment-14563180
 ] 

Rohith commented on YARN-3585:
--

Another observation is I enabled debug logs for NodeManger. And noticed that 
occurrence of this issue become relative low. I think it a timing of db close 
causing issue in LevelDb. And this Issue won't appear always on all the nodes, 
but in cluster at lease one node in the cluster is going for toss.

I too think it should a level db issue. I think we should report issue in 
LevelDb. 

 For calling {{adding system.exit}} in NodeManager gracefully shutdown will 
mask many issues. Given this is acceptable , I will upload a patch.

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Priority: Critical
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563124#comment-14563124
 ] 

Rohith commented on YARN-3585:
--

Tested with patch to log before and after db.close,  but found that db is 
closed.There were no exception thrown while closing db.close.

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Priority: Critical
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-05-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562638#comment-14562638
 ] 

Rohith commented on YARN-3733:
--

Steps to reproduce the scenario quickly. Assume that configuration for 
max-am-resurce-limit is 0.5 and cluster capacity is 10GB after NM is 
registered. So,Max AM ResourceLimit is 5B
# Start RM configuring DominantResourceAllocator.(Dont start NM in the cluster)
# Submit 10 applications with 1GB each, and all 10 applications get activated.
# Start NM, RM launched all 10 applications AM's and cluster is full where 
cluster is hangs forever.
When there is no NM is registered, submitted applications should not be 
activated i.e should not participate in scheduling.

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Critical
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-05-28 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562622#comment-14562622
 ] 

Rohith commented on YARN-3733:
--

Verified the RM logs from [~bibinchundatt] offline. The sequence of events 
ocured are 
# 30 applications are submitted to RM1 concurrently. *pendingApplications=18 
and activeApplications=12*. Active applications are started RUNNING state.
# RM1 switched to standby, RM2 transitioned to Active state. Currently active 
RM is RM2.
# Previous submitted 30 applications started recovering. As part of recovery 
process, all the 30 applications submitted to schedulers and all these 
applications become active i.e *activeApplications=30 and 
pendingApplications=0* which is not expected to happen.
# NM registered with RM and running AM's registered with RM.
# Since 30 applications are activated, schedulers tries to launch all the 
activated applications ApplicatonMater and occupied full cluster capacity.

Basically the issue AM limit check in LeafQueue#activateApplications is not 
working as expected for {{DominantResourceAllocator}}. In order to confirm 
this, written simple program for both Default and Dominant resource allocator 
like below memory configurations. Output of the program is 
For DefaultResourceAllocator, result is false which Limits the applications 
being activated when AM resource Limit is exceeded.
For DominatReosurceAllocator, result is true  which allows all the applications 
to be activated even if AM resource Limit is exceeded.
{noformat}
2015-05-28 14:00:52,704 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
application AMResource  maxAMResourcePerQueuePercent 0.5 
amLimit  lastClusterResource  
amIfStarted 
{noformat}

{code}
package com.test.hadoop;

import org.apache.hadoop.yarn.api.records.Resource;
import org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator;
import org.apache.hadoop.yarn.util.resource.DominantResourceCalculator;
import org.apache.hadoop.yarn.util.resource.ResourceCalculator;
import org.apache.hadoop.yarn.util.resource.Resources;

public class TestResourceCalculator {

  public static void main(String[] args) {
// Default Resource Allocator
ResourceCalculator defaultResourceCalculator =
new DefaultResourceCalculator();

// Dominant Resource Allocator
ResourceCalculator dominantResourceCalculator =
new DominantResourceCalculator();

Resource lastClusterResource = Resource.newInstance(0, 0);
Resource amIfStarted = Resource.newInstance(4096, 1);
Resource amLimit = Resource.newInstance(0, 0);

   // expected result false, but actual also false
System.out.println("DefaultResourceCalculator : "
+ Resources.lessThanOrEqual(defaultResourceCalculator,
lastClusterResource, amIfStarted, amLimit));

   // expected result false, but actual also true for DominantResourceAllocator
System.out.println("DominantResourceCalculator : "
+ Resources.lessThanOrEqual(dominantResourceCalculator,
lastClusterResource, amIfStarted, amLimit));

  }
}

{code}

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Critical
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue

2015-05-28 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith reassigned YARN-3733:


Assignee: Rohith

>  On RM restart AM getting more than maximum possible memory when many  tasks 
> in queue
> -
>
> Key: YARN-3733
> URL: https://issues.apache.org/jira/browse/YARN-3733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
> Environment: Suse 11 Sp3 , 2 NM , 2 RM
> one NM - 3 GB 6 v core
>Reporter: Bibin A Chundatt
>Assignee: Rohith
>Priority: Critical
>
> Steps to reproduce
> =
> 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster)
> 2. Configure map and reduce size to 512 MB  after changing scheduler minimum 
> size to 512 MB
> 3. Configure capacity scheduler and AM limit to .5 
> (DominantResourceCalculator is configured)
> 4. Submit 30 concurrent task 
> 5. Switch RM
> Actual
> =
> For 12 Jobs AM gets allocated and all 12 starts running
> No other Yarn child is initiated , *all 12 Jobs in Running state for ever*
> Expected
> ===
> Only 6 should be running at a time since max AM allocated is .5 (3072 MB)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3731) Unknown container. Container either has not started or has already completed or doesn’t belong to this node at all.

2015-05-27 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562280#comment-14562280
 ] 

Rohith commented on YARN-3731:
--

Closing the issue as invalid.

> Unknown container. Container either has not started or has already completed 
> or doesn’t belong to this node at all. 
> 
>
> Key: YARN-3731
> URL: https://issues.apache.org/jira/browse/YARN-3731
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: amit
>Priority: Critical
>
> Hi 
> I am importing data from sql server to hdfs and below is the command
> sqoop import –connect 
> “jdbc:sqlserver://Servername:1433;username=hadoop;password=Password;database=MSBI”
>  –table DimDate –target-dir /Hadoop/hdpdatadn/dn/DW/msbi
> but I am getting following error:
> User: amit.tomar
>  Name: DimDate.jar
>  Application Type: MAPREDUCE
>  Application Tags:
>  State: FAILED
>  FinalStatus: FAILED
>  Started: Wed May 27 12:39:48 +0800 2015
>  Elapsed: 23sec
>  Tracking URL: History
>  Diagnostics: Application application_1432698911303_0005 failed 2 times due 
> to AM Container for appattempt_1432698911303_0005_02 exited with 
> exitCode: 1
>  For more detailed output, check application tracking 
> page:http://ServerName/proxy/application_1432698911303_0005/Then, click on 
> links to logs of each attempt.
>  Diagnostics: Exception from container-launch.
>  Container id: container_1432698911303_0005_02_01
>  Exit code: 1
>  Stack trace: ExitCodeException exitCode=1:
>  at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>  at org.apache.hadoop.util.Shell.run(Shell.java:455)
>  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  Shell output: 1 file(s) moved.
>  Container exited with a non-zero exit code 1
>  Failing this attempt. Failing the application. 
> From the log below is the message:
> java.lang.Exception: Unknown container. Container either has not started or 
> has already completed or doesn’t belong to this node at all. 
> Thanks in advance
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3731) Unknown container. Container either has not started or has already completed or doesn’t belong to this node at all.

2015-05-27 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith resolved YARN-3731.
--
Resolution: Invalid

> Unknown container. Container either has not started or has already completed 
> or doesn’t belong to this node at all. 
> 
>
> Key: YARN-3731
> URL: https://issues.apache.org/jira/browse/YARN-3731
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: amit
>Priority: Critical
>
> Hi 
> I am importing data from sql server to hdfs and below is the command
> sqoop import –connect 
> “jdbc:sqlserver://Servername:1433;username=hadoop;password=Password;database=MSBI”
>  –table DimDate –target-dir /Hadoop/hdpdatadn/dn/DW/msbi
> but I am getting following error:
> User: amit.tomar
>  Name: DimDate.jar
>  Application Type: MAPREDUCE
>  Application Tags:
>  State: FAILED
>  FinalStatus: FAILED
>  Started: Wed May 27 12:39:48 +0800 2015
>  Elapsed: 23sec
>  Tracking URL: History
>  Diagnostics: Application application_1432698911303_0005 failed 2 times due 
> to AM Container for appattempt_1432698911303_0005_02 exited with 
> exitCode: 1
>  For more detailed output, check application tracking 
> page:http://ServerName/proxy/application_1432698911303_0005/Then, click on 
> links to logs of each attempt.
>  Diagnostics: Exception from container-launch.
>  Container id: container_1432698911303_0005_02_01
>  Exit code: 1
>  Stack trace: ExitCodeException exitCode=1:
>  at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>  at org.apache.hadoop.util.Shell.run(Shell.java:455)
>  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  Shell output: 1 file(s) moved.
>  Container exited with a non-zero exit code 1
>  Failing this attempt. Failing the application. 
> From the log below is the message:
> java.lang.Exception: Unknown container. Container either has not started or 
> has already completed or doesn’t belong to this node at all. 
> Thanks in advance
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3731) Unknown container. Container either has not started or has already completed or doesn’t belong to this node at all.

2015-05-27 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562279#comment-14562279
 ] 

Rohith commented on YARN-3731:
--

Hi [~amitmsbi]
Thanks for using Hadoop. You are trying to access the log link where the 
application master itself never launched. From the diagnosis message, it is 
clear that application is not launched. So first and formost, you need to check 
the application mater that why it is not launched. There would be some 
application configuration or classpath issue which you can get it from syserr 
container logs.

And JIRA is meant for tracking development activities.For queries kinldy 
register to [mailing list|https://hadoop.apache.org/mailing_lists.html] and 
send mail to users mailing list i.e {{u...@hadoop.apache.org}}. Definitely 
folks will help you to solve or answer your queries.

> Unknown container. Container either has not started or has already completed 
> or doesn’t belong to this node at all. 
> 
>
> Key: YARN-3731
> URL: https://issues.apache.org/jira/browse/YARN-3731
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: amit
>Priority: Critical
>
> Hi 
> I am importing data from sql server to hdfs and below is the command
> sqoop import –connect 
> “jdbc:sqlserver://Servername:1433;username=hadoop;password=Password;database=MSBI”
>  –table DimDate –target-dir /Hadoop/hdpdatadn/dn/DW/msbi
> but I am getting following error:
> User: amit.tomar
>  Name: DimDate.jar
>  Application Type: MAPREDUCE
>  Application Tags:
>  State: FAILED
>  FinalStatus: FAILED
>  Started: Wed May 27 12:39:48 +0800 2015
>  Elapsed: 23sec
>  Tracking URL: History
>  Diagnostics: Application application_1432698911303_0005 failed 2 times due 
> to AM Container for appattempt_1432698911303_0005_02 exited with 
> exitCode: 1
>  For more detailed output, check application tracking 
> page:http://ServerName/proxy/application_1432698911303_0005/Then, click on 
> links to logs of each attempt.
>  Diagnostics: Exception from container-launch.
>  Container id: container_1432698911303_0005_02_01
>  Exit code: 1
>  Stack trace: ExitCodeException exitCode=1:
>  at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>  at org.apache.hadoop.util.Shell.run(Shell.java:455)
>  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  Shell output: 1 file(s) moved.
>  Container exited with a non-zero exit code 1
>  Failing this attempt. Failing the application. 
> From the log below is the message:
> java.lang.Exception: Unknown container. Container either has not started or 
> has already completed or doesn’t belong to this node at all. 
> Thanks in advance
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-27 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561256#comment-14561256
 ] 

Rohith commented on YARN-3585:
--

bq. Could you to instrument logs in the state store code to verify the leveldb 
database is indeed being closed even when it hangs? 
sorry, did not get it exactly what and where should I add logs? Do you mean 
should I add log after {{NMLeveldbStateStoreService#closeStorage()}} being 
called?

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Priority: Critical
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-27 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560965#comment-14560965
 ] 

Rohith commented on YARN-3585:
--

I have attached NM logs and thread dump in YARN-3640. Would get it from 
YARN-3640?

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Priority: Critical
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED

2015-05-27 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560774#comment-14560774
 ] 

Rohith commented on YARN-3535:
--

Thanks [~peng.zhang] for working on this issue..  
Some comments
# I think the method {{recoverResourceRequestForContainer}} should be 
synchronized, any thought?
# Why do we require {{RMContextImpl.java}} changes? I think this we can avoid, 
not necessarily required.

Tests : 
# Any specific reason for chaning {{TestAMRestart.java}}?
# IIUC, this issue can occur in all the scheduler given AM-RM heart beat is 
lesser than NM-RM heart beat interval. So can it include FT test case that 
applicable for both CS and FS. May it you can add test in the extending class 
{{ParameterizedSchedulerTestBase}} i.e TestAbstractYarnScheduler.


>  ResourceRequest should be restored back to scheduler when RMContainer is 
> killed at ALLOCATED
> -
>
> Key: YARN-3535
> URL: https://issues.apache.org/jira/browse/YARN-3535
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Assignee: Peng Zhang
>  Labels: BB2015-05-TBR
> Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, 
> yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-26 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560470#comment-14560470
 ] 

Rohith commented on YARN-3585:
--

I tested locally using YARN-3641 FIX, issue is still exist.

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Priority: Critical
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-26 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560410#comment-14560410
 ] 

Rohith commented on YARN-3585:
--

I will test YARN-3641 fix for this JIRA scenario. About the patch, I think 
calling System.exit() explicitely after shutdown thead exit is one option.

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Priority: Critical
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-25 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14558050#comment-14558050
 ] 

Rohith commented on YARN-3543:
--

[~vinodkv] Kindly review the updated patch..

> ApplicationReport should be able to tell whether the Application is AM 
> managed or not. 
> ---
>
> Key: YARN-3543
> URL: https://issues.apache.org/jira/browse/YARN-3543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.6.0
>Reporter: Spandan Dutta
>Assignee: Rohith
> Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
> 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
> 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 
> YARN-3543-AH.PNG, YARN-3543-RM.PNG
>
>
> Currently we can know whether the application submitted by the user is AM 
> managed from the applicationSubmissionContext. This can be only done  at the 
> time when the user submits the job. We should have access to this info from 
> the ApplicationReport as well so that we can check whether an app is AM 
> managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-24 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3543:
-
Attachment: 0004-YARN-3543.patch

Attaching same patch as previous to kick off Jenkins

> ApplicationReport should be able to tell whether the Application is AM 
> managed or not. 
> ---
>
> Key: YARN-3543
> URL: https://issues.apache.org/jira/browse/YARN-3543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.6.0
>Reporter: Spandan Dutta
>Assignee: Rohith
> Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
> 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
> 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 
> YARN-3543-AH.PNG, YARN-3543-RM.PNG
>
>
> Currently we can know whether the application submitted by the user is AM 
> managed from the applicationSubmissionContext. This can be only done  at the 
> time when the user submits the job. We should have access to this info from 
> the ApplicationReport as well so that we can check whether an app is AM 
> managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page

2015-05-24 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557960#comment-14557960
 ] 

Rohith commented on YARN-2238:
--

+1 lgtm (non-binding)

> filtering on UI sticks even if I move away from the page
> 
>
> Key: YARN-2238
> URL: https://issues.apache.org/jira/browse/YARN-2238
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.4.0
>Reporter: Sangjin Lee
>Assignee: Jian He
>  Labels: usability
> Attachments: YARN-2238.patch, YARN-2238.png, filtered.png
>
>
> The main data table in many web pages (RM, AM, etc.) seems to show an 
> unexpected filtering behavior.
> If I filter the table by typing something in the key or value field (or I 
> suspect any search field), the data table gets filtered. The example I used 
> is the job configuration page for a MR job. That is expected.
> However, when I move away from that page and visit any other web page of the 
> same type (e.g. a job configuration page), the page is rendered with the 
> filtering! That is unexpected.
> What's even stranger is that it does not render the filtering term. As a 
> result, I have a page that's mysteriously filtered but doesn't tell me what 
> it's filtering on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page

2015-05-24 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557959#comment-14557959
 ] 

Rohith commented on YARN-2238:
--

Tested locally with YARN-3707 fix, working fine:-)

> filtering on UI sticks even if I move away from the page
> 
>
> Key: YARN-2238
> URL: https://issues.apache.org/jira/browse/YARN-2238
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.4.0
>Reporter: Sangjin Lee
>Assignee: Jian He
>  Labels: usability
> Attachments: YARN-2238.patch, YARN-2238.png, filtered.png
>
>
> The main data table in many web pages (RM, AM, etc.) seems to show an 
> unexpected filtering behavior.
> If I filter the table by typing something in the key or value field (or I 
> suspect any search field), the data table gets filtered. The example I used 
> is the job configuration page for a MR job. That is expected.
> However, when I move away from that page and visit any other web page of the 
> same type (e.g. a job configuration page), the page is rendered with the 
> filtering! That is unexpected.
> What's even stranger is that it does not render the filtering term. As a 
> result, I have a page that's mysteriously filtered but doesn't tell me what 
> it's filtering on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3708) container num become -1 after job finished

2015-05-24 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith resolved YARN-3708.
--
Resolution: Duplicate

This is duplicate of YARN-3552.  Closing the issue as duplicate..

> container num become -1 after job finished
> --
>
> Key: YARN-3708
> URL: https://issues.apache.org/jira/browse/YARN-3708
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.0
>Reporter: tongshiquan
>Priority: Minor
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page

2015-05-23 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557279#comment-14557279
 ] 

Rohith commented on YARN-2238:
--

Attached the RM web UI page image file which depicts the problem-2 in my 
previous comment.

> filtering on UI sticks even if I move away from the page
> 
>
> Key: YARN-2238
> URL: https://issues.apache.org/jira/browse/YARN-2238
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.4.0
>Reporter: Sangjin Lee
>Assignee: Jian He
>  Labels: usability
> Attachments: YARN-2238.patch, YARN-2238.png, filtered.png
>
>
> The main data table in many web pages (RM, AM, etc.) seems to show an 
> unexpected filtering behavior.
> If I filter the table by typing something in the key or value field (or I 
> suspect any search field), the data table gets filtered. The example I used 
> is the job configuration page for a MR job. That is expected.
> However, when I move away from that page and visit any other web page of the 
> same type (e.g. a job configuration page), the page is rendered with the 
> filtering! That is unexpected.
> What's even stranger is that it does not render the filtering term. As a 
> result, I have a page that's mysteriously filtered but doesn't tell me what 
> it's filtering on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2238) filtering on UI sticks even if I move away from the page

2015-05-23 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-2238:
-
Attachment: YARN-2238.png

> filtering on UI sticks even if I move away from the page
> 
>
> Key: YARN-2238
> URL: https://issues.apache.org/jira/browse/YARN-2238
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.4.0
>Reporter: Sangjin Lee
>Assignee: Jian He
>  Labels: usability
> Attachments: YARN-2238.patch, YARN-2238.png, filtered.png
>
>
> The main data table in many web pages (RM, AM, etc.) seems to show an 
> unexpected filtering behavior.
> If I filter the table by typing something in the key or value field (or I 
> suspect any search field), the data table gets filtered. The example I used 
> is the job configuration page for a MR job. That is expected.
> However, when I move away from that page and visit any other web page of the 
> same type (e.g. a job configuration page), the page is rendered with the 
> filtering! That is unexpected.
> What's even stranger is that it does not render the filtering term. As a 
> result, I have a page that's mysteriously filtered but doesn't tell me what 
> it's filtering on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page

2015-05-23 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557277#comment-14557277
 ] 

Rohith commented on YARN-2238:
--

I do not have much idea on JQuery, but I did blackbox testing in 1 node cluster 
applying the patch. 
Some observations
# Filtering on scheduler page does not carry to application page. This is JIRA 
scenario which is working fine.
# Once navigate to scheduler page, the click on LeafQueue bar apply the filters 
but does not show any apps running on that queue in the scheduler page.

> filtering on UI sticks even if I move away from the page
> 
>
> Key: YARN-2238
> URL: https://issues.apache.org/jira/browse/YARN-2238
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Affects Versions: 2.4.0
>Reporter: Sangjin Lee
>Assignee: Jian He
>  Labels: usability
> Attachments: YARN-2238.patch, filtered.png
>
>
> The main data table in many web pages (RM, AM, etc.) seems to show an 
> unexpected filtering behavior.
> If I filter the table by typing something in the key or value field (or I 
> suspect any search field), the data table gets filtered. The example I used 
> is the job configuration page for a MR job. That is expected.
> However, when I move away from that page and visit any other web page of the 
> same type (e.g. a job configuration page), the page is rendered with the 
> filtering! That is unexpected.
> What's even stranger is that it does not render the filtering term. As a 
> result, I have a page that's mysteriously filtered but doesn't tell me what 
> it's filtering on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled

2015-05-22 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557170#comment-14557170
 ] 

Rohith commented on YARN-3585:
--

I think we can invoke System.exit once the NodeManger is shutdown in finally 
block. For test case execution, bypass using flag. Any thoughts?

> NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
> --
>
> Key: YARN-3585
> URL: https://issues.apache.org/jira/browse/YARN-3585
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Peng Zhang
>Priority: Critical
>
> With NM recovery enabled, after decommission, nodemanager log show stop but 
> process cannot end. 
> non daemon thread:
> {noformat}
> "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
> condition [0x]
> "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
> [0x]
> "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 
> nid=0x29ed runnable 
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 
> nid=0x29ee runnable 
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 
> nid=0x29ef runnable 
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 
> nid=0x29f0 runnable 
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 
> nid=0x29f1 runnable 
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 
> nid=0x29f2 runnable 
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 
> nid=0x29f3 runnable 
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 
> nid=0x29f4 runnable 
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 
> runnable 
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 
> nid=0x29f5 runnable 
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 
> nid=0x29f6 runnable 
> "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
> on condition 
> {noformat}
> and jni leveldb thread stack
> {noformat}
> Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
> #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f33dfce2a3b in leveldb::(anonymous 
> namespace)::PosixEnv::BGThreadWrapper(void*) () from 
> /tmp/libleveldbjni-64-1-6922178968300745716.8
> #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
> #3  0x003d830e811d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-22 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14556239#comment-14556239
 ] 

Rohith commented on YARN-3543:
--

[~aw] Would you help to understand and resolve build issue? Basically the issue 
what I observe is the patch containes many file changes that includes many 
projects. When the test cases are triggered, it is ignoring the applied patches 
and taking existing class files which causing the compilation failure and other 
issues. But if I apply patch and build , it is successfull.

> ApplicationReport should be able to tell whether the Application is AM 
> managed or not. 
> ---
>
> Key: YARN-3543
> URL: https://issues.apache.org/jira/browse/YARN-3543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.6.0
>Reporter: Spandan Dutta
>Assignee: Rohith
> Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
> 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
> 0004-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG
>
>
> Currently we can know whether the application submitted by the user is AM 
> managed from the applicationSubmissionContext. This can be only done  at the 
> time when the user submits the job. We should have access to this info from 
> the ApplicationReport as well so that we can check whether an app is AM 
> managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3692) Allow REST API to set a user generated message when killing an application

2015-05-21 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554057#comment-14554057
 ] 

Rohith commented on YARN-3692:
--

All the applications are killed by user only. Diagnostic message for KILLED 
application by user is internal to YARN either it can be from REST or 
ApplicationClientProtocol who kills it. 
Is this let user set the reason for killing applications?

> Allow REST API to set a user generated message when killing an application
> --
>
> Key: YARN-3692
> URL: https://issues.apache.org/jira/browse/YARN-3692
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rajat Jain
>Assignee: Rohith
>
> Currently YARN's REST API supports killing an application without setting a 
> diagnostic message. It would be good to provide that support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3692) Allow REST API to set a user generated message when killing an application

2015-05-20 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith reassigned YARN-3692:


Assignee: Rohith

> Allow REST API to set a user generated message when killing an application
> --
>
> Key: YARN-3692
> URL: https://issues.apache.org/jira/browse/YARN-3692
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rajat Jain
>Assignee: Rohith
>
> Currently YARN's REST API supports killing an application without setting a 
> diagnostic message. It would be good to provide that support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-20 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552225#comment-14552225
 ] 

Rohith commented on YARN-3646:
--

+1 lgtm (non-binding)..  wait for jenkins report!!

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
> Attachments: YARN-3646.001.patch, YARN-3646.002.patch, YARN-3646.patch
>
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-20 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552165#comment-14552165
 ] 

Rohith commented on YARN-3543:
--

Build machine is not able to run all those test at one shot. Similar issue had 
faced earlier in YARN-2784.  I think need to split the  JIRA into proto change, 
WebUI change, AH change and more.

> ApplicationReport should be able to tell whether the Application is AM 
> managed or not. 
> ---
>
> Key: YARN-3543
> URL: https://issues.apache.org/jira/browse/YARN-3543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.6.0
>Reporter: Spandan Dutta
>Assignee: Rohith
>  Labels: BB2015-05-TBR
> Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
> 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
> 0004-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG
>
>
> Currently we can know whether the application submitted by the user is AM 
> managed from the applicationSubmissionContext. This can be only done  at the 
> time when the user submits the job. We should have access to this info from 
> the ApplicationReport as well so that we can check whether an app is AM 
> managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-20 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552091#comment-14552091
 ] 

Rohith commented on YARN-3646:
--

Thanks for updating the patch, some comments on tests 
# I think we can remove the tests added in the hadoop-common project, since 
yarn-client verifies required funcitionality. And basically hadoop-common test 
was mocking the RMProxy functionality which test was passing without RMProxy 
fix also.
# code never reach {{Assert.fail("");}}. better to remove it
# Catch the ApplicationNotFoundException instead of catching throwable. I think 
you can add {{expected = ApplicationNotFoundException.class}} in the @Test 
annotation  like below.
{code}
@Test(timeout = 3, expected = ApplicationNotFoundException.class)
  public void testClientWithRetryPolicyForEver() throws Exception {
YarnConfiguration conf = new YarnConfiguration();
conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, -1);

ResourceManager rm = null;
YarnClient yarnClient = null;
try {
  // start rm
  rm = new ResourceManager();
  rm.init(conf);
  rm.start();

  yarnClient = YarnClient.createYarnClient();
  yarnClient.init(conf);
  yarnClient.start();

  // create invalid application id
  ApplicationId appId = ApplicationId.newInstance(1430126768987L, 10645);

  // RM should throw ApplicationNotFoundException exception
  yarnClient.getApplicationReport(appId);
} finally {
  if (yarnClient != null) {
yarnClient.stop();
  }
  if (rm != null) {
rm.stop();
  }
}
  }
{code}
# can you rename the test name with actual functionality test, like 
{{testShouldNotRetryForeverForNonNetworkExceptions}}

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
> Attachments: YARN-3646.001.patch, YARN-3646.patch
>
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.had

[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-20 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3543:
-
Attachment: 0004-YARN-3543.patch

Attached same patch to kick off Jenkins

> ApplicationReport should be able to tell whether the Application is AM 
> managed or not. 
> ---
>
> Key: YARN-3543
> URL: https://issues.apache.org/jira/browse/YARN-3543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.6.0
>Reporter: Spandan Dutta
>Assignee: Rohith
>  Labels: BB2015-05-TBR
> Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
> 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
> 0004-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG
>
>
> Currently we can know whether the application submitted by the user is AM 
> managed from the applicationSubmissionContext. This can be only done  at the 
> time when the user submits the job. We should have access to this info from 
> the ApplicationReport as well so that we can check whether an app is AM 
> managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-20 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3543:
-
Attachment: (was: 0003-YARN-3543.patch)

> ApplicationReport should be able to tell whether the Application is AM 
> managed or not. 
> ---
>
> Key: YARN-3543
> URL: https://issues.apache.org/jira/browse/YARN-3543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.6.0
>Reporter: Spandan Dutta
>Assignee: Rohith
>  Labels: BB2015-05-TBR
> Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
> 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
> 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG
>
>
> Currently we can know whether the application submitted by the user is AM 
> managed from the applicationSubmissionContext. This can be only done  at the 
> time when the user submits the job. We should have access to this info from 
> the ApplicationReport as well so that we can check whether an app is AM 
> managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running

2015-05-19 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551826#comment-14551826
 ] 

Rohith commented on YARN-2268:
--

Thanks [~sunilg] [~jianhe] [~kasha] for sharing your thoughts.. 
bq. Given we recommend using the ZK-store when using HA, how about adding this 
for the ZK-store using an ephemeral znode for lock first?
+1 given state store recommend for ZKRMStateStore. 

bq. How about creating a lock file and declaring it stale after a stipulated 
period of time.
If we use stipulated period, am thinking that within the stiplated period, 
neither RM cant be started nor state store format cant be done. And the file 
has to be stored in hdfs neverthless of RMStateStore which is extra binding 
with filesytem. 

I am thinking , why can't we use general approach of polling the web service, 
it will give more accurate state. ? 


> Disallow formatting the RMStateStore when there is an RM running
> 
>
> Key: YARN-2268
> URL: https://issues.apache.org/jira/browse/YARN-2268
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Rohith
> Attachments: 0001-YARN-2268.patch
>
>
> YARN-2131 adds a way to format the RMStateStore. However, it can be a problem 
> if we format the store while an RM is actively using it. It would be nice to 
> fail the format if there is an RM running and using this store. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-19 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550854#comment-14550854
 ] 

Rohith commented on YARN-3543:
--

About the -1's from QA, 
# Findbugs is YARN-3677 exists to track  issue.
# Checkstyle error is number of parameter exceeds 7, which need to be ignored i 
think.  Am not sure , should it be added to any ignore file or just ignore it.
# Reg test failures, I am doubt on the test machines, many tests are  failing 
.. 
## Type-1, Address already in use exception. 
## Type-2, NoSuchMethodError
## Type-3, ClassCasteException and many others

I am pretty doubt on the order of compilation and test execution. Probably , 
for running resourcemanager tests, it is not including the modified classes in 
yarn-api/yarn-common. so NoSuchMethod error is thrown.

> ApplicationReport should be able to tell whether the Application is AM 
> managed or not. 
> ---
>
> Key: YARN-3543
> URL: https://issues.apache.org/jira/browse/YARN-3543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.6.0
>Reporter: Spandan Dutta
>Assignee: Rohith
>  Labels: BB2015-05-TBR
> Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
> 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
> 0003-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG
>
>
> Currently we can know whether the application submitted by the user is AM 
> managed from the applicationSubmissionContext. This can be only done  at the 
> time when the user submits the job. We should have access to this info from 
> the ApplicationReport as well so that we can check whether an app is AM 
> managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-19 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3543:
-
Attachment: 0004-YARN-3543.patch

> ApplicationReport should be able to tell whether the Application is AM 
> managed or not. 
> ---
>
> Key: YARN-3543
> URL: https://issues.apache.org/jira/browse/YARN-3543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.6.0
>Reporter: Spandan Dutta
>Assignee: Rohith
>  Labels: BB2015-05-TBR
> Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
> 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
> 0003-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG
>
>
> Currently we can know whether the application submitted by the user is AM 
> managed from the applicationSubmissionContext. This can be only done  at the 
> time when the user submits the job. We should have access to this info from 
> the ApplicationReport as well so that we can check whether an app is AM 
> managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-19 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550258#comment-14550258
 ] 

Rohith commented on YARN-3646:
--

And I verified in one node cluster by enabling and disabling retryforever 
policy.

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
> Attachments: YARN-3646.patch
>
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-19 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550256#comment-14550256
 ] 

Rohith commented on YARN-3646:
--

Thanks for working on this issue.. The patch overall looks good to me.
nit : Can the test moved to Yarn package since issue is in Yarn? Otherwise if 
there is any changed in the RMProxy, test will not run.

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
> Attachments: YARN-3646.patch
>
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-19 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550233#comment-14550233
 ] 

Rohith commented on YARN-3646:
--

bq. Seems we do not even require exceptionToPolicy for FOREVER policy if we 
catch the exception in shouldRetry method.
make sense to me,will reveiw the patch, thanks

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
> Attachments: YARN-3646.patch
>
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3674) YARN application disappears from view

2015-05-18 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14549928#comment-14549928
 ] 

Rohith commented on YARN-3674:
--

Is this dup of YARN-2238?

> YARN application disappears from view
> -
>
> Key: YARN-3674
> URL: https://issues.apache.org/jira/browse/YARN-3674
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.0
>Reporter: Sergey Shelukhin
>
> I have 2 tabs open at exact same URL with RUNNING applications view. There is 
> an application that is, in fact, running, that is visible in one tab but not 
> the other. This persists across refreshes. If I open new tab from the tab 
> where the application is not visible, in that tab it shows up ok.
> I didn't change scheduler/queue settings before this behavior happened; on 
> [~sseth]'s advice I went and tried to click the root node of the scheduler on 
> scheduler page; the app still does not become visible.
> Something got stuck somewhere...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546562#comment-14546562
 ] 

Rohith commented on YARN-3543:
--

bq. But doesn't impact compatibility?
I meant to say ApplicationReport.newInstance() is called from out side of YARN. 
Ex : In MR, NotRunningJob#getUnknownApplicationReport. Similarly, if any other 
yarn clients using for ApplicationReport.newInstance, it would cause 
compatibility issue.  So I just provided setters and getters for UnmanagedApp.

> ApplicationReport should be able to tell whether the Application is AM 
> managed or not. 
> ---
>
> Key: YARN-3543
> URL: https://issues.apache.org/jira/browse/YARN-3543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.6.0
>Reporter: Spandan Dutta
>Assignee: Rohith
>  Labels: BB2015-05-TBR
> Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
> 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
> 0003-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG
>
>
> Currently we can know whether the application submitted by the user is AM 
> managed from the applicationSubmissionContext. This can be only done  at the 
> time when the user submits the job. We should have access to this info from 
> the ApplicationReport as well so that we can check whether an app is AM 
> managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546560#comment-14546560
 ] 

Rohith commented on YARN-3543:
--

bq. ApplicationReport.newInstance() is Private, so you should simply update the 
existing method instead of adding a new one.
 I understood your comment above like since it is private, newInstance() method 
should not be modified. So I just added setter and getter methods in 
ApplicationReport. But doesn't impact compatibility? 

bq. app == null ? null : app.getUser()); What are these changes for?
This is for fixing findbug in earlier jenkins report. One thing observed is 
# when {{return ApplicationReport.newInstance}}, does not give findbug warning 
but
# when assign {{ApplicationReport.newInstance}} to new variable and return the 
variable giving findbug waining. So I changed above null check.

bq. AppInfo.getUnmanagedAM() needs to be renamed too.
Agree

> ApplicationReport should be able to tell whether the Application is AM 
> managed or not. 
> ---
>
> Key: YARN-3543
> URL: https://issues.apache.org/jira/browse/YARN-3543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.6.0
>Reporter: Spandan Dutta
>Assignee: Rohith
>  Labels: BB2015-05-TBR
> Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
> 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
> 0003-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG
>
>
> Currently we can know whether the application submitted by the user is AM 
> managed from the applicationSubmissionContext. This can be only done  at the 
> time when the user submits the job. We should have access to this info from 
> the ApplicationReport as well so that we can check whether an app is AM 
> managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.

2015-05-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545916#comment-14545916
 ] 

Rohith commented on YARN-3543:
--

Need to kick off jenkins again to check test failure are regular.

> ApplicationReport should be able to tell whether the Application is AM 
> managed or not. 
> ---
>
> Key: YARN-3543
> URL: https://issues.apache.org/jira/browse/YARN-3543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.6.0
>Reporter: Spandan Dutta
>Assignee: Rohith
>  Labels: BB2015-05-TBR
> Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 
> 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 
> 0003-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG
>
>
> Currently we can know whether the application submitted by the user is AM 
> managed from the applicationSubmissionContext. This can be only done  at the 
> time when the user submits the job. We should have access to this info from 
> the ApplicationReport as well so that we can check whether an app is AM 
> managed or not anytime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3642) Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java

2015-05-15 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith resolved YARN-3642.
--
Resolution: Invalid

Closing as Invalid.

If there is any queries or basic environment problems , I suggest to use user 
mailing lists to ask queries.

> Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java
> -
>
> Key: YARN-3642
> URL: https://issues.apache.org/jira/browse/YARN-3642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: yarn-site.xml:
> 
>
>   yarn.nodemanager.aux-services
>   mapreduce_shuffle
>
>
>   yarn.nodemanager.aux-services.mapreduce.shuffle.class
>   org.apache.hadoop.mapred.ShuffleHandler
>
>
>   yarn.resourcemanager.hostname
>   qadoop-nn001.apsalar.com
>
>
>   yarn.resourcemanager.scheduler.address
>   qadoop-nn001.apsalar.com:8030
>
>
>   yarn.resourcemanager.address
>   qadoop-nn001.apsalar.com:8032
>
>
>   yarn.resourcemanager.webap.address
>   qadoop-nn001.apsalar.com:8088
>
>
>   yarn.resourcemanager.resource-tracker.address
>   qadoop-nn001.apsalar.com:8031
>
>
>   yarn.resourcemanager.admin.address
>   qadoop-nn001.apsalar.com:8033
>
>
>   yarn.log-aggregation-enable
>   true
>
>
>   Where to aggregate logs to.
>   yarn.nodemanager.remote-app-log-dir
>   /var/log/hadoop/apps
>
>
>   yarn.web-proxy.address
>   qadoop-nn001.apsalar.com:8088
>
> 
> core-site.xml:
> 
>
>   fs.defaultFS
>   hdfs://qadoop-nn001.apsalar.com
>
>
>   hadoop.proxyuser.hdfs.hosts
>   *
>
>
>   hadoop.proxyuser.hdfs.groups
>   *
>
> 
> hdfs-site.xml:
> 
>
>   dfs.replication
>   2
>
>
>   dfs.namenode.name.dir
>   file:/hadoop/nn
>
>
>   dfs.datanode.data.dir
>   file:/hadoop/dn/dfs
>
>
>   dfs.http.address
>   qadoop-nn001.apsalar.com:50070
>
>
>   dfs.secondary.http.address
>   qadoop-nn002.apsalar.com:50090
>
> 
> mapred-site.xml:
> 
> 
>   mapred.job.tracker 
>   qadoop-nn001.apsalar.com:8032 
>
>
>   mapreduce.framework.name
>   yarn
>
>
>   mapreduce.jobhistory.address
>   qadoop-nn001.apsalar.com:10020
>   the JobHistoryServer address.
>
>  
>   mapreduce.jobhistory.webapp.address  
>   qadoop-nn001.apsalar.com:19888  
>   the JobHistoryServer web address
>
> 
> hbase-site.xml:
> 
>  
> hbase.master 
> qadoop-nn001.apsalar.com:6 
>  
>  
> hbase.rootdir 
> hdfs://qadoop-nn001.apsalar.com:8020/hbase 
>  
>  
> hbase.cluster.distributed 
> true 
>  
> 
> hbase.zookeeper.property.dataDir
> /opt/local/zookeeper
>  
> 
> hbase.zookeeper.property.clientPort
> 2181 
> 
>  
> hbase.zookeeper.quorum 
> qadoop-nn001.apsalar.com 
>  
>  
> zookeeper.session.timeout 
> 18 
>  
> 
>Reporter: Lee Hounshell
>
> There is an issue with Hadoop 2.7.0 when in distributed operation the 
> datanode is unable to reach the yarn scheduler.  In our yarn-site.xml, we 
> have defined this path to be:
> {code}
>
>   yarn.resourcemanager.scheduler.address
>   qadoop-nn001.apsalar.com:8030
>
> {code}
> But when running an oozie job, the problem manifests when looking at the job 
> logs for the yarn container.
> We see logs similar to the following showing the connection problem:
> {quote}
> Showing 4096 bytes. Click here for full log
> [main] org.apache.hadoop.http.HttpServer2: Jetty bound to port 64065
> 2015-05-13 17:49:33,930 INFO [main] org.mortbay.log: jetty-6.1.26
> 2015-05-13 17:49:33,971 INFO [main] org.mortbay.log: Extract 
> jar:file:/opt/local/hadoop/hadoop-2.7.0/share/hadoop/yarn/hadoop-yarn-common-2.7.0.jar!/webapps/mapreduce
>  to /var/tmp/Jetty_0_0_0_0_64065_mapreduce.1ayyhk/webapp
> 2015-05-13 17:49:34,234 INFO [main] org.mortbay.log: Started 
> HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:64065
> 2015-05-13 17:49:34,234 INFO [main] org.apache.hadoop.yarn.webapp.WebApps: 
> Web app /mapreduce started at 64065
> 2015-05-13 17:49:34,645 INFO [main] org.apache.hadoop.yarn.webapp.WebApps: 
> Registered webapp guice modules
> 2015-05-13 17:49:34,651 INFO [main] org.apache.hadoop.ipc.CallQueueManager: 
> Using callQueue class java.util.concurrent.LinkedBlockingQueue
> 2015-05-13 17:49:34,652 INFO [Socket Reader #1 for port 38927] 
> org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 38927
> 2015-05-13 17:49:3

[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544988#comment-14544988
 ] 

Rohith commented on YARN-3646:
--

Setting RetryPolicies.RETRY_FOREVER for exceptionToPolicyMap as default policy 
is not sufficient, but also {{RetryPolicies.RetryForever.shouldRetry()}} should 
check for Connect exceptions and handle it. Otherwise shouldRetry always return 
RetryAction.RETRY action.

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544959#comment-14544959
 ] 

Rohith commented on YARN-3646:
--

I was copied *yarn.resourcemanager.connect.wait-ms* from description but actual 
configuration is *yarn.resourcemanager.connect.max-wait.ms*.

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544947#comment-14544947
 ] 

Rohith commented on YARN-3646:
--

RetryPolicies.RETRY_FOREVER should also should use exceptionToPolicyMap.
[~raju.bairishetti] Feel free to take up this JIRA. 

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544938#comment-14544938
 ] 

Rohith commented on YARN-3646:
--

Thanks for the explanation.. I got the problem in my machines too. Last time 
when I test, the configuration settings had issue. 

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3642) Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java

2015-05-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544920#comment-14544920
 ] 

Rohith commented on YARN-3642:
--

How many nodemanagers are running? If it more than 1 then I am thinking what 
would have happen in your case is yarn-site.xml never read by clent i.e oozi 
job but still you are able to submit the job because you might be submitting 
job from the local machine i.e where RM is running. So with default port job is 
able to submit , but when AppplicationManster is launched , it is launched in 
different machine where NodeManager is running. Since scheduler address is not 
loaded by any configuration, AM tries to connect default address i.e 
0.0.0.0:8030 which never connect. 

I suggest that you can make sure your yarn-site.xml is loaded into classpath 
before submitting the job. So the AM gets the 
yarn.resourcemanager.scheduler.address and connect to RM. Otherway is 
explicitely set yarn.resourcemanager.scheduler.address  using job client.

> Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java
> -
>
> Key: YARN-3642
> URL: https://issues.apache.org/jira/browse/YARN-3642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
> Environment: yarn-site.xml:
> 
>
>   yarn.nodemanager.aux-services
>   mapreduce_shuffle
>
>
>   yarn.nodemanager.aux-services.mapreduce.shuffle.class
>   org.apache.hadoop.mapred.ShuffleHandler
>
>
>   yarn.resourcemanager.hostname
>   qadoop-nn001.apsalar.com
>
>
>   yarn.resourcemanager.scheduler.address
>   qadoop-nn001.apsalar.com:8030
>
>
>   yarn.resourcemanager.address
>   qadoop-nn001.apsalar.com:8032
>
>
>   yarn.resourcemanager.webap.address
>   qadoop-nn001.apsalar.com:8088
>
>
>   yarn.resourcemanager.resource-tracker.address
>   qadoop-nn001.apsalar.com:8031
>
>
>   yarn.resourcemanager.admin.address
>   qadoop-nn001.apsalar.com:8033
>
>
>   yarn.log-aggregation-enable
>   true
>
>
>   Where to aggregate logs to.
>   yarn.nodemanager.remote-app-log-dir
>   /var/log/hadoop/apps
>
>
>   yarn.web-proxy.address
>   qadoop-nn001.apsalar.com:8088
>
> 
> core-site.xml:
> 
>
>   fs.defaultFS
>   hdfs://qadoop-nn001.apsalar.com
>
>
>   hadoop.proxyuser.hdfs.hosts
>   *
>
>
>   hadoop.proxyuser.hdfs.groups
>   *
>
> 
> hdfs-site.xml:
> 
>
>   dfs.replication
>   2
>
>
>   dfs.namenode.name.dir
>   file:/hadoop/nn
>
>
>   dfs.datanode.data.dir
>   file:/hadoop/dn/dfs
>
>
>   dfs.http.address
>   qadoop-nn001.apsalar.com:50070
>
>
>   dfs.secondary.http.address
>   qadoop-nn002.apsalar.com:50090
>
> 
> mapred-site.xml:
> 
> 
>   mapred.job.tracker 
>   qadoop-nn001.apsalar.com:8032 
>
>
>   mapreduce.framework.name
>   yarn
>
>
>   mapreduce.jobhistory.address
>   qadoop-nn001.apsalar.com:10020
>   the JobHistoryServer address.
>
>  
>   mapreduce.jobhistory.webapp.address  
>   qadoop-nn001.apsalar.com:19888  
>   the JobHistoryServer web address
>
> 
> hbase-site.xml:
> 
>  
> hbase.master 
> qadoop-nn001.apsalar.com:6 
>  
>  
> hbase.rootdir 
> hdfs://qadoop-nn001.apsalar.com:8020/hbase 
>  
>  
> hbase.cluster.distributed 
> true 
>  
> 
> hbase.zookeeper.property.dataDir
> /opt/local/zookeeper
>  
> 
> hbase.zookeeper.property.clientPort
> 2181 
> 
>  
> hbase.zookeeper.quorum 
> qadoop-nn001.apsalar.com 
>  
>  
> zookeeper.session.timeout 
> 18 
>  
> 
>Reporter: Lee Hounshell
>
> There is an issue with Hadoop 2.7.0 when in distributed operation the 
> datanode is unable to reach the yarn scheduler.  In our yarn-site.xml, we 
> have defined this path to be:
> {code}
>
>   yarn.resourcemanager.scheduler.address
>   qadoop-nn001.apsalar.com:8030
>
> {code}
> But when running an oozie job, the problem manifests when looking at the job 
> logs for the yarn container.
> We see logs similar to the following showing the connection problem:
> {quote}
> Showing 4096 bytes. Click here for full log
> [main] org.apache.hadoop.http.HttpServer2: Jetty bound to port 64065
> 2015-05-13 17:49:33,930 INFO [main] org.mortbay.log: jetty-6.1.26
> 2015-05-13 17:49:33,971 INFO [main] org.mortbay.log: Extract 
> jar:file:/opt/local/hadoop/hadoop-2.7.0/share/hadoop/yarn/hadoop-yarn-common-2.

[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543776#comment-14543776
 ] 

Rohith commented on YARN-3646:
--

Which version of Hadoop are you using? I don't see this problem in trunk or 
branch-2.

> Applications are getting stuck some times in case of retry policy forever
> -
>
> Key: YARN-3646
> URL: https://issues.apache.org/jira/browse/YARN-3646
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Raju Bairishetti
>
> We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
> retry policy.
> Yarn client is infinitely retrying in case of exceptions from the RM as it is 
> using retrying policy as FOREVER. The problem is it is retrying for all kinds 
> of exceptions (like ApplicationNotFoundException), even though it is not a 
> connection failure. Due to this my application is not progressing further.
> *Yarn client should not retry infinitely in case of non connection failures.*
> We have written a simple yarn-client which is trying to get an application 
> report for an invalid  or older appId. ResourceManager is throwing an 
> ApplicationNotFoundException as this is an invalid or older appId.  But 
> because of retry policy FOREVER, client is keep on retrying for getting the 
> application report and ResourceManager is throwing 
> ApplicationNotFoundException continuously.
> {code}
> private void testYarnClientRetryPolicy() throws  Exception{
> YarnConfiguration conf = new YarnConfiguration();
> conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
> -1);
> YarnClient yarnClient = YarnClient.createYarnClient();
> yarnClient.init(conf);
> yarnClient.start();
> ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
> 10645);
> ApplicationReport report = yarnClient.getApplicationReport(appId);
> }
> {code}
> *RM logs:*
> {noformat}
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875162 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1430126768987_10645' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> 
> 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 10.14.120.231:61621 Call#875163 Retry#0
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   10   >