[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589294#comment-14589294 ] Rohith commented on YARN-2305: -- Updated the duplicated id link. > When a container is in reserved state then total cluster memory is displayed > wrongly. > - > > Key: YARN-2305 > URL: https://issues.apache.org/jira/browse/YARN-2305 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: J.Andreina >Assignee: Sunil G > Attachments: Capture.jpg > > > ENV Details: > = > 3 queues : a(50%),b(25%),c(25%) ---> All max utilization is set to > 100 > 2 Node cluster with total memory as 16GB > TestSteps: > = > Execute following 3 jobs with different memory configurations for > Map , reducer and AM task > ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a > -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 > -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 > /dir8 /preempt_85 (application_1405414066690_0023) > ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b > -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 > -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 > /dir2 /preempt_86 (application_1405414066690_0025) > > ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c > -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 > -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 > /dir2 /preempt_62 > Issue > = > when 2GB memory is in reserved state totoal memory is shown as > 15GB and used as 15GB ( while total memory is 16GB) > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
[ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587519#comment-14587519 ] Rohith commented on YARN-3809: -- This is interesting scenario, but am not sure why ThreadPool is set to 10 which is not configurable. bq. the default RPC time out is 15 mins.. I see RPC timeout is 1 minute, am I missing anything? {code} static final int DEFAULT_COMMAND_TIMEOUT = 6; . int expireIntvl = conf.getInt(NM_COMMAND_TIMEOUT, DEFAULT_COMMAND_TIMEOUT); proxy = (ContainerManagementProtocolPB) RPC.getProxy(ContainerManagementProtocolPB.class, clientVersion, addr, ugi, conf, NetUtils.getDefaultSocketFactory(conf), expireIntvl); {code} > Failed to launch new attempts because ApplicationMasterLauncher's threads all > hang > -- > > Key: YARN-3809 > URL: https://issues.apache.org/jira/browse/YARN-3809 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Jun Gong >Assignee: Jun Gong > > ApplicationMasterLauncher create a thread pool whose size is 10 to deal with > AMLauncherEventType(LAUNCH and CLEANUP). > In our cluster, there was many NM with 10+ AM running on it, and one shut > down for some reason. After RM found the NM LOST, it cleaned up AMs running > on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. > ApplicationMasterLauncher's thread pool would be filled up, and they all hang > in the code containerMgrProxy.stopContainers(stopRequest) because NM was > down, the default RPC time out is 15 mins. It means that in 15 mins > ApplicationMasterLauncher could not handle new event such as LAUNCH, then new > attempts will fails to launch because of time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587412#comment-14587412 ] Rohith commented on YARN-3789: -- Looks good to me too.. > Refactor logs for LeafQueue#activateApplications() to remove duplicate logging > -- > > Key: YARN-3789 > URL: https://issues.apache.org/jira/browse/YARN-3789 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, > 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch > > > Duplicate logging from resource manager > during am limit check for each application > {code} > 015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585664#comment-14585664 ] Rohith commented on YARN-3789: -- +1(non-binding) > Refactor logs for LeafQueue#activateApplications() to remove duplicate logging > -- > > Key: YARN-3789 > URL: https://issues.apache.org/jira/browse/YARN-3789 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, > 0003-YARN-3789.patch, 0004-YARN-3789.patch > > > Duplicate logging from resource manager > during am limit check for each application > {code} > 015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585450#comment-14585450 ] Rohith commented on YARN-3790: -- Thank @zhihai for your detailed explanation.. I got the problem..:-) Overall patch looks good to me, I think we should change this JIRA component to scheduler since code change is in FairScheduler > TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in > trunk for FS scheduler > > > Key: YARN-3790 > URL: https://issues.apache.org/jira/browse/YARN-3790 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith >Assignee: zhihai xu > Attachments: YARN-3790.000.patch > > > Failure trace is as follows > {noformat} > Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 6.502 sec <<< FAILURE! > java.lang.AssertionError: expected:<6144> but was:<8192> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-1382) NodeListManager has a memory leak, unusableRMNodesConcurrentSet is never purged
[ https://issues.apache.org/jira/browse/YARN-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith reassigned YARN-1382: Assignee: Rohith > NodeListManager has a memory leak, unusableRMNodesConcurrentSet is never > purged > --- > > Key: YARN-1382 > URL: https://issues.apache.org/jira/browse/YARN-1382 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.2.0 >Reporter: Alejandro Abdelnur >Assignee: Rohith > > If a node is in the unusable nodes set (unusableRMNodesConcurrentSet) and > never comes back, the node will be there forever. > While the leak is not big, it gets aggravated if the NM addresses are > configured with ephemeral ports as when the nodes come back they come back as > new. > Some related details in YARN-1343 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585431#comment-14585431 ] Rohith commented on YARN-3543: -- Thanks [~xgong] for the review.. bq. Could we not directly change the ApplicationReport.newInstance() ? This will break other applications, such as Tez. IIUC, ApplicationReport#newInstance() is @private annotated, so ohter client should not able to use this. And in the ealier patch I was added new method which does not break compatibility, but [~vinodkv] suggested me not to change this API in his reveiw comment [link|https://issues.apache.org/jira/browse/YARN-3543?focusedCommentId=14533819&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14533819] > ApplicationReport should be able to tell whether the Application is AM > managed or not. > --- > > Key: YARN-3543 > URL: https://issues.apache.org/jira/browse/YARN-3543 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.6.0 >Reporter: Spandan Dutta >Assignee: Rohith > Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, > 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, > 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, > YARN-3543-AH.PNG, YARN-3543-RM.PNG > > > Currently we can know whether the application submitted by the user is AM > managed from the applicationSubmissionContext. This can be only done at the > time when the user submits the job. We should have access to this info from > the ApplicationReport as well so that we can check whether an app is AM > managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580255#comment-14580255 ] Rohith commented on YARN-3790: -- Thanks for looking into this issue, bq. If UpdateThread call update after recoverContainersOnNode, the test will succeed In the test, I see below code which verify for contaner to recover right? {code} // Wait for RM to settle down on recovering containers; waitForNumContainersToRecover(2, rm2, am1.getApplicationAttemptId()); Set launchedContainers = ((RMNodeImpl) rm2.getRMContext().getRMNodes().get(nm1.getNodeId())) .getLaunchedContainers(); assertTrue(launchedContainers.contains(amContainer.getContainerId())); assertTrue(launchedContainers.contains(runningContainer.getContainerId())); {code} Am I missing anything? > TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in > trunk for FS scheduler > > > Key: YARN-3790 > URL: https://issues.apache.org/jira/browse/YARN-3790 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith >Assignee: zhihai xu > > Failure trace is as follows > {noformat} > Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 6.502 sec <<< FAILURE! > java.lang.AssertionError: expected:<6144> but was:<8192> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3790: - Summary: TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler (was: TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler) > TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in > trunk for FS scheduler > > > Key: YARN-3790 > URL: https://issues.apache.org/jira/browse/YARN-3790 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith >Assignee: zhihai xu > > Failure trace is as follows > {noformat} > Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 6.502 sec <<< FAILURE! > java.lang.AssertionError: expected:<6144> but was:<8192> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580228#comment-14580228 ] Rohith commented on YARN-3790: -- bq. I think this test fails intermittently. Yes, it is failing intermittenlty. May be issue summary can be updated. > TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS > scheduler > - > > Key: YARN-3790 > URL: https://issues.apache.org/jira/browse/YARN-3790 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Rohith >Assignee: zhihai xu > > Failure trace is as follows > {noformat} > Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 6.502 sec <<< FAILURE! > java.lang.AssertionError: expected:<6144> but was:<8192> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler
Rohith created YARN-3790: Summary: TestWorkPreservingRMRestart#testSchedulerRecovery fails in trunk for FS scheduler Key: YARN-3790 URL: https://issues.apache.org/jira/browse/YARN-3790 Project: Hadoop YARN Issue Type: Bug Reporter: Rohith Failure trace is as follows {noformat} Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 6.502 sec <<< FAILURE! java.lang.AssertionError: expected:<6144> but was:<8192> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579198#comment-14579198 ] Rohith commented on YARN-3789: -- I think, instead of *Not starting*, *Not activating the application* would make more meaningful. > Refactor logs for LeafQueue#activateApplications() to remove duplicate logging > -- > > Key: YARN-3789 > URL: https://issues.apache.org/jira/browse/YARN-3789 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: 0001-YARN-3789.patch > > > Duplicate logging from resource manager > during am limit check for each application > {code} > 015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3788) Application Master and Task Tracker timeouts are applied incorrectly
[ https://issues.apache.org/jira/browse/YARN-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579188#comment-14579188 ] Rohith commented on YARN-3788: -- This is MapReduce project issue/query, moving to MR for further discussion. > Application Master and Task Tracker timeouts are applied incorrectly > > > Key: YARN-3788 > URL: https://issues.apache.org/jira/browse/YARN-3788 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Dmitry Sivachenko > > I am running a streaming job which requires a big (~50GB) data file to run > (file is attached via hadoop jar <...> -file BigFile.dat). > Most likely this command will fail as follows (note that error message is > rather meaningless): > 2015-05-27 15:55:00,754 WARN [main] streaming.StreamJob > (StreamJob.java:parseArgv(291)) - -file option is deprecated, please use > generic option -files instead. > packageJobJar: [/ssd/mt/lm/en_reorder.ylm, mapper.py, > /tmp/hadoop-mitya/hadoop-unjar3778165585140840383/] [] > /var/tmp/streamjob633547925483233845.jar tmpDir=null > 2015-05-27 19:46:22,942 INFO [main] client.RMProxy > (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at > nezabudka1-00.yandex.ru/5.255.231.129:8032 > 2015-05-27 19:46:23,733 INFO [main] client.RMProxy > (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at > nezabudka1-00.yandex.ru/5.255.231.129:8032 > 2015-05-27 20:13:37,231 INFO [main] mapred.FileInputFormat > (FileInputFormat.java:listStatus(247)) - Total input paths to process : 1 > 2015-05-27 20:13:38,110 INFO [main] mapreduce.JobSubmitter > (JobSubmitter.java:submitJobInternal(396)) - number of splits:1 > 2015-05-27 20:13:38,136 INFO [main] Configuration.deprecation > (Configuration.java:warnOnceIfDeprecated(1009)) - mapred.reduce.tasks is > deprecated. Instead, use mapreduce.job.reduces > 2015-05-27 20:13:38,390 INFO [main] mapreduce.JobSubmitter > (JobSubmitter.java:printTokens(479)) - Submitting tokens for job: > job_1431704916575_2531 > 2015-05-27 20:13:38,689 INFO [main] impl.YarnClientImpl > (YarnClientImpl.java:submitApplication(204)) - Submitted application > application_1431704916575_2531 > 2015-05-27 20:13:38,743 INFO [main] mapreduce.Job (Job.java:submit(1289)) - > The url to track the job: > http://nezabudka1-00.yandex.ru:8088/proxy/application_1431704916575_2531/ > 2015-05-27 20:13:38,746 INFO [main] mapreduce.Job > (Job.java:monitorAndPrintJob(1334)) - Running job: job_1431704916575_2531 > 2015-05-27 21:04:12,353 INFO [main] mapreduce.Job > (Job.java:monitorAndPrintJob(1355)) - Job job_1431704916575_2531 running in > uber mode : false > 2015-05-27 21:04:12,356 INFO [main] mapreduce.Job > (Job.java:monitorAndPrintJob(1362)) - map 0% reduce 0% > 2015-05-27 21:04:12,374 INFO [main] mapreduce.Job > (Job.java:monitorAndPrintJob(1375)) - Job job_1431704916575_2531 failed with > state FAILED due to: Application application_1431704916575_2531 failed 2 > times due to ApplicationMaster for attempt > appattempt_1431704916575_2531_02 timed out. Failing the application. > 2015-05-27 21:04:12,473 INFO [main] mapreduce.Job > (Job.java:monitorAndPrintJob(1380)) - Counters: 0 > 2015-05-27 21:04:12,474 ERROR [main] streaming.StreamJob > (StreamJob.java:submitAndMonitorJob(1019)) - Job not Successful! > Streaming Command Failed! > This is because yarn.am.liveness-monitor.expiry-interval-ms (defaults to 600 > sec) timeout expires before large data file is transferred. > Next step I increase yarn.am.liveness-monitor.expiry-interval-ms. After that > application is successfully initialized and tasks are spawned. > But I encounter another error: the default 600 seconds mapreduce.task.timeout > expires before tasks are initialized and tasks fail. > Error message Task attempt_XXX failed to report status for 600 seconds is > also misleading: this timeout is supposed to kill non-responsive (stuck) > tasks but it rather strikes because auxiliary data files are copying slowly. > So I need to increase mapreduce.task.timeout too and only after that my job > is successful. > At the very least error messages need to be tweaked to indicate that > Application (or Task) is failing because auxiliary files are not copied > during that time, not just generic "timeout expired". > Better solution would be not to account time spent for data files > distribution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3789) Refactor logs for LeafQueue#activateApplications() to remove duplicate logging
[ https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579184#comment-14579184 ] Rohith commented on YARN-3789: -- Thanks [~bibinchundatt] for reporting and providing patch Some comments # Log message can be made more clear for log analysis. The messages can be like ## Not starting the application as usedAMResource < amIfStarted > exceeds AMResourceLimit ## Not starting the application for the user as usedUserAMResource < userAmIfStarted > exceeds userAMResourceLimit < userAMLimit > # Can you update issue summary and description as real problem i.e issue is in log message correction, not removing duplicate logging. > Refactor logs for LeafQueue#activateApplications() to remove duplicate logging > -- > > Key: YARN-3789 > URL: https://issues.apache.org/jira/browse/YARN-3789 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Minor > Attachments: 0001-YARN-3789.patch > > > Duplicate logging from resource manager > during am limit check for each application > {code} > 015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > 2015-06-09 17:32:40,019 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > not starting application as amIfStarted exceeds amLimit > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3697) FairScheduler: ContinuousSchedulingThread can't be shutdown after stop sometimes.
[ https://issues.apache.org/jira/browse/YARN-3697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578677#comment-14578677 ] Rohith commented on YARN-3697: -- Hi [~zxu], Trying for understanding the problem, Is it ocured when the RM shutdown is called which tries to stop FS service? Does it causing RM to hang during shutdown? > FairScheduler: ContinuousSchedulingThread can't be shutdown after stop > sometimes. > -- > > Key: YARN-3697 > URL: https://issues.apache.org/jira/browse/YARN-3697 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3697.000.patch > > > FairScheduler: ContinuousSchedulingThread can't be shutdown after stop > sometimes. > The reason is because the InterruptedException is blocked in > continuousSchedulingAttempt > {code} > try { > if (node != null && Resources.fitsIn(minimumAllocation, > node.getAvailableResource())) { > attemptScheduling(node); > } > } catch (Throwable ex) { > LOG.error("Error while attempting scheduling for node " + node + > ": " + ex.toString(), ex); > } > {code} > I saw the following exception after stop: > {code} > 2015-05-17 23:30:43,065 WARN [FairSchedulerContinuousScheduling] > event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher > thread interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285) > 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] > fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - > Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 > available= used=: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249) > at > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerS
[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID
[ https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578296#comment-14578296 ] Rohith commented on YARN-3017: -- Thanks [~ozawa] for confirmation:-) > ContainerID in ResourceManager Log Has Slightly Different Format From > AppAttemptID > -- > > Key: YARN-3017 > URL: https://issues.apache.org/jira/browse/YARN-3017 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: MUFEED USMAN >Priority: Minor > Labels: PatchAvailable > Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch > > > Not sure if this should be filed as a bug or not. > In the ResourceManager log in the events surrounding the creation of a new > application attempt, > ... > ... > 2014-11-14 17:45:37,258 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching > masterappattempt_1412150883650_0001_02 > ... > ... > The application attempt has the ID format "_1412150883650_0001_02". > Whereas the associated ContainerID goes by "_1412150883650_0001_02_". > ... > ... > 2014-11-14 17:45:37,260 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting > up > container Container: [ContainerId: container_1412150883650_0001_02_01, > NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: vCores:1, > disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02 > ... > ... > Curious to know if this is kept like that for a reason. If not while using > filtering tools to, say, grep events surrounding a specific attempt by the > numeric ID part information may slip out during troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3775) Job does not exit after all node become unhealthy
[ https://issues.apache.org/jira/browse/YARN-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith resolved YARN-3775. -- Resolution: Not A Problem Closing as Not A Problem. Please Reopen if you disagree.. > Job does not exit after all node become unhealthy > - > > Key: YARN-3775 > URL: https://issues.apache.org/jira/browse/YARN-3775 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 > Environment: Environment: > Version : 2.7.0 > OS: RHEL7 > NameNodes: xiachsh11 xiachsh12 (HA enabled) > DataNodes: 5 xiachsh13-17 > ResourceManage: xiachsh11 > NodeManage: 5 xiachsh13-17 > all nodes are openstack provisioned: > MEM: 1.5G > Disk: 16G >Reporter: Chengshun Xia > Attachments: logs.tar.gz > > > Running Terasort with data size 10G, all the containers exit since the disk > space threshold 0.90 reached,at this point,the job does not exit with error > 15/06/05 13:13:28 INFO mapreduce.Job: map 9% reduce 0% > 15/06/05 13:13:52 INFO mapreduce.Job: map 10% reduce 0% > 15/06/05 13:14:30 INFO mapreduce.Job: map 11% reduce 0% > 15/06/05 13:15:11 INFO mapreduce.Job: map 12% reduce 0% > 15/06/05 13:15:43 INFO mapreduce.Job: map 13% reduce 0% > 15/06/05 13:16:38 INFO mapreduce.Job: map 14% reduce 0% > 15/06/05 13:16:41 INFO mapreduce.Job: map 15% reduce 0% > 15/06/05 13:16:53 INFO mapreduce.Job: map 16% reduce 0% > 15/06/05 13:17:24 INFO mapreduce.Job: map 17% reduce 0% > 15/06/05 13:17:53 INFO mapreduce.Job: map 18% reduce 0% > 15/06/05 13:18:36 INFO mapreduce.Job: map 19% reduce 0% > 15/06/05 13:19:03 INFO mapreduce.Job: map 20% reduce 0% > 15/06/05 13:19:09 INFO mapreduce.Job: map 15% reduce 0% > 15/06/05 13:19:32 INFO mapreduce.Job: map 16% reduce 0% > 15/06/05 13:20:00 INFO mapreduce.Job: map 17% reduce 0% > 15/06/05 13:20:36 INFO mapreduce.Job: map 18% reduce 0% > 15/06/05 13:20:57 INFO mapreduce.Job: map 19% reduce 0% > 15/06/05 13:21:22 INFO mapreduce.Job: map 18% reduce 0% > 15/06/05 13:21:24 INFO mapreduce.Job: map 14% reduce 0% > 15/06/05 13:21:25 INFO mapreduce.Job: map 9% reduce 0% > 15/06/05 13:21:28 INFO mapreduce.Job: map 10% reduce 0% > 15/06/05 13:22:22 INFO mapreduce.Job: map 11% reduce 0% > 15/06/05 13:23:06 INFO mapreduce.Job: map 12% reduce 0% > 15/06/05 13:23:41 INFO mapreduce.Job: map 9% reduce 0% > 15/06/05 13:23:42 INFO mapreduce.Job: map 5% reduce 0% > 15/06/05 13:24:38 INFO mapreduce.Job: map 6% reduce 0% > 15/06/05 13:25:16 INFO mapreduce.Job: map 7% reduce 0% > 15/06/05 13:25:53 INFO mapreduce.Job: map 8% reduce 0% > 15/06/05 13:26:35 INFO mapreduce.Job: map 9% reduce 0% > the last response time is 15/06/05 13:26:35 > and current time : > [root@xiachsh11 logs]# date > Fri Jun 5 19:19:59 EDT 2015 > [root@xiachsh11 logs]# > [root@xiachsh11 logs]# yarn node -list > 15/06/05 19:20:18 INFO client.RMProxy: Connecting to ResourceManager at > xiachsh11.eng.platformlab.ibm.com/9.21.62.234:8032 > Total Nodes:0 > Node-Id Node-State Node-Http-Address > Number-of-Running-Containers > [root@xiachsh11 logs]# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3775) Job does not exit after all node become unhealthy
[ https://issues.apache.org/jira/browse/YARN-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577065#comment-14577065 ] Rohith commented on YARN-3775: -- [~xiachengs...@yeah.net] Thanks for reporting the issue. IIUC, This is expected behavior If the Application attempt is killed because of the following reason, then current attempt failure is not considered as attempt failures count. # Preempted # Aborted # Disk_failed(i.e NM unhealthy) # killed by ResourceManager. In your case, applicaitons attempt got killed because of disk_failed, which RM never consider this as attempt failure. So RM wait for this applications to launch and run in further NM register to it. > Job does not exit after all node become unhealthy > - > > Key: YARN-3775 > URL: https://issues.apache.org/jira/browse/YARN-3775 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 > Environment: Environment: > Version : 2.7.0 > OS: RHEL7 > NameNodes: xiachsh11 xiachsh12 (HA enabled) > DataNodes: 5 xiachsh13-17 > ResourceManage: xiachsh11 > NodeManage: 5 xiachsh13-17 > all nodes are openstack provisioned: > MEM: 1.5G > Disk: 16G >Reporter: Chengshun Xia > Attachments: logs.tar.gz > > > Running Terasort with data size 10G, all the containers exit since the disk > space threshold 0.90 reached,at this point,the job does not exit with error > 15/06/05 13:13:28 INFO mapreduce.Job: map 9% reduce 0% > 15/06/05 13:13:52 INFO mapreduce.Job: map 10% reduce 0% > 15/06/05 13:14:30 INFO mapreduce.Job: map 11% reduce 0% > 15/06/05 13:15:11 INFO mapreduce.Job: map 12% reduce 0% > 15/06/05 13:15:43 INFO mapreduce.Job: map 13% reduce 0% > 15/06/05 13:16:38 INFO mapreduce.Job: map 14% reduce 0% > 15/06/05 13:16:41 INFO mapreduce.Job: map 15% reduce 0% > 15/06/05 13:16:53 INFO mapreduce.Job: map 16% reduce 0% > 15/06/05 13:17:24 INFO mapreduce.Job: map 17% reduce 0% > 15/06/05 13:17:53 INFO mapreduce.Job: map 18% reduce 0% > 15/06/05 13:18:36 INFO mapreduce.Job: map 19% reduce 0% > 15/06/05 13:19:03 INFO mapreduce.Job: map 20% reduce 0% > 15/06/05 13:19:09 INFO mapreduce.Job: map 15% reduce 0% > 15/06/05 13:19:32 INFO mapreduce.Job: map 16% reduce 0% > 15/06/05 13:20:00 INFO mapreduce.Job: map 17% reduce 0% > 15/06/05 13:20:36 INFO mapreduce.Job: map 18% reduce 0% > 15/06/05 13:20:57 INFO mapreduce.Job: map 19% reduce 0% > 15/06/05 13:21:22 INFO mapreduce.Job: map 18% reduce 0% > 15/06/05 13:21:24 INFO mapreduce.Job: map 14% reduce 0% > 15/06/05 13:21:25 INFO mapreduce.Job: map 9% reduce 0% > 15/06/05 13:21:28 INFO mapreduce.Job: map 10% reduce 0% > 15/06/05 13:22:22 INFO mapreduce.Job: map 11% reduce 0% > 15/06/05 13:23:06 INFO mapreduce.Job: map 12% reduce 0% > 15/06/05 13:23:41 INFO mapreduce.Job: map 9% reduce 0% > 15/06/05 13:23:42 INFO mapreduce.Job: map 5% reduce 0% > 15/06/05 13:24:38 INFO mapreduce.Job: map 6% reduce 0% > 15/06/05 13:25:16 INFO mapreduce.Job: map 7% reduce 0% > 15/06/05 13:25:53 INFO mapreduce.Job: map 8% reduce 0% > 15/06/05 13:26:35 INFO mapreduce.Job: map 9% reduce 0% > the last response time is 15/06/05 13:26:35 > and current time : > [root@xiachsh11 logs]# date > Fri Jun 5 19:19:59 EDT 2015 > [root@xiachsh11 logs]# > [root@xiachsh11 logs]# yarn node -list > 15/06/05 19:20:18 INFO client.RMProxy: Connecting to ResourceManager at > xiachsh11.eng.platformlab.ibm.com/9.21.62.234:8032 > Total Nodes:0 > Node-Id Node-State Node-Http-Address > Number-of-Running-Containers > [root@xiachsh11 logs]# -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3508) Preemption processing occuring on the main RM dispatcher
[ https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577026#comment-14577026 ] Rohith commented on YARN-3508: -- The problem I see in the clubbing with scheduler events is if there is many scheduler events already in the event queue then it delays pre-emption events to trigger. As [~varun_saxena] said, container preemption events should be considered as higher priority than scheduler events. Having separate event disaptcher for preemptiong events would allow preemption events to participate in obtaining the lock in--earlier--stages rather then waiting for scheuduler events queue to complete. I think current patch approach make sense to me i.e having individual dispatcher thread for preemption events. > Preemption processing occuring on the main RM dispatcher > > > Key: YARN-3508 > URL: https://issues.apache.org/jira/browse/YARN-3508 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-3508.002.patch, YARN-3508.01.patch > > > We recently saw the RM for a large cluster lag far behind on the > AsyncDispacher event queue. The AsyncDispatcher thread was consistently > blocked on the highly-contended CapacityScheduler lock trying to dispatch > preemption-related events for RMContainerPreemptEventDispatcher. Preemption > processing should occur on the scheduler event dispatcher thread or a > separate thread to avoid delaying the processing of other events in the > primary dispatcher queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576774#comment-14576774 ] Rohith commented on YARN-3535: -- Recently in test we faced same issue, [~peng.zhang] would you mind updating the patch? > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Labels: BB2015-05-TBR > Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3535: - Priority: Critical (was: Major) > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang >Priority: Critical > Labels: BB2015-05-TBR > Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID
[ https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576671#comment-14576671 ] Rohith commented on YARN-3017: -- +1 lgtm (non-binding) > ContainerID in ResourceManager Log Has Slightly Different Format From > AppAttemptID > -- > > Key: YARN-3017 > URL: https://issues.apache.org/jira/browse/YARN-3017 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: MUFEED USMAN >Priority: Minor > Labels: PatchAvailable > Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch > > > Not sure if this should be filed as a bug or not. > In the ResourceManager log in the events surrounding the creation of a new > application attempt, > ... > ... > 2014-11-14 17:45:37,258 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching > masterappattempt_1412150883650_0001_02 > ... > ... > The application attempt has the ID format "_1412150883650_0001_02". > Whereas the associated ContainerID goes by "_1412150883650_0001_02_". > ... > ... > 2014-11-14 17:45:37,260 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting > up > container Container: [ContainerId: container_1412150883650_0001_02_01, > NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: vCores:1, > disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02 > ... > ... > Curious to know if this is kept like that for a reason. If not while using > filtering tools to, say, grep events surrounding a specific attempt by the > numeric ID part information may slip out during troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID
[ https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576652#comment-14576652 ] Rohith commented on YARN-3017: -- I see.. Thanks for the detailed explanation.. > ContainerID in ResourceManager Log Has Slightly Different Format From > AppAttemptID > -- > > Key: YARN-3017 > URL: https://issues.apache.org/jira/browse/YARN-3017 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: MUFEED USMAN >Priority: Minor > Labels: PatchAvailable > Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch > > > Not sure if this should be filed as a bug or not. > In the ResourceManager log in the events surrounding the creation of a new > application attempt, > ... > ... > 2014-11-14 17:45:37,258 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching > masterappattempt_1412150883650_0001_02 > ... > ... > The application attempt has the ID format "_1412150883650_0001_02". > Whereas the associated ContainerID goes by "_1412150883650_0001_02_". > ... > ... > 2014-11-14 17:45:37,260 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting > up > container Container: [ContainerId: container_1412150883650_0001_02_01, > NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: vCores:1, > disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02 > ... > ... > Curious to know if this is kept like that for a reason. If not while using > filtering tools to, say, grep events surrounding a specific attempt by the > numeric ID part information may slip out during troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3780) Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition
[ https://issues.apache.org/jira/browse/YARN-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576550#comment-14576550 ] Rohith commented on YARN-3780: -- Makse sense, +1 lgtm (non-binding) > Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition > - > > Key: YARN-3780 > URL: https://issues.apache.org/jira/browse/YARN-3780 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-3780.000.patch > > > Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition > to avoid unnecessary NodeResourceUpdateSchedulerEvent. > The current code use {{!=}} to compare Resource totalCapability, which will > compare reference not the real value in Resource. So we should use equals to > compare Resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working as expected in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574393#comment-14574393 ] Rohith commented on YARN-3758: -- All these confusion should be solved probably after YARN-2986. This issue can be raised there whether they will be handling it. > The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not > working as expected in FairScheduler > > > Key: YARN-3758 > URL: https://issues.apache.org/jira/browse/YARN-3758 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: skrho > > Hello there~~ > I have 2 clusters > First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G > Physical memory each node > Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G > Physical memory each node > Wherever a mapreduce job is running, I want resourcemanager is to set the > minimum memory 256m to container > So I was changing configuration in yarn-site.xml & mapred-site.xml > yarn.scheduler.minimum-allocation-mb : 256 > mapreduce.map.java.opts : -Xms256m > mapreduce.reduce.java.opts : -Xms256m > mapreduce.map.memory.mb : 256 > mapreduce.reduce.memory.mb : 256 > In First cluster whenever a mapreduce job is running , I can see used memory > 256m in web console( http://installedIP:8088/cluster/nodes ) > But In Second cluster whenever a mapreduce job is running , I can see used > memory 1024m in web console( http://installedIP:8088/cluster/nodes ) > I know default memory value is 1024m, so if there is not changing memory > setting, the default value is working. > I have been testing for two weeks, but I don't know why mimimum memory > setting is not working in second cluster > Why this difference is happened? > Am I wrong setting configuration? > or Is there bug? > Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID
[ https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574228#comment-14574228 ] Rohith commented on YARN-3017: -- bq. Could you give a little more detail about the possibility to break the rolling upgrade? I was thinking that does it cause any issue while parsing the containerId after upgrade. Say, current container id format is container_1430441527236_0001_01_01 which is running in the NM-1, after upgrade container-id format changes container_1430441527236_0001_01_01. But NM reports running containers as container_1430441527236_0001_01_01. > ContainerID in ResourceManager Log Has Slightly Different Format From > AppAttemptID > -- > > Key: YARN-3017 > URL: https://issues.apache.org/jira/browse/YARN-3017 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: MUFEED USMAN >Priority: Minor > Labels: PatchAvailable > Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch > > > Not sure if this should be filed as a bug or not. > In the ResourceManager log in the events surrounding the creation of a new > application attempt, > ... > ... > 2014-11-14 17:45:37,258 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching > masterappattempt_1412150883650_0001_02 > ... > ... > The application attempt has the ID format "_1412150883650_0001_02". > Whereas the associated ContainerID goes by "_1412150883650_0001_02_". > ... > ... > 2014-11-14 17:45:37,260 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting > up > container Container: [ContainerId: container_1412150883650_0001_02_01, > NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: vCores:1, > disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02 > ... > ... > Curious to know if this is kept like that for a reason. If not while using > filtering tools to, say, grep events surrounding a specific attempt by the > numeric ID part information may slip out during troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working as expected in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3758: - Summary: The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working as expected in FairScheduler (was: The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container) > The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not > working as expected in FairScheduler > > > Key: YARN-3758 > URL: https://issues.apache.org/jira/browse/YARN-3758 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: skrho > > Hello there~~ > I have 2 clusters > First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G > Physical memory each node > Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G > Physical memory each node > Wherever a mapreduce job is running, I want resourcemanager is to set the > minimum memory 256m to container > So I was changing configuration in yarn-site.xml & mapred-site.xml > yarn.scheduler.minimum-allocation-mb : 256 > mapreduce.map.java.opts : -Xms256m > mapreduce.reduce.java.opts : -Xms256m > mapreduce.map.memory.mb : 256 > mapreduce.reduce.memory.mb : 256 > In First cluster whenever a mapreduce job is running , I can see used memory > 256m in web console( http://installedIP:8088/cluster/nodes ) > But In Second cluster whenever a mapreduce job is running , I can see used > memory 1024m in web console( http://installedIP:8088/cluster/nodes ) > I know default memory value is 1024m, so if there is not changing memory > setting, the default value is working. > I have been testing for two weeks, but I don't know why mimimum memory > setting is not working in second cluster > Why this difference is happened? > Am I wrong setting configuration? > or Is there bug? > Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572630#comment-14572630 ] Rohith commented on YARN-3758: -- bq. Is it bug ? To be clear, is the inconsistent behavior is bug? or implemented intentionally for FS? > The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not > working in container > > > Key: YARN-3758 > URL: https://issues.apache.org/jira/browse/YARN-3758 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: skrho > > Hello there~~ > I have 2 clusters > First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G > Physical memory each node > Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G > Physical memory each node > Wherever a mapreduce job is running, I want resourcemanager is to set the > minimum memory 256m to container > So I was changing configuration in yarn-site.xml & mapred-site.xml > yarn.scheduler.minimum-allocation-mb : 256 > mapreduce.map.java.opts : -Xms256m > mapreduce.reduce.java.opts : -Xms256m > mapreduce.map.memory.mb : 256 > mapreduce.reduce.memory.mb : 256 > In First cluster whenever a mapreduce job is running , I can see used memory > 256m in web console( http://installedIP:8088/cluster/nodes ) > But In Second cluster whenever a mapreduce job is running , I can see used > memory 1024m in web console( http://installedIP:8088/cluster/nodes ) > I know default memory value is 1024m, so if there is not changing memory > setting, the default value is working. > I have been testing for two weeks, but I don't know why mimimum memory > setting is not working in second cluster > Why this difference is happened? > Am I wrong setting configuration? > or Is there bug? > Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3758) The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not working in container
[ https://issues.apache.org/jira/browse/YARN-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572628#comment-14572628 ] Rohith commented on YARN-3758: -- Had looked into code for CS and FS. The minimum allocation understanding and its behavior is different acros CS and FS. # CS : It is straight forward that if any request with less than min-allocation-mb, then the CS normalize the request to min-allocation-mb. And containers are allocated with minimum-allocation-mb. # FS : if any request with less than min-allocation-mb then the FS normalize the request with the factor {{yarn.scheduler.increment-allocation-mb}}. Example in description, min-alocation-mb is 256mb, but increment-allocation-mb default 1024mb which always allocate 1024mb to containers. There is huge effect of {{yarn.scheduler.increment-allocation-mb}} which changes the requested memory and assign with newly calculated resource. The behavior is not consistent with CS and FS. I am not sure why there an additional configuration introduced in FS? Is it bug ? > The mininum memory setting(yarn.scheduler.minimum-allocation-mb) is not > working in container > > > Key: YARN-3758 > URL: https://issues.apache.org/jira/browse/YARN-3758 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: skrho > > Hello there~~ > I have 2 clusters > First cluster is 5 node , default 1 application queue, Capacity scheduler, 8G > Physical memory each node > Second cluster is 10 node, 2 application queuey, fair-scheduler, 230G > Physical memory each node > Wherever a mapreduce job is running, I want resourcemanager is to set the > minimum memory 256m to container > So I was changing configuration in yarn-site.xml & mapred-site.xml > yarn.scheduler.minimum-allocation-mb : 256 > mapreduce.map.java.opts : -Xms256m > mapreduce.reduce.java.opts : -Xms256m > mapreduce.map.memory.mb : 256 > mapreduce.reduce.memory.mb : 256 > In First cluster whenever a mapreduce job is running , I can see used memory > 256m in web console( http://installedIP:8088/cluster/nodes ) > But In Second cluster whenever a mapreduce job is running , I can see used > memory 1024m in web console( http://installedIP:8088/cluster/nodes ) > I know default memory value is 1024m, so if there is not changing memory > setting, the default value is working. > I have been testing for two weeks, but I don't know why mimimum memory > setting is not working in second cluster > Why this difference is happened? > Am I wrong setting configuration? > or Is there bug? > Thank you for reading~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3017) ContainerID in ResourceManager Log Has Slightly Different Format From AppAttemptID
[ https://issues.apache.org/jira/browse/YARN-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572289#comment-14572289 ] Rohith commented on YARN-3017: -- Apoligies for coming very late into this issue.. Thinking that changing containerId format may breaks complatability when rolling upgrade has been done with RM HA + work preserving enabled? IIUC, using ZKRMStateStore, rolling upgrade can be done now. > ContainerID in ResourceManager Log Has Slightly Different Format From > AppAttemptID > -- > > Key: YARN-3017 > URL: https://issues.apache.org/jira/browse/YARN-3017 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: MUFEED USMAN >Priority: Minor > Labels: PatchAvailable > Attachments: YARN-3017.patch, YARN-3017_1.patch, YARN-3017_2.patch > > > Not sure if this should be filed as a bug or not. > In the ResourceManager log in the events surrounding the creation of a new > application attempt, > ... > ... > 2014-11-14 17:45:37,258 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching > masterappattempt_1412150883650_0001_02 > ... > ... > The application attempt has the ID format "_1412150883650_0001_02". > Whereas the associated ContainerID goes by "_1412150883650_0001_02_". > ... > ... > 2014-11-14 17:45:37,260 INFO > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting > up > container Container: [ContainerId: container_1412150883650_0001_02_01, > NodeId: n67:55933, NodeHttpAddress: n67:8042, Resource: vCores:1, > disks:0.0>, Priority: 0, Token: Token { kind: ContainerToken, service: > 10.10.70.67:55933 }, ] for AM appattempt_1412150883650_0001_02 > ... > ... > Curious to know if this is kept like that for a reason. If not while using > filtering tools to, say, grep events surrounding a specific attempt by the > numeric ID part information may slip out during troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572247#comment-14572247 ] Rohith commented on YARN-3733: -- +1 for handling virtual core's. This will good immprovement for testing DominantRC functionality precicely. > DominantRC#compare() does not work as expected if cluster resource is empty > --- > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: 0001-YARN-3733.patch, 0002-YARN-3733.patch, > 0002-YARN-3733.patch, YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572244#comment-14572244 ] Rohith commented on YARN-3754: -- bq. When NM is shutting down, ContainerLaunch is also interrupted. During this interrupted exception handling, NM tries to update container diagnostics. But from main thread statestore is down ,hence caused the DB Close exception. I think this issue caused since NM jvm did not exit on_time which allowed to process the statestore event. After YARN-3585 , I think this should be OK. [~bibinchundatt] Can you regression it pls > Race condition when the NodeManager is shutting down and container is launched > -- > > Key: YARN-3754 > URL: https://issues.apache.org/jira/browse/YARN-3754 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Environment: Suse 11 Sp3 >Reporter: Bibin A Chundatt >Assignee: Sunil G >Priority: Critical > Attachments: NM.log > > > Container is launched and returned to ContainerImpl > NodeManager closed the DB connection which resulting in > {{org.iq80.leveldb.DBException: Closed}}. > *Attaching the exception trace* > {code} > 2015-05-30 02:11:49,122 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Unable to update state store diagnostics for > container_e310_1432817693365_3338_01_02 > java.io.IOException: org.iq80.leveldb.DBException: Closed > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.iq80.leveldb.DBException: Closed > at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) > at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) > at > org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) > ... 15 more > {code} > we can add a check whether DB is closed while we move container from ACQUIRED > state. > As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3733: - Attachment: 0002-YARN-3733.patch Updated the patch fixing test side comments.. Kindly review the patch > DominantRC#compare() does not work as expected if cluster resource is empty > --- > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: 0001-YARN-3733.patch, 0002-YARN-3733.patch, > 0002-YARN-3733.patch, YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572085#comment-14572085 ] Rohith commented on YARN-3733: -- bq. only memory or vcores are more in TestCapacityScheduler. All the combination of inputs are verified in the TestResourceCalculator. And in TestCapacityScheduler, app submission happens only for memory in {{MockRM.submitApp}}, so default vcore minimum allocation is 1 which will be taken by default. So just changing memory to {{amResourceLimit.getMemory() + 2}} should enough. bq. TestCapacityScheduler#verifyAMLimitForLeafQueue, while submitting second app, you could change the app name to "app-2". Agree. I will upload a patch soon > DominantRC#compare() does not work as expected if cluster resource is empty > --- > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: 0001-YARN-3733.patch, 0002-YARN-3733.patch, > YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3733: - Attachment: 0002-YARN-3733.patch Thanks [~sunilg] and [~leftnoteasy] for sharing your thoughts.. I modified bit of logic and the order of if check so that it should handle all the possible combination of inputs below table. The problem was in 5th and 7th inputs. The validation returning 1 but it was expected to be zero for 5th combinations i.e flow never reach 2nd check since 1st step is OR for memory vs cpu. ||Sl.no||cr||lhs||rhs||Output|| |1|<0,0>| <1,1> | <1,1> | 0 | |2|<0,0>| <1,1> | <0,0> | 1 | |3|<0,0>| <0,0> | <1,1> | -1 | |4|<0,0>| <0,1> | <1,0> | 0 | |5|<0,0>| <1,0> | <0,1> | 0 | |6|<0,0>| <1,1> | <1,0> | 1 | |7|<0,0>| <1,0> | <1,1> | -1 | Updated Patch has followig change : # Changed the logic for comparing lhs and rhs resources when clusterResource is empty as suggested. # Added test for AMLimit usage. # Addred test for all above cobination of inputs. Kindly review the patch > DominantRC#compare() does not work as expected if cluster resource is empty > --- > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: 0001-YARN-3733.patch, 0002-YARN-3733.patch, > YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3733: - Attachment: 0001-YARN-3733.patch The updated patch that fixes for 2nd and 3rd scenarios(This issue scenario fixes) in above table and refactored the test code. As a overall solution that solves input combination like 4th and 5th from above table, need to explore more on how to define fraction and how to decide which one is dominant. Any suggestions on this? > DominantRC#compare() does not work as expected if cluster resource is empty > --- > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: 0001-YARN-3733.patch, YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568682#comment-14568682 ] Rohith commented on YARN-3733: -- Updated the summary as per defect. > DominantRC#compare() does not work as expected if cluster resource is empty > --- > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3733) DominantRC#compare() does not work as expected if cluster resource is empty
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3733: - Summary: DominantRC#compare() does not work as expected if cluster resource is empty (was: On RM restart AM getting more than maximum possible memory when many tasks in queue) > DominantRC#compare() does not work as expected if cluster resource is empty > --- > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568539#comment-14568539 ] Rohith commented on YARN-3585: -- Thanks [~jlowe] for the review .. bq. if we should flip the logic to not exit but then have NodeManager.main override that. This probably precludes the need to update existing tests. Make sense to me.. Changed the logic to call jvm exit when NodeMananager is instantiated from main function. bq. We should be using ExitUtil instead of System.exit directly. Done bq. Nit: "setexitOnShutdownEvent" s/b "setExitOnShutdownEvent" This method is not necessary now since patch preassume true when it is called from only main funtion. I have removed this. Kindly reveiw updated patch > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Rohith >Priority: Critical > Attachments: 0001-YARN-3585.patch, YARN-3585.patch > > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3585: - Attachment: 0001-YARN-3585.patch > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Rohith >Priority: Critical > Attachments: 0001-YARN-3585.patch, YARN-3585.patch > > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568462#comment-14568462 ] Rohith commented on YARN-3733: -- This issue fix need to go in for 2.7.1. Updated the target version as 2.7.1 > On RM restart AM getting more than maximum possible memory when many tasks > in queue > - > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3733: - Target Version/s: 2.7.1 > On RM restart AM getting more than maximum possible memory when many tasks > in queue > - > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567201#comment-14567201 ] Rohith commented on YARN-3585: -- -1 for findbug, does not show any error report, but not sure why -1 given. Test failure is unrelated to this patch. [~jlowe] Kindly review the patch. > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Rohith >Priority: Critical > Attachments: YARN-3585.patch > > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567196#comment-14567196 ] Rohith commented on YARN-3585: -- Yes, we can raise different Jira. [~bibinchundatt] Can you raise Jira, we can validate the issue there? > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Rohith >Priority: Critical > Attachments: YARN-3585.patch > > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567189#comment-14567189 ] Rohith commented on YARN-3733: -- bq. Verify infinity by calling isInfinite(float v). Quoting from jdk7 Since infinity is derived from lhs and rhs, infinity can not be differentiated for the clusterResource=<0,0> lhs=<1,1>, and rhs<2,2>. Method {{getResourceAsValue()}} return infinity for both l and r which cant compare it. > On RM restart AM getting more than maximum possible memory when many tasks > in queue > - > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567186#comment-14567186 ] Rohith commented on YARN-3733: -- bq. 2. The newly added code is duplicated in two places, can you eliminate the duplicate code? sencond time validation is not required ICO NaN,will remove this in next patch. > On RM restart AM getting more than maximum possible memory when many tasks > in queue > - > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567184#comment-14567184 ] Rohith commented on YARN-3733: -- Thanks [~devaraj.k] and [~sunilg] for review bq. Can we check for lhs/rhs emptiness and compare these before ending up with infinite values? If we calculater for emptyness, this would affect specific input values like clusterResource=<0,0> lhs=<1,1>, and rhs<2,2>. Then which one is considered as dominant? bcs directly dominant component can not be retrieved by memory or cpu. And I listed out what are the possible combination of inputs would ocure in YARN. These are ||Sl.no||clusterResorce||lhs||rhs||Remark|| |1|<0,0>|<0,0>|<0,0>|Valid Input;Handled| |2|<0,0>||<0,0>|NaN vs Infinity: Patch Handle This scenario| |3|<0,0>|<0,0>||Nan vs Infinity: Patch Handle This scenario| |4|<0,0>|||Infinity vs Infinity: Can this type can ocur in YARN?| |5|<0,0>||<0,positive integer>|Is this valid input? Can this type can ocur in YARN?| > On RM restart AM getting more than maximum possible memory when many tasks > in queue > - > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566993#comment-14566993 ] Rohith commented on YARN-3585: -- This is race condition when the NodeManager is shutting down and container is launched. By the time container is launched and returned to ContainerImpl, NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed }} > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Rohith >Priority: Critical > Attachments: YARN-3585.patch > > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3733: - Attachment: YARN-3733.patch > On RM restart AM getting more than maximum possible memory when many tasks > in queue > - > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3733: - Attachment: (was: YARN-3733.patch) > On RM restart AM getting more than maximum possible memory when many tasks > in queue > - > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3733: - Attachment: YARN-3733.patch Attached the patch fixing the issue. Kindly review the patch. > On RM restart AM getting more than maximum possible memory when many tasks > in queue > - > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Blocker > Attachments: YARN-3733.patch > > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3585: - Attachment: YARN-3585.patch > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Rohith >Priority: Critical > Attachments: YARN-3585.patch > > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith reassigned YARN-3585: Assignee: Rohith > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Rohith >Priority: Critical > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563180#comment-14563180 ] Rohith commented on YARN-3585: -- Another observation is I enabled debug logs for NodeManger. And noticed that occurrence of this issue become relative low. I think it a timing of db close causing issue in LevelDb. And this Issue won't appear always on all the nodes, but in cluster at lease one node in the cluster is going for toss. I too think it should a level db issue. I think we should report issue in LevelDb. For calling {{adding system.exit}} in NodeManager gracefully shutdown will mask many issues. Given this is acceptable , I will upload a patch. > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Priority: Critical > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14563124#comment-14563124 ] Rohith commented on YARN-3585: -- Tested with patch to log before and after db.close, but found that db is closed.There were no exception thrown while closing db.close. > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Priority: Critical > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562638#comment-14562638 ] Rohith commented on YARN-3733: -- Steps to reproduce the scenario quickly. Assume that configuration for max-am-resurce-limit is 0.5 and cluster capacity is 10GB after NM is registered. So,Max AM ResourceLimit is 5B # Start RM configuring DominantResourceAllocator.(Dont start NM in the cluster) # Submit 10 applications with 1GB each, and all 10 applications get activated. # Start NM, RM launched all 10 applications AM's and cluster is full where cluster is hangs forever. When there is no NM is registered, submitted applications should not be activated i.e should not participate in scheduling. > On RM restart AM getting more than maximum possible memory when many tasks > in queue > - > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Critical > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562622#comment-14562622 ] Rohith commented on YARN-3733: -- Verified the RM logs from [~bibinchundatt] offline. The sequence of events ocured are # 30 applications are submitted to RM1 concurrently. *pendingApplications=18 and activeApplications=12*. Active applications are started RUNNING state. # RM1 switched to standby, RM2 transitioned to Active state. Currently active RM is RM2. # Previous submitted 30 applications started recovering. As part of recovery process, all the 30 applications submitted to schedulers and all these applications become active i.e *activeApplications=30 and pendingApplications=0* which is not expected to happen. # NM registered with RM and running AM's registered with RM. # Since 30 applications are activated, schedulers tries to launch all the activated applications ApplicatonMater and occupied full cluster capacity. Basically the issue AM limit check in LeafQueue#activateApplications is not working as expected for {{DominantResourceAllocator}}. In order to confirm this, written simple program for both Default and Dominant resource allocator like below memory configurations. Output of the program is For DefaultResourceAllocator, result is false which Limits the applications being activated when AM resource Limit is exceeded. For DominatReosurceAllocator, result is true which allows all the applications to be activated even if AM resource Limit is exceeded. {noformat} 2015-05-28 14:00:52,704 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: application AMResource maxAMResourcePerQueuePercent 0.5 amLimit lastClusterResource amIfStarted {noformat} {code} package com.test.hadoop; import org.apache.hadoop.yarn.api.records.Resource; import org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator; import org.apache.hadoop.yarn.util.resource.DominantResourceCalculator; import org.apache.hadoop.yarn.util.resource.ResourceCalculator; import org.apache.hadoop.yarn.util.resource.Resources; public class TestResourceCalculator { public static void main(String[] args) { // Default Resource Allocator ResourceCalculator defaultResourceCalculator = new DefaultResourceCalculator(); // Dominant Resource Allocator ResourceCalculator dominantResourceCalculator = new DominantResourceCalculator(); Resource lastClusterResource = Resource.newInstance(0, 0); Resource amIfStarted = Resource.newInstance(4096, 1); Resource amLimit = Resource.newInstance(0, 0); // expected result false, but actual also false System.out.println("DefaultResourceCalculator : " + Resources.lessThanOrEqual(defaultResourceCalculator, lastClusterResource, amIfStarted, amLimit)); // expected result false, but actual also true for DominantResourceAllocator System.out.println("DominantResourceCalculator : " + Resources.lessThanOrEqual(dominantResourceCalculator, lastClusterResource, amIfStarted, amLimit)); } } {code} > On RM restart AM getting more than maximum possible memory when many tasks > in queue > - > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Critical > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3733) On RM restart AM getting more than maximum possible memory when many tasks in queue
[ https://issues.apache.org/jira/browse/YARN-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith reassigned YARN-3733: Assignee: Rohith > On RM restart AM getting more than maximum possible memory when many tasks > in queue > - > > Key: YARN-3733 > URL: https://issues.apache.org/jira/browse/YARN-3733 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Environment: Suse 11 Sp3 , 2 NM , 2 RM > one NM - 3 GB 6 v core >Reporter: Bibin A Chundatt >Assignee: Rohith >Priority: Critical > > Steps to reproduce > = > 1. Install HA with 2 RM 2 NM (3072 MB * 2 total cluster) > 2. Configure map and reduce size to 512 MB after changing scheduler minimum > size to 512 MB > 3. Configure capacity scheduler and AM limit to .5 > (DominantResourceCalculator is configured) > 4. Submit 30 concurrent task > 5. Switch RM > Actual > = > For 12 Jobs AM gets allocated and all 12 starts running > No other Yarn child is initiated , *all 12 Jobs in Running state for ever* > Expected > === > Only 6 should be running at a time since max AM allocated is .5 (3072 MB) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3731) Unknown container. Container either has not started or has already completed or doesn’t belong to this node at all.
[ https://issues.apache.org/jira/browse/YARN-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562280#comment-14562280 ] Rohith commented on YARN-3731: -- Closing the issue as invalid. > Unknown container. Container either has not started or has already completed > or doesn’t belong to this node at all. > > > Key: YARN-3731 > URL: https://issues.apache.org/jira/browse/YARN-3731 > Project: Hadoop YARN > Issue Type: Bug >Reporter: amit >Priority: Critical > > Hi > I am importing data from sql server to hdfs and below is the command > sqoop import –connect > “jdbc:sqlserver://Servername:1433;username=hadoop;password=Password;database=MSBI” > –table DimDate –target-dir /Hadoop/hdpdatadn/dn/DW/msbi > but I am getting following error: > User: amit.tomar > Name: DimDate.jar > Application Type: MAPREDUCE > Application Tags: > State: FAILED > FinalStatus: FAILED > Started: Wed May 27 12:39:48 +0800 2015 > Elapsed: 23sec > Tracking URL: History > Diagnostics: Application application_1432698911303_0005 failed 2 times due > to AM Container for appattempt_1432698911303_0005_02 exited with > exitCode: 1 > For more detailed output, check application tracking > page:http://ServerName/proxy/application_1432698911303_0005/Then, click on > links to logs of each attempt. > Diagnostics: Exception from container-launch. > Container id: container_1432698911303_0005_02_01 > Exit code: 1 > Stack trace: ExitCodeException exitCode=1: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Shell output: 1 file(s) moved. > Container exited with a non-zero exit code 1 > Failing this attempt. Failing the application. > From the log below is the message: > java.lang.Exception: Unknown container. Container either has not started or > has already completed or doesn’t belong to this node at all. > Thanks in advance > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3731) Unknown container. Container either has not started or has already completed or doesn’t belong to this node at all.
[ https://issues.apache.org/jira/browse/YARN-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith resolved YARN-3731. -- Resolution: Invalid > Unknown container. Container either has not started or has already completed > or doesn’t belong to this node at all. > > > Key: YARN-3731 > URL: https://issues.apache.org/jira/browse/YARN-3731 > Project: Hadoop YARN > Issue Type: Bug >Reporter: amit >Priority: Critical > > Hi > I am importing data from sql server to hdfs and below is the command > sqoop import –connect > “jdbc:sqlserver://Servername:1433;username=hadoop;password=Password;database=MSBI” > –table DimDate –target-dir /Hadoop/hdpdatadn/dn/DW/msbi > but I am getting following error: > User: amit.tomar > Name: DimDate.jar > Application Type: MAPREDUCE > Application Tags: > State: FAILED > FinalStatus: FAILED > Started: Wed May 27 12:39:48 +0800 2015 > Elapsed: 23sec > Tracking URL: History > Diagnostics: Application application_1432698911303_0005 failed 2 times due > to AM Container for appattempt_1432698911303_0005_02 exited with > exitCode: 1 > For more detailed output, check application tracking > page:http://ServerName/proxy/application_1432698911303_0005/Then, click on > links to logs of each attempt. > Diagnostics: Exception from container-launch. > Container id: container_1432698911303_0005_02_01 > Exit code: 1 > Stack trace: ExitCodeException exitCode=1: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Shell output: 1 file(s) moved. > Container exited with a non-zero exit code 1 > Failing this attempt. Failing the application. > From the log below is the message: > java.lang.Exception: Unknown container. Container either has not started or > has already completed or doesn’t belong to this node at all. > Thanks in advance > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3731) Unknown container. Container either has not started or has already completed or doesn’t belong to this node at all.
[ https://issues.apache.org/jira/browse/YARN-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562279#comment-14562279 ] Rohith commented on YARN-3731: -- Hi [~amitmsbi] Thanks for using Hadoop. You are trying to access the log link where the application master itself never launched. From the diagnosis message, it is clear that application is not launched. So first and formost, you need to check the application mater that why it is not launched. There would be some application configuration or classpath issue which you can get it from syserr container logs. And JIRA is meant for tracking development activities.For queries kinldy register to [mailing list|https://hadoop.apache.org/mailing_lists.html] and send mail to users mailing list i.e {{u...@hadoop.apache.org}}. Definitely folks will help you to solve or answer your queries. > Unknown container. Container either has not started or has already completed > or doesn’t belong to this node at all. > > > Key: YARN-3731 > URL: https://issues.apache.org/jira/browse/YARN-3731 > Project: Hadoop YARN > Issue Type: Bug >Reporter: amit >Priority: Critical > > Hi > I am importing data from sql server to hdfs and below is the command > sqoop import –connect > “jdbc:sqlserver://Servername:1433;username=hadoop;password=Password;database=MSBI” > –table DimDate –target-dir /Hadoop/hdpdatadn/dn/DW/msbi > but I am getting following error: > User: amit.tomar > Name: DimDate.jar > Application Type: MAPREDUCE > Application Tags: > State: FAILED > FinalStatus: FAILED > Started: Wed May 27 12:39:48 +0800 2015 > Elapsed: 23sec > Tracking URL: History > Diagnostics: Application application_1432698911303_0005 failed 2 times due > to AM Container for appattempt_1432698911303_0005_02 exited with > exitCode: 1 > For more detailed output, check application tracking > page:http://ServerName/proxy/application_1432698911303_0005/Then, click on > links to logs of each attempt. > Diagnostics: Exception from container-launch. > Container id: container_1432698911303_0005_02_01 > Exit code: 1 > Stack trace: ExitCodeException exitCode=1: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Shell output: 1 file(s) moved. > Container exited with a non-zero exit code 1 > Failing this attempt. Failing the application. > From the log below is the message: > java.lang.Exception: Unknown container. Container either has not started or > has already completed or doesn’t belong to this node at all. > Thanks in advance > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561256#comment-14561256 ] Rohith commented on YARN-3585: -- bq. Could you to instrument logs in the state store code to verify the leveldb database is indeed being closed even when it hangs? sorry, did not get it exactly what and where should I add logs? Do you mean should I add log after {{NMLeveldbStateStoreService#closeStorage()}} being called? > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Priority: Critical > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560965#comment-14560965 ] Rohith commented on YARN-3585: -- I have attached NM logs and thread dump in YARN-3640. Would get it from YARN-3640? > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Priority: Critical > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560774#comment-14560774 ] Rohith commented on YARN-3535: -- Thanks [~peng.zhang] for working on this issue.. Some comments # I think the method {{recoverResourceRequestForContainer}} should be synchronized, any thought? # Why do we require {{RMContextImpl.java}} changes? I think this we can avoid, not necessarily required. Tests : # Any specific reason for chaning {{TestAMRestart.java}}? # IIUC, this issue can occur in all the scheduler given AM-RM heart beat is lesser than NM-RM heart beat interval. So can it include FT test case that applicable for both CS and FS. May it you can add test in the extending class {{ParameterizedSchedulerTestBase}} i.e TestAbstractYarnScheduler. > ResourceRequest should be restored back to scheduler when RMContainer is > killed at ALLOCATED > - > > Key: YARN-3535 > URL: https://issues.apache.org/jira/browse/YARN-3535 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Assignee: Peng Zhang > Labels: BB2015-05-TBR > Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, > yarn-app.log > > > During rolling update of NM, AM start of container on NM failed. > And then job hang there. > Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560470#comment-14560470 ] Rohith commented on YARN-3585: -- I tested locally using YARN-3641 FIX, issue is still exist. > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Priority: Critical > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560410#comment-14560410 ] Rohith commented on YARN-3585: -- I will test YARN-3641 fix for this JIRA scenario. About the patch, I think calling System.exit() explicitely after shutdown thead exit is one option. > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Priority: Critical > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14558050#comment-14558050 ] Rohith commented on YARN-3543: -- [~vinodkv] Kindly review the updated patch.. > ApplicationReport should be able to tell whether the Application is AM > managed or not. > --- > > Key: YARN-3543 > URL: https://issues.apache.org/jira/browse/YARN-3543 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.6.0 >Reporter: Spandan Dutta >Assignee: Rohith > Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, > 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, > 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, > YARN-3543-AH.PNG, YARN-3543-RM.PNG > > > Currently we can know whether the application submitted by the user is AM > managed from the applicationSubmissionContext. This can be only done at the > time when the user submits the job. We should have access to this info from > the ApplicationReport as well so that we can check whether an app is AM > managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3543: - Attachment: 0004-YARN-3543.patch Attaching same patch as previous to kick off Jenkins > ApplicationReport should be able to tell whether the Application is AM > managed or not. > --- > > Key: YARN-3543 > URL: https://issues.apache.org/jira/browse/YARN-3543 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.6.0 >Reporter: Spandan Dutta >Assignee: Rohith > Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, > 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, > 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, > YARN-3543-AH.PNG, YARN-3543-RM.PNG > > > Currently we can know whether the application submitted by the user is AM > managed from the applicationSubmissionContext. This can be only done at the > time when the user submits the job. We should have access to this info from > the ApplicationReport as well so that we can check whether an app is AM > managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page
[ https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557960#comment-14557960 ] Rohith commented on YARN-2238: -- +1 lgtm (non-binding) > filtering on UI sticks even if I move away from the page > > > Key: YARN-2238 > URL: https://issues.apache.org/jira/browse/YARN-2238 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.4.0 >Reporter: Sangjin Lee >Assignee: Jian He > Labels: usability > Attachments: YARN-2238.patch, YARN-2238.png, filtered.png > > > The main data table in many web pages (RM, AM, etc.) seems to show an > unexpected filtering behavior. > If I filter the table by typing something in the key or value field (or I > suspect any search field), the data table gets filtered. The example I used > is the job configuration page for a MR job. That is expected. > However, when I move away from that page and visit any other web page of the > same type (e.g. a job configuration page), the page is rendered with the > filtering! That is unexpected. > What's even stranger is that it does not render the filtering term. As a > result, I have a page that's mysteriously filtered but doesn't tell me what > it's filtering on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page
[ https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557959#comment-14557959 ] Rohith commented on YARN-2238: -- Tested locally with YARN-3707 fix, working fine:-) > filtering on UI sticks even if I move away from the page > > > Key: YARN-2238 > URL: https://issues.apache.org/jira/browse/YARN-2238 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.4.0 >Reporter: Sangjin Lee >Assignee: Jian He > Labels: usability > Attachments: YARN-2238.patch, YARN-2238.png, filtered.png > > > The main data table in many web pages (RM, AM, etc.) seems to show an > unexpected filtering behavior. > If I filter the table by typing something in the key or value field (or I > suspect any search field), the data table gets filtered. The example I used > is the job configuration page for a MR job. That is expected. > However, when I move away from that page and visit any other web page of the > same type (e.g. a job configuration page), the page is rendered with the > filtering! That is unexpected. > What's even stranger is that it does not render the filtering term. As a > result, I have a page that's mysteriously filtered but doesn't tell me what > it's filtering on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3708) container num become -1 after job finished
[ https://issues.apache.org/jira/browse/YARN-3708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith resolved YARN-3708. -- Resolution: Duplicate This is duplicate of YARN-3552. Closing the issue as duplicate.. > container num become -1 after job finished > -- > > Key: YARN-3708 > URL: https://issues.apache.org/jira/browse/YARN-3708 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.0 >Reporter: tongshiquan >Priority: Minor > Attachments: screenshot-1.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page
[ https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557279#comment-14557279 ] Rohith commented on YARN-2238: -- Attached the RM web UI page image file which depicts the problem-2 in my previous comment. > filtering on UI sticks even if I move away from the page > > > Key: YARN-2238 > URL: https://issues.apache.org/jira/browse/YARN-2238 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.4.0 >Reporter: Sangjin Lee >Assignee: Jian He > Labels: usability > Attachments: YARN-2238.patch, YARN-2238.png, filtered.png > > > The main data table in many web pages (RM, AM, etc.) seems to show an > unexpected filtering behavior. > If I filter the table by typing something in the key or value field (or I > suspect any search field), the data table gets filtered. The example I used > is the job configuration page for a MR job. That is expected. > However, when I move away from that page and visit any other web page of the > same type (e.g. a job configuration page), the page is rendered with the > filtering! That is unexpected. > What's even stranger is that it does not render the filtering term. As a > result, I have a page that's mysteriously filtered but doesn't tell me what > it's filtering on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2238) filtering on UI sticks even if I move away from the page
[ https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2238: - Attachment: YARN-2238.png > filtering on UI sticks even if I move away from the page > > > Key: YARN-2238 > URL: https://issues.apache.org/jira/browse/YARN-2238 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.4.0 >Reporter: Sangjin Lee >Assignee: Jian He > Labels: usability > Attachments: YARN-2238.patch, YARN-2238.png, filtered.png > > > The main data table in many web pages (RM, AM, etc.) seems to show an > unexpected filtering behavior. > If I filter the table by typing something in the key or value field (or I > suspect any search field), the data table gets filtered. The example I used > is the job configuration page for a MR job. That is expected. > However, when I move away from that page and visit any other web page of the > same type (e.g. a job configuration page), the page is rendered with the > filtering! That is unexpected. > What's even stranger is that it does not render the filtering term. As a > result, I have a page that's mysteriously filtered but doesn't tell me what > it's filtering on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page
[ https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557277#comment-14557277 ] Rohith commented on YARN-2238: -- I do not have much idea on JQuery, but I did blackbox testing in 1 node cluster applying the patch. Some observations # Filtering on scheduler page does not carry to application page. This is JIRA scenario which is working fine. # Once navigate to scheduler page, the click on LeafQueue bar apply the filters but does not show any apps running on that queue in the scheduler page. > filtering on UI sticks even if I move away from the page > > > Key: YARN-2238 > URL: https://issues.apache.org/jira/browse/YARN-2238 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.4.0 >Reporter: Sangjin Lee >Assignee: Jian He > Labels: usability > Attachments: YARN-2238.patch, filtered.png > > > The main data table in many web pages (RM, AM, etc.) seems to show an > unexpected filtering behavior. > If I filter the table by typing something in the key or value field (or I > suspect any search field), the data table gets filtered. The example I used > is the job configuration page for a MR job. That is expected. > However, when I move away from that page and visit any other web page of the > same type (e.g. a job configuration page), the page is rendered with the > filtering! That is unexpected. > What's even stranger is that it does not render the filtering term. As a > result, I have a page that's mysteriously filtered but doesn't tell me what > it's filtering on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557170#comment-14557170 ] Rohith commented on YARN-3585: -- I think we can invoke System.exit once the NodeManger is shutdown in finally block. For test case execution, bypass using flag. Any thoughts? > NodeManager cannot exit on SHUTDOWN event triggered and NM recovery is enabled > -- > > Key: YARN-3585 > URL: https://issues.apache.org/jira/browse/YARN-3585 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Peng Zhang >Priority: Critical > > With NM recovery enabled, after decommission, nodemanager log show stop but > process cannot end. > non daemon thread: > {noformat} > "DestroyJavaVM" prio=10 tid=0x7f3460011800 nid=0x29ec waiting on > condition [0x] > "leveldb" prio=10 tid=0x7f3354001800 nid=0x2a97 runnable > [0x] > "VM Thread" prio=10 tid=0x7f3460167000 nid=0x29f8 runnable > "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x7f346002 > nid=0x29ed runnable > "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x7f3460022000 > nid=0x29ee runnable > "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x7f3460024000 > nid=0x29ef runnable > "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x7f3460025800 > nid=0x29f0 runnable > "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x7f3460027800 > nid=0x29f1 runnable > "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x7f3460029000 > nid=0x29f2 runnable > "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x7f346002b000 > nid=0x29f3 runnable > "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x7f346002d000 > nid=0x29f4 runnable > "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x7f3460120800 nid=0x29f7 > runnable > "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x7f346011c800 > nid=0x29f5 runnable > "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x7f346011e800 > nid=0x29f6 runnable > "VM Periodic Task Thread" prio=10 tid=0x7f346019f800 nid=0x2a01 waiting > on condition > {noformat} > and jni leveldb thread stack > {noformat} > Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): > #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7f33dfce2a3b in leveldb::(anonymous > namespace)::PosixEnv::BGThreadWrapper(void*) () from > /tmp/libleveldbjni-64-1-6922178968300745716.8 > #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 > #3 0x003d830e811d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14556239#comment-14556239 ] Rohith commented on YARN-3543: -- [~aw] Would you help to understand and resolve build issue? Basically the issue what I observe is the patch containes many file changes that includes many projects. When the test cases are triggered, it is ignoring the applied patches and taking existing class files which causing the compilation failure and other issues. But if I apply patch and build , it is successfull. > ApplicationReport should be able to tell whether the Application is AM > managed or not. > --- > > Key: YARN-3543 > URL: https://issues.apache.org/jira/browse/YARN-3543 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.6.0 >Reporter: Spandan Dutta >Assignee: Rohith > Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, > 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, > 0004-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG > > > Currently we can know whether the application submitted by the user is AM > managed from the applicationSubmissionContext. This can be only done at the > time when the user submits the job. We should have access to this info from > the ApplicationReport as well so that we can check whether an app is AM > managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3692) Allow REST API to set a user generated message when killing an application
[ https://issues.apache.org/jira/browse/YARN-3692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554057#comment-14554057 ] Rohith commented on YARN-3692: -- All the applications are killed by user only. Diagnostic message for KILLED application by user is internal to YARN either it can be from REST or ApplicationClientProtocol who kills it. Is this let user set the reason for killing applications? > Allow REST API to set a user generated message when killing an application > -- > > Key: YARN-3692 > URL: https://issues.apache.org/jira/browse/YARN-3692 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rajat Jain >Assignee: Rohith > > Currently YARN's REST API supports killing an application without setting a > diagnostic message. It would be good to provide that support. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3692) Allow REST API to set a user generated message when killing an application
[ https://issues.apache.org/jira/browse/YARN-3692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith reassigned YARN-3692: Assignee: Rohith > Allow REST API to set a user generated message when killing an application > -- > > Key: YARN-3692 > URL: https://issues.apache.org/jira/browse/YARN-3692 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Rajat Jain >Assignee: Rohith > > Currently YARN's REST API supports killing an application without setting a > diagnostic message. It would be good to provide that support. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552225#comment-14552225 ] Rohith commented on YARN-3646: -- +1 lgtm (non-binding).. wait for jenkins report!! > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > Attachments: YARN-3646.001.patch, YARN-3646.002.patch, YARN-3646.patch > > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552165#comment-14552165 ] Rohith commented on YARN-3543: -- Build machine is not able to run all those test at one shot. Similar issue had faced earlier in YARN-2784. I think need to split the JIRA into proto change, WebUI change, AH change and more. > ApplicationReport should be able to tell whether the Application is AM > managed or not. > --- > > Key: YARN-3543 > URL: https://issues.apache.org/jira/browse/YARN-3543 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.6.0 >Reporter: Spandan Dutta >Assignee: Rohith > Labels: BB2015-05-TBR > Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, > 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, > 0004-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG > > > Currently we can know whether the application submitted by the user is AM > managed from the applicationSubmissionContext. This can be only done at the > time when the user submits the job. We should have access to this info from > the ApplicationReport as well so that we can check whether an app is AM > managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552091#comment-14552091 ] Rohith commented on YARN-3646: -- Thanks for updating the patch, some comments on tests # I think we can remove the tests added in the hadoop-common project, since yarn-client verifies required funcitionality. And basically hadoop-common test was mocking the RMProxy functionality which test was passing without RMProxy fix also. # code never reach {{Assert.fail("");}}. better to remove it # Catch the ApplicationNotFoundException instead of catching throwable. I think you can add {{expected = ApplicationNotFoundException.class}} in the @Test annotation like below. {code} @Test(timeout = 3, expected = ApplicationNotFoundException.class) public void testClientWithRetryPolicyForEver() throws Exception { YarnConfiguration conf = new YarnConfiguration(); conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, -1); ResourceManager rm = null; YarnClient yarnClient = null; try { // start rm rm = new ResourceManager(); rm.init(conf); rm.start(); yarnClient = YarnClient.createYarnClient(); yarnClient.init(conf); yarnClient.start(); // create invalid application id ApplicationId appId = ApplicationId.newInstance(1430126768987L, 10645); // RM should throw ApplicationNotFoundException exception yarnClient.getApplicationReport(appId); } finally { if (yarnClient != null) { yarnClient.stop(); } if (rm != null) { rm.stop(); } } } {code} # can you rename the test name with actual functionality test, like {{testShouldNotRetryForeverForNonNetworkExceptions}} > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > Attachments: YARN-3646.001.patch, YARN-3646.patch > > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.had
[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3543: - Attachment: 0004-YARN-3543.patch Attached same patch to kick off Jenkins > ApplicationReport should be able to tell whether the Application is AM > managed or not. > --- > > Key: YARN-3543 > URL: https://issues.apache.org/jira/browse/YARN-3543 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.6.0 >Reporter: Spandan Dutta >Assignee: Rohith > Labels: BB2015-05-TBR > Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, > 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, > 0004-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG > > > Currently we can know whether the application submitted by the user is AM > managed from the applicationSubmissionContext. This can be only done at the > time when the user submits the job. We should have access to this info from > the ApplicationReport as well so that we can check whether an app is AM > managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3543: - Attachment: (was: 0003-YARN-3543.patch) > ApplicationReport should be able to tell whether the Application is AM > managed or not. > --- > > Key: YARN-3543 > URL: https://issues.apache.org/jira/browse/YARN-3543 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.6.0 >Reporter: Spandan Dutta >Assignee: Rohith > Labels: BB2015-05-TBR > Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, > 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, > 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG > > > Currently we can know whether the application submitted by the user is AM > managed from the applicationSubmissionContext. This can be only done at the > time when the user submits the job. We should have access to this info from > the ApplicationReport as well so that we can check whether an app is AM > managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running
[ https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551826#comment-14551826 ] Rohith commented on YARN-2268: -- Thanks [~sunilg] [~jianhe] [~kasha] for sharing your thoughts.. bq. Given we recommend using the ZK-store when using HA, how about adding this for the ZK-store using an ephemeral znode for lock first? +1 given state store recommend for ZKRMStateStore. bq. How about creating a lock file and declaring it stale after a stipulated period of time. If we use stipulated period, am thinking that within the stiplated period, neither RM cant be started nor state store format cant be done. And the file has to be stored in hdfs neverthless of RMStateStore which is extra binding with filesytem. I am thinking , why can't we use general approach of polling the web service, it will give more accurate state. ? > Disallow formatting the RMStateStore when there is an RM running > > > Key: YARN-2268 > URL: https://issues.apache.org/jira/browse/YARN-2268 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Karthik Kambatla >Assignee: Rohith > Attachments: 0001-YARN-2268.patch > > > YARN-2131 adds a way to format the RMStateStore. However, it can be a problem > if we format the store while an RM is actively using it. It would be nice to > fail the format if there is an RM running and using this store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550854#comment-14550854 ] Rohith commented on YARN-3543: -- About the -1's from QA, # Findbugs is YARN-3677 exists to track issue. # Checkstyle error is number of parameter exceeds 7, which need to be ignored i think. Am not sure , should it be added to any ignore file or just ignore it. # Reg test failures, I am doubt on the test machines, many tests are failing .. ## Type-1, Address already in use exception. ## Type-2, NoSuchMethodError ## Type-3, ClassCasteException and many others I am pretty doubt on the order of compilation and test execution. Probably , for running resourcemanager tests, it is not including the modified classes in yarn-api/yarn-common. so NoSuchMethod error is thrown. > ApplicationReport should be able to tell whether the Application is AM > managed or not. > --- > > Key: YARN-3543 > URL: https://issues.apache.org/jira/browse/YARN-3543 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.6.0 >Reporter: Spandan Dutta >Assignee: Rohith > Labels: BB2015-05-TBR > Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, > 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, > 0003-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG > > > Currently we can know whether the application submitted by the user is AM > managed from the applicationSubmissionContext. This can be only done at the > time when the user submits the job. We should have access to this info from > the ApplicationReport as well so that we can check whether an app is AM > managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3543: - Attachment: 0004-YARN-3543.patch > ApplicationReport should be able to tell whether the Application is AM > managed or not. > --- > > Key: YARN-3543 > URL: https://issues.apache.org/jira/browse/YARN-3543 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.6.0 >Reporter: Spandan Dutta >Assignee: Rohith > Labels: BB2015-05-TBR > Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, > 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, > 0003-YARN-3543.patch, 0004-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG > > > Currently we can know whether the application submitted by the user is AM > managed from the applicationSubmissionContext. This can be only done at the > time when the user submits the job. We should have access to this info from > the ApplicationReport as well so that we can check whether an app is AM > managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550258#comment-14550258 ] Rohith commented on YARN-3646: -- And I verified in one node cluster by enabling and disabling retryforever policy. > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > Attachments: YARN-3646.patch > > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550256#comment-14550256 ] Rohith commented on YARN-3646: -- Thanks for working on this issue.. The patch overall looks good to me. nit : Can the test moved to Yarn package since issue is in Yarn? Otherwise if there is any changed in the RMProxy, test will not run. > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > Attachments: YARN-3646.patch > > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550233#comment-14550233 ] Rohith commented on YARN-3646: -- bq. Seems we do not even require exceptionToPolicy for FOREVER policy if we catch the exception in shouldRetry method. make sense to me,will reveiw the patch, thanks > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > Attachments: YARN-3646.patch > > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3674) YARN application disappears from view
[ https://issues.apache.org/jira/browse/YARN-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14549928#comment-14549928 ] Rohith commented on YARN-3674: -- Is this dup of YARN-2238? > YARN application disappears from view > - > > Key: YARN-3674 > URL: https://issues.apache.org/jira/browse/YARN-3674 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Sergey Shelukhin > > I have 2 tabs open at exact same URL with RUNNING applications view. There is > an application that is, in fact, running, that is visible in one tab but not > the other. This persists across refreshes. If I open new tab from the tab > where the application is not visible, in that tab it shows up ok. > I didn't change scheduler/queue settings before this behavior happened; on > [~sseth]'s advice I went and tried to click the root node of the scheduler on > scheduler page; the app still does not become visible. > Something got stuck somewhere... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546562#comment-14546562 ] Rohith commented on YARN-3543: -- bq. But doesn't impact compatibility? I meant to say ApplicationReport.newInstance() is called from out side of YARN. Ex : In MR, NotRunningJob#getUnknownApplicationReport. Similarly, if any other yarn clients using for ApplicationReport.newInstance, it would cause compatibility issue. So I just provided setters and getters for UnmanagedApp. > ApplicationReport should be able to tell whether the Application is AM > managed or not. > --- > > Key: YARN-3543 > URL: https://issues.apache.org/jira/browse/YARN-3543 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.6.0 >Reporter: Spandan Dutta >Assignee: Rohith > Labels: BB2015-05-TBR > Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, > 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, > 0003-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG > > > Currently we can know whether the application submitted by the user is AM > managed from the applicationSubmissionContext. This can be only done at the > time when the user submits the job. We should have access to this info from > the ApplicationReport as well so that we can check whether an app is AM > managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546560#comment-14546560 ] Rohith commented on YARN-3543: -- bq. ApplicationReport.newInstance() is Private, so you should simply update the existing method instead of adding a new one. I understood your comment above like since it is private, newInstance() method should not be modified. So I just added setter and getter methods in ApplicationReport. But doesn't impact compatibility? bq. app == null ? null : app.getUser()); What are these changes for? This is for fixing findbug in earlier jenkins report. One thing observed is # when {{return ApplicationReport.newInstance}}, does not give findbug warning but # when assign {{ApplicationReport.newInstance}} to new variable and return the variable giving findbug waining. So I changed above null check. bq. AppInfo.getUnmanagedAM() needs to be renamed too. Agree > ApplicationReport should be able to tell whether the Application is AM > managed or not. > --- > > Key: YARN-3543 > URL: https://issues.apache.org/jira/browse/YARN-3543 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.6.0 >Reporter: Spandan Dutta >Assignee: Rohith > Labels: BB2015-05-TBR > Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, > 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, > 0003-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG > > > Currently we can know whether the application submitted by the user is AM > managed from the applicationSubmissionContext. This can be only done at the > time when the user submits the job. We should have access to this info from > the ApplicationReport as well so that we can check whether an app is AM > managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545916#comment-14545916 ] Rohith commented on YARN-3543: -- Need to kick off jenkins again to check test failure are regular. > ApplicationReport should be able to tell whether the Application is AM > managed or not. > --- > > Key: YARN-3543 > URL: https://issues.apache.org/jira/browse/YARN-3543 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.6.0 >Reporter: Spandan Dutta >Assignee: Rohith > Labels: BB2015-05-TBR > Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, > 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, > 0003-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG > > > Currently we can know whether the application submitted by the user is AM > managed from the applicationSubmissionContext. This can be only done at the > time when the user submits the job. We should have access to this info from > the ApplicationReport as well so that we can check whether an app is AM > managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3642) Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java
[ https://issues.apache.org/jira/browse/YARN-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith resolved YARN-3642. -- Resolution: Invalid Closing as Invalid. If there is any queries or basic environment problems , I suggest to use user mailing lists to ask queries. > Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java > - > > Key: YARN-3642 > URL: https://issues.apache.org/jira/browse/YARN-3642 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: yarn-site.xml: > > > yarn.nodemanager.aux-services > mapreduce_shuffle > > > yarn.nodemanager.aux-services.mapreduce.shuffle.class > org.apache.hadoop.mapred.ShuffleHandler > > > yarn.resourcemanager.hostname > qadoop-nn001.apsalar.com > > > yarn.resourcemanager.scheduler.address > qadoop-nn001.apsalar.com:8030 > > > yarn.resourcemanager.address > qadoop-nn001.apsalar.com:8032 > > > yarn.resourcemanager.webap.address > qadoop-nn001.apsalar.com:8088 > > > yarn.resourcemanager.resource-tracker.address > qadoop-nn001.apsalar.com:8031 > > > yarn.resourcemanager.admin.address > qadoop-nn001.apsalar.com:8033 > > > yarn.log-aggregation-enable > true > > > Where to aggregate logs to. > yarn.nodemanager.remote-app-log-dir > /var/log/hadoop/apps > > > yarn.web-proxy.address > qadoop-nn001.apsalar.com:8088 > > > core-site.xml: > > > fs.defaultFS > hdfs://qadoop-nn001.apsalar.com > > > hadoop.proxyuser.hdfs.hosts > * > > > hadoop.proxyuser.hdfs.groups > * > > > hdfs-site.xml: > > > dfs.replication > 2 > > > dfs.namenode.name.dir > file:/hadoop/nn > > > dfs.datanode.data.dir > file:/hadoop/dn/dfs > > > dfs.http.address > qadoop-nn001.apsalar.com:50070 > > > dfs.secondary.http.address > qadoop-nn002.apsalar.com:50090 > > > mapred-site.xml: > > > mapred.job.tracker > qadoop-nn001.apsalar.com:8032 > > > mapreduce.framework.name > yarn > > > mapreduce.jobhistory.address > qadoop-nn001.apsalar.com:10020 > the JobHistoryServer address. > > > mapreduce.jobhistory.webapp.address > qadoop-nn001.apsalar.com:19888 > the JobHistoryServer web address > > > hbase-site.xml: > > > hbase.master > qadoop-nn001.apsalar.com:6 > > > hbase.rootdir > hdfs://qadoop-nn001.apsalar.com:8020/hbase > > > hbase.cluster.distributed > true > > > hbase.zookeeper.property.dataDir > /opt/local/zookeeper > > > hbase.zookeeper.property.clientPort > 2181 > > > hbase.zookeeper.quorum > qadoop-nn001.apsalar.com > > > zookeeper.session.timeout > 18 > > >Reporter: Lee Hounshell > > There is an issue with Hadoop 2.7.0 when in distributed operation the > datanode is unable to reach the yarn scheduler. In our yarn-site.xml, we > have defined this path to be: > {code} > > yarn.resourcemanager.scheduler.address > qadoop-nn001.apsalar.com:8030 > > {code} > But when running an oozie job, the problem manifests when looking at the job > logs for the yarn container. > We see logs similar to the following showing the connection problem: > {quote} > Showing 4096 bytes. Click here for full log > [main] org.apache.hadoop.http.HttpServer2: Jetty bound to port 64065 > 2015-05-13 17:49:33,930 INFO [main] org.mortbay.log: jetty-6.1.26 > 2015-05-13 17:49:33,971 INFO [main] org.mortbay.log: Extract > jar:file:/opt/local/hadoop/hadoop-2.7.0/share/hadoop/yarn/hadoop-yarn-common-2.7.0.jar!/webapps/mapreduce > to /var/tmp/Jetty_0_0_0_0_64065_mapreduce.1ayyhk/webapp > 2015-05-13 17:49:34,234 INFO [main] org.mortbay.log: Started > HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:64065 > 2015-05-13 17:49:34,234 INFO [main] org.apache.hadoop.yarn.webapp.WebApps: > Web app /mapreduce started at 64065 > 2015-05-13 17:49:34,645 INFO [main] org.apache.hadoop.yarn.webapp.WebApps: > Registered webapp guice modules > 2015-05-13 17:49:34,651 INFO [main] org.apache.hadoop.ipc.CallQueueManager: > Using callQueue class java.util.concurrent.LinkedBlockingQueue > 2015-05-13 17:49:34,652 INFO [Socket Reader #1 for port 38927] > org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 38927 > 2015-05-13 17:49:3
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544988#comment-14544988 ] Rohith commented on YARN-3646: -- Setting RetryPolicies.RETRY_FOREVER for exceptionToPolicyMap as default policy is not sufficient, but also {{RetryPolicies.RetryForever.shouldRetry()}} should check for Connect exceptions and handle it. Otherwise shouldRetry always return RetryAction.RETRY action. > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544959#comment-14544959 ] Rohith commented on YARN-3646: -- I was copied *yarn.resourcemanager.connect.wait-ms* from description but actual configuration is *yarn.resourcemanager.connect.max-wait.ms*. > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544947#comment-14544947 ] Rohith commented on YARN-3646: -- RetryPolicies.RETRY_FOREVER should also should use exceptionToPolicyMap. [~raju.bairishetti] Feel free to take up this JIRA. > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544938#comment-14544938 ] Rohith commented on YARN-3646: -- Thanks for the explanation.. I got the problem in my machines too. Last time when I test, the configuration settings had issue. > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3642) Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java
[ https://issues.apache.org/jira/browse/YARN-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544920#comment-14544920 ] Rohith commented on YARN-3642: -- How many nodemanagers are running? If it more than 1 then I am thinking what would have happen in your case is yarn-site.xml never read by clent i.e oozi job but still you are able to submit the job because you might be submitting job from the local machine i.e where RM is running. So with default port job is able to submit , but when AppplicationManster is launched , it is launched in different machine where NodeManager is running. Since scheduler address is not loaded by any configuration, AM tries to connect default address i.e 0.0.0.0:8030 which never connect. I suggest that you can make sure your yarn-site.xml is loaded into classpath before submitting the job. So the AM gets the yarn.resourcemanager.scheduler.address and connect to RM. Otherway is explicitely set yarn.resourcemanager.scheduler.address using job client. > Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java > - > > Key: YARN-3642 > URL: https://issues.apache.org/jira/browse/YARN-3642 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 > Environment: yarn-site.xml: > > > yarn.nodemanager.aux-services > mapreduce_shuffle > > > yarn.nodemanager.aux-services.mapreduce.shuffle.class > org.apache.hadoop.mapred.ShuffleHandler > > > yarn.resourcemanager.hostname > qadoop-nn001.apsalar.com > > > yarn.resourcemanager.scheduler.address > qadoop-nn001.apsalar.com:8030 > > > yarn.resourcemanager.address > qadoop-nn001.apsalar.com:8032 > > > yarn.resourcemanager.webap.address > qadoop-nn001.apsalar.com:8088 > > > yarn.resourcemanager.resource-tracker.address > qadoop-nn001.apsalar.com:8031 > > > yarn.resourcemanager.admin.address > qadoop-nn001.apsalar.com:8033 > > > yarn.log-aggregation-enable > true > > > Where to aggregate logs to. > yarn.nodemanager.remote-app-log-dir > /var/log/hadoop/apps > > > yarn.web-proxy.address > qadoop-nn001.apsalar.com:8088 > > > core-site.xml: > > > fs.defaultFS > hdfs://qadoop-nn001.apsalar.com > > > hadoop.proxyuser.hdfs.hosts > * > > > hadoop.proxyuser.hdfs.groups > * > > > hdfs-site.xml: > > > dfs.replication > 2 > > > dfs.namenode.name.dir > file:/hadoop/nn > > > dfs.datanode.data.dir > file:/hadoop/dn/dfs > > > dfs.http.address > qadoop-nn001.apsalar.com:50070 > > > dfs.secondary.http.address > qadoop-nn002.apsalar.com:50090 > > > mapred-site.xml: > > > mapred.job.tracker > qadoop-nn001.apsalar.com:8032 > > > mapreduce.framework.name > yarn > > > mapreduce.jobhistory.address > qadoop-nn001.apsalar.com:10020 > the JobHistoryServer address. > > > mapreduce.jobhistory.webapp.address > qadoop-nn001.apsalar.com:19888 > the JobHistoryServer web address > > > hbase-site.xml: > > > hbase.master > qadoop-nn001.apsalar.com:6 > > > hbase.rootdir > hdfs://qadoop-nn001.apsalar.com:8020/hbase > > > hbase.cluster.distributed > true > > > hbase.zookeeper.property.dataDir > /opt/local/zookeeper > > > hbase.zookeeper.property.clientPort > 2181 > > > hbase.zookeeper.quorum > qadoop-nn001.apsalar.com > > > zookeeper.session.timeout > 18 > > >Reporter: Lee Hounshell > > There is an issue with Hadoop 2.7.0 when in distributed operation the > datanode is unable to reach the yarn scheduler. In our yarn-site.xml, we > have defined this path to be: > {code} > > yarn.resourcemanager.scheduler.address > qadoop-nn001.apsalar.com:8030 > > {code} > But when running an oozie job, the problem manifests when looking at the job > logs for the yarn container. > We see logs similar to the following showing the connection problem: > {quote} > Showing 4096 bytes. Click here for full log > [main] org.apache.hadoop.http.HttpServer2: Jetty bound to port 64065 > 2015-05-13 17:49:33,930 INFO [main] org.mortbay.log: jetty-6.1.26 > 2015-05-13 17:49:33,971 INFO [main] org.mortbay.log: Extract > jar:file:/opt/local/hadoop/hadoop-2.7.0/share/hadoop/yarn/hadoop-yarn-common-2.
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543776#comment-14543776 ] Rohith commented on YARN-3646: -- Which version of Hadoop are you using? I don't see this problem in trunk or branch-2. > Applications are getting stuck some times in case of retry policy forever > - > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Raju Bairishetti > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)