[jira] [Updated] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped
[ https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2328: --- Attachment: yarn-2328-2.patch Thanks Sandy. Removed the unrelated change. Will commit this if Jenkins is fine. > FairScheduler: Verify update and continuous scheduling threads are stopped > when the scheduler is stopped > > > Key: YARN-2328 > URL: https://issues.apache.org/jira/browse/YARN-2328 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.4.1 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Minor > Attachments: yarn-2328-1.patch, yarn-2328-2.patch > > > FairScheduler threads can use a little cleanup and tests. To begin with, the > update and continuous-scheduling threads should extend Thread and handle > being interrupted. We should have tests for starting and stopping them as > well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class
zhihai xu created YARN-2337: --- Summary: remove duplication function call (setClientRMService) in resource manage class Key: YARN-2337 URL: https://issues.apache.org/jira/browse/YARN-2337 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: zhihai xu Priority: Minor remove duplication function call (setClientRMService) in resource manage class. rmContext.setClientRMService(clientRM); is duplicate in serviceInit of ResourceManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class
[ https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu reassigned YARN-2337: --- Assignee: zhihai xu > remove duplication function call (setClientRMService) in resource manage class > -- > > Key: YARN-2337 > URL: https://issues.apache.org/jira/browse/YARN-2337 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > > remove duplication function call (setClientRMService) in resource manage > class. > rmContext.setClientRMService(clientRM); is duplicate in serviceInit of > ResourceManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class
[ https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2337: Attachment: YARN-2337.000.patch > remove duplication function call (setClientRMService) in resource manage class > -- > > Key: YARN-2337 > URL: https://issues.apache.org/jira/browse/YARN-2337 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-2337.000.patch > > > remove duplication function call (setClientRMService) in resource manage > class. > rmContext.setClientRMService(clientRM); is duplicate in serviceInit of > ResourceManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class
[ https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071450#comment-14071450 ] zhihai xu commented on YARN-2337: - It is not necessary to call rmContext.setClientRMService(clientRM); twice in the following code. rmContext.setClientRMService(clientRM); addService(clientRM); rmContext.setClientRMService(clientRM); the first one is removed in the patch. > remove duplication function call (setClientRMService) in resource manage class > -- > > Key: YARN-2337 > URL: https://issues.apache.org/jira/browse/YARN-2337 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-2337.000.patch > > > remove duplication function call (setClientRMService) in resource manage > class. > rmContext.setClientRMService(clientRM); is duplicate in serviceInit of > ResourceManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2284) Find missing config options in YarnConfiguration and yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071470#comment-14071470 ] Hadoop QA commented on YARN-2284: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657289/YARN2284-03.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: org.apache.hadoop.ipc.TestIPC {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4399//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4399//console This message is automatically generated. > Find missing config options in YarnConfiguration and yarn-default.xml > - > > Key: YARN-2284 > URL: https://issues.apache.org/jira/browse/YARN-2284 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.4.1 >Reporter: Ray Chiang >Assignee: Ray Chiang >Priority: Minor > Labels: supportability > Attachments: YARN2284-01.patch, YARN2284-02.patch, YARN2284-03.patch > > > YarnConfiguration has one set of properties. yarn-default.xml has another > set of properties. Ideally, there should be an automatic way to find missing > properties in either location. > This is analogous to MAPREDUCE-5130, but for yarn-default.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2338) service assemble so complex
tangjunjie created YARN-2338: Summary: service assemble so complex Key: YARN-2338 URL: https://issues.apache.org/jira/browse/YARN-2338 Project: Hadoop YARN Issue Type: Wish Reporter: tangjunjie See ResourceManager protected void serviceInit(Configuration configuration) throws Exception So many service will assembe into resourcemanager. Use guice or other service assemble framework to refactor this complex code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2339) service assemble so complex
tangjunjie created YARN-2339: Summary: service assemble so complex Key: YARN-2339 URL: https://issues.apache.org/jira/browse/YARN-2339 Project: Hadoop YARN Issue Type: Wish Reporter: tangjunjie See ResourceManager protected void serviceInit(Configuration configuration) throws Exception So many service will assembe into resourcemanager. Use guice or other service assemble framework to refactor this complex code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2339) service assemble so complex
[ https://issues.apache.org/jira/browse/YARN-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangjunjie resolved YARN-2339. -- Resolution: Duplicate > service assemble so complex > --- > > Key: YARN-2339 > URL: https://issues.apache.org/jira/browse/YARN-2339 > Project: Hadoop YARN > Issue Type: Wish >Reporter: tangjunjie > > See ResourceManager > protected void serviceInit(Configuration configuration) throws Exception > So many service will assembe into resourcemanager. > Use guice or other service assemble framework to refactor this complex code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2338) service assemble so complex
[ https://issues.apache.org/jira/browse/YARN-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071495#comment-14071495 ] Tsuyoshi OZAWA commented on YARN-2338: -- Hi, do you mean that we should use DI framework? What kind of refatoring are you planning to do? > service assemble so complex > --- > > Key: YARN-2338 > URL: https://issues.apache.org/jira/browse/YARN-2338 > Project: Hadoop YARN > Issue Type: Wish >Reporter: tangjunjie > > See ResourceManager > protected void serviceInit(Configuration configuration) throws Exception > So many service will assembe into resourcemanager. > Use guice or other service assemble framework to refactor this complex code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class
[ https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071496#comment-14071496 ] Tsuyoshi OZAWA commented on YARN-2337: -- +1 (non-binding), let's waiting for the result of Jenkins CI. > remove duplication function call (setClientRMService) in resource manage class > -- > > Key: YARN-2337 > URL: https://issues.apache.org/jira/browse/YARN-2337 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-2337.000.patch > > > remove duplication function call (setClientRMService) in resource manage > class. > rmContext.setClientRMService(clientRM); is duplicate in serviceInit of > ResourceManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped
[ https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071503#comment-14071503 ] Hadoop QA commented on YARN-2328: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657303/yarn-2328-2.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4400//console This message is automatically generated. > FairScheduler: Verify update and continuous scheduling threads are stopped > when the scheduler is stopped > > > Key: YARN-2328 > URL: https://issues.apache.org/jira/browse/YARN-2328 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.4.1 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Minor > Attachments: yarn-2328-1.patch, yarn-2328-2.patch > > > FairScheduler threads can use a little cleanup and tests. To begin with, the > update and continuous-scheduling threads should extend Thread and handle > being interrupted. We should have tests for starting and stopping them as > well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071547#comment-14071547 ] Karthik Kambatla commented on YARN-2313: Sorry for coming in late here. Didn't see this before. I think we need a better solution here. Otherwise, clusters will continue to run into this. One simple way to address this could be to wait {{updateInterval}} ms after finishing an iteration of update-thread before starting the next iteration. We should do something similar for the continuous thread as well. > Livelock can occur in FairScheduler when there are lots of running apps > --- > > Key: YARN-2313 > URL: https://issues.apache.org/jira/browse/YARN-2313 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.4.1 >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Fix For: 2.6.0 > > Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, > YARN-2313.4.patch, rm-stack-trace.txt > > > Observed livelock on FairScheduler when there are lots entry in queue. After > my investigating code, following case can occur: > 1. {{update()}} called by UpdateThread takes longer times than > UPDATE_INTERVAL(500ms) if there are lots queue. > 2. UpdateThread goes busy loop. > 3. Other threads(AllocationFileReloader, > ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped
[ https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2328: --- Attachment: yarn-2328-2.patch Updated patch on latest trunk. > FairScheduler: Verify update and continuous scheduling threads are stopped > when the scheduler is stopped > > > Key: YARN-2328 > URL: https://issues.apache.org/jira/browse/YARN-2328 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.4.1 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Minor > Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch > > > FairScheduler threads can use a little cleanup and tests. To begin with, the > update and continuous-scheduling threads should extend Thread and handle > being interrupted. We should have tests for starting and stopping them as > well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071558#comment-14071558 ] Karthik Kambatla commented on YARN-2313: Actually, thinking more about it, I don't quite understand how the update-thread can go into a busy loop. Thread.sleep() and update are called serially. So, irrespective of how long update() takes the next Thread.sleep is called for 500 ms, no? It is possible that these 500 ms are not enough for other work and the scheduler lags, but should still make progress. > Livelock can occur in FairScheduler when there are lots of running apps > --- > > Key: YARN-2313 > URL: https://issues.apache.org/jira/browse/YARN-2313 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.4.1 >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Fix For: 2.6.0 > > Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, > YARN-2313.4.patch, rm-stack-trace.txt > > > Observed livelock on FairScheduler when there are lots entry in queue. After > my investigating code, following case can occur: > 1. {{update()}} called by UpdateThread takes longer times than > UPDATE_INTERVAL(500ms) if there are lots queue. > 2. UpdateThread goes busy loop. > 3. Other threads(AllocationFileReloader, > ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class
[ https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071566#comment-14071566 ] Hadoop QA commented on YARN-2337: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657305/YARN-2337.000.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4401//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4401//console This message is automatically generated. > remove duplication function call (setClientRMService) in resource manage class > -- > > Key: YARN-2337 > URL: https://issues.apache.org/jira/browse/YARN-2337 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-2337.000.patch > > > remove duplication function call (setClientRMService) in resource manage > class. > rmContext.setClientRMService(clientRM); is duplicate in serviceInit of > ResourceManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071599#comment-14071599 ] Hudson commented on YARN-2295: -- FAILURE: Integrated in Hadoop-Yarn-trunk #621 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/621/]) YARN-2295. Refactored DistributedShell to use public APIs of protocol records. Contributed by Li Lu (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612626) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java > Refactor YARN distributed shell with existing public stable API > --- > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Fix For: 2.6.0 > > Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, > YARN-2295-071514.patch, YARN-2295-072114.patch > > > Some API calls in YARN distributed shell have been marked as unstable and > private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071597#comment-14071597 ] Hudson commented on YARN-2313: -- FAILURE: Integrated in Hadoop-Yarn-trunk #621 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/621/]) YARN-2313. Livelock can occur in FairScheduler when there are lots of running apps (Tsuyoshi Ozawa via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612769) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm > Livelock can occur in FairScheduler when there are lots of running apps > --- > > Key: YARN-2313 > URL: https://issues.apache.org/jira/browse/YARN-2313 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.4.1 >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Fix For: 2.6.0 > > Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, > YARN-2313.4.patch, rm-stack-trace.txt > > > Observed livelock on FairScheduler when there are lots entry in queue. After > my investigating code, following case can occur: > 1. {{update()}} called by UpdateThread takes longer times than > UPDATE_INTERVAL(500ms) if there are lots queue. > 2. UpdateThread goes busy loop. > 3. Other threads(AllocationFileReloader, > ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071611#comment-14071611 ] Hudson commented on YARN-2242: -- FAILURE: Integrated in Hadoop-Yarn-trunk #621 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/621/]) YARN-2242. Addendum patch. Improve exception information on AM launch crashes. (Contributed by Li Lu) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612565) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Fix For: 2.6.0 > > Attachments: YARN-2242-070115-2.patch, YARN-2242-070814-1.patch, > YARN-2242-070814.patch, YARN-2242-071114.patch, YARN-2242-071214.patch, > YARN-2242-071414.patch > > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling thread when we lose a node
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071596#comment-14071596 ] Hudson commented on YARN-2273: -- FAILURE: Integrated in Hadoop-Yarn-trunk #621 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/621/]) YARN-2273. NPE in ContinuousScheduling thread when we lose a node. (Wei Yan via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612720) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java > NPE in ContinuousScheduling thread when we lose a node > -- > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton >Assignee: Wei Yan > Fix For: 2.6.0 > > Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, > YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071610#comment-14071610 ] Hudson commented on YARN-2319: -- FAILURE: Integrated in Hadoop-Yarn-trunk #621 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/621/]) YARN-2319. Made the MiniKdc instance start/close before/after the class of TestRMWebServicesDelegationTokens. Contributed by Wenwu Peng. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612588) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokens.java > Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java > --- > > Key: YARN-2319 > URL: https://issues.apache.org/jira/browse/YARN-2319 > Project: Hadoop YARN > Issue Type: Test > Components: resourcemanager >Affects Versions: 3.0.0, 2.5.0 >Reporter: Wenwu Peng >Assignee: Wenwu Peng > Fix For: 2.5.0 > > Attachments: YARN-2319.0.patch, YARN-2319.1.patch, YARN-2319.2.patch, > YARN-2319.2.patch > > > MiniKdc only invoke start method not stop in > TestRMWebServicesDelegationTokens.java > {code} > testMiniKDC.start(); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071607#comment-14071607 ] Hudson commented on YARN-2131: -- FAILURE: Integrated in Hadoop-Yarn-trunk #621 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/621/]) YARN-2131. Addendum2: Document -format-state-store. Add a way to format the RMStateStore. (Robert Kanter via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612634) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/YarnCommands.apt.vm > Add a way to format the RMStateStore > > > Key: YARN-2131 > URL: https://issues.apache.org/jira/browse/YARN-2131 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Robert Kanter > Fix For: 2.6.0 > > Attachments: YARN-2131.patch, YARN-2131.patch, > YARN-2131_addendum.patch, YARN-2131_addendum2.patch > > > There are cases when we don't want to recover past applications, but recover > applications going forward. To do this, one has to clear the store. Today, > there is no easy way to do this and users should understand how each store > works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071620#comment-14071620 ] Tsuyoshi OZAWA commented on YARN-2313: -- Hi Karthik, thank you for pointing it out. {quote} So, irrespective of how long update() takes the next Thread.sleep is called for 500 ms, no? {quote} You're correct. The description "go busy loop" is wrong. But there still remains starvation problem: 1. {{FairScheduler#update()}} can take more than 10 sec, default value of reloadIntervalMs, with lock. 2. {{AllocationFileLoaderThread#onReload}} can take more than 500 ms, default value of updateInterval, with lock. 3. As a result, {{FairScheduler#update()}} and {{FairScheduler#onReload}} can always wins lock of the instance of {{FairScheduler}}. 4. {{ResourceManager$SchedulerEventDispatcher}} can wait forever. The problem we faced was that cluster(note that it's very busy cluster!) hung up even after killing exist apps. I got the stack trace when we faced the problem. In our case, we can avoid the problem by setting the configuration value(updateInterval) larger. IIUC, it's because we can have the margin that ResourceManager$SchedulerEventDispatcher acquire lock. As you mentioned, this fix is just a workaround. However, it's effective. More essential way is making updateInterval and reloadIntervalMs dynamic. Please correct me if I'm wrong. > Livelock can occur in FairScheduler when there are lots of running apps > --- > > Key: YARN-2313 > URL: https://issues.apache.org/jira/browse/YARN-2313 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.4.1 >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Fix For: 2.6.0 > > Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, > YARN-2313.4.patch, rm-stack-trace.txt > > > Observed livelock on FairScheduler when there are lots entry in queue. After > my investigating code, following case can occur: > 1. {{update()}} called by UpdateThread takes longer times than > UPDATE_INTERVAL(500ms) if there are lots queue. > 2. UpdateThread goes busy loop. > 3. Other threads(AllocationFileReloader, > ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
[ https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenji Kikushima updated YARN-2336: -- Attachment: YARN-2336-2.patch Fixed test failure. > Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree > -- > > Key: YARN-2336 > URL: https://issues.apache.org/jira/browse/YARN-2336 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.4.1 >Reporter: Kenji Kikushima >Assignee: Kenji Kikushima > Attachments: YARN-2336-2.patch, YARN-2336.patch > > > When we have sub queues in Fair Scheduler, REST api returns a missing '[' > blacket JSON for childQueues. > This issue found by [~ajisakaa] at YARN-1050. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped
[ https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071636#comment-14071636 ] Hadoop QA commented on YARN-2328: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657329/yarn-2328-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4402//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4402//console This message is automatically generated. > FairScheduler: Verify update and continuous scheduling threads are stopped > when the scheduler is stopped > > > Key: YARN-2328 > URL: https://issues.apache.org/jira/browse/YARN-2328 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.4.1 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Minor > Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch > > > FairScheduler threads can use a little cleanup and tests. To begin with, the > update and continuous-scheduling threads should extend Thread and handle > being interrupted. We should have tests for starting and stopping them as > well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
[ https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071665#comment-14071665 ] Hadoop QA commented on YARN-2336: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657339/YARN-2336-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4403//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4403//console This message is automatically generated. > Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree > -- > > Key: YARN-2336 > URL: https://issues.apache.org/jira/browse/YARN-2336 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.4.1 >Reporter: Kenji Kikushima >Assignee: Kenji Kikushima > Attachments: YARN-2336-2.patch, YARN-2336.patch > > > When we have sub queues in Fair Scheduler, REST api returns a missing '[' > blacket JSON for childQueues. > This issue found by [~ajisakaa] at YARN-1050. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071705#comment-14071705 ] Sunil G commented on YARN-2301: --- This will be really useful enhancement. I have a concern here. bq. yarn container -list * *list* with ** comes after the variable input from user (appId|appAttemptId). And is only for one of the type named *appId*. May be it may confuse user also, like which sub option needs the . I feel may be we can have a new command itself for listing application container. A suggestion is: {noformat} yarn container -list-appid yarn container -list-appattemptid {noformat} OR {noformat} yarn application -list-containers {noformat} * I feel sequential checks with ConverterUtils.toApplicationID and ConverterUtils.toApplicationAttemptId has to be done to know whether input is appId|appAttemptId. So rediercting to my point 1, if seperate command is there, may be it can be handled in a better way from applicationCLI (rather than handling specific types of exceptions). Please share your thoughts > Improve yarn container command > -- > > Key: YARN-2301 > URL: https://issues.apache.org/jira/browse/YARN-2301 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Naganarasimha G R > Labels: usability > > While running yarn container -list command, some > observations: > 1) the scheme (e.g. http/https ) before LOG-URL is missing > 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to > print as time format. > 3) finish-time is 0 if container is not yet finished. May be "N/A" > 4) May have an option to run as yarn container -list OR yarn > application -list-containers also. > As attempt Id is not shown on console, this is easier for user to just copy > the appId and run it, may also be useful for container-preserving AM > restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071728#comment-14071728 ] Hudson commented on YARN-2313: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1813 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1813/]) YARN-2313. Livelock can occur in FairScheduler when there are lots of running apps (Tsuyoshi Ozawa via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612769) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm > Livelock can occur in FairScheduler when there are lots of running apps > --- > > Key: YARN-2313 > URL: https://issues.apache.org/jira/browse/YARN-2313 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.4.1 >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Fix For: 2.6.0 > > Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, > YARN-2313.4.patch, rm-stack-trace.txt > > > Observed livelock on FairScheduler when there are lots entry in queue. After > my investigating code, following case can occur: > 1. {{update()}} called by UpdateThread takes longer times than > UPDATE_INTERVAL(500ms) if there are lots queue. > 2. UpdateThread goes busy loop. > 3. Other threads(AllocationFileReloader, > ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071730#comment-14071730 ] Hudson commented on YARN-2295: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1813 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1813/]) YARN-2295. Refactored DistributedShell to use public APIs of protocol records. Contributed by Li Lu (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612626) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java > Refactor YARN distributed shell with existing public stable API > --- > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Fix For: 2.6.0 > > Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, > YARN-2295-071514.patch, YARN-2295-072114.patch > > > Some API calls in YARN distributed shell have been marked as unstable and > private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071742#comment-14071742 ] Hudson commented on YARN-2242: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1813 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1813/]) YARN-2242. Addendum patch. Improve exception information on AM launch crashes. (Contributed by Li Lu) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612565) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Fix For: 2.6.0 > > Attachments: YARN-2242-070115-2.patch, YARN-2242-070814-1.patch, > YARN-2242-070814.patch, YARN-2242-071114.patch, YARN-2242-071214.patch, > YARN-2242-071414.patch > > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071726#comment-14071726 ] Jason Lowe commented on YARN-2314: -- I suppose we could use a wait timeout. I was just matching the behavior when it tries to refresh the NM token on an in-use proxy which also waits indefinitely. What's the proposed behavior when the timeout expires? Log a message and then...? Arguably the timeouts should be on the RPC calls rather than the proxy cache, since I'm assuming if we're not willing to wait forever for a proxy to be freed up we're also not willing to wait forever for a remote call to complete. > ContainerManagementProtocolProxy can create thousands of threads for a large > cluster > > > Key: YARN-2314 > URL: https://issues.apache.org/jira/browse/YARN-2314 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Priority: Critical > Attachments: nmproxycachefix.prototype.patch > > > ContainerManagementProtocolProxy has a cache of NM proxies, and the size of > this cache is configurable. However the cache can grow far beyond the > configured size when running on a large cluster and blow AM address/container > limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071738#comment-14071738 ] Hudson commented on YARN-2131: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1813 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1813/]) YARN-2131. Addendum2: Document -format-state-store. Add a way to format the RMStateStore. (Robert Kanter via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612634) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/YarnCommands.apt.vm > Add a way to format the RMStateStore > > > Key: YARN-2131 > URL: https://issues.apache.org/jira/browse/YARN-2131 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Robert Kanter > Fix For: 2.6.0 > > Attachments: YARN-2131.patch, YARN-2131.patch, > YARN-2131_addendum.patch, YARN-2131_addendum2.patch > > > There are cases when we don't want to recover past applications, but recover > applications going forward. To do this, one has to clear the store. Today, > there is no easy way to do this and users should understand how each store > works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling thread when we lose a node
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071727#comment-14071727 ] Hudson commented on YARN-2273: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1813 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1813/]) YARN-2273. NPE in ContinuousScheduling thread when we lose a node. (Wei Yan via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612720) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java > NPE in ContinuousScheduling thread when we lose a node > -- > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton >Assignee: Wei Yan > Fix For: 2.6.0 > > Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, > YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071741#comment-14071741 ] Hudson commented on YARN-2319: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1813 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1813/]) YARN-2319. Made the MiniKdc instance start/close before/after the class of TestRMWebServicesDelegationTokens. Contributed by Wenwu Peng. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612588) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokens.java > Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java > --- > > Key: YARN-2319 > URL: https://issues.apache.org/jira/browse/YARN-2319 > Project: Hadoop YARN > Issue Type: Test > Components: resourcemanager >Affects Versions: 3.0.0, 2.5.0 >Reporter: Wenwu Peng >Assignee: Wenwu Peng > Fix For: 2.5.0 > > Attachments: YARN-2319.0.patch, YARN-2319.1.patch, YARN-2319.2.patch, > YARN-2319.2.patch > > > MiniKdc only invoke start method not stop in > TestRMWebServicesDelegationTokens.java > {code} > testMiniKDC.start(); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1779) Handle AMRMTokens across RM failover
[ https://issues.apache.org/jira/browse/YARN-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071772#comment-14071772 ] Rohith commented on YARN-1779: -- This is critical issue for work preserving restart feature. AM can not connect to new RM because of proxy object is cached and token service is overwritten. One approach to solve this by cloning the token object and add token to UserGroupInformation. Sample like below {code} for (Token token : UserGroupInformation .getCurrentUser().getTokens()) { if (token.getKind().equals(AMRMTokenIdentifier.KIND_NAME)) { Token specificToken = new Token(token); SecurityUtil.setTokenService(specificToken, resourceManagerAddress); UserGroupInformation.getCurrentUser().addToken(specificToken); } } {code} Does it make sense? > Handle AMRMTokens across RM failover > > > Key: YARN-1779 > URL: https://issues.apache.org/jira/browse/YARN-1779 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.3.0 >Reporter: Karthik Kambatla >Priority: Blocker > Labels: ha > > Verify if AMRMTokens continue to work against RM failover. If not, we will > have to do something along the lines of YARN-986. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071778#comment-14071778 ] Zhijie Shen commented on YARN-2301: --- [~sunilg], thanks for your input. Here's my response. bq. And is only for one of the type named appId. May be it may confuse user also, like which sub option needs the . I don't worry too much about it, because we can update the usage block to let users how to use the opts correctly. When users make the mistake, they will be redirect the usage output. bq. I feel may be we can have a new command itself for listing application container. I incline not to change the command to keep backward compatibility. bq. I feel sequential checks with ConverterUtils.toApplicationID and ConverterUtils.toApplicationAttemptId has to be done to know whether input is appId|appAttemptId. We can use ConverterUtils.APPLICATION_PREFIX and ConverterUtils.APPLICATION_ATTEMPT_PREFIX to check the prefix of the given id to determine whether it is the app id or the app attempt id. We don't need to handle the exception actually. > Improve yarn container command > -- > > Key: YARN-2301 > URL: https://issues.apache.org/jira/browse/YARN-2301 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Naganarasimha G R > Labels: usability > > While running yarn container -list command, some > observations: > 1) the scheme (e.g. http/https ) before LOG-URL is missing > 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to > print as time format. > 3) finish-time is 0 if container is not yet finished. May be "N/A" > 4) May have an option to run as yarn container -list OR yarn > application -list-containers also. > As attempt Id is not shown on console, this is easier for user to just copy > the appId and run it, may also be useful for container-preserving AM > restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1063) Winutils needs ability to create task as domain user
[ https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-1063: --- Attachment: YARN-1063.5.patch Patch .5 changes the environment block of the secure process to inherit the parent environment. > Winutils needs ability to create task as domain user > > > Key: YARN-1063 > URL: https://issues.apache.org/jira/browse/YARN-1063 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Environment: Windows >Reporter: Kyle Leckie >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, > YARN-1063.5.patch, YARN-1063.patch > > > h1. Summary: > Securing a Hadoop cluster requires constructing some form of security > boundary around the processes executed in YARN containers. Isolation based on > Windows user isolation seems most feasible. This approach is similar to the > approach taken by the existing LinuxContainerExecutor. The current patch to > winutils.exe adds the ability to create a process as a domain user. > h1. Alternative Methods considered: > h2. Process rights limited by security token restriction: > On Windows access decisions are made by examining the security token of a > process. It is possible to spawn a process with a restricted security token. > Any of the rights granted by SIDs of the default token may be restricted. It > is possible to see this in action by examining the security tone of a > sandboxed process launch be a web browser. Typically the launched process > will have a fully restricted token and need to access machine resources > through a dedicated broker process that enforces a custom security policy. > This broker process mechanism would break compatibility with the typical > Hadoop container process. The Container process must be able to utilize > standard function calls for disk and network IO. I performed some work > looking at ways to ACL the local files to the specific launched without > granting rights to other processes launched on the same machine but found > this to be an overly complex solution. > h2. Relying on APP containers: > Recent versions of windows have the ability to launch processes within an > isolated container. Application containers are supported for execution of > WinRT based executables. This method was ruled out due to the lack of > official support for standard windows APIs. At some point in the future > windows may support functionality similar to BSD jails or Linux containers, > at that point support for containers should be added. > h1. Create As User Feature Description: > h2. Usage: > A new sub command was added to the set of task commands. Here is the syntax: > winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE] > Some notes: > * The username specified is in the format of "user@domain" > * The machine executing this command must be joined to the domain of the user > specified > * The domain controller must allow the account executing the command access > to the user information. For this join the account to the predefined group > labeled "Pre-Windows 2000 Compatible Access" > * The account running the command must have several rights on the local > machine. These can be managed manually using secpol.msc: > ** "Act as part of the operating system" - SE_TCB_NAME > ** "Replace a process-level token" - SE_ASSIGNPRIMARYTOKEN_NAME > ** "Adjust memory quotas for a process" - SE_INCREASE_QUOTA_NAME > * The launched process will not have rights to the desktop so will not be > able to display any information or create UI. > * The launched process will have no network credentials. Any access of > network resources that requires domain authentication will fail. > h2. Implementation: > Winutils performs the following steps: > # Enable the required privileges for the current process. > # Register as a trusted process with the Local Security Authority (LSA). > # Create a new logon for the user passed on the command line. > # Load/Create a profile on the local machine for the new logon. > # Create a new environment for the new logon. > # Launch the new process in a job with the task name specified and using the > created logon. > # Wait for the JOB to exit. > h2. Future work: > The following work was scoped out of this check in: > * Support for non-domain users or machine that are not domain joined. > * Support for privilege isolation by running the task launcher in a high > privilege service with access over an ACLed named pipe. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-1972: --- Attachment: YARN-1972.3.patch Patch .3 reverts the separation of createUserAppCacheDirs, as per review comment, and ads hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm > Implement secure Windows Container Executor > --- > > Key: YARN-1972 > URL: https://issues.apache.org/jira/browse/YARN-1972 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch > > > h1. Windows Secure Container Executor (WCE) > YARN-1063 adds the necessary infrasturcture to launch a process as a domain > user as a solution for the problem of having a security boundary between > processes executed in YARN containers and the Hadoop services. The WCE is a > container executor that leverages the winutils capabilities introduced in > YARN-1063 and launches containers as an OS process running as the job > submitter user. A description of the S4U infrastructure used by YARN-1063 > alternatives considered can be read on that JIRA. > The WCE is based on the DefaultContainerExecutor. It relies on the DCE to > drive the flow of execution, but it overwrrides some emthods to the effect of: > * change the DCE created user cache directories to be owned by the job user > and by the nodemanager group. > * changes the actual container run command to use the 'createAsUser' command > of winutils task instead of 'create' > * runs the localization as standalone process instead of an in-process Java > method call. This in turn relies on the winutil createAsUser feature to run > the localization as the job user. > > When compared to LinuxContainerExecutor (LCE), the WCE has some minor > differences: > * it does no delegate the creation of the user cache directories to the > native implementation. > * it does no require special handling to be able to delete user files > The approach on the WCE came from a practical trial-and-error approach. I had > to iron out some issues around the Windows script shell limitations (command > line length) to get it to work, the biggest issue being the huge CLASSPATH > that is commonplace in Hadoop environment container executions. The job > container itself is already dealing with this via a so called 'classpath > jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch > as a separate container the same issue had to be resolved and I used the same > 'classpath jar' approach. > h2. Deployment Requirements > To use the WCE one needs to set the > `yarn.nodemanager.container-executor.class` to > `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` > and set the `yarn.nodemanager.windows-secure-container-executor.group` to a > Windows security group name that is the nodemanager service principal is a > member of (equivalent of LCE > `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE > does not require any configuration outside of the Hadoop own's yar-site.xml. > For WCE to work the nodemanager must run as a service principal that is > member of the local Administrators group or LocalSystem. this is derived from > the need to invoke LoadUserProfile API which mention these requirements in > the specifications. This is in addition to the SE_TCB privilege mentioned in > YARN-1063, but this requirement will automatically imply that the SE_TCB > privilege is held by the nodemanager. For the Linux speakers in the audience, > the requirement is basically to run NM as root. > h2. Dedicated high privilege Service > Due to the high privilege required by the WCE we had discussed the need to > isolate the high privilege operations into a separate process, an 'executor' > service that is solely responsible to start the containers (incloding the > localizer). The NM would have to authenticate, authorize and communicate with > this service via an IPC mechanism and use this service to launch the > containers. I still believe we'll end up deploying such a service, but the > effort to onboard such a new platfrom specific new service on the project are > not trivial. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user
[ https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071781#comment-14071781 ] Hadoop QA commented on YARN-1063: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657356/YARN-1063.5.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4404//console This message is automatically generated. > Winutils needs ability to create task as domain user > > > Key: YARN-1063 > URL: https://issues.apache.org/jira/browse/YARN-1063 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Environment: Windows >Reporter: Kyle Leckie >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, > YARN-1063.5.patch, YARN-1063.patch > > > h1. Summary: > Securing a Hadoop cluster requires constructing some form of security > boundary around the processes executed in YARN containers. Isolation based on > Windows user isolation seems most feasible. This approach is similar to the > approach taken by the existing LinuxContainerExecutor. The current patch to > winutils.exe adds the ability to create a process as a domain user. > h1. Alternative Methods considered: > h2. Process rights limited by security token restriction: > On Windows access decisions are made by examining the security token of a > process. It is possible to spawn a process with a restricted security token. > Any of the rights granted by SIDs of the default token may be restricted. It > is possible to see this in action by examining the security tone of a > sandboxed process launch be a web browser. Typically the launched process > will have a fully restricted token and need to access machine resources > through a dedicated broker process that enforces a custom security policy. > This broker process mechanism would break compatibility with the typical > Hadoop container process. The Container process must be able to utilize > standard function calls for disk and network IO. I performed some work > looking at ways to ACL the local files to the specific launched without > granting rights to other processes launched on the same machine but found > this to be an overly complex solution. > h2. Relying on APP containers: > Recent versions of windows have the ability to launch processes within an > isolated container. Application containers are supported for execution of > WinRT based executables. This method was ruled out due to the lack of > official support for standard windows APIs. At some point in the future > windows may support functionality similar to BSD jails or Linux containers, > at that point support for containers should be added. > h1. Create As User Feature Description: > h2. Usage: > A new sub command was added to the set of task commands. Here is the syntax: > winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE] > Some notes: > * The username specified is in the format of "user@domain" > * The machine executing this command must be joined to the domain of the user > specified > * The domain controller must allow the account executing the command access > to the user information. For this join the account to the predefined group > labeled "Pre-Windows 2000 Compatible Access" > * The account running the command must have several rights on the local > machine. These can be managed manually using secpol.msc: > ** "Act as part of the operating system" - SE_TCB_NAME > ** "Replace a process-level token" - SE_ASSIGNPRIMARYTOKEN_NAME > ** "Adjust memory quotas for a process" - SE_INCREASE_QUOTA_NAME > * The launched process will not have rights to the desktop so will not be > able to display any information or create UI. > * The launched process will have no network credentials. Any access of > network resources that requires domain authentication will fail. > h2. Implementation: > Winutils performs the following steps: > # Enable the required privileges for the current process. > # Register as a trusted process with the Local Security Authority (LSA). > # Create a new logon for the user passed on the command line. > # Load/Create a profile on the local machine for the new logon. > # Create a new environment for the new logon. > # Launch the new process in a job with the task name specified and using the > created logon. > # Wait for the JOB to exit. > h2. Future work: > The following work was scoped out of this check in: > * Support for non-domain users or machine that are not domain joined. > * Support for privilege isolation by run
[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2198: --- Attachment: YARN-2198.2.patch Patch .2 enables mutual auth on LRPC. TODO: sparate config for the service from yarn-site.xml and update SecureExecutor.apt.vm to reflect the reality of YARN-2198 > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-2198.1.patch, YARN-2198.2.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires a the process launching the container to be LocalSystem or > a member of the a local Administrators group. Since the process in question > is the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2247) Allow RM web services users to authenticate using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2247: Attachment: apache-yarn-2247.4.patch {quote} Varun Vasudev, thanks for your patience on my comments. The new patch looks almost good to me. Just some nits: 1. Should not be necessary. Always load TimelineAuthenticationFilter. With "simple" type, still the pseudo handler is to used. {noformat} +if (authType.equals("simple") && !UserGroupInformation.isSecurityEnabled()) { + container.addFilter("authentication", +AuthenticationFilter.class.getName(), filterConfig); + return; +} {noformat} {quote} Good point. Fixed. {quote} 2. Check not null first for testMiniKDC and rm? Same for TestRMWebappAuthentication {noformat} +testMiniKDC.stop(); +rm.stop(); {noformat} {quote} Fixed. {quote} 3. I didn't find the logic to forbid it. Anyway, is it good to mention it in the document as well? {noformat} + // Test to make sure that we can't do delegation token + // functions using just delegation token auth {noformat} {quote} The test is in RMWebServices. {noformat} callerUGI = createKerberosUserGroupInformation(hsr); {noformat} which in turn has this check {noformat} String authType = hsr.getAuthType(); if (!KerberosAuthenticationHandler.TYPE.equals(authType)) { String msg = "Delegation token operations can only be carried out on a " + "Kerberos authenticated channel"; throw new YarnException(msg); } {noformat} I've documented it under the delegation token rest API section: {noformat} All delegation token requests must be carried out on a Kerberos authenticated connection(using SPNEGO). {noformat} > Allow RM web services users to authenticate using delegation tokens > --- > > Key: YARN-2247 > URL: https://issues.apache.org/jira/browse/YARN-2247 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-2247.0.patch, apache-yarn-2247.1.patch, > apache-yarn-2247.2.patch, apache-yarn-2247.3.patch, apache-yarn-2247.4.patch > > > The RM webapp should allow users to authenticate using delegation tokens to > maintain parity with RPC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1342) Recover container tokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1342: - Attachment: YARN-1342v6.patch Thanks for the review, Junping! bq. Would you confirm my understanding is correct? If so, the following code may not be necessary? Yes, that's correct. Sorry, I meant to remove that code to match the same behavior from NMContainerTokenSecretManagerInNM and forgot to do so. Thanks for catching this, and I updated the patch accordingly. > Recover container tokens upon nodemanager restart > - > > Key: YARN-1342 > URL: https://issues.apache.org/jira/browse/YARN-1342 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1342.patch, YARN-1342v2.patch, > YARN-1342v3-and-YARN-1987.patch, YARN-1342v4.patch, YARN-1342v5.patch, > YARN-1342v6.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
[ https://issues.apache.org/jira/browse/YARN-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071829#comment-14071829 ] Hudson commented on YARN-2319: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1840 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1840/]) YARN-2319. Made the MiniKdc instance start/close before/after the class of TestRMWebServicesDelegationTokens. Contributed by Wenwu Peng. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612588) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokens.java > Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java > --- > > Key: YARN-2319 > URL: https://issues.apache.org/jira/browse/YARN-2319 > Project: Hadoop YARN > Issue Type: Test > Components: resourcemanager >Affects Versions: 3.0.0, 2.5.0 >Reporter: Wenwu Peng >Assignee: Wenwu Peng > Fix For: 2.5.0 > > Attachments: YARN-2319.0.patch, YARN-2319.1.patch, YARN-2319.2.patch, > YARN-2319.2.patch > > > MiniKdc only invoke start method not stop in > TestRMWebServicesDelegationTokens.java > {code} > testMiniKDC.start(); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071816#comment-14071816 ] Hudson commented on YARN-2313: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1840 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1840/]) YARN-2313. Livelock can occur in FairScheduler when there are lots of running apps (Tsuyoshi Ozawa via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612769) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm > Livelock can occur in FairScheduler when there are lots of running apps > --- > > Key: YARN-2313 > URL: https://issues.apache.org/jira/browse/YARN-2313 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.4.1 >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Fix For: 2.6.0 > > Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, > YARN-2313.4.patch, rm-stack-trace.txt > > > Observed livelock on FairScheduler when there are lots entry in queue. After > my investigating code, following case can occur: > 1. {{update()}} called by UpdateThread takes longer times than > UPDATE_INTERVAL(500ms) if there are lots queue. > 2. UpdateThread goes busy loop. > 3. Other threads(AllocationFileReloader, > ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071826#comment-14071826 ] Hudson commented on YARN-2131: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1840 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1840/]) YARN-2131. Addendum2: Document -format-state-store. Add a way to format the RMStateStore. (Robert Kanter via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612634) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/YarnCommands.apt.vm > Add a way to format the RMStateStore > > > Key: YARN-2131 > URL: https://issues.apache.org/jira/browse/YARN-2131 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Robert Kanter > Fix For: 2.6.0 > > Attachments: YARN-2131.patch, YARN-2131.patch, > YARN-2131_addendum.patch, YARN-2131_addendum2.patch > > > There are cases when we don't want to recover past applications, but recover > applications going forward. To do this, one has to clear the store. Today, > there is no easy way to do this and users should understand how each store > works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071818#comment-14071818 ] Hudson commented on YARN-2295: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1840 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1840/]) YARN-2295. Refactored DistributedShell to use public APIs of protocol records. Contributed by Li Lu (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612626) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java > Refactor YARN distributed shell with existing public stable API > --- > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Fix For: 2.6.0 > > Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, > YARN-2295-071514.patch, YARN-2295-072114.patch > > > Some API calls in YARN distributed shell have been marked as unstable and > private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071830#comment-14071830 ] Hudson commented on YARN-2242: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1840 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1840/]) YARN-2242. Addendum patch. Improve exception information on AM launch crashes. (Contributed by Li Lu) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612565) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java > Improve exception information on AM launch crashes > -- > > Key: YARN-2242 > URL: https://issues.apache.org/jira/browse/YARN-2242 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Fix For: 2.6.0 > > Attachments: YARN-2242-070115-2.patch, YARN-2242-070814-1.patch, > YARN-2242-070814.patch, YARN-2242-071114.patch, YARN-2242-071214.patch, > YARN-2242-071414.patch > > > Now on each time AM Container crashes during launch, both the console and the > webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, > but sometimes confusing. With the help of log aggregator, container logs are > actually aggregated, and can be very helpful for debugging. One possible way > to improve the whole process is to send a "pointer" to the aggregated logs to > the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling thread when we lose a node
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071815#comment-14071815 ] Hudson commented on YARN-2273: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1840 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1840/]) YARN-2273. NPE in ContinuousScheduling thread when we lose a node. (Wei Yan via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612720) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java > NPE in ContinuousScheduling thread when we lose a node > -- > > Key: YARN-2273 > URL: https://issues.apache.org/jira/browse/YARN-2273 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager >Affects Versions: 2.3.0, 2.4.1 > Environment: cdh5.0.2 wheezy >Reporter: Andy Skelton >Assignee: Wei Yan > Fix For: 2.6.0 > > Attachments: YARN-2273-5.patch, YARN-2273-replayException.patch, > YARN-2273.patch, YARN-2273.patch, YARN-2273.patch, YARN-2273.patch > > > One DN experienced memory errors and entered a cycle of rebooting and > rejoining the cluster. After the second time the node went away, the RM > produced this: > {code} > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Application attempt appattempt_1404858438119_4352_01 released container > container_1404858438119_4352_01_04 on node: host: > node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 > available= used= with event: KILL > 2014-07-09 21:47:36,571 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: > > 2014-07-09 21:47:36,571 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[ContinuousScheduling,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) > at java.util.TimSort.sort(TimSort.java:203) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) > at java.lang.Thread.run(Thread.java:744) > {code} > A few cycles later YARN was crippled. The RM was running and jobs could be > submitted but containers were not assigned and no progress was made. > Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071838#comment-14071838 ] Sunil G commented on YARN-2301: --- bq.we can update the usage block to let users how to use the opts correctly. When users make the mistake, they will be redirect the usage output +1. Yes. User can be redirected back correct usage. > Improve yarn container command > -- > > Key: YARN-2301 > URL: https://issues.apache.org/jira/browse/YARN-2301 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Naganarasimha G R > Labels: usability > > While running yarn container -list command, some > observations: > 1) the scheme (e.g. http/https ) before LOG-URL is missing > 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to > print as time format. > 3) finish-time is 0 if container is not yet finished. May be "N/A" > 4) May have an option to run as yarn container -list OR yarn > application -list-containers also. > As attempt Id is not shown on console, this is easier for user to just copy > the appId and run it, may also be useful for container-preserving AM > restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2340) NPE thrown when RM restart after queue is STOPPED
Nishan Shetty created YARN-2340: --- Summary: NPE thrown when RM restart after queue is STOPPED Key: YARN-2340 URL: https://issues.apache.org/jira/browse/YARN-2340 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.4.1 Environment: Capacityscheduler with Queue a, b Reporter: Nishan Shetty Priority: Critical While job is in progress make Queue state as STOPPED and then restart RM Observe that standby RM fails to come up as acive throwing below NPE 2014-07-23 18:43:24,432 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1406116264351_0014_02 State change from NEW to SUBMITTED 2014-07-23 18:43:24,433 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602) at java.lang.Thread.run(Thread.java:662) 2014-07-23 18:43:24,434 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2340) NPE thrown when RM restart after queue is STOPPED
[ https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishan Shetty resolved YARN-2340. - Resolution: Unresolved > NPE thrown when RM restart after queue is STOPPED > - > > Key: YARN-2340 > URL: https://issues.apache.org/jira/browse/YARN-2340 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.4.1 > Environment: Capacityscheduler with Queue a, b >Reporter: Nishan Shetty >Priority: Critical > > While job is in progress make Queue state as STOPPED and then restart RM > Observe that standby RM fails to come up as acive throwing below NPE > 2014-07-23 18:43:24,432 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1406116264351_0014_02 State change from NEW to SUBMITTED > 2014-07-23 18:43:24,433 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_ADDED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602) > at java.lang.Thread.run(Thread.java:662) > 2014-07-23 18:43:24,434 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (YARN-2340) NPE thrown when RM restart after queue is STOPPED
[ https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishan Shetty reopened YARN-2340: - > NPE thrown when RM restart after queue is STOPPED > - > > Key: YARN-2340 > URL: https://issues.apache.org/jira/browse/YARN-2340 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.4.1 > Environment: Capacityscheduler with Queue a, b >Reporter: Nishan Shetty >Priority: Critical > > While job is in progress make Queue state as STOPPED and then restart RM > Observe that standby RM fails to come up as acive throwing below NPE > 2014-07-23 18:43:24,432 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1406116264351_0014_02 State change from NEW to SUBMITTED > 2014-07-23 18:43:24,433 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_ADDED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602) > at java.lang.Thread.run(Thread.java:662) > 2014-07-23 18:43:24,434 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071853#comment-14071853 ] Tsuyoshi OZAWA commented on YARN-2229: -- Thanks for the comments, Jian, Zhijie and Sid. {quote} For example, ContainerTokenIdentifier serializes a long (getContainerId()) at RM side, but deserializes a int (getId()) at NM side. In this case, I'm afraid it's going to be wrong {quote} If we think the backward compatibility as first priority, we can choose the first design I proposed as Sid mentioned. This design choice looks reasonable to me. [~jianhe], what do you think? We discussed that we should avoid introducing new field to ContinerId class. In my opinion, this reason is weaker than the backward compatibility. {quote} ConverterUtils is a separate consideration. It is marked as @private - but is used in MapReduce for example (and also in Tez). Looks like the toString method isn't being changed either, whcih means to ConverterUtils method would continue to work. {quote} I'm thinking to suffix the epoch at the end of container id. I'll work with old jar which includes old {{ConverterUtils#toContainerId}}. YARN-2182 is the JIRA to address the change of {{ConverterUtils#toContainerId}}. > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, > YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, > YARN-2229.8.patch, YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071859#comment-14071859 ] Tsuyoshi OZAWA commented on YARN-2229: -- {quote} I'm not sure it's good to makr a @Stable method back to @Unstable Agree with Zhijie on not changing an @Stable method to @Unstable. Deprecate in this patch itself ? {quote} @Stable or @Unstable discussion will be disappeared if we decide to continue to use {{getId}}. I'd like to decide it before continuing the discussion. {quote} hashCode and equals are inconsistent in the latest patch. One uses getId(), the other uses getContainerId {quote} This is my mistake, I'll update it in the next patch. Or, if we decide to continue to use {{getId}, I'll revert it. > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, > YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, > YARN-2229.8.patch, YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1342) Recover container tokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071865#comment-14071865 ] Hadoop QA commented on YARN-1342: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657361/YARN-1342v6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4406//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4406//console This message is automatically generated. > Recover container tokens upon nodemanager restart > - > > Key: YARN-1342 > URL: https://issues.apache.org/jira/browse/YARN-1342 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1342.patch, YARN-1342v2.patch, > YARN-1342v3-and-YARN-1987.patch, YARN-1342v4.patch, YARN-1342v5.patch, > YARN-1342v6.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071878#comment-14071878 ] Hadoop QA commented on YARN-2247: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657359/apache-yarn-2247.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4405//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4405//console This message is automatically generated. > Allow RM web services users to authenticate using delegation tokens > --- > > Key: YARN-2247 > URL: https://issues.apache.org/jira/browse/YARN-2247 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-2247.0.patch, apache-yarn-2247.1.patch, > apache-yarn-2247.2.patch, apache-yarn-2247.3.patch, apache-yarn-2247.4.patch > > > The RM webapp should allow users to authenticate using delegation tokens to > maintain parity with RPC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishan Shetty updated YARN-2262: Attachment: yarn-testos-resourcemanager-HOST-10-18-40-84.log yarn-testos-historyserver-HOST-10-18-40-95.log Capture1.PNG Capture.PNG yarn-testos-resourcemanager-HOST-10-18-40-95.log > Few fields displaying wrong values in Timeline server after RM restart > -- > > Key: YARN-2262 > URL: https://issues.apache.org/jira/browse/YARN-2262 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.4.0 >Reporter: Nishan Shetty >Assignee: Naganarasimha G R > Attachments: Capture.PNG, Capture1.PNG, > yarn-testos-historyserver-HOST-10-18-40-95.log, > yarn-testos-resourcemanager-HOST-10-18-40-84.log, > yarn-testos-resourcemanager-HOST-10-18-40-95.log > > > Few fields displaying wrong values in Timeline server after RM restart > State:null > FinalStatus: UNDEFINED > Started: 8-Jul-2014 14:58:08 > Elapsed: 2562047397789hrs, 44mins, 47sec -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071944#comment-14071944 ] Nishan Shetty commented on YARN-2262: - [~zjshen] Attached logs Application id is application_1406114813957_0002 > Few fields displaying wrong values in Timeline server after RM restart > -- > > Key: YARN-2262 > URL: https://issues.apache.org/jira/browse/YARN-2262 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.4.0 >Reporter: Nishan Shetty >Assignee: Naganarasimha G R > Attachments: Capture.PNG, Capture1.PNG, > yarn-testos-historyserver-HOST-10-18-40-95.log, > yarn-testos-resourcemanager-HOST-10-18-40-84.log, > yarn-testos-resourcemanager-HOST-10-18-40-95.log > > > Few fields displaying wrong values in Timeline server after RM restart > State:null > FinalStatus: UNDEFINED > Started: 8-Jul-2014 14:58:08 > Elapsed: 2562047397789hrs, 44mins, 47sec -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071985#comment-14071985 ] Naganarasimha G R commented on YARN-2301: - Thanks [~zjshen],[~sunilg],[~devaraj.k] & [~jianhe] for the comments, I will start modifying as per [~zjshen]'s approach and try to provide the patch at the earliest. > Improve yarn container command > -- > > Key: YARN-2301 > URL: https://issues.apache.org/jira/browse/YARN-2301 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Naganarasimha G R > Labels: usability > > While running yarn container -list command, some > observations: > 1) the scheme (e.g. http/https ) before LOG-URL is missing > 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to > print as time format. > 3) finish-time is 0 if container is not yet finished. May be "N/A" > 4) May have an option to run as yarn container -list OR yarn > application -list-containers also. > As attempt Id is not shown on console, this is easier for user to just copy > the appId and run it, may also be useful for container-preserving AM > restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class
[ https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071999#comment-14071999 ] zhihai xu commented on YARN-2337: - [~ozawa] thanks for your quick response. > remove duplication function call (setClientRMService) in resource manage class > -- > > Key: YARN-2337 > URL: https://issues.apache.org/jira/browse/YARN-2337 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-2337.000.patch > > > remove duplication function call (setClientRMService) in resource manage > class. > rmContext.setClientRMService(clientRM); is duplicate in serviceInit of > ResourceManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur in FairScheduler when there are lots of running apps
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072042#comment-14072042 ] Karthik Kambatla commented on YARN-2313: Thanks for the explanation, [~ozawa]. I see the issue clearly now. In that case, a better approach might be to have a single "maintenance" thread that periodically executes a bunch of runnables (reload, update, continuous-scheduling) serially. Otherwise, as we add more threads that hold onto the scheduler lock, it will be hairy to tune all of them so the scheduler can make some meaningful progress. > Livelock can occur in FairScheduler when there are lots of running apps > --- > > Key: YARN-2313 > URL: https://issues.apache.org/jira/browse/YARN-2313 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.4.1 >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Fix For: 2.6.0 > > Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, > YARN-2313.4.patch, rm-stack-trace.txt > > > Observed livelock on FairScheduler when there are lots entry in queue. After > my investigating code, following case can occur: > 1. {{update()}} called by UpdateThread takes longer times than > UPDATE_INTERVAL(500ms) if there are lots queue. > 2. UpdateThread goes busy loop. > 3. Other threads(AllocationFileReloader, > ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2212: Attachment: YARN-2212.2.patch > ApplicationMaster needs to find a way to update the AMRMToken periodically > -- > > Key: YARN-2212 > URL: https://issues.apache.org/jira/browse/YARN-2212 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2212.1.patch, YARN-2212.2.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072086#comment-14072086 ] Xuan Gong commented on YARN-2212: - Merged YARN-2237 together. > ApplicationMaster needs to find a way to update the AMRMToken periodically > -- > > Key: YARN-2212 > URL: https://issues.apache.org/jira/browse/YARN-2212 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2212.1.patch, YARN-2212.2.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Moved] (YARN-2341) Refactor TestCapacityScheduler to separate tests per feature
[ https://issues.apache.org/jira/browse/YARN-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer moved MAPREDUCE-723 to YARN-2341: -- Component/s: (was: capacity-sched) capacityscheduler Issue Type: Test (was: Bug) Key: YARN-2341 (was: MAPREDUCE-723) Project: Hadoop YARN (was: Hadoop Map/Reduce) > Refactor TestCapacityScheduler to separate tests per feature > > > Key: YARN-2341 > URL: https://issues.apache.org/jira/browse/YARN-2341 > Project: Hadoop YARN > Issue Type: Test > Components: capacityscheduler >Reporter: Vinod Kumar Vavilapalli > > TestCapacityScheduler has grown rapidly over time. It now has tests for > various features interspersed amongst each other. It would be helpful to > separate out tests per feature, moving out the central mock objects to a > primary test class. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2342) When killing a task, we don't always need to send a subsequent SIGKILL
[ https://issues.apache.org/jira/browse/YARN-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2342: --- Labels: newbie (was: ) > When killing a task, we don't always need to send a subsequent SIGKILL > -- > > Key: YARN-2342 > URL: https://issues.apache.org/jira/browse/YARN-2342 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Vinod Kumar Vavilapalli > Labels: newbie > > In both TaskController/LinuxTaskController, while killing tasks, first a > SIGTERM and then a subsequent SIGKILL. We don't need to send the SIGKILL > always. It can be avoided when the SIGTERM command (kill pid for process or > kill -- -pid for session) returns a non-zero exit code, i.e. when the signal > is not sent successfully because process/process group doesn't exist. 'man 2 > kill' says exit code is non-zero only when process/process group is not alive > or invalid signal is specified or the process doesn't have permissions. The > last two don't happen in mapred code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2342) When killing a task, we don't always need to send a subsequent SIGKILL
[ https://issues.apache.org/jira/browse/YARN-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072249#comment-14072249 ] Allen Wittenauer commented on YARN-2342: Moving this to YARN, as we need to check container executor. > When killing a task, we don't always need to send a subsequent SIGKILL > -- > > Key: YARN-2342 > URL: https://issues.apache.org/jira/browse/YARN-2342 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Vinod Kumar Vavilapalli > Labels: newbie > > In both TaskController/LinuxTaskController, while killing tasks, first a > SIGTERM and then a subsequent SIGKILL. We don't need to send the SIGKILL > always. It can be avoided when the SIGTERM command (kill pid for process or > kill -- -pid for session) returns a non-zero exit code, i.e. when the signal > is not sent successfully because process/process group doesn't exist. 'man 2 > kill' says exit code is non-zero only when process/process group is not alive > or invalid signal is specified or the process doesn't have permissions. The > last two don't happen in mapred code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Moved] (YARN-2342) When killing a task, we don't always need to send a subsequent SIGKILL
[ https://issues.apache.org/jira/browse/YARN-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer moved MAPREDUCE-780 to YARN-2342: -- Issue Type: Improvement (was: Bug) Key: YARN-2342 (was: MAPREDUCE-780) Project: Hadoop YARN (was: Hadoop Map/Reduce) > When killing a task, we don't always need to send a subsequent SIGKILL > -- > > Key: YARN-2342 > URL: https://issues.apache.org/jira/browse/YARN-2342 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Vinod Kumar Vavilapalli > Labels: newbie > > In both TaskController/LinuxTaskController, while killing tasks, first a > SIGTERM and then a subsequent SIGKILL. We don't need to send the SIGKILL > always. It can be avoided when the SIGTERM command (kill pid for process or > kill -- -pid for session) returns a non-zero exit code, i.e. when the signal > is not sent successfully because process/process group doesn't exist. 'man 2 > kill' says exit code is non-zero only when process/process group is not alive > or invalid signal is specified or the process doesn't have permissions. The > last two don't happen in mapred code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2343) Improve
Li Lu created YARN-2343: --- Summary: Improve Key: YARN-2343 URL: https://issues.apache.org/jira/browse/YARN-2343 Project: Hadoop YARN Issue Type: Improvement Reporter: Li Lu Priority: Trivial -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Moved] (YARN-2344) Provide a mechanism to pause the jobtracker
[ https://issues.apache.org/jira/browse/YARN-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer moved MAPREDUCE-828 to YARN-2344: -- Component/s: (was: jobtracker) resourcemanager Key: YARN-2344 (was: MAPREDUCE-828) Project: Hadoop YARN (was: Hadoop Map/Reduce) > Provide a mechanism to pause the jobtracker > --- > > Key: YARN-2344 > URL: https://issues.apache.org/jira/browse/YARN-2344 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Reporter: Hemanth Yamijala > > We've seen scenarios when we have needed to stop the namenode for a > maintenance activity. In such scenarios, if the jobtracker (JT) continues to > run, jobs would fail due to initialization or task failures (due to DFS). We > could restart the JT enabling job recovery, during such scenarios. But > restart has proved to be a very intrusive activity, particularly if the JT is > not at fault itself and does not require a restart. The ask is for a > admin-controlled feature to pause the JT which would take it to a state > somewhat analogous to the safe mode of DFS. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2344) Provide a mechanism to pause the jobtracker
[ https://issues.apache.org/jira/browse/YARN-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072323#comment-14072323 ] Allen Wittenauer commented on YARN-2344: Moving this to YARN. We still need a way to pause the Resource Manager from accepting new submissions. > Provide a mechanism to pause the jobtracker > --- > > Key: YARN-2344 > URL: https://issues.apache.org/jira/browse/YARN-2344 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Reporter: Hemanth Yamijala > > We've seen scenarios when we have needed to stop the namenode for a > maintenance activity. In such scenarios, if the jobtracker (JT) continues to > run, jobs would fail due to initialization or task failures (due to DFS). We > could restart the JT enabling job recovery, during such scenarios. But > restart has proved to be a very intrusive activity, particularly if the JT is > not at fault itself and does not require a restart. The ask is for a > admin-controlled feature to pause the JT which would take it to a state > somewhat analogous to the safe mode of DFS. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2343) Improve error message on token expire exception
[ https://issues.apache.org/jira/browse/YARN-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2343: Description: Some of token expire exception is triggered by wrong time settings on cluster nodes, but the current exception message does not explicitly address that. It would be helpful to add some message explicitly pointing out that this exception could be caused by machines out of sync in time, or even wrong time zone settings. Assignee: Li Lu Labels: usability (was: ) Summary: Improve error message on token expire exception (was: Improve) > Improve error message on token expire exception > --- > > Key: YARN-2343 > URL: https://issues.apache.org/jira/browse/YARN-2343 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Li Lu >Assignee: Li Lu >Priority: Trivial > Labels: usability > > Some of token expire exception is triggered by wrong time settings on cluster > nodes, but the current exception message does not explicitly address that. It > would be helpful to add some message explicitly pointing out that this > exception could be caused by machines out of sync in time, or even wrong time > zone settings. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2343) Improve error message on token expire exception
[ https://issues.apache.org/jira/browse/YARN-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2343: Attachment: YARN-2343-072314.patch Adding the message to point out that the token expire exception could be caused by machine out of sync, or wrong timezone settings. > Improve error message on token expire exception > --- > > Key: YARN-2343 > URL: https://issues.apache.org/jira/browse/YARN-2343 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Li Lu >Assignee: Li Lu >Priority: Trivial > Labels: usability > Attachments: YARN-2343-072314.patch > > > Some of token expire exception is triggered by wrong time settings on cluster > nodes, but the current exception message does not explicitly address that. It > would be helpful to add some message explicitly pointing out that this > exception could be caused by machines out of sync in time, or even wrong time > zone settings. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2345) yarn rmadin -report
Allen Wittenauer created YARN-2345: -- Summary: yarn rmadin -report Key: YARN-2345 URL: https://issues.apache.org/jira/browse/YARN-2345 Project: Hadoop YARN Issue Type: Improvement Reporter: Allen Wittenauer It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072356#comment-14072356 ] Siddharth Seth commented on YARN-2229: -- [~ozawa] - I was primarily looking at this from a backward compatibility perspective. Will leave the decision to go with the current approach or adding a hidden field to you, Jian and Zhijie. > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, > YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, > YARN-2229.8.patch, YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2345: --- Summary: yarn rmadmin -report (was: yarn rmadin -report) > yarn rmadmin -report > > > Key: YARN-2345 > URL: https://issues.apache.org/jira/browse/YARN-2345 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Allen Wittenauer > > It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2345: --- Labels: newbie (was: ) > yarn rmadmin -report > > > Key: YARN-2345 > URL: https://issues.apache.org/jira/browse/YARN-2345 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Allen Wittenauer > Labels: newbie > > It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2346) Add a 'status' command to yarn-daemon.sh
Nikunj Bansal created YARN-2346: --- Summary: Add a 'status' command to yarn-daemon.sh Key: YARN-2346 URL: https://issues.apache.org/jira/browse/YARN-2346 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.4.1, 2.4.0, 2.3.0, 2.2.0 Reporter: Nikunj Bansal Priority: Minor Adding a 'status' command to yarn-daemon.sh will be useful for finding out the status of yarn daemons. Running the 'status' command should exit with a 0 exit code if the target daemon is running and non-zero code in case its not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2346) Add a 'status' command to yarn-daemon.sh
[ https://issues.apache.org/jira/browse/YARN-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikunj Bansal updated YARN-2346: Affects Version/s: 2.2.1 > Add a 'status' command to yarn-daemon.sh > > > Key: YARN-2346 > URL: https://issues.apache.org/jira/browse/YARN-2346 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.2.0, 2.3.0, 2.2.1, 2.4.0, 2.4.1 >Reporter: Nikunj Bansal >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > Adding a 'status' command to yarn-daemon.sh will be useful for finding out > the status of yarn daemons. > Running the 'status' command should exit with a 0 exit code if the target > daemon is running and non-zero code in case its not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2147) client lacks delegation token exception details when application submit fails
[ https://issues.apache.org/jira/browse/YARN-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072390#comment-14072390 ] Jason Lowe commented on YARN-2147: -- The test timeouts were an artifact of a period where we were requiring each test to have a timeout to work around a surefire timeout bug, but we no longer need each test to have one. It's not going to hurt if present even for tests that shouldn't need them as long as the timeout is reasonable for the test. +1 lgtm. Committing this. > client lacks delegation token exception details when application submit fails > - > > Key: YARN-2147 > URL: https://issues.apache.org/jira/browse/YARN-2147 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Jason Lowe >Assignee: Chen He >Priority: Minor > Attachments: YARN-2147-v2.patch, YARN-2147-v3.patch, > YARN-2147-v4.patch, YARN-2147-v5.patch, YARN-2147.patch > > > When an client submits an application and the delegation token process fails > the client can lack critical details needed to understand the nature of the > error. Only the message of the error exception is conveyed to the client, > which sometimes isn't enough to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2343) Improve error message on token expire exception
[ https://issues.apache.org/jira/browse/YARN-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072394#comment-14072394 ] Hadoop QA commented on YARN-2343: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657437/YARN-2343-072314.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4407//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4407//console This message is automatically generated. > Improve error message on token expire exception > --- > > Key: YARN-2343 > URL: https://issues.apache.org/jira/browse/YARN-2343 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Li Lu >Assignee: Li Lu >Priority: Trivial > Labels: usability > Attachments: YARN-2343-072314.patch > > > Some of token expire exception is triggered by wrong time settings on cluster > nodes, but the current exception message does not explicitly address that. It > would be helpful to add some message explicitly pointing out that this > exception could be caused by machines out of sync in time, or even wrong time > zone settings. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2214) preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness
[ https://issues.apache.org/jira/browse/YARN-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072409#comment-14072409 ] Ashwin Shankar commented on YARN-2214: -- [~kasha], [~sandyr] Can one of you please look at this one ? Thanks in advance ! > preemptContainerPreCheck() in FSParentQueue delays convergence towards > fairness > --- > > Key: YARN-2214 > URL: https://issues.apache.org/jira/browse/YARN-2214 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.5.0 >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Attachments: YARN-2214-v1.txt > > > preemptContainerPreCheck() in FSParentQueue rejects preemption requests if > the parent queue is below fair share. This can cause a delay in converging > towards fairness when the starved leaf queue and the queue above fairshare > belong under a non-root parent queue(ie their least common ancestor is a > parent queue which is not root). > Here is an example : > root.parent has fair share = 80% and usage = 80% > root.parent.child1 has fair share =40% usage = 80% > root.parent.child2 has fair share=40% usage=0% > Now a job is submitted to child2 and the demand is 40%. > Preemption will kick in and try to reclaim all the 40% from child1. > When it preempts the first container from child1,the usage of root.parent > will become <80%, which is less than root.parent's fair share,causing > preemption to stop.So only one container gets preempted in this round > although the need is a lot more. child2 would eventually get to half its fair > share but only after multiple rounds of preemption. > Solution is to remove preemptContainerPreCheck() in FSParentQueue and keep it > only in FSLeafQueue(which is already there). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2147) client lacks delegation token exception details when application submit fails
[ https://issues.apache.org/jira/browse/YARN-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072426#comment-14072426 ] Hudson commented on YARN-2147: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5956 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5956/]) YARN-2147. client lacks delegation token exception details when application submit fails. Contributed by Chen He (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612950) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestDelegationTokenRenewer.java > client lacks delegation token exception details when application submit fails > - > > Key: YARN-2147 > URL: https://issues.apache.org/jira/browse/YARN-2147 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Jason Lowe >Assignee: Chen He >Priority: Minor > Fix For: 3.0.0, 2.6.0 > > Attachments: YARN-2147-v2.patch, YARN-2147-v3.patch, > YARN-2147-v4.patch, YARN-2147-v5.patch, YARN-2147.patch > > > When an client submits an application and the delegation token process fails > the client can lack critical details needed to understand the nature of the > error. Only the message of the error exception is conveyed to the client, > which sometimes isn't enough to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072507#comment-14072507 ] Li Lu commented on YARN-2314: - Yes, that makes sense. And I do agree that a quick fix to the problem. > ContainerManagementProtocolProxy can create thousands of threads for a large > cluster > > > Key: YARN-2314 > URL: https://issues.apache.org/jira/browse/YARN-2314 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Priority: Critical > Attachments: nmproxycachefix.prototype.patch > > > ContainerManagementProtocolProxy has a cache of NM proxies, and the size of > this cache is configurable. However the cache can grow far beyond the > configured size when running on a large cluster and blow AM address/container > limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-415: Attachment: YARN-415.201407232237.txt [~leftnoteasy], Thank you for your reply. I have implemented the following changes with the current patch. {quote} 1. Revert changes of SchedulerAppReport, we already have changed ApplicationResourceUsageReport, and memory utilization should be a part of resource usage report. {quote} Changes to SchedulerAppReport have been reverted. {quote} 2. Remove getMemory(VCore)Seconds from RMAppAttempt, modify RMAppAttemptMetrics#getFinishedMemory(VCore)Seconds to return completed+running resource utilization. {quote} I have removed getters and setters from RMAppAttempt and added RMAppAttemptMetrics#getResourceUtilization, which returns a single ResourceUtilization instance that contains both memorySeconds and vcoreSeconds for the appAttempt. These include both finished and running statistics IF the appAttempt is ALSO the current attempt. If not, it only includes the finished statistics. {quote} 3. put {code} ._("Resources:", String.format("%d MB-seconds, %d vcore-seconds", app.getMemorySeconds(), app.getVcoreSeconds())) {code} from "Application Overview" to "Application Metrics", and rename it to "Resource Seconds". It should be considered as a part of application metrics instead of overview. {quote} Changes completed. {quote} 4. Change finishedMemory/VCoreSeconds to AtomicLong in RMAppAttemptMetrics to make it can be efficiently accessed by multi-thread. {quote} Changes completed. {quote} 5. I think it's better to add a new method in SchedulerApplicationAttempt like getMemoryUtilization, which will only return memory/cpu seconds. We do this to prevent locking scheduling thread when showing application metrics on web UI. getMemoryUtilization will be used by RMAppAttemptMetrics#getFinishedMemory(VCore)Seconds to return completed+running resource utilization. And used by SchedulerApplicationAttempt#getResourceUsageReport as well. The MemoryUtilization class may contain two fields: runningContainerMemory(VCore)Seconds. {quote} Added ResourceUtilization (instead of MemoryUtilization), but did not make the other changes as per comment: https://issues.apache.org/jira/browse/YARN-415?focusedCommentId=14071181&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14071181 {quote} 6. Since compute running container resource utilization is not O(1), we need scan all containers under an application. I think it's better to cache a previous compute result, and it will be recomputed after several seconds (maybe 1-3 seconds should be enough) elapsed. {quote} I added chached values in SchedulerApplicationAttempt for memorySeconds and vcoreSeconds that are updated when 1) a request is received to calculate these metrics, AND 2) it has been more than 3 seconds since the last request. One thing I did notice when these values are cached is that there is a race where containers can get counted twice: - RMAppAttemptMetrics#getResourceUtilization sends a request to calculate running containers, and container X is almost finished. RMAppAttemptMetrics#getResourceUtilization adds the finished values to the running values and returns ResourceUtilization. - Container X completes and its memorySeconds and vcoreSeconds are added to the finished values for appAttempt. - RMAppAttemptMetrics#getResourceUtilization makes another request before the 3 second interval, and the cached values are added to the finished values for appAttempt. Since both the cached values and the finished values contain metrics for Container X, those are double counted until 3 seconds elapses and the next RMAppAttemptMetrics#getResourceUtilization request is made. {quote} And you can modify SchedulerApplicationAttempt#liveContainers to be a ConcurrentHashMap. With #6, get memory utilization to show metrics on web UI will not lock scheduling thread at all. {quote} I am a little reluctant to modify the type of SchedulerApplicationAttempt#liveContainers as part of this JIRA. That seems like something that could be done separately. > Capture memory utilization at the app-level for chargeback > -- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 0.23.6 >Reporter: Kendall Thrapp >Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201
[jira] [Commented] (YARN-2338) service assemble so complex
[ https://issues.apache.org/jira/browse/YARN-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072638#comment-14072638 ] tangjunjie commented on YARN-2338: -- Hello, Tsuyoshi OZAWA I think service assembly should remove from resourcemanager because the main task for resourcemanager is alloct resource and so on.Consider use lightweight DI framwork like guice to refactor .Then, resourcemanager code will get rid of bad code smell. Use xml or annotation to display service assembley. For example, .. I think test code will also benifit from this refactor. Because we can easily mock a service then inject for test. > service assemble so complex > --- > > Key: YARN-2338 > URL: https://issues.apache.org/jira/browse/YARN-2338 > Project: Hadoop YARN > Issue Type: Wish >Reporter: tangjunjie > > See ResourceManager > protected void serviceInit(Configuration configuration) throws Exception > So many service will assembe into resourcemanager. > Use guice or other service assemble framework to refactor this complex code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072634#comment-14072634 ] Hadoop QA commented on YARN-415: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657484/YARN-415.201407232237.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4408//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4408//console This message is automatically generated. > Capture memory utilization at the app-level for chargeback > -- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 0.23.6 >Reporter: Kendall Thrapp >Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.201407171553.txt, > YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the application, as even if the app didn't > use all that memory, no one else was able to use it. > (reserved ram for container 1 * lifetime of container 1) + (reserved ram for > container 2 * lifetime of container 2) + ... + (reserved ram for container n > * lifetime of container n) > It'd be nice to have this at the app level instead of the job level because: > 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't > appear on the job history server). > 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). > This new metric should be available both through the RM UI and RM Web > Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2338) service assemble so complex
[ https://issues.apache.org/jira/browse/YARN-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072661#comment-14072661 ] dingjiaqi commented on YARN-2338: - Hi,Tsuyoshi OZAWA. I agree with tangjunjie.Do you need to refactor it? > service assemble so complex > --- > > Key: YARN-2338 > URL: https://issues.apache.org/jira/browse/YARN-2338 > Project: Hadoop YARN > Issue Type: Wish >Reporter: tangjunjie > > See ResourceManager > protected void serviceInit(Configuration configuration) throws Exception > So many service will assembe into resourcemanager. > Use guice or other service assemble framework to refactor this complex code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072696#comment-14072696 ] Wangda Tan commented on YARN-415: - Hi Eric, Thanks for updating your patch, I think now don't have major comments, *Following are some minor comments:* 1) RMAppAttemptImpl.java 1.1 There're some irrelevant line changes in RMAppAttemptImpl, could you please revert them? Like {code} RMAppAttemptEventType.RECOVER, new AttemptRecoveredTransition()) - + {code} 1.2 getResourceUtilization: {code} +if (rmApps != null) { + RMApp app = rmApps.get(attemptId.getApplicationId()); + if (app != null) { {code} I think the two cannot happen, we don't need check null to avoid potential bug here {code} + ApplicationResourceUsageReport appResUsageRpt = {code} It's better to name it appResUsageReport since rpt is not a common abbr of report. 2) RMContainerImpl.java 2.1 updateAttemptMetrics: {code} if (rmApps != null) { RMApp rmApp = rmApps.get(container.getApplicationAttemptId().getApplicationId()); if (rmApp != null) { {code} Again, I think the two null check is unnecessary 3) SchedulerApplicationAttempt.java 3.1 Some rename suggestions: (Please let me know if you have better idea) CACHE_MILLI -> MEMORY_UTILIZATION_CACHE_MILLISECONDS lastTime -> lastMemoryUtilizationUpdateTime cachedMemorySeconds -> lastMemorySeconds same for cachedVCore ... 4) AppBlock.java Should we rename "Resource Seconds:" to "Resource Utilization" or something? 5) Test 5.1 I'm wondering if we need add a end to end test, since we changed RMAppAttempt/RMContainerImpl/SchedulerApplicationAttempt. It can consist submit an application, launch several containers, and finish application. And it's better to make the launched application contains several application attempt. While the application running, there're muliple containers running, and multiple containers finished. We can check if total resource utilization are expected. *To your comments:* 1) bq. One thing I did notice when these values are cached is that there is a race where containers can get counted twice: I think this can not be avoid, it should be a transient state and Jian He and I discussed about this long time before. But apparently, 3 sec cache make it not only a transient state. I suggest you can make "lastTime" in SchedulerApplicationAttempt protected. And in FiCaSchedulerApp/FSSchedulerApp, when remove container from liveContainer (in completedContainer method). You can set lastTime to a negtive value like -1, and next time when trying to get accumulated resource utilization, it will recompute all container utilization. 2) bq. I am a little reluctant to modify the type of SchedulerApplicationAttempt#liveContainers as part of this JIRA. That seems like something that could be done separately. I think that will be fine :), because current getRunningResourceUtilization is called by getResourceUsageReport. And getResourceUsageReport is synchronized, no matter we changed liveContainers to concurrent map or not, we cannot solve the locking problem. I agree to enhance it in a separated JIRA in the future. Thanks, Wangda > Capture memory utilization at the app-level for chargeback > -- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 0.23.6 >Reporter: Kendall Thrapp >Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.201407171553.txt, > YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the application, as even if the app didn't > use all that memory, no one else was able to use it. > (reserved ram for container 1 * lifetime of container 1) + (reserved ram for > container 2 * lifetime of container 2) + ... + (reserved ram for container n > * lifetime of container n) > It'd be nice to have this at the app level instead of the job level because: > 1. We'd still be able to get memory usage for jobs that crashed (and wouldn
[jira] [Commented] (YARN-2277) Add Cross-Origin support to the ATS REST API
[ https://issues.apache.org/jira/browse/YARN-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072704#comment-14072704 ] Jonathan Eagles commented on YARN-2277: --- [~zjshen], [~vinodkv], do you have any comments or concerns with the approach above? Would like to get some feed back soon since TEZ-8 is basing work off of the CORS patch above. > Add Cross-Origin support to the ATS REST API > > > Key: YARN-2277 > URL: https://issues.apache.org/jira/browse/YARN-2277 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: YARN-2277-CORS.patch, YARN-2277-JSONP.patch > > > As the Application Timeline Server is not provided with built-in UI, it may > make sense to enable JSONP or CORS Rest API capabilities to allow for remote > UI to access the data directly via javascript without cross side server > browser blocks coming into play. > Example client may be like > http://api.jquery.com/jQuery.getJSON/ > This can alleviate the need to create a local proxy cache. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1342) Recover container tokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072707#comment-14072707 ] Junping Du commented on YARN-1342: -- Thanks for updating the patch, [~jlowe]! Patch looks good to me. Hey [~devaraj.k], if you don't have additional comments, I will commit it tomorrow. > Recover container tokens upon nodemanager restart > - > > Key: YARN-1342 > URL: https://issues.apache.org/jira/browse/YARN-1342 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1342.patch, YARN-1342v2.patch, > YARN-1342v3-and-YARN-1987.patch, YARN-1342v4.patch, YARN-1342v5.patch, > YARN-1342v6.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2347) Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common
Junping Du created YARN-2347: Summary: Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common Key: YARN-2347 URL: https://issues.apache.org/jira/browse/YARN-2347 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Junping Du We have similar things for version state for RM, NM, TS (TimelineServer), etc. I think we should consolidate them into a common object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2300) Document better sample requests for RM web services for submitting apps
[ https://issues.apache.org/jira/browse/YARN-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072722#comment-14072722 ] Zhijie Shen commented on YARN-2300: --- +1 LGTM, will commit the patch > Document better sample requests for RM web services for submitting apps > --- > > Key: YARN-2300 > URL: https://issues.apache.org/jira/browse/YARN-2300 > Project: Hadoop YARN > Issue Type: Improvement > Components: documentation >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2300.0.patch > > > The documentation for RM web services should provide better examples for app > submission. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2347) Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common
[ https://issues.apache.org/jira/browse/YARN-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2347: - Attachment: YARN-2347.patch > Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in > yarn-server-common > > > Key: YARN-2347 > URL: https://issues.apache.org/jira/browse/YARN-2347 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-2347.patch > > > We have similar things for version state for RM, NM, TS (TimelineServer), > etc. I think we should consolidate them into a common object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2300) Document better sample requests for RM web services for submitting apps
[ https://issues.apache.org/jira/browse/YARN-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072745#comment-14072745 ] Hudson commented on YARN-2300: -- FAILURE: Integrated in Hadoop-trunk-Commit #5957 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5957/]) YARN-2300. Improved the documentation of the sample requests for RM REST API - submitting an app. Contributed by Varun Vasudev. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1612981) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm > Document better sample requests for RM web services for submitting apps > --- > > Key: YARN-2300 > URL: https://issues.apache.org/jira/browse/YARN-2300 > Project: Hadoop YARN > Issue Type: Improvement > Components: documentation >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Fix For: 2.5.0 > > Attachments: apache-yarn-2300.0.patch > > > The documentation for RM web services should provide better examples for app > submission. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072751#comment-14072751 ] Arpit Agarwal commented on YARN-1994: - +1 for the v6 patch. I will hold off on committing until Vinod or another YARN committer can sanity check the changes. Thanks [~cwelch] and [~mipoto]! I think there is a Windows line-endings issue with the patch hence Jenkins failed to pick it up. I was able to apply it with _git apply -p0 --whitespace=fix_ > Expose YARN/MR endpoints on multiple interfaces > --- > > Key: YARN-1994 > URL: https://issues.apache.org/jira/browse/YARN-1994 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, webapp >Affects Versions: 2.4.0 >Reporter: Arpit Agarwal >Assignee: Craig Welch > Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.2.patch, > YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch > > > YARN and MapReduce daemons currently do not support specifying a wildcard > address for the server endpoints. This prevents the endpoints from being > accessible from all interfaces on a multihomed machine. > Note that if we do specify INADDR_ANY for any of the options, it will break > clients as they will attempt to connect to 0.0.0.0. We need a solution that > allows specifying a hostname or IP-address for clients while requesting > wildcard bind for the servers. > (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2348) ResourceManager web UI should display locale time instead of UTC time
Leitao Guo created YARN-2348: Summary: ResourceManager web UI should display locale time instead of UTC time Key: YARN-2348 URL: https://issues.apache.org/jira/browse/YARN-2348 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: Leitao Guo Attachments: 1.before-change.jpg, 2.after-change.jpg ResourceManager web UI, including application list and scheduler, displays UTC time in default, this will confuse users who do not use UTC time. This web UI should display local time of users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2348) ResourceManager web UI should display locale time instead of UTC time
[ https://issues.apache.org/jira/browse/YARN-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated YARN-2348: - Attachment: 2.after-change.jpg > ResourceManager web UI should display locale time instead of UTC time > - > > Key: YARN-2348 > URL: https://issues.apache.org/jira/browse/YARN-2348 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Leitao Guo > Attachments: 1.before-change.jpg, 2.after-change.jpg > > > ResourceManager web UI, including application list and scheduler, displays > UTC time in default, this will confuse users who do not use UTC time. This > web UI should display local time of users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2348) ResourceManager web UI should display locale time instead of UTC time
[ https://issues.apache.org/jira/browse/YARN-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated YARN-2348: - Attachment: 1.before-change.jpg > ResourceManager web UI should display locale time instead of UTC time > - > > Key: YARN-2348 > URL: https://issues.apache.org/jira/browse/YARN-2348 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Leitao Guo > Attachments: 1.before-change.jpg, 2.after-change.jpg > > > ResourceManager web UI, including application list and scheduler, displays > UTC time in default, this will confuse users who do not use UTC time. This > web UI should display local time of users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2348) ResourceManager web UI should display locale time instead of UTC time
[ https://issues.apache.org/jira/browse/YARN-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leitao Guo updated YARN-2348: - Attachment: YARN-2348.patch Please have a check of the patch. > ResourceManager web UI should display locale time instead of UTC time > - > > Key: YARN-2348 > URL: https://issues.apache.org/jira/browse/YARN-2348 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Leitao Guo > Attachments: 1.before-change.jpg, 2.after-change.jpg, YARN-2348.patch > > > ResourceManager web UI, including application list and scheduler, displays > UTC time in default, this will confuse users who do not use UTC time. This > web UI should display local time of users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2347) Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common
[ https://issues.apache.org/jira/browse/YARN-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072770#comment-14072770 ] Hadoop QA commented on YARN-2347: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657521/YARN-2347.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4409//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4409//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4409//console This message is automatically generated. > Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in > yarn-server-common > > > Key: YARN-2347 > URL: https://issues.apache.org/jira/browse/YARN-2347 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-2347.patch > > > We have similar things for version state for RM, NM, TS (TimelineServer), > etc. I think we should consolidate them into a common object. -- This message was sent by Atlassian JIRA (v6.2#6252)