[jira] [Commented] (YARN-2270) TestFSDownload#testDownloadPublicWithStatCache fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061777#comment-14061777 ] Hadoop QA commented on YARN-2270: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655703/YARN-2270.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4302//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4302//console This message is automatically generated. > TestFSDownload#testDownloadPublicWithStatCache fails in trunk > - > > Key: YARN-2270 > URL: https://issues.apache.org/jira/browse/YARN-2270 > Project: Hadoop YARN > Issue Type: Test >Affects Versions: 2.4.1 >Reporter: Ted Yu >Assignee: Akira AJISAKA >Priority: Minor > Attachments: YARN-2270.patch > > > From https://builds.apache.org/job/Hadoop-yarn-trunk/608/console : > {code} > Running org.apache.hadoop.yarn.util.TestFSDownload > Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.955 sec <<< > FAILURE! - in org.apache.hadoop.yarn.util.TestFSDownload > testDownloadPublicWithStatCache(org.apache.hadoop.yarn.util.TestFSDownload) > Time elapsed: 0.137 sec <<< FAILURE! > java.lang.AssertionError: null > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.hadoop.yarn.util.TestFSDownload.testDownloadPublicWithStatCache(TestFSDownload.java:363) > {code} > Similar error can be seen here: > https://builds.apache.org/job/PreCommit-YARN-Build/4243//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPublicWithStatCache/ > Looks like future.get() returned null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-1408: -- Attachment: (was: Yarn-1408.11.patch) > Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task > timeout for 30mins > -- > > Key: YARN-1408 > URL: https://issues.apache.org/jira/browse/YARN-1408 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.2.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: Yarn-1408.1.patch, Yarn-1408.10.patch, > Yarn-1408.11.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, > Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, > Yarn-1408.9.patch, Yarn-1408.patch > > > Capacity preemption is enabled as follows. > * yarn.resourcemanager.scheduler.monitor.enable= true , > * > yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy > Queue = a,b > Capacity of Queue A = 80% > Capacity of Queue B = 20% > Step 1: Assign a big jobA on queue a which uses full cluster capacity > Step 2: Submitted a jobB to queue b which would use less than 20% of cluster > capacity > JobA task which uses queue b capcity is been preempted and killed. > This caused below problem: > 1. New Container has got allocated for jobA in Queue A as per node update > from an NM. > 2. This container has been preempted immediately as per preemption. > Here ACQUIRED at KILLED Invalid State exception came when the next AM > heartbeat reached RM. > ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > ACQUIRED at KILLED > This also caused the Task to go for a timeout for 30minutes as this Container > was already killed by preemption. > attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-1408: -- Attachment: Yarn-1408.11.patch Test case failures are in webapp and its due to connection bind exception. I corrected visibility as mentioned by [~jianhe]. Attaching patch again to re-run test cases. > Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task > timeout for 30mins > -- > > Key: YARN-1408 > URL: https://issues.apache.org/jira/browse/YARN-1408 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.2.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: Yarn-1408.1.patch, Yarn-1408.10.patch, > Yarn-1408.11.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, > Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, > Yarn-1408.9.patch, Yarn-1408.patch > > > Capacity preemption is enabled as follows. > * yarn.resourcemanager.scheduler.monitor.enable= true , > * > yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy > Queue = a,b > Capacity of Queue A = 80% > Capacity of Queue B = 20% > Step 1: Assign a big jobA on queue a which uses full cluster capacity > Step 2: Submitted a jobB to queue b which would use less than 20% of cluster > capacity > JobA task which uses queue b capcity is been preempted and killed. > This caused below problem: > 1. New Container has got allocated for jobA in Queue A as per node update > from an NM. > 2. This container has been preempted immediately as per preemption. > Here ACQUIRED at KILLED Invalid State exception came when the next AM > heartbeat reached RM. > ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > ACQUIRED at KILLED > This also caused the Task to go for a timeout for 30minutes as this Container > was already killed by preemption. > attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2287) Add audit log levels for NM and RM
Varun Saxena created YARN-2287: -- Summary: Add audit log levels for NM and RM Key: YARN-2287 URL: https://issues.apache.org/jira/browse/YARN-2287 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 2.4.1 Reporter: Varun Saxena NM and RM audit logging can be done based on log level as some of the audit logs, especially the container audit logs appear too many times. By introducing log level, certain audit logs can be suppressed, if not required in deployment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2033: -- Attachment: ProposalofStoringYARNMetricsintotheTimelineStore.pdf > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061873#comment-14061873 ] Zhijie Shen commented on YARN-2033: --- Reassign the ticket to myself. > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen reassigned YARN-2033: - Assignee: Zhijie Shen (was: Vinod Kumar Vavilapalli) > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061872#comment-14061872 ] Hadoop QA commented on YARN-1408: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655718/Yarn-1408.11.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4303//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4303//console This message is automatically generated. > Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task > timeout for 30mins > -- > > Key: YARN-1408 > URL: https://issues.apache.org/jira/browse/YARN-1408 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.2.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: Yarn-1408.1.patch, Yarn-1408.10.patch, > Yarn-1408.11.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, > Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, > Yarn-1408.9.patch, Yarn-1408.patch > > > Capacity preemption is enabled as follows. > * yarn.resourcemanager.scheduler.monitor.enable= true , > * > yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy > Queue = a,b > Capacity of Queue A = 80% > Capacity of Queue B = 20% > Step 1: Assign a big jobA on queue a which uses full cluster capacity > Step 2: Submitted a jobB to queue b which would use less than 20% of cluster > capacity > JobA task which uses queue b capcity is been preempted and killed. > This caused below problem: > 1. New Container has got allocated for jobA in Queue A as per node update > from an NM. > 2. This container has been preempted immediately as per preemption. > Here ACQUIRED at KILLED Invalid State exception came when the next AM > heartbeat reached RM. > ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > ACQUIRED at KILLED > This also caused the Task to go for a timeout for 30minutes as this Container > was already killed by preemption. > attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2033: -- Attachment: YARN-2033.Prototype.patch Upload the proposal of changes and the demo code. > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.Prototype.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061889#comment-14061889 ] Zhijie Shen commented on YARN-2033: --- bq. I would think the timeline store might have to supporting storing a lot more information than the history store. In that case, one might want to keep them separate? IMHO, it depends on the use cases. Either generic or the application specific metrics can be a lot, or both. The problem to keep them separately is two have two different set of interfaces, which double our effort of maintenance and upgrade. For example, we've done data retention, and caching for the Leveldb-based timeline store. However, generic history cannot taken advantage of it unless we implement the similar features for the application history store again. I don't think keeping both metrics sets into the same store will be a big trouble for each other. With the uniformed the store interface, we can restrict the effort of improving the store implementation, where can isolate the two metrics sets (e.g., storing in two tables). > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.Prototype.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061891#comment-14061891 ] Junping Du commented on YARN-1341: -- Hey [~jlowe], I also agree it is better to discuss the inconsistent scenario for each cases on separated JIRAs. However, for now, our conclusion from these discussions can only be true in theoretically but it may have bugs/issues in practical. Thus, I also suggest we should have a central place to document these assumptions/conclusions from discussions and it would help us and others in community to identify potential issues if coming up with UT or other integration tests on negative cases later. What do you think? If you are also agree on this, we can separate this document effort to other JIRA (Umbrella or a dedicated one, whatever you like) and continue the discussion on this particular case. On this particular one, the assumptions here from discussion above seems like: if NM restart with stale keys, a. if currentMasterKey is stale, it can be updated and override soon with registering to RM later. Nothing is affected. b. if previousMasterKey is stale, then the real previous master key is lost, so the affection is: AMs with real master key cannot connect to NM to launch containers. c. if applicationMasterKeys are stale, then previous old keys get tracked in applicationMasterKeys get lost after restart. The affection is: AMs with old keys cannot connect to NM to launch containers. I would prefer option 1 too if we listed all affections here. Anything I am missing here? > Recover NMTokens upon nodemanager restart > - > > Key: YARN-1341 > URL: https://issues.apache.org/jira/browse/YARN-1341 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, > YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2288) Data persistent in timelinestore should be versioned
Junping Du created YARN-2288: Summary: Data persistent in timelinestore should be versioned Key: YARN-2288 URL: https://issues.apache.org/jira/browse/YARN-2288 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du We have LevelDB-backed TimelineStore, it should have schema version for changes in schema in future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2289) ApplicationHistoryStore should be versioned
[ https://issues.apache.org/jira/browse/YARN-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2289: - Component/s: applications > ApplicationHistoryStore should be versioned > --- > > Key: YARN-2289 > URL: https://issues.apache.org/jira/browse/YARN-2289 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications >Reporter: Junping Du > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2289) ApplicationHistoryStore should be versioned
Junping Du created YARN-2289: Summary: ApplicationHistoryStore should be versioned Key: YARN-2289 URL: https://issues.apache.org/jira/browse/YARN-2289 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2289) ApplicationHistoryStore should be versioned
[ https://issues.apache.org/jira/browse/YARN-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061905#comment-14061905 ] Junping Du commented on YARN-2289: -- Generic History Server is being refactored to be based on TimelineStore. > ApplicationHistoryStore should be versioned > --- > > Key: YARN-2289 > URL: https://issues.apache.org/jira/browse/YARN-2289 > Project: Hadoop YARN > Issue Type: Sub-task > Components: applications >Reporter: Junping Du > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2256) Too many nodemanager audit logs are generated
[ https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061907#comment-14061907 ] Varun Saxena commented on YARN-2256: Adding of log levels in RM and NM is addressed by this issue > Too many nodemanager audit logs are generated > - > > Key: YARN-2256 > URL: https://issues.apache.org/jira/browse/YARN-2256 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.4.0 >Reporter: Varun Saxena > > Following audit logs are generated too many times(due to the possibility of a > large number of containers) : > 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a > container > 2. In RM - Audit logs corresponding to AM allocating a container and AM > releasing a container > We can have different log levels even for NM and RM audit logs and move these > successful container related logs to DEBUG. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2256) Too many nodemanager audit logs are generated
[ https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061908#comment-14061908 ] Varun Saxena commented on YARN-2256: Changed NM container logs to debug level so that they don't appear in audit logs by default. > Too many nodemanager audit logs are generated > - > > Key: YARN-2256 > URL: https://issues.apache.org/jira/browse/YARN-2256 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.4.0 >Reporter: Varun Saxena > > Following audit logs are generated too many times(due to the possibility of a > large number of containers) : > 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a > container > 2. In RM - Audit logs corresponding to AM allocating a container and AM > releasing a container > We can have different log levels even for NM and RM audit logs and move these > successful container related logs to DEBUG. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2256) Too many nodemanager audit logs are generated
[ https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2256: --- Attachment: YARN-2256.patch Please review the patch > Too many nodemanager audit logs are generated > - > > Key: YARN-2256 > URL: https://issues.apache.org/jira/browse/YARN-2256 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.4.0 >Reporter: Varun Saxena > Attachments: YARN-2256.patch > > > Following audit logs are generated too many times(due to the possibility of a > large number of containers) : > 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a > container > 2. In RM - Audit logs corresponding to AM allocating a container and AM > releasing a container > We can have different log levels even for NM and RM audit logs and move these > successful container related logs to DEBUG. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2287) Add audit log levels for NM and RM
[ https://issues.apache.org/jira/browse/YARN-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061919#comment-14061919 ] Varun Saxena commented on YARN-2287: I will make the following changes : 1. Create new logSuccess and logFailure methods having an additional parameter indicating log level. This can be an enum in RMAuditLogger and NMAuditLogger. 2. The previous logSuccess method will continue printing logs at INFO level. The new method can be used to print logs at appropriate levels. > Add audit log levels for NM and RM > -- > > Key: YARN-2287 > URL: https://issues.apache.org/jira/browse/YARN-2287 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Affects Versions: 2.4.1 >Reporter: Varun Saxena > > NM and RM audit logging can be done based on log level as some of the audit > logs, especially the container audit logs appear too many times. By > introducing log level, certain audit logs can be suppressed, if not required > in deployment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2256) Too many nodemanager audit logs are generated
[ https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061922#comment-14061922 ] Hadoop QA commented on YARN-2256: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655734/YARN-2256.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4304//console This message is automatically generated. > Too many nodemanager audit logs are generated > - > > Key: YARN-2256 > URL: https://issues.apache.org/jira/browse/YARN-2256 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.4.0 >Reporter: Varun Saxena > Attachments: YARN-2256.patch > > > Following audit logs are generated too many times(due to the possibility of a > large number of containers) : > 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a > container > 2. In RM - Audit logs corresponding to AM allocating a container and AM > releasing a container > We can have different log levels even for NM and RM audit logs and move these > successful container related logs to DEBUG. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2228) TimelineServer should load pseudo authentication filter when authentication = simple
[ https://issues.apache.org/jira/browse/YARN-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061930#comment-14061930 ] Hudson commented on YARN-2228: -- FAILURE: Integrated in Hadoop-Yarn-trunk #613 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/613/]) YARN-2228. Augmented TimelineServer to load pseudo authentication filter when authentication = simple. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1610575) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/ForbiddenException.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/GenericExceptionHandler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineACLsManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilter.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilterInitializer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/TimelineWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestMemoryApplicationHistoryStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java > TimelineServer should load pseudo authentication filter when authentication = > simple > > > Key: YARN-2228 > URL: https://issues.apache.org/jira/browse/YARN-2228 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Fix For: 2.6.0 > > Attachments: YARN-2228.1.patch, YARN-2228.2.patch, YARN-2228.3.patch, > YARN-2228.4.patch, YARN-2228.5.patch, YARN-2228.6.patch > > > When kerberos authentication is not enabled, we should let the timeline > server to work with pseudo authentication filter. In this way, the sever is > able to detect the request user by checking "user.name". > On the other hand, timeline client should append "user.name" in un-secure > case as well, such that ACLs can keep working in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2260) Add containers to launchedContainers list in RMNode on container recovery
[ https://issues.apache.org/jira/browse/YARN-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061932#comment-14061932 ] Hudson commented on YARN-2260: -- FAILURE: Integrated in Hadoop-Yarn-trunk #613 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/613/]) YARN-2260. Fixed ResourceManager's RMNode to correctly remember containers when nodes resync during work-preserving RM restart. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1610557) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java > Add containers to launchedContainers list in RMNode on container recovery > - > > Key: YARN-2260 > URL: https://issues.apache.org/jira/browse/YARN-2260 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Fix For: 2.6.0 > > Attachments: YARN-2260.1.patch, YARN-2260.2.patch > > > The justLaunchedContainers map in RMNode should be re-populated when > container is sent from NM for recovery. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1050) Document the Fair Scheduler REST API
[ https://issues.apache.org/jira/browse/YARN-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061959#comment-14061959 ] Akira AJISAKA commented on YARN-1050: - Thanks [~kj-ki] for the update. {code} +"childQueues": { +"clusterResources": { +"memory": 8192, +"vCores": 8 {code} '[' bracket is needed since "childQueues" is a collection. Minor nit: There are some trailing whitespaces in the JSON response body. > Document the Fair Scheduler REST API > > > Key: YARN-1050 > URL: https://issues.apache.org/jira/browse/YARN-1050 > Project: Hadoop YARN > Issue Type: Improvement > Components: documentation >Reporter: Sandy Ryza >Assignee: Kenji Kikushima > Attachments: YARN-1050-2.patch, YARN-1050.patch > > > The documentation should be placed here along with the Capacity Scheduler > documentation: > http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2256) Too many nodemanager audit logs are generated
[ https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2256: --- Attachment: (was: YARN-2256.patch) > Too many nodemanager audit logs are generated > - > > Key: YARN-2256 > URL: https://issues.apache.org/jira/browse/YARN-2256 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.4.0 >Reporter: Varun Saxena > > Following audit logs are generated too many times(due to the possibility of a > large number of containers) : > 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a > container > 2. In RM - Audit logs corresponding to AM allocating a container and AM > releasing a container > We can have different log levels even for NM and RM audit logs and move these > successful container related logs to DEBUG. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2256) Too many nodemanager audit logs are generated
[ https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2256: --- Attachment: YARN-2256.patch > Too many nodemanager audit logs are generated > - > > Key: YARN-2256 > URL: https://issues.apache.org/jira/browse/YARN-2256 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.4.0 >Reporter: Varun Saxena > Attachments: YARN-2256.patch > > > Following audit logs are generated too many times(due to the possibility of a > large number of containers) : > 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a > container > 2. In RM - Audit logs corresponding to AM allocating a container and AM > releasing a container > We can have different log levels even for NM and RM audit logs and move these > successful container related logs to DEBUG. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2256) Too many nodemanager audit logs are generated
[ https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062009#comment-14062009 ] Hadoop QA commented on YARN-2256: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655753/YARN-2256.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4305//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4305//console This message is automatically generated. > Too many nodemanager audit logs are generated > - > > Key: YARN-2256 > URL: https://issues.apache.org/jira/browse/YARN-2256 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.4.0 >Reporter: Varun Saxena > Attachments: YARN-2256.patch > > > Following audit logs are generated too many times(due to the possibility of a > large number of containers) : > 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a > container > 2. In RM - Audit logs corresponding to AM allocating a container and AM > releasing a container > We can have different log levels even for NM and RM audit logs and move these > successful container related logs to DEBUG. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2152) Recover missing container information
[ https://issues.apache.org/jira/browse/YARN-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062060#comment-14062060 ] Jason Lowe commented on YARN-2152: -- Yeah that's what I suspected as well, but I wanted to mention it in case I missed something. It's crucial we get the token compatibility sorted out sooner rather than later, otherwise I can see us regularly breaking compatibility between even minor versions as we tweak tokens to add features. Whenever that happens rolling upgrades will not work in practice. > Recover missing container information > - > > Key: YARN-2152 > URL: https://issues.apache.org/jira/browse/YARN-2152 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Fix For: 2.5.0 > > Attachments: YARN-2152.1.patch, YARN-2152.1.patch, YARN-2152.2.patch, > YARN-2152.3.patch > > > Container information such as container priority and container start time > cannot be recovered because NM container today lacks such container > information to send across on NM registration when RM recovery happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2260) Add containers to launchedContainers list in RMNode on container recovery
[ https://issues.apache.org/jira/browse/YARN-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062075#comment-14062075 ] Hudson commented on YARN-2260: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1805 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1805/]) YARN-2260. Fixed ResourceManager's RMNode to correctly remember containers when nodes resync during work-preserving RM restart. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1610557) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java > Add containers to launchedContainers list in RMNode on container recovery > - > > Key: YARN-2260 > URL: https://issues.apache.org/jira/browse/YARN-2260 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Fix For: 2.6.0 > > Attachments: YARN-2260.1.patch, YARN-2260.2.patch > > > The justLaunchedContainers map in RMNode should be re-populated when > container is sent from NM for recovery. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2228) TimelineServer should load pseudo authentication filter when authentication = simple
[ https://issues.apache.org/jira/browse/YARN-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062073#comment-14062073 ] Hudson commented on YARN-2228: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1805 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1805/]) YARN-2228. Augmented TimelineServer to load pseudo authentication filter when authentication = simple. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1610575) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/ForbiddenException.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/GenericExceptionHandler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineACLsManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilter.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilterInitializer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/TimelineWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestMemoryApplicationHistoryStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java > TimelineServer should load pseudo authentication filter when authentication = > simple > > > Key: YARN-2228 > URL: https://issues.apache.org/jira/browse/YARN-2228 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Fix For: 2.6.0 > > Attachments: YARN-2228.1.patch, YARN-2228.2.patch, YARN-2228.3.patch, > YARN-2228.4.patch, YARN-2228.5.patch, YARN-2228.6.patch > > > When kerberos authentication is not enabled, we should let the timeline > server to work with pseudo authentication filter. In this way, the sever is > able to detect the request user by checking "user.name". > On the other hand, timeline client should append "user.name" in un-secure > case as well, such that ACLs can keep working in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2256) Too many nodemanager audit logs are generated
[ https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062079#comment-14062079 ] Varun Saxena commented on YARN-2256: Only changed log level from INFO to DEBUG. No tests need to be included. Tested the flows manually where the audit log changed appears. > Too many nodemanager audit logs are generated > - > > Key: YARN-2256 > URL: https://issues.apache.org/jira/browse/YARN-2256 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.4.0 >Reporter: Varun Saxena > Attachments: YARN-2256.patch > > > Following audit logs are generated too many times(due to the possibility of a > large number of containers) : > 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a > container > 2. In RM - Audit logs corresponding to AM allocating a container and AM > releasing a container > We can have different log levels even for NM and RM audit logs and move these > successful container related logs to DEBUG. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2228) TimelineServer should load pseudo authentication filter when authentication = simple
[ https://issues.apache.org/jira/browse/YARN-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062171#comment-14062171 ] Hudson commented on YARN-2228: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1832 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1832/]) YARN-2228. Augmented TimelineServer to load pseudo authentication filter when authentication = simple. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1610575) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/ForbiddenException.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/GenericExceptionHandler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineACLsManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilter.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilterInitializer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/TimelineWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestMemoryApplicationHistoryStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java > TimelineServer should load pseudo authentication filter when authentication = > simple > > > Key: YARN-2228 > URL: https://issues.apache.org/jira/browse/YARN-2228 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Fix For: 2.6.0 > > Attachments: YARN-2228.1.patch, YARN-2228.2.patch, YARN-2228.3.patch, > YARN-2228.4.patch, YARN-2228.5.patch, YARN-2228.6.patch > > > When kerberos authentication is not enabled, we should let the timeline > server to work with pseudo authentication filter. In this way, the sever is > able to detect the request user by checking "user.name". > On the other hand, timeline client should append "user.name" in un-secure > case as well, such that ACLs can keep working in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2260) Add containers to launchedContainers list in RMNode on container recovery
[ https://issues.apache.org/jira/browse/YARN-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062173#comment-14062173 ] Hudson commented on YARN-2260: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1832 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1832/]) YARN-2260. Fixed ResourceManager's RMNode to correctly remember containers when nodes resync during work-preserving RM restart. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1610557) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java > Add containers to launchedContainers list in RMNode on container recovery > - > > Key: YARN-2260 > URL: https://issues.apache.org/jira/browse/YARN-2260 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Fix For: 2.6.0 > > Attachments: YARN-2260.1.patch, YARN-2260.2.patch > > > The justLaunchedContainers map in RMNode should be re-populated when > container is sent from NM for recovery. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2045) Data persisted in NM should be versioned
[ https://issues.apache.org/jira/browse/YARN-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062212#comment-14062212 ] Jason Lowe commented on YARN-2045: -- bq. I agree the concept is not quite the same but I tend to handle them both together as either of change (protobuf schema or layout schema) will bring difficulty/risky for NMStateStoreService to load old version of data. I think lumping them together and handling them in the implementation-specific code is fine, but if the implementation is handling all the details then why is it exposed in the interface? I think the most telling point is that in the proposed patch no common code actually uses the interfaces that were added. Each implementation does its own version setting, its own compatibility check, and I assume its own marshaling in the future if necessary. The interfaces aren't called by common code. Maybe I'm not seeing the future use case of these methods? I guess it could be useful for common code to do logging/reporting of the persisted/current versions or maybe to do a very simplistic incompatibility check (e.g.: assume different major numbers means incompatible), although arguably the implementation could simply log these numbers as it initializes and is already doing an implementation-specific compatibility check. However I'm particularly doubtful of the storeVersion method as it seems like the only way to safely convert versions in the general sense is with implementation-specific code. Using the conversion pseudo-code above as an example, if we crash halfway through the conversion of a series of objects then we have a mix of old and new data on the next restart but the stored version number is still old (or vice-versa if we store the new version first then convert). In an implementation-specific approach it may be possible to make the conversion atomic, e.g.: using a batch write for the entire conversion in leveldb. Therefore it makes more sense to me that an implementation should be responsible for deciding when and how to update the persisted schema version. I would expect implementations to do this sort of conversion during initialization and potentially the old persisted version would never be seen since it would already be converted. Do you have an example where using the storeVersion method in the interface via implementation-independent code would be more appropriate and therefore the storeVersion method in the interface is necessary? To summarize, I can see exposing the ability to get the persisted and current state store versions in the interface for logging, etc. However I don't see how implementation-independent code can properly update the version via the interface. We're lumping both interface and implementation-specific schema changes in the same version number, and it isn't possible to do an update of multiple store objects atomically via the current interface. bq. Are you suggesting NMDBSchemaVersion to play as PBImpl directly to include raw protobuf or something else? Sort of a subset of what the PBImpl is doing. I was thinking of having NMDBSchemaVersion wrap the protobuf but in a read-only way (i.e.: no set methods, no builder stuff). If one wants to change the version number, create a new protobuf. PBImpls tend to get into trouble because they can be written, and it's simpler to treat the protobufs as immutable as they were intended. Another approach would be to simply have some static helper util methods that take two protobufs to do the compatible checks, etc. Although I don't think we can really implement a useful isCompatibleTo check in implementaion-independent code since the version numbers encodes implementation-specific schema information. Anyway I didn't mean to drag out this change for too long. I'm wondering about these interfaces since I'm a strong believer that interfaces should be minimal and necessary, and I'm having difficulty seeing how these interfaces are really going to be used. However I'm probably in the minority on these methods. If people feel strongly that these interfaces are necessary and useful then go ahead and add them. It seems to me that these interfaces will either never be called or only called for trivial reasons (e.g.: logging). However I don't think having them is going to break anything or be an unreasonable burden on an implementation, rather just extra baggage that state store implementations have to expose. As for the PBImpl, it's mostly a nit. If you really would rather keep it in I guess that's fine. We should be able to remove it later if we realize we don't have a use for it. The main change I think has to be made is the leveldb schema check should handle the original method for storing the schema. Two ways to handle that are either explicitly check for the "1.0" string before trying to parse the
[jira] [Updated] (YARN-2287) Add audit log levels for NM and RM
[ https://issues.apache.org/jira/browse/YARN-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2287: --- Attachment: YARN-2287.patch Kindly review the patch > Add audit log levels for NM and RM > -- > > Key: YARN-2287 > URL: https://issues.apache.org/jira/browse/YARN-2287 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Affects Versions: 2.4.1 >Reporter: Varun Saxena > Attachments: YARN-2287.patch > > > NM and RM audit logging can be done based on log level as some of the audit > logs, especially the container audit logs appear too many times. By > introducing log level, certain audit logs can be suppressed, if not required > in deployment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2233: Attachment: apache-yarn-2233.4.patch {quote} bq.It seems to me that all API implementations should take the fulll principle name if available. I meant to replace all occurrences of getCallerUserGroupInformation(hsr), if that makes sense. {quote} Fixed this. Use the principal everywhere {quote} bq.We should set all the fields of a DT - token, renewer, expiration-time all the time - new-token, renew-token? renewDelegationToken only returns only the expiry-time and getToken only returns the token. This is consistent with RPCs. But I think in a followup, we should fix this. Fixed. bq. You meant we will fix this in a separate JIRA? I still see renewToken not returning the entire token info. I'm okay doing it separately, just clarifying what you said.. {quote} I've fixed this for creating a new delegation token but I didn't fix it for renew token. I think it's ok to fix it as part of a seperate JIRA. > Implement web services to create, renew and cancel delegation tokens > > > Key: YARN-2233 > URL: https://issues.apache.org/jira/browse/YARN-2233 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch, > apache-yarn-2233.2.patch, apache-yarn-2233.3.patch, apache-yarn-2233.4.patch > > > Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062261#comment-14062261 ] Eric Payne commented on YARN-415: - Hi [~leftnoteasy]. Thank you very much for reviewing my patch. I think I understand what you are suggesting. Please let me clarify: {quote} 1) Add memory utilization to RMAppMetrics/RMAppAttemptMetrics {quote} Since every RMAppAttemptImpl object has a reference to an RMAppAttemptMetrics object, you are suggesting that I move the resource usage stats to RMAppAttemptMetrics. Also, when reporting on resource usage, use the reporting methods from RMAppAttempt and RMApp. {quote} 2) Keep running container resource utilization in SchedulerApplicationAttempt {quote} The way the patch for YARN-415 is currently, it keeps resource usage stats for both running and finished containers in the SchedulerApplicationAttempt object. You're suggestion is to keep resource usage stats only for running containers. {quote} 3) Move completed container resource calculation to RMContainerImpl#FinishTransition {quote} For completed containers, you are suggesting that the calculation be done for final resource usage stats within the RMContainerImpl#FinishTransition method and have that send the resource stats as a payload within the RMAppAttemptContainerFinishedEvent event. Then, when RMAppAttemptImpl receives the CONTAINER_FINISHED event, it would add the resource usage stats for the finished containers to those already collected within the RMAppAttemptMetrics object. Is that correct? > Capture memory utilization at the app-level for chargeback > -- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 0.23.6 >Reporter: Kendall Thrapp >Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the application, as even if the app didn't > use all that memory, no one else was able to use it. > (reserved ram for container 1 * lifetime of container 1) + (reserved ram for > container 2 * lifetime of container 2) + ... + (reserved ram for container n > * lifetime of container n) > It'd be nice to have this at the app level instead of the job level because: > 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't > appear on the job history server). > 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). > This new metric should be available both through the RM UI and RM Web > Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062270#comment-14062270 ] Varun Vasudev commented on YARN-2233: - [~tucu00] I'm going to file another ticket to migrate over to the hadoop-common implementation once you've committed the changes(and once support for passing tokens via headers is added). > Implement web services to create, renew and cancel delegation tokens > > > Key: YARN-2233 > URL: https://issues.apache.org/jira/browse/YARN-2233 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch, > apache-yarn-2233.2.patch, apache-yarn-2233.3.patch, apache-yarn-2233.4.patch > > > Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2290) Add support for passing delegation tokens via headers for web services
Varun Vasudev created YARN-2290: --- Summary: Add support for passing delegation tokens via headers for web services Key: YARN-2290 URL: https://issues.apache.org/jira/browse/YARN-2290 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev HADOOP-10799 refactors the WebHDFS code to handle delegation tokens a part of hadoop-common. We should add support to pass delegation tokens as a header instead of passing it as part of the url. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2291) Timeline and RM web services should use same authentication code
Varun Vasudev created YARN-2291: --- Summary: Timeline and RM web services should use same authentication code Key: YARN-2291 URL: https://issues.apache.org/jira/browse/YARN-2291 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev The TimelineServer and the RM web services have very similar requirements and implementation for authentication via delegation tokens apart from the fact that the RM web services requires delegation tokens to be passed as a header. They should use the same code base instead of different implementations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062281#comment-14062281 ] Milan Potocnik commented on YARN-1994: -- Both TestFSDownload and TestMemoryApplicationHistoryStore pass on my box and do not seem to be related to the change. > Expose YARN/MR endpoints on multiple interfaces > --- > > Key: YARN-1994 > URL: https://issues.apache.org/jira/browse/YARN-1994 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, webapp >Affects Versions: 2.4.0 >Reporter: Arpit Agarwal >Assignee: Craig Welch > Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.2.patch, > YARN-1994.3.patch, YARN-1994.4.patch > > > YARN and MapReduce daemons currently do not support specifying a wildcard > address for the server endpoints. This prevents the endpoints from being > accessible from all interfaces on a multihomed machine. > Note that if we do specify INADDR_ANY for any of the options, it will break > clients as they will attempt to connect to 0.0.0.0. We need a solution that > allows specifying a hostname or IP-address for clients while requesting > wildcard bind for the servers. > (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2292) RM web services should use hadoop-common for authentication using delegation tokens
Varun Vasudev created YARN-2292: --- Summary: RM web services should use hadoop-common for authentication using delegation tokens Key: YARN-2292 URL: https://issues.apache.org/jira/browse/YARN-2292 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev HADOOP-10771 refactors the WebHDFS authentication code to hadoop-common. YARN-2290 will add support for passing delegation tokens via headers. Once support is added RM web services should use the authentication code from hadoop-common -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062331#comment-14062331 ] Vinod Kumar Vavilapalli commented on YARN-668: -- One other important point from the design doc at YARN-666 is to make sure that, during the upgrade, tokens are accepted by both the old and new NMs. We need some magic on the ResourceManager. > TokenIdentifier serialization should consider Unknown fields > > > Key: YARN-668 > URL: https://issues.apache.org/jira/browse/YARN-668 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Vinod Kumar Vavilapalli > > This would allow changing of the TokenIdentifier between versions. The > current serialization is Writable. A simple way to achieve this would be to > have a Proto object as the payload for TokenIdentifiers, instead of > individual fields. > TokenIdentifier continues to implement Writable to work with the RPC layer - > but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2152) Recover missing container information
[ https://issues.apache.org/jira/browse/YARN-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062332#comment-14062332 ] Vinod Kumar Vavilapalli commented on YARN-2152: --- Yeah, I just realized that YARN-668 already exists for this. I made a comment there to make sure we don't miss this. I was actively thinking about it, this is one of the big pending issues for rolling upgrades.. > Recover missing container information > - > > Key: YARN-2152 > URL: https://issues.apache.org/jira/browse/YARN-2152 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Fix For: 2.5.0 > > Attachments: YARN-2152.1.patch, YARN-2152.1.patch, YARN-2152.2.patch, > YARN-2152.3.patch > > > Container information such as container priority and container start time > cannot be recovered because NM container today lacks such container > information to send across on NM registration when RM recovery happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2233: Attachment: apache-yarn-2233.5.patch Uploaded new patch fixing findbug error. The test case failures are due to TestClientRMService.testForceKillApplication failing which lead to a whole bunch of subsequent tests to fail. > Implement web services to create, renew and cancel delegation tokens > > > Key: YARN-2233 > URL: https://issues.apache.org/jira/browse/YARN-2233 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch, > apache-yarn-2233.2.patch, apache-yarn-2233.3.patch, apache-yarn-2233.4.patch, > apache-yarn-2233.5.patch > > > Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2100) Refactor the Timeline Server code for Kerberos + DT authentication
[ https://issues.apache.org/jira/browse/YARN-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2100: -- Summary: Refactor the Timeline Server code for Kerberos + DT authentication (was: Refactor the code of Kerberos + DT authentication) > Refactor the Timeline Server code for Kerberos + DT authentication > -- > > Key: YARN-2100 > URL: https://issues.apache.org/jira/browse/YARN-2100 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > The customized Kerberos + DT authentication of the timeline server largely > refers to that of Http FS, therefore, there're a portion of duplicate code. > We should think about refactor the code if it is necessary. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2292) RM web services should use hadoop-common for authentication using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062379#comment-14062379 ] Vinod Kumar Vavilapalli commented on YARN-2292: --- YARN-2100 is the related ticket for Timeline Service.. > RM web services should use hadoop-common for authentication using delegation > tokens > --- > > Key: YARN-2292 > URL: https://issues.apache.org/jira/browse/YARN-2292 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > > HADOOP-10771 refactors the WebHDFS authentication code to hadoop-common. > YARN-2290 will add support for passing delegation tokens via headers. Once > support is added RM web services should use the authentication code from > hadoop-common -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2100) Refactor the Timeline Server code for Kerberos + DT authentication
[ https://issues.apache.org/jira/browse/YARN-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2100: -- Target Version/s: 2.6.0 > Refactor the Timeline Server code for Kerberos + DT authentication > -- > > Key: YARN-2100 > URL: https://issues.apache.org/jira/browse/YARN-2100 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > The customized Kerberos + DT authentication of the timeline server largely > refers to that of Http FS, therefore, there're a portion of duplicate code. > We should think about refactor the code if it is necessary. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2292) RM web services should use hadoop-common for authentication using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2292: -- Target Version/s: 2.6.0 > RM web services should use hadoop-common for authentication using delegation > tokens > --- > > Key: YARN-2292 > URL: https://issues.apache.org/jira/browse/YARN-2292 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > > HADOOP-10771 refactors the WebHDFS authentication code to hadoop-common. > YARN-2290 will add support for passing delegation tokens via headers. Once > support is added RM web services should use the authentication code from > hadoop-common -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2284) Find missing config options in YarnConfiguration and yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062382#comment-14062382 ] Hadoop QA commented on YARN-2284: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655640/YARN2284-01.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1262 javac compiler warnings (more than the trunk's current 1258 warnings). {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/4307//artifact/trunk/patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: org.apache.hadoop.ipc.TestIPC org.apache.hadoop.fs.TestSymlinkLocalFSFileSystem org.apache.hadoop.fs.TestSymlinkLocalFSFileContext org.apache.hadoop.yarn.util.TestFSDownload {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4307//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4307//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4307//console This message is automatically generated. > Find missing config options in YarnConfiguration and yarn-default.xml > - > > Key: YARN-2284 > URL: https://issues.apache.org/jira/browse/YARN-2284 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.4.1 >Reporter: Ray Chiang >Assignee: Ray Chiang >Priority: Minor > Labels: supportability > Attachments: YARN2284-01.patch > > > YarnConfiguration has one set of properties. yarn-default.xml has another > set of properties. Ideally, there should be an automatic way to find missing > properties in either location. > This is analogous to MAPREDUCE-5130, but for yarn-default.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062350#comment-14062350 ] Hadoop QA commented on YARN-2233: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655791/apache-yarn-2233.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-auth hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4306//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4306//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4306//console This message is automatically generated. > Implement web services to create, renew and cancel delegation tokens > > > Key: YARN-2233 > URL: https://issues.apache.org/jira/browse/YARN-2233 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch, > apache-yarn-2233.2.patch, apache-yarn-2233.3.patch, apache-yarn-2233.4.patch > > > Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2291) Timeline and RM web services should use same authentication code
[ https://issues.apache.org/jira/browse/YARN-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062392#comment-14062392 ] Vinod Kumar Vavilapalli commented on YARN-2291: --- This is likely a dup of the combination of YARN-2100 & YARN-2292. Keeping it open for now, we can close as later as is needed. > Timeline and RM web services should use same authentication code > > > Key: YARN-2291 > URL: https://issues.apache.org/jira/browse/YARN-2291 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > > The TimelineServer and the RM web services have very similar requirements and > implementation for authentication via delegation tokens apart from the fact > that the RM web services requires delegation tokens to be passed as a header. > They should use the same code base instead of different implementations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062431#comment-14062431 ] Sunil G commented on YARN-796: -- Hi [~gp.leftnoteasy] Great. This feature will be a big addition to YARN. I have few thoughts on this. 1. In our use case scenarios, we are more likely to have OR and NOT. I feel combination of these labels need to be in a defined or restricted way. Result of some combinations (AND, OR and NOT) may come invalid, and some may need to be reduced. This complexity need not have to bring to RM to take a final decision. 2. *Reservation*: If a node label has many nodes under it, then there is a chance of reservation. Valid candidates may come later, so solution can be look in to this aspect also. Node Label level reservations ? 3. Centralized Configuration: If a new node is added to cluster, may be it can be started by having a label configuration in its yarn-site.xml. This may be fine I feel. your thoughts? > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2219) AMs and NMs can get exceptions after recovery but before scheduler knowns apps and app-attempts
[ https://issues.apache.org/jira/browse/YARN-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2219: -- Attachment: YARN-2219.2.patch > AMs and NMs can get exceptions after recovery but before scheduler knowns > apps and app-attempts > --- > > Key: YARN-2219 > URL: https://issues.apache.org/jira/browse/YARN-2219 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ashwin Shankar >Assignee: Jian He > Attachments: YARN-2219.1.patch, YARN-2219.2.patch > > > {code} > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testAppReregisterOnRMWorkPreservingRestart[0](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 4.335 sec <<< ERROR! > java.lang.NullPointerException: null > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getTransferredContainers(AbstractYarnScheduler.java:91) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.registerApplicationMaster(ApplicationMasterService.java:297) > at > org.apache.hadoop.yarn.server.resourcemanager.MockAM$1.run(MockAM.java:113) > at > org.apache.hadoop.yarn.server.resourcemanager.MockAM$1.run(MockAM.java:110) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1626) > at > org.apache.hadoop.yarn.server.resourcemanager.MockAM.registerAppAttempt(MockAM.java:109) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testAppReregisterOnRMWorkPreservingRestart(TestWorkPreservingRMRestart.java:562) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2219) AMs and NMs can get exceptions after recovery but before scheduler knowns apps and app-attempts
[ https://issues.apache.org/jira/browse/YARN-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062492#comment-14062492 ] Jian He commented on YARN-2219: --- Fixed the comments bq. instead of the shouldNotifyAppAccepted nomenclature, we can say isAppRecovering and flip the logic Updated the naming for attempt also to be consistent. > AMs and NMs can get exceptions after recovery but before scheduler knowns > apps and app-attempts > --- > > Key: YARN-2219 > URL: https://issues.apache.org/jira/browse/YARN-2219 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ashwin Shankar >Assignee: Jian He > Attachments: YARN-2219.1.patch, YARN-2219.2.patch > > > {code} > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testAppReregisterOnRMWorkPreservingRestart[0](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 4.335 sec <<< ERROR! > java.lang.NullPointerException: null > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getTransferredContainers(AbstractYarnScheduler.java:91) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.registerApplicationMaster(ApplicationMasterService.java:297) > at > org.apache.hadoop.yarn.server.resourcemanager.MockAM$1.run(MockAM.java:113) > at > org.apache.hadoop.yarn.server.resourcemanager.MockAM$1.run(MockAM.java:110) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1626) > at > org.apache.hadoop.yarn.server.resourcemanager.MockAM.registerAppAttempt(MockAM.java:109) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testAppReregisterOnRMWorkPreservingRestart(TestWorkPreservingRMRestart.java:562) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062509#comment-14062509 ] Hadoop QA commented on YARN-2233: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655816/apache-yarn-2233.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-auth hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4308//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4308//console This message is automatically generated. > Implement web services to create, renew and cancel delegation tokens > > > Key: YARN-2233 > URL: https://issues.apache.org/jira/browse/YARN-2233 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch, > apache-yarn-2233.2.patch, apache-yarn-2233.3.patch, apache-yarn-2233.4.patch, > apache-yarn-2233.5.patch > > > Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2293) Scoring for NMs to identify a better candidate to launch AMs
Sunil G created YARN-2293: - Summary: Scoring for NMs to identify a better candidate to launch AMs Key: YARN-2293 URL: https://issues.apache.org/jira/browse/YARN-2293 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Sunil G Assignee: Sunil G Container exit status from NM is giving indications of reasons for its failure. Some times, it may be because of container launching problems in NM. In a heterogeneous cluster, some machines with weak hardware may cause more failures. It will be better not to launch AMs there more often. Also I would like to clear that container failures because of buggy job should not result in decreasing score. As mentioned earlier, based on exit status if a scoring mechanism is added for NMs in RM, then NMs with better scores can be given for launching AMs. Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2293) Scoring for NMs to identify a better candidate to launch AMs
[ https://issues.apache.org/jira/browse/YARN-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062530#comment-14062530 ] Jason Lowe commented on YARN-2293: -- This sounds very similar to YARN-2005, if a bit more general. This approach sounds like it could support a "gray" area for NMs where it really doesn't like to launch AMs on a node but may choose to do so anyway if that's the only place it can find. It may be more fruitful to continue this discussion over on YARN-2005 and hash through how exit status would map to scoring adjustments, how the score would affect scheduling, and work through various corner cases. > Scoring for NMs to identify a better candidate to launch AMs > > > Key: YARN-2293 > URL: https://issues.apache.org/jira/browse/YARN-2293 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Reporter: Sunil G >Assignee: Sunil G > > Container exit status from NM is giving indications of reasons for its > failure. Some times, it may be because of container launching problems in NM. > In a heterogeneous cluster, some machines with weak hardware may cause more > failures. It will be better not to launch AMs there more often. Also I would > like to clear that container failures because of buggy job should not result > in decreasing score. > As mentioned earlier, based on exit status if a scoring mechanism is added > for NMs in RM, then NMs with better scores can be given for launching AMs. > Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2219) AMs and NMs can get exceptions after recovery but before scheduler knowns apps and app-attempts
[ https://issues.apache.org/jira/browse/YARN-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062547#comment-14062547 ] Hadoop QA commented on YARN-2219: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655840/YARN-2219.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4309//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4309//console This message is automatically generated. > AMs and NMs can get exceptions after recovery but before scheduler knowns > apps and app-attempts > --- > > Key: YARN-2219 > URL: https://issues.apache.org/jira/browse/YARN-2219 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ashwin Shankar >Assignee: Jian He > Attachments: YARN-2219.1.patch, YARN-2219.2.patch > > > {code} > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart > testAppReregisterOnRMWorkPreservingRestart[0](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) > Time elapsed: 4.335 sec <<< ERROR! > java.lang.NullPointerException: null > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getTransferredContainers(AbstractYarnScheduler.java:91) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.registerApplicationMaster(ApplicationMasterService.java:297) > at > org.apache.hadoop.yarn.server.resourcemanager.MockAM$1.run(MockAM.java:113) > at > org.apache.hadoop.yarn.server.resourcemanager.MockAM$1.run(MockAM.java:110) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1626) > at > org.apache.hadoop.yarn.server.resourcemanager.MockAM.registerAppAttempt(MockAM.java:109) > at > org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testAppReregisterOnRMWorkPreservingRestart(TestWorkPreservingRMRestart.java:562) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2264) Race in DrainDispatcher can cause random test failures
[ https://issues.apache.org/jira/browse/YARN-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062551#comment-14062551 ] Jian He commented on YARN-2264: --- patch looks good. > Race in DrainDispatcher can cause random test failures > -- > > Key: YARN-2264 > URL: https://issues.apache.org/jira/browse/YARN-2264 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siddharth Seth >Assignee: Li Lu > Attachments: YARN-2264-070814.patch > > > This is what can happen. > This is the potential race. > DrainDispatcher is started via serviceStart() . As a last step, this starts > the actual dispatcher thread (eventHandlingThread.start() - and returns > immediately - which means the thread may or may not have started up by the > time start returns. > Event sequence: > UserThread: calls dispatcher.getEventHandler().handle() > This sets drained = false, and a context switch happens. > DispatcherThread: starts running > DispatcherThread drained = queue.isEmpty(); -> This sets drained to true, > since Thread1 yielded before putting anything into the queue. > UserThread: actual.handle(event) - which puts the event in the queue for the > dispatcher thread to process, and returns control. > UserThread: dispatcher.await() - Since drained is true, this returns > immediately - even though there is a pending event to process. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062573#comment-14062573 ] Jian He commented on YARN-2211: --- some comments: - setCurrnetMasterKeyData, setNextMasterKeyData methods not used - change not needed? {code} |- AMRMTOKEN_SECRET_MANAGER_ROOT_ZNODE_NAME {code} - Fix “System.out.println(stateData.getCurrentTokenMasterKey());” in FileSystemRMStateStore - Test: add test in restart scenario that AM issued with rolled-over AMRMToken is still able to communicate with restarted RM. testAppAttemptTokensRestoredOnRMRestart may help writing the test. > RMStateStore needs to save AMRMToken master key for recovery when RM > restart/failover happens > -- > > Key: YARN-2211 > URL: https://issues.apache.org/jira/browse/YARN-2211 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch > > > After YARN-2208, AMRMToken can be rolled over periodically. We need to save > related Master Keys and use them to recover the AMRMToken when RM > restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062578#comment-14062578 ] Hadoop QA commented on YARN-2069: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12654821/YARN-2069-trunk-5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4311//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4311//console This message is automatically generated. > CS queue level preemption should respect user-limits > > > Key: YARN-2069 > URL: https://issues.apache.org/jira/browse/YARN-2069 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Vinod Kumar Vavilapalli >Assignee: Mayank Bansal > Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, > YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch > > > This is different from (even if related to, and likely share code with) > YARN-2113. > YARN-2113 focuses on making sure that even if queue has its guaranteed > capacity, it's individual users are treated in-line with their limits > irrespective of when they join in. > This JIRA is about respecting user-limits while preempting containers to > balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2211: Attachment: YARN-2211.4.patch > RMStateStore needs to save AMRMToken master key for recovery when RM > restart/failover happens > -- > > Key: YARN-2211 > URL: https://issues.apache.org/jira/browse/YARN-2211 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, > YARN-2211.4.patch > > > After YARN-2208, AMRMToken can be rolled over periodically. We need to save > related Master Keys and use them to recover the AMRMToken when RM > restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062617#comment-14062617 ] Xuan Gong commented on YARN-2211: - bq. setCurrnetMasterKeyData, setNextMasterKeyData methods not used Removed bq. change not needed? |- AMRMTOKEN_SECRET_MANAGER_ROOT_ZNODE_NAME Removed bq. Fix “System.out.println(stateData.getCurrentTokenMasterKey());” in FileSystemRMStateStore Removed bq. Test: add test in restart scenario that AM issued with rolled-over AMRMToken is still able to communicate with restarted RM. testAppAttemptTokensRestoredOnRMRestart may help writing the test. Yes, will add this Testcase in next ticket YARN-2212 > RMStateStore needs to save AMRMToken master key for recovery when RM > restart/failover happens > -- > > Key: YARN-2211 > URL: https://issues.apache.org/jira/browse/YARN-2211 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, > YARN-2211.4.patch > > > After YARN-2208, AMRMToken can be rolled over periodically. We need to save > related Master Keys and use them to recover the AMRMToken when RM > restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2285) Preemption can cause capacity scheduler to show 5,000% queue absolute used capacity.
[ https://issues.apache.org/jira/browse/YARN-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062693#comment-14062693 ] Tassapol Athiapinya commented on YARN-2285: --- After closer look, 5000% is valid number. It means 5000% of "guaranteed capacity" of queue A (about 50% of absolute used capacity). I am making changes to jira title accordingly. I will also make this improvement jira instead of a bug. The point here becomes whether it is nice to "re-label" text in web UI to better reflect its meaning saying "% used next to queue is % of guaranteed queue capacity, not absolute used capacity". > Preemption can cause capacity scheduler to show 5,000% queue absolute used > capacity. > > > Key: YARN-2285 > URL: https://issues.apache.org/jira/browse/YARN-2285 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.5.0 > Environment: Turn on CS Preemption. >Reporter: Tassapol Athiapinya > Attachments: preemption_5000_percent.png > > > I configure queue A, B to have 1%, 99% capacity respectively. There is no max > capacity for each queue. Set high user limit factor. > Submit app 1 to queue A. AM container takes 50% of cluster memory. Task > containers take another 50%. Submit app 2 to queue B. Preempt task containers > of app 1 out. Turns out capacity of queue B increases to 99% but queue A has > 5000% used. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2285) Preemption can cause capacity scheduler to show 5,000% queue capacity.
[ https://issues.apache.org/jira/browse/YARN-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tassapol Athiapinya updated YARN-2285: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) Summary: Preemption can cause capacity scheduler to show 5,000% queue capacity. (was: Preemption can cause capacity scheduler to show 5,000% queue absolute used capacity.) > Preemption can cause capacity scheduler to show 5,000% queue capacity. > -- > > Key: YARN-2285 > URL: https://issues.apache.org/jira/browse/YARN-2285 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.5.0 > Environment: Turn on CS Preemption. >Reporter: Tassapol Athiapinya >Priority: Minor > Attachments: preemption_5000_percent.png > > > I configure queue A, B to have 1%, 99% capacity respectively. There is no max > capacity for each queue. Set high user limit factor. > Submit app 1 to queue A. AM container takes 50% of cluster memory. Task > containers take another 50%. Submit app 2 to queue B. Preempt task containers > of app 1 out. Turns out capacity of queue B increases to 99% but queue A has > 5000% used. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062698#comment-14062698 ] Mayank Bansal commented on YARN-1408: - +1 Committing Thanks [~sunilg] for the patch. Thanks [~jianhe], [~vinodkv] and [~wangda] for the reviews. Thanks, Mayank > Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task > timeout for 30mins > -- > > Key: YARN-1408 > URL: https://issues.apache.org/jira/browse/YARN-1408 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.2.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: Yarn-1408.1.patch, Yarn-1408.10.patch, > Yarn-1408.11.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, > Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, > Yarn-1408.9.patch, Yarn-1408.patch > > > Capacity preemption is enabled as follows. > * yarn.resourcemanager.scheduler.monitor.enable= true , > * > yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy > Queue = a,b > Capacity of Queue A = 80% > Capacity of Queue B = 20% > Step 1: Assign a big jobA on queue a which uses full cluster capacity > Step 2: Submitted a jobB to queue b which would use less than 20% of cluster > capacity > JobA task which uses queue b capcity is been preempted and killed. > This caused below problem: > 1. New Container has got allocated for jobA in Queue A as per node update > from an NM. > 2. This container has been preempted immediately as per preemption. > Here ACQUIRED at KILLED Invalid State exception came when the next AM > heartbeat reached RM. > ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > ACQUIRED at KILLED > This also caused the Task to go for a timeout for 30minutes as this Container > was already killed by preemption. > attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062696#comment-14062696 ] Hadoop QA commented on YARN-1408: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655718/Yarn-1408.11.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4312//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4312//console This message is automatically generated. > Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task > timeout for 30mins > -- > > Key: YARN-1408 > URL: https://issues.apache.org/jira/browse/YARN-1408 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.2.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: Yarn-1408.1.patch, Yarn-1408.10.patch, > Yarn-1408.11.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, > Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, > Yarn-1408.9.patch, Yarn-1408.patch > > > Capacity preemption is enabled as follows. > * yarn.resourcemanager.scheduler.monitor.enable= true , > * > yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy > Queue = a,b > Capacity of Queue A = 80% > Capacity of Queue B = 20% > Step 1: Assign a big jobA on queue a which uses full cluster capacity > Step 2: Submitted a jobB to queue b which would use less than 20% of cluster > capacity > JobA task which uses queue b capcity is been preempted and killed. > This caused below problem: > 1. New Container has got allocated for jobA in Queue A as per node update > from an NM. > 2. This container has been preempted immediately as per preemption. > Here ACQUIRED at KILLED Invalid State exception came when the next AM > heartbeat reached RM. > ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > ACQUIRED at KILLED > This also caused the Task to go for a timeout for 30minutes as this Container > was already killed by preemption. > attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2285) Preemption can cause capacity scheduler to show 5,000% queue capacity.
[ https://issues.apache.org/jira/browse/YARN-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062697#comment-14062697 ] Tassapol Athiapinya commented on YARN-2285: --- Also it is not major but percentage shown is not right. In attached screenshot, root queue used is 146.5%. > Preemption can cause capacity scheduler to show 5,000% queue capacity. > -- > > Key: YARN-2285 > URL: https://issues.apache.org/jira/browse/YARN-2285 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.5.0 > Environment: Turn on CS Preemption. >Reporter: Tassapol Athiapinya >Priority: Minor > Attachments: preemption_5000_percent.png > > > I configure queue A, B to have 1%, 99% capacity respectively. There is no max > capacity for each queue. Set high user limit factor. > Submit app 1 to queue A. AM container takes 50% of cluster memory. Task > containers take another 50%. Submit app 2 to queue B. Preempt task containers > of app 1 out. Turns out capacity of queue B increases to 99% but queue A has > 5000% used. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062716#comment-14062716 ] Hudson commented on YARN-1408: -- FAILURE: Integrated in Hadoop-trunk-Commit #5887 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5887/]) YARN-1408 Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins. (Sunil G via mayank) (mayank: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1610860) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSSchedulerApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java > Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task > timeout for 30mins > -- > > Key: YARN-1408 > URL: https://issues.apache.org/jira/browse/YARN-1408 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.2.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: Yarn-1408.1.patch, Yarn-1408.10.patch, > Yarn-1408.11.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, > Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, > Yarn-1408.9.patch, Yarn-1408.patch > > > Capacity preemption is enabled as follows. > * yarn.resourcemanager.scheduler.monitor.enable= true , > * > yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy > Queue = a,b > Capacity of Queue A = 80% > Capacity of Queue B = 20% > Step 1: Assign a big jobA on queue a which uses full cluster capacity > Step 2: Submitted a jobB to queue b which would use less than 20% of cluster > capacity > JobA task which uses queue b capcity is been preempted and killed. > This caused below problem: > 1. New Container has got allocated for jobA in Queue A as per node update > from an NM. > 2. This container has been preempted immediately as per preemption. > Here ACQUIRED at KILLED Invalid State exception came when the next AM > heartbeat reached RM. > ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmcon
[jira] [Updated] (YARN-1336) Work-preserving nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1336: - Attachment: NMRestartDesignOverview.pdf Attaching a PDF that briefly describes the approach and how the methods of the state store interface are used to persist and recover state. > Work-preserving nodemanager restart > --- > > Key: YARN-1336 > URL: https://issues.apache.org/jira/browse/YARN-1336 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: NMRestartDesignOverview.pdf, YARN-1336-rollup.patch > > > This serves as an umbrella ticket for tasks related to work-preserving > nodemanager restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062727#comment-14062727 ] Craig Welch commented on YARN-1680: --- It looks like this won't account for nodes which are blacklisted based on their rack, I think this is an uncovered case. > availableResources sent to applicationMaster in heartbeat should exclude > blacklistedNodes free memory. > -- > > Key: YARN-1680 > URL: https://issues.apache.org/jira/browse/YARN-1680 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0, 2.3.0 > Environment: SuSE 11 SP2 + Hadoop-2.3 >Reporter: Rohith >Assignee: Chen He > Attachments: YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch > > > There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster > slow start is set to 1. > Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is > become unstable(3 Map got killed), MRAppMaster blacklisted unstable > NodeManager(NM-4). All reducer task are running in cluster now. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes memory. This makes > jobs to hang forever(ResourceManager does not assing any new containers on > blacklisted nodes but returns availableResouce considers cluster free > memory). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2294) Update sample program and documentations for writing YARN Application
Li Lu created YARN-2294: --- Summary: Update sample program and documentations for writing YARN Application Key: YARN-2294 URL: https://issues.apache.org/jira/browse/YARN-2294 Project: Hadoop YARN Issue Type: Improvement Reporter: Li Lu Many APIs for writing YARN applications have been stabilized. However, some of them have also been changed since the last time sample YARN program, like distributed shell, and documentations were updated. There are on-going discussions in the user's mailing list about updating the outdated "Writing YARN Applications" documentation. Updating the sample programs like distributed shells is also needed, since they may probably be the very first demonstration of YARN applications for newcomers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2295) Updating Client of YARN distributed shell with existing public stable API
Li Lu created YARN-2295: --- Summary: Updating Client of YARN distributed shell with existing public stable API Key: YARN-2295 URL: https://issues.apache.org/jira/browse/YARN-2295 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Some API calls in YARN distributed shell client have been marked as unstable and private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2295) Update Client of YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2295: Summary: Update Client of YARN distributed shell with existing public stable API (was: Updating Client of YARN distributed shell with existing public stable API) > Update Client of YARN distributed shell with existing public stable API > --- > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2295-071514.patch > > > Some API calls in YARN distributed shell client have been marked as unstable > and private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2295) Updating Client of YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2295: Attachment: YARN-2295-071514.patch Replacing the unstable privately visible Records.newRecord method with the newInstance method for each class. > Updating Client of YARN distributed shell with existing public stable API > - > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2295-071514.patch > > > Some API calls in YARN distributed shell client have been marked as unstable > and private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2296) Update Application Master of YARN distributed shell with existing public stable API
Li Lu created YARN-2296: --- Summary: Update Application Master of YARN distributed shell with existing public stable API Key: YARN-2296 URL: https://issues.apache.org/jira/browse/YARN-2296 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2295) Update Client of YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2295: Attachment: YARN-2295-071514-1.patch Updated patch with refactoring in both AM and Client > Update Client of YARN distributed shell with existing public stable API > --- > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2295-071514-1.patch, YARN-2295-071514.patch > > > Some API calls in YARN distributed shell client have been marked as unstable > and private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2295: Description: Some API calls in YARN distributed shell have been marked as unstable and private. Use existing public stable API to replace them, if possible. (was: Some API calls in YARN distributed shell client have been marked as unstable and private. Use existing public stable API to replace them, if possible. ) Summary: Refactor YARN distributed shell with existing public stable API (was: Update Client of YARN distributed shell with existing public stable API) > Refactor YARN distributed shell with existing public stable API > --- > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2295-071514-1.patch, YARN-2295-071514.patch > > > Some API calls in YARN distributed shell have been marked as unstable and > private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2296) Update Application Master of YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu resolved YARN-2296. - Resolution: Duplicate Merged into YARN-2295 > Update Application Master of YARN distributed shell with existing public > stable API > --- > > Key: YARN-2296 > URL: https://issues.apache.org/jira/browse/YARN-2296 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062799#comment-14062799 ] Hadoop QA commented on YARN-2295: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655897/YARN-2295-071514.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4313//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4313//console This message is automatically generated. > Refactor YARN distributed shell with existing public stable API > --- > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2295-071514-1.patch, YARN-2295-071514.patch > > > Some API calls in YARN distributed shell have been marked as unstable and > private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062808#comment-14062808 ] Vinod Kumar Vavilapalli commented on YARN-2233: --- Looks good, +1. Checking this in.. > Implement web services to create, renew and cancel delegation tokens > > > Key: YARN-2233 > URL: https://issues.apache.org/jira/browse/YARN-2233 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch, > apache-yarn-2233.2.patch, apache-yarn-2233.3.patch, apache-yarn-2233.4.patch, > apache-yarn-2233.5.patch > > > Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1695) Implement the rest (writable APIs) of RM web-services
[ https://issues.apache.org/jira/browse/YARN-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1695: -- Priority: Major (was: Blocker) > Implement the rest (writable APIs) of RM web-services > - > > Key: YARN-1695 > URL: https://issues.apache.org/jira/browse/YARN-1695 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Vinod Kumar Vavilapalli >Assignee: Varun Vasudev > > MAPREDUCE-2863 added the REST web-services to RM and NM. But all the APIs > added there were only focused on obtaining information from the cluster. We > need to have the following REST APIs to finish the feature > - Application submission/termination (Priority): This unblocks easy client > interaction with a YARN cluster > - Application Client protocol: For resource scheduling by apps written in an > arbitrary language. Will have to think about throughput concerns > - ContainerManagement Protocol: Again for arbitrary language apps. > One important thing to note here is that we already have client libraries on > all the three protocols that do some some heavy-lifting. One part of the > effort is to figure out if they can be made any thinner and/or how > web-services will implement the same functionality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062826#comment-14062826 ] Hadoop QA commented on YARN-2295: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655903/YARN-2295-071514-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4314//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4314//console This message is automatically generated. > Refactor YARN distributed shell with existing public stable API > --- > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2295-071514-1.patch, YARN-2295-071514.patch > > > Some API calls in YARN distributed shell have been marked as unstable and > private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062831#comment-14062831 ] Hudson commented on YARN-2233: -- FAILURE: Integrated in Hadoop-trunk-Commit #5888 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5888/]) YARN-2233. Implemented ResourceManager web-services to create, renew and cancel delegation tokens. Contributed by Varun Vasudev. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1610876) * /hadoop/common/trunk/hadoop-common-project/hadoop-auth/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/RMDelegationTokenSecretManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/DelegationToken.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokens.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm > Implement web services to create, renew and cancel delegation tokens > > > Key: YARN-2233 > URL: https://issues.apache.org/jira/browse/YARN-2233 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Fix For: 2.5.0 > > Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch, > apache-yarn-2233.2.patch, apache-yarn-2233.3.patch, apache-yarn-2233.4.patch, > apache-yarn-2233.5.patch > > > Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062861#comment-14062861 ] Jason Lowe commented on YARN-1341: -- Thanks for commenting, Devaraj! My apologies for the late reply, as I was on vacation and am still catching up. bq. In addition to option 1), I'd think of making the NM down if NM fails to store RM keys for certain number of times(configurable) consecutively. As for retries, I mentioned earlier that if retries are likely to help then the state store implementation should do so rather than have the common code do so. For the leveldb implementation it is very unlikely that a retry is going to do anything other than just make the operation take longer to ultimately fail. The the firmware of the drive is already going to implement a large number of retries to attempt to recover from hardware errors, and non-hardware local filesystem errors are highly unlikely to be fixed by simply retrying immediately. If that were the case then I'd expect retries to be implemented in many other places where the local filesystem is used by Hadoop code. bq. And also we can make it(i.e. tear down NM or not) as configurable I'd like to avoid adding yet more config options unless we think we really need them, but if people agree this needs to be configurable then we can do so. Also I assume in that scenario you would want the NM to shutdown while also tearing down containers, cleaning up, etc. as if it didn't support recovery. Tearing down the NM on a state store error just to have it start up again and try to recover with stale state seems pointless -- might as well have just kept running which is a better outcome. Or am I missing a use case for that? And thanks, Junping, for the recent comments! bq. If you are also agree on this, we can separate this document effort to other JIRA (Umbrella or a dedicated one, whatever you like) and continue the discussion on this particular case. Sure, we can discuss general error handling or an overall document for it either on YARN-1336 or a new JIRA. bq. a. if currentMasterKey is stale, it can be updated and override soon with registering to RM later. Nothing is affected. Correct, the NM should receive the current master key upon re-registration with the RM after it restarts. bq. b. if previousMasterKey is stale, then the real previous master key is lost, so the affection is: AMs with real master key cannot connect to NM to launch containers. AMs that have the current master key will still be able to connect because the NM just got the current master key as described in a). AM's that have the previous master key will not be able to connect to the NM unless that particular master key also happened to be successfully associated with the attempt in the state store (related to case c). bq. c. if applicationMasterKeys are stale, then previous old keys get tracked in applicationMasterKeys get lost after restart. The affection is: AMs with old keys cannot connect to NM to launch containers. AMs that use an old key (i.e.: not the current or previous master key) would be unable to connect to the NM. bq. Anything I am missing here? I don't believe so. The bottom line is that an AM may not be able to successfully connect to an NM after a restart with stale NM token state. > Recover NMTokens upon nodemanager restart > - > > Key: YARN-1341 > URL: https://issues.apache.org/jira/browse/YARN-1341 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, > YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062879#comment-14062879 ] Mayank Bansal commented on YARN-1408: - Committed to trunk, branch 2 and branch-2.5. branch-2.5 needed some rebase , Updating the patch for branch-2.5 Thanks, Mayank > Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task > timeout for 30mins > -- > > Key: YARN-1408 > URL: https://issues.apache.org/jira/browse/YARN-1408 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.2.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: YARN-1408-branch-2.5-1.patch, Yarn-1408.1.patch, > Yarn-1408.10.patch, Yarn-1408.11.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, > Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, > Yarn-1408.8.patch, Yarn-1408.9.patch, Yarn-1408.patch > > > Capacity preemption is enabled as follows. > * yarn.resourcemanager.scheduler.monitor.enable= true , > * > yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy > Queue = a,b > Capacity of Queue A = 80% > Capacity of Queue B = 20% > Step 1: Assign a big jobA on queue a which uses full cluster capacity > Step 2: Submitted a jobB to queue b which would use less than 20% of cluster > capacity > JobA task which uses queue b capcity is been preempted and killed. > This caused below problem: > 1. New Container has got allocated for jobA in Queue A as per node update > from an NM. > 2. This container has been preempted immediately as per preemption. > Here ACQUIRED at KILLED Invalid State exception came when the next AM > heartbeat reached RM. > ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > ACQUIRED at KILLED > This also caused the Task to go for a timeout for 30minutes as this Container > was already killed by preemption. > attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1408: Attachment: YARN-1408-branch-2.5-1.patch rebasing against branch 2.5 > Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task > timeout for 30mins > -- > > Key: YARN-1408 > URL: https://issues.apache.org/jira/browse/YARN-1408 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.2.0 >Reporter: Sunil G >Assignee: Sunil G > Attachments: YARN-1408-branch-2.5-1.patch, Yarn-1408.1.patch, > Yarn-1408.10.patch, Yarn-1408.11.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, > Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, > Yarn-1408.8.patch, Yarn-1408.9.patch, Yarn-1408.patch > > > Capacity preemption is enabled as follows. > * yarn.resourcemanager.scheduler.monitor.enable= true , > * > yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy > Queue = a,b > Capacity of Queue A = 80% > Capacity of Queue B = 20% > Step 1: Assign a big jobA on queue a which uses full cluster capacity > Step 2: Submitted a jobB to queue b which would use less than 20% of cluster > capacity > JobA task which uses queue b capcity is been preempted and killed. > This caused below problem: > 1. New Container has got allocated for jobA in Queue A as per node update > from an NM. > 2. This container has been preempted immediately as per preemption. > Here ACQUIRED at KILLED Invalid State exception came when the next AM > heartbeat reached RM. > ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > ACQUIRED at KILLED > This also caused the Task to go for a timeout for 30minutes as this Container > was already killed by preemption. > attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2285) Preemption can cause capacity scheduler to show 5,000% queue capacity.
[ https://issues.apache.org/jira/browse/YARN-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-2285: Assignee: Wangda Tan Assigned it to me, working on this ... > Preemption can cause capacity scheduler to show 5,000% queue capacity. > -- > > Key: YARN-2285 > URL: https://issues.apache.org/jira/browse/YARN-2285 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.5.0 > Environment: Turn on CS Preemption. >Reporter: Tassapol Athiapinya >Assignee: Wangda Tan >Priority: Minor > Attachments: preemption_5000_percent.png > > > I configure queue A, B to have 1%, 99% capacity respectively. There is no max > capacity for each queue. Set high user limit factor. > Submit app 1 to queue A. AM container takes 50% of cluster memory. Task > containers take another 50%. Submit app 2 to queue B. Preempt task containers > of app 1 out. Turns out capacity of queue B increases to 99% but queue A has > 5000% used. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2295: Attachment: YARN-2295-071514-1.patch Could not reappear the UT failure locally. Resubmitting this patch to see if the problem reappears. > Refactor YARN distributed shell with existing public stable API > --- > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2295-071514-1.patch, YARN-2295-071514-1.patch, > YARN-2295-071514.patch > > > Some API calls in YARN distributed shell have been marked as unstable and > private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2297) Preemption can hang in corner case by not allowing any task container to proceed.
Tassapol Athiapinya created YARN-2297: - Summary: Preemption can hang in corner case by not allowing any task container to proceed. Key: YARN-2297 URL: https://issues.apache.org/jira/browse/YARN-2297 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.5.0 Reporter: Tassapol Athiapinya Priority: Critical Preemption can cause hang issue in single-node cluster. Only AMs run. No task container can run. h3. queue configuration Queue A/B has 1% and 99% respectively. No max capacity. h3. scenario Turn on preemption. Configure 1 NM with 4 GB of memory. Use only 2 apps. Use 1 user. Submit app 1 to queue A. AM needs 2 GB. There is 1 task that needs 2 GB. Occupy entire cluster. Submit app 2 to queue B. AM needs 2 GB. There are 3 tasks that need 2 GB each. Instead of entire app 1 preempted, app 1 AM will stay. App 2 AM will launch. No task of either app can proceed. h3. commands /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter "-Dmapreduce.map.memory.mb=2000" "-Dyarn.app.mapreduce.am.command-opts=-Xmx1800M" "-Dmapreduce.randomtextwriter.bytespermap=2147483648" "-Dmapreduce.job.queuename=A" "-Dmapreduce.map.maxattempts=100" "-Dmapreduce.am.max-attempts=1" "-Dyarn.app.mapreduce.am.resource.mb=2000" "-Dmapreduce.map.java.opts=-Xmx1800M" "-Dmapreduce.randomtextwriter.mapsperhost=1" "-Dmapreduce.randomtextwriter.totalbytes=2147483648" dir1 /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep "-Dmapreduce.map.memory.mb=2000" "-Dyarn.app.mapreduce.am.command-opts=-Xmx1800M" "-Dmapreduce.job.queuename=B" "-Dmapreduce.map.maxattempts=100" "-Dmapreduce.am.max-attempts=1" "-Dyarn.app.mapreduce.am.resource.mb=2000" "-Dmapreduce.map.java.opts=-Xmx1800M" -m 1 -r 0 -mt 4000 -rt 0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062911#comment-14062911 ] Hadoop QA commented on YARN-2069: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12654821/YARN-2069-trunk-5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4315//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4315//console This message is automatically generated. > CS queue level preemption should respect user-limits > > > Key: YARN-2069 > URL: https://issues.apache.org/jira/browse/YARN-2069 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Vinod Kumar Vavilapalli >Assignee: Mayank Bansal > Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, > YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch > > > This is different from (even if related to, and likely share code with) > YARN-2113. > YARN-2113 focuses on making sure that even if queue has its guaranteed > capacity, it's individual users are treated in-line with their limits > irrespective of when they join in. > This JIRA is about respecting user-limits while preempting containers to > balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2295: Attachment: (was: YARN-2295-071514-1.patch) > Refactor YARN distributed shell with existing public stable API > --- > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2295-071514-1.patch, YARN-2295-071514.patch > > > Some API calls in YARN distributed shell have been marked as unstable and > private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2297) Preemption can hang in corner case by not allowing any task container to proceed.
[ https://issues.apache.org/jira/browse/YARN-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-2297: Assignee: Wangda Tan > Preemption can hang in corner case by not allowing any task container to > proceed. > - > > Key: YARN-2297 > URL: https://issues.apache.org/jira/browse/YARN-2297 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.5.0 >Reporter: Tassapol Athiapinya >Assignee: Wangda Tan >Priority: Critical > > Preemption can cause hang issue in single-node cluster. Only AMs run. No task > container can run. > h3. queue configuration > Queue A/B has 1% and 99% respectively. > No max capacity. > h3. scenario > Turn on preemption. Configure 1 NM with 4 GB of memory. Use only 2 apps. Use > 1 user. > Submit app 1 to queue A. AM needs 2 GB. There is 1 task that needs 2 GB. > Occupy entire cluster. > Submit app 2 to queue B. AM needs 2 GB. There are 3 tasks that need 2 GB each. > Instead of entire app 1 preempted, app 1 AM will stay. App 2 AM will launch. > No task of either app can proceed. > h3. commands > /usr/lib/hadoop/bin/hadoop jar > /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter > "-Dmapreduce.map.memory.mb=2000" > "-Dyarn.app.mapreduce.am.command-opts=-Xmx1800M" > "-Dmapreduce.randomtextwriter.bytespermap=2147483648" > "-Dmapreduce.job.queuename=A" "-Dmapreduce.map.maxattempts=100" > "-Dmapreduce.am.max-attempts=1" "-Dyarn.app.mapreduce.am.resource.mb=2000" > "-Dmapreduce.map.java.opts=-Xmx1800M" > "-Dmapreduce.randomtextwriter.mapsperhost=1" > "-Dmapreduce.randomtextwriter.totalbytes=2147483648" dir1 > /usr/lib/hadoop/bin/hadoop jar > /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep > "-Dmapreduce.map.memory.mb=2000" > "-Dyarn.app.mapreduce.am.command-opts=-Xmx1800M" > "-Dmapreduce.job.queuename=B" "-Dmapreduce.map.maxattempts=100" > "-Dmapreduce.am.max-attempts=1" "-Dyarn.app.mapreduce.am.resource.mb=2000" > "-Dmapreduce.map.java.opts=-Xmx1800M" -m 1 -r 0 -mt 4000 -rt 0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062936#comment-14062936 ] Hadoop QA commented on YARN-2295: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655934/YARN-2295-071514-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4316//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4316//console This message is automatically generated. > Refactor YARN distributed shell with existing public stable API > --- > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-2295-071514-1.patch, YARN-2295-071514.patch > > > Some API calls in YARN distributed shell have been marked as unstable and > private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2295: Attachment: TEST-YARN-2295-071514.patch Probably a deterministic failure on server. Use a trivial formatting patch with no trailing tabs to see if it's the problem with the server. > Refactor YARN distributed shell with existing public stable API > --- > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, > YARN-2295-071514.patch > > > Some API calls in YARN distributed shell have been marked as unstable and > private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2295) Refactor YARN distributed shell with existing public stable API
[ https://issues.apache.org/jira/browse/YARN-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062964#comment-14062964 ] Hadoop QA commented on YARN-2295: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655957/TEST-YARN-2295-071514.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4317//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4317//console This message is automatically generated. > Refactor YARN distributed shell with existing public stable API > --- > > Key: YARN-2295 > URL: https://issues.apache.org/jira/browse/YARN-2295 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Li Lu >Assignee: Li Lu > Attachments: TEST-YARN-2295-071514.patch, YARN-2295-071514-1.patch, > YARN-2295-071514.patch > > > Some API calls in YARN distributed shell have been marked as unstable and > private. Use existing public stable API to replace them, if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062965#comment-14062965 ] Craig Welch commented on YARN-1198: --- It seems like the related problem with these group of jiras is mostly around when the cluster is resource constrained/has a small number of large jobs using most of the resources it can get into deadlock scenarios. In addition to fixes for the specific behaviors I think it would be worthwhile to do a min of the calculated headroom against "cluster headroom" as a sanity check, cluster headroom being the total cluster resource - utilized resources. I've attached a partial patch for that. This will not help with the application blacklist case (1680) but it would help with 1857 and 2008 (it doesn't correct the mistake in headroom calculation, but it should prevent it from causing a deadlock). (That's not to say we should not also fix the individual issues, just that this might be a good "catch all" for others we aren't aware of / the problem generally). I'm attaching an initial pass at doing this (it's just the basics to see if the direction makes sense, not a finished product). > Capacity Scheduler headroom calculation does not work as expected > - > > Key: YARN-1198 > URL: https://issues.apache.org/jira/browse/YARN-1198 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Omkar Vinit Joshi >Assignee: Omkar Vinit Joshi > > Today headroom calculation (for the app) takes place only when > * New node is added/removed from the cluster > * New container is getting assigned to the application. > However there are potentially lot of situations which are not considered for > this calculation > * If a container finishes then headroom for that application will change and > should be notified to the AM accordingly. > * If a single user has submitted multiple applications (app1 and app2) to the > same queue then > ** If app1's container finishes then not only app1's but also app2's AM > should be notified about the change in headroom. > ** Similarly if a container is assigned to any applications app1/app2 then > both AM should be notified about their headroom. > ** To simplify the whole communication process it is ideal to keep headroom > per User per LeafQueue so that everyone gets the same picture (apps belonging > to same user and submitted in same queue). > * If a new user submits an application to the queue then all applications > submitted by all users in that queue should be notified of the headroom > change. > * Also today headroom is an absolute number ( I think it should be normalized > but then this is going to be not backward compatible..) > * Also when admin user refreshes queue headroom has to be updated. > These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1198: -- Attachment: YARN-1198.1.patch > Capacity Scheduler headroom calculation does not work as expected > - > > Key: YARN-1198 > URL: https://issues.apache.org/jira/browse/YARN-1198 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Omkar Vinit Joshi >Assignee: Omkar Vinit Joshi > Attachments: YARN-1198.1.patch > > > Today headroom calculation (for the app) takes place only when > * New node is added/removed from the cluster > * New container is getting assigned to the application. > However there are potentially lot of situations which are not considered for > this calculation > * If a container finishes then headroom for that application will change and > should be notified to the AM accordingly. > * If a single user has submitted multiple applications (app1 and app2) to the > same queue then > ** If app1's container finishes then not only app1's but also app2's AM > should be notified about the change in headroom. > ** Similarly if a container is assigned to any applications app1/app2 then > both AM should be notified about their headroom. > ** To simplify the whole communication process it is ideal to keep headroom > per User per LeafQueue so that everyone gets the same picture (apps belonging > to same user and submitted in same queue). > * If a new user submits an application to the queue then all applications > submitted by all users in that queue should be notified of the headroom > change. > * Also today headroom is an absolute number ( I think it should be normalized > but then this is going to be not backward compatible..) > * Also when admin user refreshes queue headroom has to be updated. > These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062972#comment-14062972 ] Craig Welch commented on YARN-1680: --- I was also wondering if we could maintain a Resource representing the amount of resources blacklisted by the application which was updated as nodes/racks were blacklisted and removed from the application blacklist instead of iterating the nodes looking for the amount of blacklisted resources at the time of headroom calculation. This "blacklisted" resource would be subtracted from the cluster resource (similar to how it works in the current patch in that respect) to make sure the headroom calculation is correct. It seems like this might be a good approach as it should be "close to free" to update that blacklist resource when adding and removing things form the blacklist, and I think blacklisting may be less frequent than headroom calculation. Thoughts? > availableResources sent to applicationMaster in heartbeat should exclude > blacklistedNodes free memory. > -- > > Key: YARN-1680 > URL: https://issues.apache.org/jira/browse/YARN-1680 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.2.0, 2.3.0 > Environment: SuSE 11 SP2 + Hadoop-2.3 >Reporter: Rohith >Assignee: Chen He > Attachments: YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch > > > There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster > slow start is set to 1. > Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is > become unstable(3 Map got killed), MRAppMaster blacklisted unstable > NodeManager(NM-4). All reducer task are running in cluster now. > MRAppMaster does not preempt the reducers because for Reducer preemption > calculation, headRoom is considering blacklisted nodes memory. This makes > jobs to hang forever(ResourceManager does not assing any new containers on > blacklisted nodes but returns availableResouce considers cluster free > memory). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2045) Data persisted in NM should be versioned
[ https://issues.apache.org/jira/browse/YARN-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2045: - Attachment: YARN-2045-v3.patch Thanks [~jlowe] for above comments. In v3 patch: - Remove related interfaces in NMStateStoreService, we can add it back if we find it useful in future. - To handle old version type (String) issue, rename DB_SCHEMA_VERSION_KEY, if cannot loading data against new key, treat it as new version type 1.0 - Still keep PBImpl there, we can improve/remove it if we found it unuseful in future. - Address [~vvasudev]'s comments above. > Data persisted in NM should be versioned > > > Key: YARN-2045 > URL: https://issues.apache.org/jira/browse/YARN-2045 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.4.1 >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-2045-v2.patch, YARN-2045-v3.patch, YARN-2045.patch > > > As a split task from YARN-667, we want to add version info to NM related > data, include: > - NodeManager local LevelDB state > - NodeManager directory structure -- This message was sent by Atlassian JIRA (v6.2#6252)