[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148834#comment-14148834 ] Zhijie Shen commented on YARN-2468: --- The patch is generally good. Some minor comments, and puzzles about the code. 1. The first one is \@VisibleForTesting? And the second one is not necessary? {code} - private static String getNodeString(NodeId nodeId) { + public static String getNodeString(NodeId nodeId) { return nodeId.toString().replace(":", "_"); } - + + public static String getNodeString(String nodeId) { +return nodeId.replace(":", "_"); + } {code} 2. Add a TODO to say the test will be fixed in a in followup Jira, in case we forget it? {code} + @Ignore @Test public void testNoLogs() throws Exception { {code} 3. Based on my understanding, uploadedFiles is the candidate files to upload? If so, can we rename the variables and related methods? {code} +private Set uploadedFiles = new HashSet(); {code} 4. I assume this var is going to capture all the existing log files on HDFS, isn't it? If so, the computation of it seems to be problematic, because it doesn't exclude the files to be excluded. And what's the effect on alreadyUploadedLogs? {code} +private Set allExistingFileMeta = new HashSet(); {code} {code} Iterable mask = Iterables.filter(alreadyUploadedLogs, new Predicate() { @Override public boolean apply(String next) { return currentExistingLogFiles.contains(next); } }); {code} 5. Make the old LogValue constructor based on the new one? 6. LogValue.write is not necessary to be changed? 7. It's recommended to close the Closable objects via IOUtils, but it seems that AggregatedLogFormat already has this issue before. Let's file a separate ticket for it. {code} +if (this.fsDataOStream != null) { + this.fsDataOStream.close(); +} {code} 8. nodeId seems to be of no use. No need to be passed into AppLogAggregatorImpl. {code} + private final NodeId nodeId; {code} 9. remoteNodeLogDirForApp doesn't affect remoteNodeTmpLogFileForApp, which only depends on remoteNodeLogFileForApp. remoteNodeLogFileForApp is determined at construction, so remoteNodeTmpLogFileForApp should be final and computed once in constructor as well. And constructor param remoteNodeLogDirForApp should be renamed back to remoteNodeLogFileForApp. {code} - private final Path remoteNodeTmpLogFileForApp; + private Path remoteNodeTmpLogFileForApp; {code} {code} - private Path getRemoteNodeTmpLogFileForApp() { + private Path getRemoteNodeTmpLogFileForApp(Path remoteNodeLogDirForApp) { return new Path(remoteNodeLogFileForApp.getParent(), -(remoteNodeLogFileForApp.getName() + TMP_FILE_SUFFIX)); + (remoteNodeLogFileForApp.getName() + LogAggregationUtils.TMP_FILE_SUFFIX)); } {code} 10. One typo {code} // if any of the previous uoloaded logs have been deleted, {code} 11. One question: if one file is failed at uploading in LogValue.write(), uploadedFiles will not reflect the missing uploaded file, and it will not be uploaded again? > Log handling for LRS > > > Key: YARN-2468 > URL: https://issues.apache.org/jira/browse/YARN-2468 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, nodemanager, resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, > YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, > YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, > YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, > YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, > YARN-2468.7.1.patch, YARN-2468.7.patch > > > Currently, when application is finished, NM will start to do the log > aggregation. But for Long running service applications, this is not ideal. > The problems we have are: > 1) LRS applications are expected to run for a long time (weeks, months). > 2) Currently, all the container logs (from one NM) will be written into a > single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.
[ https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148765#comment-14148765 ] Hadoop QA commented on YARN-1051: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671361/YARN-1051.1.patch against trunk revision f435724. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 21 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapreduce.lib.input.TestMRCJCFileInputFormat {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5139//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5139//console This message is automatically generated. > YARN Admission Control/Planner: enhancing the resource allocation model with > time. > -- > > Key: YARN-1051 > URL: https://issues.apache.org/jira/browse/YARN-1051 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager, scheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, > YARN-1051.patch, curino_MSR-TR-2013-108.pdf, techreport.pdf > > > In this umbrella JIRA we propose to extend the YARN RM to handle time > explicitly, allowing users to "reserve" capacity over time. This is an > important step towards SLAs, long-running services, workflows, and helps for > gang scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148734#comment-14148734 ] Hadoop QA commented on YARN-668: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671371/YARN-668-v9.patch against trunk revision e96ce6f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 11 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5141//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5141//console This message is automatically generated. > TokenIdentifier serialization should consider Unknown fields > > > Key: YARN-668 > URL: https://issues.apache.org/jira/browse/YARN-668 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-668-demo.patch, YARN-668-v2.patch, > YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, > YARN-668-v7.patch, YARN-668-v8.patch, YARN-668-v9.patch, YARN-668.patch > > > This would allow changing of the TokenIdentifier between versions. The > current serialization is Writable. A simple way to achieve this would be to > have a Proto object as the payload for TokenIdentifiers, instead of > individual fields. > TokenIdentifier continues to implement Writable to work with the RPC layer - > but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148712#comment-14148712 ] Hadoop QA commented on YARN-2179: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671373/YARN-2179-trunk-v8.patch against trunk revision e96ce6f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5142//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5142//console This message is automatically generated. > Initial cache manager structure and context > --- > > Key: YARN-2179 > URL: https://issues.apache.org/jira/browse/YARN-2179 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, > YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, > YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch > > > Implement the initial shared cache manager structure and context. The > SCMContext will be used by a number of manager services (i.e. the backing > store and the cleaner service). The AppChecker is used to gather the > currently running applications on SCM startup (necessary for an scm that is > backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148696#comment-14148696 ] Wangda Tan commented on YARN-2594: -- I think previous uploaded patch can still solve the problem. Eliminate the read lock in thread#2 will make thread#2 not blocked by the pending writeLock, and it will release synchronized lock which thread#1 wait for, so thread#1 can continue too. After that, thread#3 can achieve writelock finally. > Potential deadlock in RM when querying ApplicationResourceUsageReport > - > > Key: YARN-2594 > URL: https://issues.apache.org/jira/browse/YARN-2594 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-2594.patch > > > ResoruceManager sometimes become un-responsive: > There was in exception in ResourceManager log and contains only following > type of messages: > {code} > 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 > 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 > 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 > 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 > 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 > 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 > 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2179: --- Attachment: YARN-2179-trunk-v8.patch [~kasha] [~vinodkv] Attached is v8. This latest patch addresses the most recent comments from Karthik. > Initial cache manager structure and context > --- > > Key: YARN-2179 > URL: https://issues.apache.org/jira/browse/YARN-2179 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, > YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, > YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch > > > Implement the initial shared cache manager structure and context. The > SCMContext will be used by a number of manager services (i.e. the backing > store and the cleaner service). The AppChecker is used to gather the > currently running applications on SCM startup (necessary for an scm that is > backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148676#comment-14148676 ] Wangda Tan commented on YARN-2594: -- [~zxu], Thanks for the explanation, it's very helpful, now I can understand write lock can block read lock. I've created a test program: {code} package sandbox; import java.util.concurrent.locks.ReentrantReadWriteLock; import java.util.concurrent.locks.ReentrantReadWriteLock.ReadLock; import java.util.concurrent.locks.ReentrantReadWriteLock.WriteLock; public class Tester { private static class ReadThread implements Runnable { private String name; private ReadLock readLock; ReadThread(String name, ReadLock readLock) { this.name = name; this.readLock = readLock; } @Override public void run() { System.out.println("try lock read - " + name); readLock.lock(); System.out.println("lock read - " + name); } } private static class WriteThread implements Runnable { private String name; private WriteLock writeLock; WriteThread(String name, WriteLock writeLock) { this.name = name; this.writeLock = writeLock; } @Override public void run() { System.out.println("try lock write - " + name); writeLock.lock(); System.out.println("lock write - " + name); } } public static void main(String[] args) throws InterruptedException { ReentrantReadWriteLock lock = new ReentrantReadWriteLock(); ReadLock readLock = lock.readLock(); WriteLock writeLock = lock.writeLock(); Thread r1 = new Thread(new ReadThread("1", readLock)); Thread r2 = new Thread(new ReadThread("2", readLock)); Thread w = new Thread(new WriteThread("3", writeLock)); r1.start(); Thread.sleep(100); w.start(); Thread.sleep(100); r2.start(); } } {code} Exactly as you described, a waiting write lock will block read block to avoid starvation. > Potential deadlock in RM when querying ApplicationResourceUsageReport > - > > Key: YARN-2594 > URL: https://issues.apache.org/jira/browse/YARN-2594 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-2594.patch > > > ResoruceManager sometimes become un-responsive: > There was in exception in ResourceManager log and contains only following > type of messages: > {code} > 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 > 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 > 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 > 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 > 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 > 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 > 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-668: Attachment: YARN-668-v9.patch Fix test failures in v9 patch. > TokenIdentifier serialization should consider Unknown fields > > > Key: YARN-668 > URL: https://issues.apache.org/jira/browse/YARN-668 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-668-demo.patch, YARN-668-v2.patch, > YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, > YARN-668-v7.patch, YARN-668-v8.patch, YARN-668-v9.patch, YARN-668.patch > > > This would allow changing of the TokenIdentifier between versions. The > current serialization is Writable. A simple way to achieve this would be to > have a Proto object as the payload for TokenIdentifiers, instead of > individual fields. > TokenIdentifier continues to implement Writable to work with the RPC layer - > but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148665#comment-14148665 ] zhihai xu commented on YARN-2594: - The [ReentrantReadWriteLock | http://tutorials.jenkov.com/java-util-concurrent/readwritelock.html] implementation is {code} Read Lock If no threads have locked the ReadWriteLock for writing, and no thread have requested a write lock (but not yet obtained it). Thus, multiple threads can lock the lock for reading. Write Lock If no threads are reading or writing. Thus, only one thread at a time can lock the lock for writing {code} Base on the above information, the first three threads can cause a deadlock, The readLock is firstly acquired by thread#1, then thread#3 is blocked for writeLock, finally when Thread#2 try to acquire the readLock, thread#2 is also blocked because thread#3 is requesting the writeLock before thread#2. So this is not a bug in Java. The following is the source code in ReentrantReadWriteLock.java: {code} static final class NonfairSync extends Sync { private static final long serialVersionUID = -8159625535654395037L; final boolean writerShouldBlock() { return false; // writers can always barge } final boolean readerShouldBlock() { /* As a heuristic to avoid indefinite writer starvation, * block if the thread that momentarily appears to be head * of queue, if one exists, is a waiting writer. This is * only a probabilistic effect since a new reader will not * block if there is a waiting writer behind other enabled * readers that have not yet drained from the queue. */ return apparentlyFirstQueuedIsExclusive(); } } {code} readerShouldBlock will check whether any threads request writeLock before it. > Potential deadlock in RM when querying ApplicationResourceUsageReport > - > > Key: YARN-2594 > URL: https://issues.apache.org/jira/browse/YARN-2594 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-2594.patch > > > ResoruceManager sometimes become un-responsive: > There was in exception in ResourceManager log and contains only following > type of messages: > {code} > 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 > 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 > 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 > 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 > 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 > 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 > 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2610) Hamlet doesn't close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148660#comment-14148660 ] Hadoop QA commented on YARN-2610: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671365/YARN-2610-02.patch against trunk revision f435724. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5140//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5140//console This message is automatically generated. > Hamlet doesn't close table tags > --- > > Key: YARN-2610 > URL: https://issues.apache.org/jira/browse/YARN-2610 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ray Chiang >Assignee: Ray Chiang > Labels: supportability > Attachments: YARN-2610-01.patch, YARN-2610-02.patch > > > Revisiting a subset of MAPREDUCE-2993. > The , , , , tags are not configured to close > properly in Hamlet. While this is allowed in HTML 4.01, missing closing > table tags tends to wreak havoc with a lot of HTML processors (although not > usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148646#comment-14148646 ] Wangda Tan commented on YARN-2594: -- [~kasha], Thanks for your comments, definitely we should reduce synchronized lock, but this problems seems not caused by this, Had a discussion with Jian He, We found 4 suspicious threads, Thread #2/#4 try to acquire readlock but failed, but at the same time, *no writelock hold by anyone* (thread#3 is waiting for writelock). This is more like a bug of Java to me. Followings are links of descriptions of that bug, and there's some other people claims this not yet fixed. 1) Java bug description: http://webcache.googleusercontent.com/search?q=cache:fjM5oxWzmCsJ:bugs.java.com/view_bug.do%3Fbug_id%3D6822370+&cd=1&hl=en&ct=clnk&gl=hk 2) People report the bug still occurs: http://cs.oswego.edu/pipermail/concurrency-interest/2010-September/007413.html Thoughts? Following are thread#1-#4 *Thread#1* {code} "IPC Server handler 45 on 8032" daemon prio=10 tid=0x7f032909b000 nid=0x7bd7 waiting for monitor entry [0x7f0307aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:541) - waiting to lock <0xe0e7ea70> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:196) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:703) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:569) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:294) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) {code} *Thread#2* {code} "ResourceManager Event Processor" prio=10 tid=0x7f0328db9800 nid=0x7aeb waiting on condition [0x7f0311a48000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xe0e72bc0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getCurrentAppAttempt(RMAppImpl.java:476) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.updateAttemptMetrics(RMContainerImpl.java:509) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:495) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:484) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) - locked <0xe0e85318> (a org.apache.hadoop.yarn.state.S
[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field
[ https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148636#comment-14148636 ] Rohith commented on YARN-2523: -- Thanks [~jianhe] and [~jlowe] for review and committing this:-) > ResourceManager UI showing negative value for "Decommissioned Nodes" field > -- > > Key: YARN-2523 > URL: https://issues.apache.org/jira/browse/YARN-2523 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, webapp >Affects Versions: 3.0.0 >Reporter: Nishan Shetty >Assignee: Rohith > Fix For: 2.6.0 > > Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, > YARN-2523.patch > > > 1. Decommission one NodeManager by configuring ip in excludehost file > 2. Remove ip from excludehost file > 3. Execute -refreshNodes command and restart Decommissioned NodeManager > Observe that in RM UI negative value for "Decommissioned Nodes" field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2610) Hamlet doesn't close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2610: - Attachment: YARN-2610-02.patch Fixes for unit tests that don't expect closing table tags. > Hamlet doesn't close table tags > --- > > Key: YARN-2610 > URL: https://issues.apache.org/jira/browse/YARN-2610 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ray Chiang >Assignee: Ray Chiang > Labels: supportability > Attachments: YARN-2610-01.patch, YARN-2610-02.patch > > > Revisiting a subset of MAPREDUCE-2993. > The , , , , tags are not configured to close > properly in Hamlet. While this is allowed in HTML 4.01, missing closing > table tags tends to wreak havoc with a lot of HTML processors (although not > usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.
[ https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-1051: - Attachment: YARN-1051.1.patch Attaching a patch with the [fixes | https://issues.apache.org/jira/browse/YARN-2611?focusedCommentId=14148604] from YARN-2611. * MAPREDUCE-6094 is already tracking the fix for _TestMRCJCFileInputFormat.testAddInputPath()_ test case failure * MAPREDUCE-6048 has been opened for the intermittaent failure of _TestJavaSerialization_ > YARN Admission Control/Planner: enhancing the resource allocation model with > time. > -- > > Key: YARN-1051 > URL: https://issues.apache.org/jira/browse/YARN-1051 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager, scheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, > YARN-1051.patch, curino_MSR-TR-2013-108.pdf, techreport.pdf > > > In this umbrella JIRA we propose to extend the YARN RM to handle time > explicitly, allowing users to "reserve" capacity over time. This is an > important step towards SLAs, long-running services, workflows, and helps for > gang scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch
[ https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2611: - Attachment: YARN-2611.patch Attaching a patch that fixes the fingbugs warnings and TestRMWebServicesCapacitySched. TestJavaSerialization runs successfully in my machine so the failure must be an intermittant one. TestMRCJCFileInputFormat fails in my machine in both the branch & trunk and the error looks unrelated to our patch. > Fix jenkins findbugs warning and test case failures for trunk merge patch > - > > Key: YARN-2611 > URL: https://issues.apache.org/jira/browse/YARN-2611 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, resourcemanager, scheduler >Reporter: Subru Krishnan >Assignee: Subru Krishnan > Attachments: YARN-2611.patch > > > This JIRA is to fix jenkins findbugs warnings and test case failures for > trunk merge patch as [reported | > https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in > YARN-1051 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch
[ https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2611: - Description: This JIRA is to fix jenkins findbugs warnings and test case failures for trunk merge patch as [reported | https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in YARN-1051 (was: This JIRA is to https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506) > Fix jenkins findbugs warning and test case failures for trunk merge patch > - > > Key: YARN-2611 > URL: https://issues.apache.org/jira/browse/YARN-2611 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, resourcemanager, scheduler >Reporter: Subru Krishnan >Assignee: Subru Krishnan > > This JIRA is to fix jenkins findbugs warnings and test case failures for > trunk merge patch as [reported | > https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in > YARN-1051 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch
[ https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2611: - Description: This JIRA is to https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506 (was: This JIRA is to track the changes required to ensure branch yarn-1051 is ready to be merged with trunk. This includes fixing any compilation issues, findbug and/or javadoc warning, test cases failures, etc if any.) > Fix jenkins findbugs warning and test case failures for trunk merge patch > - > > Key: YARN-2611 > URL: https://issues.apache.org/jira/browse/YARN-2611 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, resourcemanager, scheduler >Reporter: Subru Krishnan >Assignee: Subru Krishnan > > This JIRA is to > https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch
Subru Krishnan created YARN-2611: Summary: Fix jenkins findbugs warning and test case failures for trunk merge patch Key: YARN-2611 URL: https://issues.apache.org/jira/browse/YARN-2611 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, scheduler Reporter: Subru Krishnan Assignee: Subru Krishnan This JIRA is to track the changes required to ensure branch yarn-1051 is ready to be merged with trunk. This includes fixing any compilation issues, findbug and/or javadoc warning, test cases failures, etc if any. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2610) Hamlet doesn't close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148569#comment-14148569 ] Hadoop QA commented on YARN-2610: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671350/YARN-2610-01.patch against trunk revision e9c37de. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: org.apache.hadoop.yarn.webapp.hamlet.TestHamlet org.apache.hadoop.yarn.webapp.view.TestInfoBlock {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5138//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5138//console This message is automatically generated. > Hamlet doesn't close table tags > --- > > Key: YARN-2610 > URL: https://issues.apache.org/jira/browse/YARN-2610 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ray Chiang >Assignee: Ray Chiang > Labels: supportability > Attachments: YARN-2610-01.patch > > > Revisiting a subset of MAPREDUCE-2993. > The , , , , tags are not configured to close > properly in Hamlet. While this is allowed in HTML 4.01, missing closing > table tags tends to wreak havoc with a lot of HTML processors (although not > usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2608) FairScheduler: Potential deadlocks in loading alloc files and clock access
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148562#comment-14148562 ] Hudson commented on YARN-2608: -- FAILURE: Integrated in Hadoop-trunk-Commit #6115 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6115/]) YARN-2608. FairScheduler: Potential deadlocks in loading alloc files and clock access. (Wei Yan via kasha) (kasha: rev f4357240a6f81065d91d5f443ed8fc8cd2a14a8f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt > FairScheduler: Potential deadlocks in loading alloc files and clock access > -- > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Fix For: 2.6.0 > > Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch > > > Two potential deadlocks exist inside the FairScheduler. > 1. AllocationFileLoaderService would reload the queue configuration, which > calls FairScheduler.AllocationReloadListener.onReload() function. And require > *FairScheduler's lock*; > {code} > public void onReload(AllocationConfiguration queueInfo) { > synchronized (FairScheduler.this) { > > } > } > {code} > after that, it would require the *QueueManager's queues lock*. > {code} > private FSQueue getQueue(String name, boolean create, FSQueueType > queueType) { > name = ensureRootPrefix(name); > synchronized (queues) { > > } > } > {code} > Another thread FairScheduler.assignToQueue may also need to create a new > queue when a new job submitted. This thread would hold the *QueueManager's > queues lock* firstly, and then would like to hold the *FairScheduler's lock* > as it needs to call FairScheduler.getClock() function when creating a new > FSLeafQueue. Deadlock may happen here. > 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's > lock* first, and then waits for *FairScheduler's lock*. Another thread (like > AdminService.refreshQueues) may call FairScheduler's reinitialize function, > which holds *FairScheduler's lock* first, and then waits for > *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2608) FairScheduler: Potential deadlocks in loading alloc files and clock
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2608: --- Summary: FairScheduler: Potential deadlocks in loading alloc files and clock (was: FairScheduler may hung due to two potential deadlocks) > FairScheduler: Potential deadlocks in loading alloc files and clock > --- > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch > > > Two potential deadlocks exist inside the FairScheduler. > 1. AllocationFileLoaderService would reload the queue configuration, which > calls FairScheduler.AllocationReloadListener.onReload() function. And require > *FairScheduler's lock*; > {code} > public void onReload(AllocationConfiguration queueInfo) { > synchronized (FairScheduler.this) { > > } > } > {code} > after that, it would require the *QueueManager's queues lock*. > {code} > private FSQueue getQueue(String name, boolean create, FSQueueType > queueType) { > name = ensureRootPrefix(name); > synchronized (queues) { > > } > } > {code} > Another thread FairScheduler.assignToQueue may also need to create a new > queue when a new job submitted. This thread would hold the *QueueManager's > queues lock* firstly, and then would like to hold the *FairScheduler's lock* > as it needs to call FairScheduler.getClock() function when creating a new > FSLeafQueue. Deadlock may happen here. > 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's > lock* first, and then waits for *FairScheduler's lock*. Another thread (like > AdminService.refreshQueues) may call FairScheduler's reinitialize function, > which holds *FairScheduler's lock* first, and then waits for > *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2608) FairScheduler: Potential deadlocks in loading alloc files and clock access
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2608: --- Summary: FairScheduler: Potential deadlocks in loading alloc files and clock access (was: FairScheduler: Potential deadlocks in loading alloc files and clock) > FairScheduler: Potential deadlocks in loading alloc files and clock access > -- > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch > > > Two potential deadlocks exist inside the FairScheduler. > 1. AllocationFileLoaderService would reload the queue configuration, which > calls FairScheduler.AllocationReloadListener.onReload() function. And require > *FairScheduler's lock*; > {code} > public void onReload(AllocationConfiguration queueInfo) { > synchronized (FairScheduler.this) { > > } > } > {code} > after that, it would require the *QueueManager's queues lock*. > {code} > private FSQueue getQueue(String name, boolean create, FSQueueType > queueType) { > name = ensureRootPrefix(name); > synchronized (queues) { > > } > } > {code} > Another thread FairScheduler.assignToQueue may also need to create a new > queue when a new job submitted. This thread would hold the *QueueManager's > queues lock* firstly, and then would like to hold the *FairScheduler's lock* > as it needs to call FairScheduler.getClock() function when creating a new > FSLeafQueue. Deadlock may happen here. > 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's > lock* first, and then waits for *FairScheduler's lock*. Another thread (like > AdminService.refreshQueues) may call FairScheduler's reinitialize function, > which holds *FairScheduler's lock* first, and then waits for > *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148538#comment-14148538 ] Karthik Kambatla commented on YARN-2608: +1. Committing this. > FairScheduler may hung due to two potential deadlocks > - > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch > > > Two potential deadlocks exist inside the FairScheduler. > 1. AllocationFileLoaderService would reload the queue configuration, which > calls FairScheduler.AllocationReloadListener.onReload() function. And require > *FairScheduler's lock*; > {code} > public void onReload(AllocationConfiguration queueInfo) { > synchronized (FairScheduler.this) { > > } > } > {code} > after that, it would require the *QueueManager's queues lock*. > {code} > private FSQueue getQueue(String name, boolean create, FSQueueType > queueType) { > name = ensureRootPrefix(name); > synchronized (queues) { > > } > } > {code} > Another thread FairScheduler.assignToQueue may also need to create a new > queue when a new job submitted. This thread would hold the *QueueManager's > queues lock* firstly, and then would like to hold the *FairScheduler's lock* > as it needs to call FairScheduler.getClock() function when creating a new > FSLeafQueue. Deadlock may happen here. > 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's > lock* first, and then waits for *FairScheduler's lock*. Another thread (like > AdminService.refreshQueues) may call FairScheduler's reinitialize function, > which holds *FairScheduler's lock* first, and then waits for > *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2610) Hamlet doesn't close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2610: - Attachment: YARN-2610-01.patch Turn on closing tags for HTML table formatting. > Hamlet doesn't close table tags > --- > > Key: YARN-2610 > URL: https://issues.apache.org/jira/browse/YARN-2610 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ray Chiang >Assignee: Ray Chiang > Labels: supportability > Attachments: YARN-2610-01.patch > > > Revisiting a subset of MAPREDUCE-2993. > The , , , , tags are not configured to close > properly in Hamlet. While this is allowed in HTML 4.01, missing closing > table tags tends to wreak havoc with a lot of HTML processors (although not > usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2610) Hamlet doesn't close table tags
Ray Chiang created YARN-2610: Summary: Hamlet doesn't close table tags Key: YARN-2610 URL: https://issues.apache.org/jira/browse/YARN-2610 Project: Hadoop YARN Issue Type: Bug Reporter: Ray Chiang Assignee: Ray Chiang Revisiting a subset of MAPREDUCE-2993. The , , , , tags are not configured to close properly in Hamlet. While this is allowed in HTML 4.01, missing closing table tags tends to wreak havoc with a lot of HTML processors (although not usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2602) Generic History Service of TimelineServer sometimes not able to handle NPE
[ https://issues.apache.org/jira/browse/YARN-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148519#comment-14148519 ] Hadoop QA commented on YARN-2602: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671335/YARN-2602.1.patch against trunk revision 8269bfa. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5137//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5137//console This message is automatically generated. > Generic History Service of TimelineServer sometimes not able to handle NPE > -- > > Key: YARN-2602 > URL: https://issues.apache.org/jira/browse/YARN-2602 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 > Environment: ATS is running with AHS/GHS enabled to use TimelineStore. > Running for 4-5 days, with many random example jobs running >Reporter: Karam Singh >Assignee: Zhijie Shen > Attachments: YARN-2602.1.patch > > > ATS is running with AHS/GHS enabled to use TimelineStore. > Running for 4-5 day, with many random example jobs running . > When I ran WS API for AHS/GHS: > {code} > curl --negotiate -u : > 'http:///v1/applicationhistory/apps/application_1411579118376_0001' > {code} > It ran successfully. > However > {code} > curl --negotiate -u : > 'http:///ws/v1/applicationhistory/apps' > {"exception":"WebApplicationException","message":"java.lang.NullPointerException","javaClassName":"javax.ws.rs.WebApplicationException"} > {code} > Failed with Internal server error 500. > After looking at TimelineServer logs found that there was NPE: -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.
[ https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148506#comment-14148506 ] Hadoop QA commented on YARN-1051: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671311/YARN-1051.patch against trunk revision 9f9a222. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 20 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 8 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapreduce.lib.input.TestMRCJCFileInputFormat org.apache.hadoop.mapred.TestJavaSerialization org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5133//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5133//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5133//console This message is automatically generated. > YARN Admission Control/Planner: enhancing the resource allocation model with > time. > -- > > Key: YARN-1051 > URL: https://issues.apache.org/jira/browse/YARN-1051 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager, scheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Attachments: YARN-1051-design.pdf, YARN-1051.patch, > curino_MSR-TR-2013-108.pdf, techreport.pdf > > > In this umbrella JIRA we propose to extend the YARN RM to handle time > explicitly, allowing users to "reserve" capacity over time. This is an > important step towards SLAs, long-running services, workflows, and helps for > gang scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148496#comment-14148496 ] Hadoop QA commented on YARN-2608: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671328/YARN-2608-3.patch against trunk revision 8269bfa. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5136//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5136//console This message is automatically generated. > FairScheduler may hung due to two potential deadlocks > - > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch > > > Two potential deadlocks exist inside the FairScheduler. > 1. AllocationFileLoaderService would reload the queue configuration, which > calls FairScheduler.AllocationReloadListener.onReload() function. And require > *FairScheduler's lock*; > {code} > public void onReload(AllocationConfiguration queueInfo) { > synchronized (FairScheduler.this) { > > } > } > {code} > after that, it would require the *QueueManager's queues lock*. > {code} > private FSQueue getQueue(String name, boolean create, FSQueueType > queueType) { > name = ensureRootPrefix(name); > synchronized (queues) { > > } > } > {code} > Another thread FairScheduler.assignToQueue may also need to create a new > queue when a new job submitted. This thread would hold the *QueueManager's > queues lock* firstly, and then would like to hold the *FairScheduler's lock* > as it needs to call FairScheduler.getClock() function when creating a new > FSLeafQueue. Deadlock may happen here. > 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's > lock* first, and then waits for *FairScheduler's lock*. Another thread (like > AdminService.refreshQueues) may call FairScheduler's reinitialize function, > which holds *FairScheduler's lock* first, and then waits for > *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148493#comment-14148493 ] Karthik Kambatla commented on YARN-2179: Comments: # Nit: YarnConfiguration - The definition of string constants corresponding to the config names are inconsistently indented. My personal preference is to put the value being assigned in the subsequent line if it does not all fit in one line. # AppChecker constructor that takes a name, should use that name. # In RemoteAppChecker, I would list all the known ACTIVE_STATES instead of using the complement: {code} private static final EnumSet ACTIVE_STATES = EnumSet.complementOf(EnumSet.of(YarnApplicationState.FINISHED, YarnApplicationState.FAILED, YarnApplicationState.KILLED)); {code} # SharedCacheManager: the following two lines should be moved to serviceInit(). We can get rid of serviceStart altogether. {code} DefaultMetricsSystem.initialize("SharedCacheManager"); JvmMetrics.initSingleton("SharedCacheManager", null); {code} > Initial cache manager structure and context > --- > > Key: YARN-2179 > URL: https://issues.apache.org/jira/browse/YARN-2179 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, > YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, > YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch > > > Implement the initial shared cache manager structure and context. The > SCMContext will be used by a number of manager services (i.e. the backing > store and the cleaner service). The AppChecker is used to gather the > currently running applications on SCM startup (necessary for an scm that is > backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148474#comment-14148474 ] Maysam Yabandeh commented on YARN-1963: --- Thanks [~sunilg] for the design doc. It might be useful if I share with you our use cases. Our most important use case is to let the admin change an app priority at runtime while it is running. The example is when a job gets unlucky taking much longer than usual due to some node failures or bugs. The user complains that the job is about to miss the deadline and admin needs a way to prioritize the user's job over the other jobs in the queue. This use case seems to be mentioned in Item 3 of Section 1.5.3 in the design doc but its "priority" seems not to be high. Another use case is to dynamically give a job higher priority based on the job status. For example, when mapper fails and there is no headroom in the queue, the job preempt its reducers to make space for its mappers. The freed space is however not necessarily offered back to the job in fair scheduling. Ideally job could increase its priority when its reducers are being stalled waiting for its mappers to be assigned. bq. Once all these requests of higher priority applications are served, then lower priority application requests will get served from Resource Manager. We are using fair scheduler and I assumed this jira is to also cover that since YARN-2098 created as a sub-task. The design doc however seems to be fairly centered around CapacityScheduler. In the case of fair scheduler, I guess the priority can also be incorporated to the fair share calculation, instead of the strict order of high priority first. > Support priorities across applications within the same queue > - > > Key: YARN-1963 > URL: https://issues.apache.org/jira/browse/YARN-1963 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, resourcemanager >Reporter: Arun C Murthy >Assignee: Sunil G > Attachments: YARN Application Priorities Design.pdf > > > It will be very useful to support priorities among applications within the > same queue, particularly in production scenarios. It allows for finer-grained > controls without having to force admins to create a multitude of queues, plus > allows existing applications to continue using existing queues which are > usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2320) Removing old application history store after we store the history data to timeline store
[ https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2320: -- Component/s: timelineserver > Removing old application history store after we store the history data to > timeline store > > > Key: YARN-2320 > URL: https://issues.apache.org/jira/browse/YARN-2320 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2320.1.patch, YARN-2320.2.patch > > > After YARN-2033, we should deprecate application history store set. There's > no need to maintain two sets of store interfaces. In addition, we should > conclude the outstanding jira's under YARN-321 about the application history > store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2320) Removing old application history store after we store the history data to timeline store
[ https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2320: -- Target Version/s: 2.6.0 > Removing old application history store after we store the history data to > timeline store > > > Key: YARN-2320 > URL: https://issues.apache.org/jira/browse/YARN-2320 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2320.1.patch, YARN-2320.2.patch > > > After YARN-2033, we should deprecate application history store set. There's > no need to maintain two sets of store interfaces. In addition, we should > conclude the outstanding jira's under YARN-321 about the application history > store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148452#comment-14148452 ] Wilfred Spiegelenburg commented on YARN-2578: - I proposed fixing the RPC code and by default set the timeout in HDFS-4858 but there was no interest to fix the client (at that point in time). So we now have to fix it everywhere unless we can get everyone on board and get the behaviour changed in the RPC code. The comments are still in that jira and it would be a straight forward fix in the RPC code. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2602) Generic History Service of TimelineServer sometimes not able to handle NPE
[ https://issues.apache.org/jira/browse/YARN-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2602: -- Attachment: YARN-2602.1.patch Create a patch to fix the problem > Generic History Service of TimelineServer sometimes not able to handle NPE > -- > > Key: YARN-2602 > URL: https://issues.apache.org/jira/browse/YARN-2602 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 > Environment: ATS is running with AHS/GHS enabled to use TimelineStore. > Running for 4-5 days, with many random example jobs running >Reporter: Karam Singh >Assignee: Zhijie Shen > Attachments: YARN-2602.1.patch > > > ATS is running with AHS/GHS enabled to use TimelineStore. > Running for 4-5 day, with many random example jobs running . > When I ran WS API for AHS/GHS: > {code} > curl --negotiate -u : > 'http:///v1/applicationhistory/apps/application_1411579118376_0001' > {code} > It ran successfully. > However > {code} > curl --negotiate -u : > 'http:///ws/v1/applicationhistory/apps' > {"exception":"WebApplicationException","message":"java.lang.NullPointerException","javaClassName":"javax.ws.rs.WebApplicationException"} > {code} > Failed with Internal server error 500. > After looking at TimelineServer logs found that there was NPE: -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148424#comment-14148424 ] Hadoop QA commented on YARN-2608: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671317/YARN-2608-2.patch against trunk revision 9f9a222. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 10 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5135//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5135//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5135//console This message is automatically generated. > FairScheduler may hung due to two potential deadlocks > - > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch > > > Two potential deadlocks exist inside the FairScheduler. > 1. AllocationFileLoaderService would reload the queue configuration, which > calls FairScheduler.AllocationReloadListener.onReload() function. And require > *FairScheduler's lock*; > {code} > public void onReload(AllocationConfiguration queueInfo) { > synchronized (FairScheduler.this) { > > } > } > {code} > after that, it would require the *QueueManager's queues lock*. > {code} > private FSQueue getQueue(String name, boolean create, FSQueueType > queueType) { > name = ensureRootPrefix(name); > synchronized (queues) { > > } > } > {code} > Another thread FairScheduler.assignToQueue may also need to create a new > queue when a new job submitted. This thread would hold the *QueueManager's > queues lock* firstly, and then would like to hold the *FairScheduler's lock* > as it needs to call FairScheduler.getClock() function when creating a new > FSLeafQueue. Deadlock may happen here. > 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's > lock* first, and then waits for *FairScheduler's lock*. Another thread (like > AdminService.refreshQueues) may call FairScheduler's reinitialize function, > which holds *FairScheduler's lock* first, and then waits for > *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field
[ https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148422#comment-14148422 ] Hudson commented on YARN-2523: -- FAILURE: Integrated in Hadoop-trunk-Commit #6113 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6113/]) YARN-2523. ResourceManager UI showing negative value for "Decommissioned Nodes" field. Contributed by Rohith (jlowe: rev 8269bfa613999f71767de3c0369817b58cfe1416) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java > ResourceManager UI showing negative value for "Decommissioned Nodes" field > -- > > Key: YARN-2523 > URL: https://issues.apache.org/jira/browse/YARN-2523 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, webapp >Affects Versions: 3.0.0 >Reporter: Nishan Shetty >Assignee: Rohith > Fix For: 2.6.0 > > Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, > YARN-2523.patch > > > 1. Decommission one NodeManager by configuring ip in excludehost file > 2. Remove ip from excludehost file > 3. Execute -refreshNodes command and restart Decommissioned NodeManager > Observe that in RM UI negative value for "Decommissioned Nodes" field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2608) FairScheduler may hung due to two potential deadlocks
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2608: -- Attachment: YARN-2608-3.patch Update a patch to fix the findbugs. > FairScheduler may hung due to two potential deadlocks > - > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch > > > Two potential deadlocks exist inside the FairScheduler. > 1. AllocationFileLoaderService would reload the queue configuration, which > calls FairScheduler.AllocationReloadListener.onReload() function. And require > *FairScheduler's lock*; > {code} > public void onReload(AllocationConfiguration queueInfo) { > synchronized (FairScheduler.this) { > > } > } > {code} > after that, it would require the *QueueManager's queues lock*. > {code} > private FSQueue getQueue(String name, boolean create, FSQueueType > queueType) { > name = ensureRootPrefix(name); > synchronized (queues) { > > } > } > {code} > Another thread FairScheduler.assignToQueue may also need to create a new > queue when a new job submitted. This thread would hold the *QueueManager's > queues lock* firstly, and then would like to hold the *FairScheduler's lock* > as it needs to call FairScheduler.getClock() function when creating a new > FSLeafQueue. Deadlock may happen here. > 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's > lock* first, and then waits for *FairScheduler's lock*. Another thread (like > AdminService.refreshQueues) may call FairScheduler's reinitialize function, > which holds *FairScheduler's lock* first, and then waits for > *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148405#comment-14148405 ] Hadoop QA commented on YARN-668: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671313/YARN-668-v8.patch against trunk revision 9f9a222. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 10 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.client.api.impl.TestNMClient org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart org.apache.hadoop.yarn.security.TestYARNTokenIdentifier org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerResync org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRecovery org.apache.hadoop.yarn.server.nodemanager.security.TestNMTokenSecretManagerInNM org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesContainers org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater org.apache.hadoop.yarn.server.nodemanager.TestEventFlow org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerReboot org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesApps org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerUtils org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler org.apache.hadoop.yarn.server.TestContainerManagerSecurity The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.client.api.impl.TestAMRMClient {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5134//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5134//console This message is automatically generated. > TokenIdentifier serialization should consider Unknown fields > > > Key: YARN-668 > URL: https://issues.apache.org/jira/browse/YARN-668 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-668-demo.patch, YARN-668-v2.patch, > YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, > YARN-668-v7.patch, YARN-668-v8.patch, YARN-668.patch > > > This would allow changing of the TokenIdentifier between versions. The > current serialization is Writable. A simple way to achieve this would be to > have a Proto object as the payload for TokenIdentifiers, instead of > individual fields. > TokenIdentifier continues to implement Writable to work with the RPC layer - > but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field
[ https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148401#comment-14148401 ] Jason Lowe commented on YARN-2523: -- +1 lgtm. Committing this. > ResourceManager UI showing negative value for "Decommissioned Nodes" field > -- > > Key: YARN-2523 > URL: https://issues.apache.org/jira/browse/YARN-2523 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, webapp >Affects Versions: 3.0.0 >Reporter: Nishan Shetty >Assignee: Rohith > Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, > YARN-2523.patch > > > 1. Decommission one NodeManager by configuring ip in excludehost file > 2. Remove ip from excludehost file > 3. Execute -refreshNodes command and restart Decommissioned NodeManager > Observe that in RM UI negative value for "Decommissioned Nodes" field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2609) Example of use for the ReservationSystem
Carlo Curino created YARN-2609: -- Summary: Example of use for the ReservationSystem Key: YARN-2609 URL: https://issues.apache.org/jira/browse/YARN-2609 Project: Hadoop YARN Issue Type: Improvement Reporter: Carlo Curino Assignee: Carlo Curino Priority: Minor This JIRA provides a simple new example in mapreduce-examples that request a reservation and submit a Pi computation in the reservation. This is meant just to show how to interact with the reservation system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148397#comment-14148397 ] Hadoop QA commented on YARN-913: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671298/YARN-913-010.patch against trunk revision 9f9a222. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 36 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1266 javac compiler warnings (more than the trunk's current 1265 warnings). {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/5131//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.ha.TestZKFailoverControllerStress org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5131//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5131//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-registry.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5131//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-common.html Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5131//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5131//console This message is automatically generated. > Add a way to register long-lived services in a YARN cluster > --- > > Key: YARN-913 > URL: https://issues.apache.org/jira/browse/YARN-913 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, resourcemanager >Affects Versions: 2.5.0, 2.4.1 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, > 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, > YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, > YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, > YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, > YARN-913-010.patch, yarnregistry.pdf, yarnregistry.tla > > > In a YARN cluster you can't predict where services will come up -or on what > ports. The services need to work those things out as they come up and then > publish them somewhere. > Applications need to be able to find the service instance they are to bond to > -and not any others in the cluster. > Some kind of service registry -in the RM, in ZK, could do this. If the RM > held the write access to the ZK nodes, it would be more secure than having > apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148381#comment-14148381 ] Karthik Kambatla edited comment on YARN-2578 at 9/25/14 10:04 PM: -- Thanks for the clarification, Wilfred. Don't we need to the same from AM->RM and Client->RM as well? Instead of fixing it everywhere, how about we fix this in RPC itself? In https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488, instead of using 0 as the default value, the default could be looked up in the Configuration. No? If we think it is better to do it, we should probably create a common JIRA and take the opinion from HDFS folks as well. was (Author: kkambatl): Thanks for the clarification, Wilfred. Don't we need to the same from AM->RM and Client->RM as well? Instead of fixing it everywhere, how about we fix this in RPC itself? In https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488, instead of using 0 as the default value, the default could be looked up in the Configuration. No? > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148381#comment-14148381 ] Karthik Kambatla commented on YARN-2578: Thanks for the clarification, Wilfred. Don't we need to the same from AM->RM and Client->RM as well? Instead of fixing it everywhere, how about we fix this in RPC itself? In https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488, instead of using 0 as the default value, the default could be looked up in the Configuration. No? > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148378#comment-14148378 ] Hadoop QA commented on YARN-2608: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671305/YARN-2608-1.patch against trunk revision 9f9a222. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 10 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5132//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5132//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5132//console This message is automatically generated. > FairScheduler may hung due to two potential deadlocks > - > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2608-1.patch, YARN-2608-2.patch > > > Two potential deadlocks exist inside the FairScheduler. > 1. AllocationFileLoaderService would reload the queue configuration, which > calls FairScheduler.AllocationReloadListener.onReload() function. And require > *FairScheduler's lock*; > {code} > public void onReload(AllocationConfiguration queueInfo) { > synchronized (FairScheduler.this) { > > } > } > {code} > after that, it would require the *QueueManager's queues lock*. > {code} > private FSQueue getQueue(String name, boolean create, FSQueueType > queueType) { > name = ensureRootPrefix(name); > synchronized (queues) { > > } > } > {code} > Another thread FairScheduler.assignToQueue may also need to create a new > queue when a new job submitted. This thread would hold the *QueueManager's > queues lock* firstly, and then would like to hold the *FairScheduler's lock* > as it needs to call FairScheduler.getClock() function when creating a new > FSLeafQueue. Deadlock may happen here. > 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's > lock* first, and then waits for *FairScheduler's lock*. Another thread (like > AdminService.refreshQueues) may call FairScheduler's reinitialize function, > which holds *FairScheduler's lock* first, and then waits for > *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2550) TestAMRestart fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148376#comment-14148376 ] Jason Lowe commented on YARN-2550: -- This looks like a dup of YARN-2483. > TestAMRestart fails intermittently > -- > > Key: YARN-2550 > URL: https://issues.apache.org/jira/browse/YARN-2550 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Rohith > > testShouldNotCountFailureToMaxAttemptRetry(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart) > Time elapsed: 50.64 sec <<< FAILURE! > java.lang.AssertionError: AppAttempt state is not correct (timedout) > expected: but was: > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at > org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:84) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:417) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAM(MockRM.java:582) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAndRegisterAM(MockRM.java:589) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForNewAMToLaunchAndRegister(MockRM.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry(TestAMRestart.java:402) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148354#comment-14148354 ] Karthik Kambatla commented on YARN-2608: +1, pending Jenkins. > FairScheduler may hung due to two potential deadlocks > - > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2608-1.patch, YARN-2608-2.patch > > > Two potential deadlocks exist inside the FairScheduler. > 1. AllocationFileLoaderService would reload the queue configuration, which > calls FairScheduler.AllocationReloadListener.onReload() function. And require > *FairScheduler's lock*; > {code} > public void onReload(AllocationConfiguration queueInfo) { > synchronized (FairScheduler.this) { > > } > } > {code} > after that, it would require the *QueueManager's queues lock*. > {code} > private FSQueue getQueue(String name, boolean create, FSQueueType > queueType) { > name = ensureRootPrefix(name); > synchronized (queues) { > > } > } > {code} > Another thread FairScheduler.assignToQueue may also need to create a new > queue when a new job submitted. This thread would hold the *QueueManager's > queues lock* firstly, and then would like to hold the *FairScheduler's lock* > as it needs to call FairScheduler.getClock() function when creating a new > FSLeafQueue. Deadlock may happen here. > 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's > lock* first, and then waits for *FairScheduler's lock*. Another thread (like > AdminService.refreshQueues) may call FairScheduler's reinitialize function, > which holds *FairScheduler's lock* first, and then waits for > *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2602) Generic History Service of TimelineServer sometimes not able to handle NPE
[ https://issues.apache.org/jira/browse/YARN-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148348#comment-14148348 ] Zhijie Shen commented on YARN-2602: --- The problem is that YARN_APPLICATION_VIEW_ACLS field can be null. For example, if it is a DS app, the client leaves the ACL field null. > Generic History Service of TimelineServer sometimes not able to handle NPE > -- > > Key: YARN-2602 > URL: https://issues.apache.org/jira/browse/YARN-2602 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 > Environment: ATS is running with AHS/GHS enabled to use TimelineStore. > Running for 4-5 days, with many random example jobs running >Reporter: Karam Singh >Assignee: Zhijie Shen > > ATS is running with AHS/GHS enabled to use TimelineStore. > Running for 4-5 day, with many random example jobs running . > When I ran WS API for AHS/GHS: > {code} > curl --negotiate -u : > 'http:///v1/applicationhistory/apps/application_1411579118376_0001' > {code} > It ran successfully. > However > {code} > curl --negotiate -u : > 'http:///ws/v1/applicationhistory/apps' > {"exception":"WebApplicationException","message":"java.lang.NullPointerException","javaClassName":"javax.ws.rs.WebApplicationException"} > {code} > Failed with Internal server error 500. > After looking at TimelineServer logs found that there was NPE: -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2608) FairScheduler may hung due to two potential deadlocks
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2608: -- Attachment: YARN-2608-2.patch > FairScheduler may hung due to two potential deadlocks > - > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2608-1.patch, YARN-2608-2.patch > > > Two potential deadlocks exist inside the FairScheduler. > 1. AllocationFileLoaderService would reload the queue configuration, which > calls FairScheduler.AllocationReloadListener.onReload() function. And require > *FairScheduler's lock*; > {code} > public void onReload(AllocationConfiguration queueInfo) { > synchronized (FairScheduler.this) { > > } > } > {code} > after that, it would require the *QueueManager's queues lock*. > {code} > private FSQueue getQueue(String name, boolean create, FSQueueType > queueType) { > name = ensureRootPrefix(name); > synchronized (queues) { > > } > } > {code} > Another thread FairScheduler.assignToQueue may also need to create a new > queue when a new job submitted. This thread would hold the *QueueManager's > queues lock* firstly, and then would like to hold the *FairScheduler's lock* > as it needs to call FairScheduler.getClock() function when creating a new > FSLeafQueue. Deadlock may happen here. > 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's > lock* first, and then waits for *FairScheduler's lock*. Another thread (like > AdminService.refreshQueues) may call FairScheduler's reinitialize function, > which holds *FairScheduler's lock* first, and then waits for > *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148328#comment-14148328 ] Karthik Kambatla commented on YARN-2608: Nit: Nothing to do with this patch, can we annotate FS#setClock as VisibleForTesting? Otherwise, patch looks good to me. > FairScheduler may hung due to two potential deadlocks > - > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2608-1.patch > > > Two potential deadlocks exist inside the FairScheduler. > 1. AllocationFileLoaderService would reload the queue configuration, which > calls FairScheduler.AllocationReloadListener.onReload() function. And require > *FairScheduler's lock*; > {code} > public void onReload(AllocationConfiguration queueInfo) { > synchronized (FairScheduler.this) { > > } > } > {code} > after that, it would require the *QueueManager's queues lock*. > {code} > private FSQueue getQueue(String name, boolean create, FSQueueType > queueType) { > name = ensureRootPrefix(name); > synchronized (queues) { > > } > } > {code} > Another thread FairScheduler.assignToQueue may also need to create a new > queue when a new job submitted. This thread would hold the *QueueManager's > queues lock* firstly, and then would like to hold the *FairScheduler's lock* > as it needs to call FairScheduler.getClock() function when creating a new > FSLeafQueue. Deadlock may happen here. > 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's > lock* first, and then waits for *FairScheduler's lock*. Another thread (like > AdminService.refreshQueues) may call FairScheduler's reinitialize function, > which holds *FairScheduler's lock* first, and then waits for > *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148074#comment-14148074 ] Vinod Kumar Vavilapalli edited comment on YARN-668 at 9/25/14 9:19 PM: --- Quick look at the patch - None of the records in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/yarn_security_token.proto are supposed to be exposed to users. We can move it to a sub-folder server and explicit comment in the proto file saying they are not consumable. - What about other tokens? We have Client to AM token, RM delegation-tokens etc. was (Author: vinodkv): Quick look at the patch - None of the records in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/yarn_security_token.proto are supposed to be exposed to users. We can move it to a sub-folder server and explicit comment in the proto file saying they are consumable. - What about other tokens? We have Client to AM token, RM delegation-tokens etc. > TokenIdentifier serialization should consider Unknown fields > > > Key: YARN-668 > URL: https://issues.apache.org/jira/browse/YARN-668 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-668-demo.patch, YARN-668-v2.patch, > YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, > YARN-668-v7.patch, YARN-668-v8.patch, YARN-668.patch > > > This would allow changing of the TokenIdentifier between versions. The > current serialization is Writable. A simple way to achieve this would be to > have a Proto object as the payload for TokenIdentifiers, instead of > individual fields. > TokenIdentifier continues to implement Writable to work with the RPC layer - > but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148326#comment-14148326 ] Junping Du commented on YARN-668: - Thanks [~vinodkv] and [~jianhe] for review and comments! Address your comments in latest v8 patch. bq. None of the records in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/yarn_security_token.proto are supposed to be exposed to users. We can move it to a sub-folder server and explicit comment in the proto file saying they are consumable. Good point. Move and comments. bq. What about other tokens? We have Client to AM token, RM delegation-tokens etc. The plan is to address these two tokens in a separated patch. Given patch here is already big, and add chance to conflict with other changes that could happen recently. bq. containerManagerImpl, TestApplicationMasterService changes revert There are some change actually in these two files. bq. Proto definition should have the same default. Good point. Add default value now to proto definition. However, I am not sure if protobuffer have the similar definition for Integer.MIN_VALUE, but just use hard number so far. bq." following constructors may be not needed." And "remove the commented code" Removed. > TokenIdentifier serialization should consider Unknown fields > > > Key: YARN-668 > URL: https://issues.apache.org/jira/browse/YARN-668 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-668-demo.patch, YARN-668-v2.patch, > YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, > YARN-668-v7.patch, YARN-668-v8.patch, YARN-668.patch > > > This would allow changing of the TokenIdentifier between versions. The > current serialization is Writable. A simple way to achieve this would be to > have a Proto object as the payload for TokenIdentifiers, instead of > individual fields. > TokenIdentifier continues to implement Writable to work with the RPC layer - > but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148315#comment-14148315 ] Jonathan Eagles commented on YARN-2606: --- I don't have any context why login is part of start. > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-668: Attachment: YARN-668-v8.patch > TokenIdentifier serialization should consider Unknown fields > > > Key: YARN-668 > URL: https://issues.apache.org/jira/browse/YARN-668 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-668-demo.patch, YARN-668-v2.patch, > YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, > YARN-668-v7.patch, YARN-668-v8.patch, YARN-668.patch > > > This would allow changing of the TokenIdentifier between versions. The > current serialization is Writable. A simple way to achieve this would be to > have a Proto object as the payload for TokenIdentifiers, instead of > individual fields. > TokenIdentifier continues to implement Writable to work with the RPC layer - > but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148302#comment-14148302 ] Wei Yan commented on YARN-2608: --- For the first deadlock, as the clock is only changed by testcases, so we can directly remove the synchronized, and make the clock as volatile. For the second deadlock, we can also remove the synchronized from the reinitialize and initScheduler functions; thus, the reinitialize function would require the * AllocationFileLoaderService's lock* first, and then *FairScheduler's lock*. > FairScheduler may hung due to two potential deadlocks > - > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2608-1.patch > > > Two potential deadlocks exist inside the FairScheduler. > 1. AllocationFileLoaderService would reload the queue configuration, which > calls FairScheduler.AllocationReloadListener.onReload() function. And require > *FairScheduler's lock*; > {code} > public void onReload(AllocationConfiguration queueInfo) { > synchronized (FairScheduler.this) { > > } > } > {code} > after that, it would require the *QueueManager's queues lock*. > {code} > private FSQueue getQueue(String name, boolean create, FSQueueType > queueType) { > name = ensureRootPrefix(name); > synchronized (queues) { > > } > } > {code} > Another thread FairScheduler.assignToQueue may also need to create a new > queue when a new job submitted. This thread would hold the *QueueManager's > queues lock* firstly, and then would like to hold the *FairScheduler's lock* > as it needs to call FairScheduler.getClock() function when creating a new > FSLeafQueue. Deadlock may happen here. > 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's > lock* first, and then waits for *FairScheduler's lock*. Another thread (like > AdminService.refreshQueues) may call FairScheduler's reinitialize function, > which holds *FairScheduler's lock* first, and then waits for > *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.
[ https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-1051: - Attachment: YARN-1051.patch I am attaching a merge patch with trunk for easy reference. This patch is created after rebasing branch yarn-1051 with trunk. I ran test-patch against trunk with the attached patch in my box and got a +1. > YARN Admission Control/Planner: enhancing the resource allocation model with > time. > -- > > Key: YARN-1051 > URL: https://issues.apache.org/jira/browse/YARN-1051 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler, resourcemanager, scheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Attachments: YARN-1051-design.pdf, YARN-1051.patch, > curino_MSR-TR-2013-108.pdf, techreport.pdf > > > In this umbrella JIRA we propose to extend the YARN RM to handle time > explicitly, allowing users to "reserve" capacity over time. This is an > important step towards SLAs, long-running services, workflows, and helps for > gang scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2608) FairScheduler may hung due to two potential deadlocks
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2608: -- Description: Two potential deadlocks exist inside the FairScheduler. 1. AllocationFileLoaderService would reload the queue configuration, which calls FairScheduler.AllocationReloadListener.onReload() function. And require *FairScheduler's lock*; {code} public void onReload(AllocationConfiguration queueInfo) { synchronized (FairScheduler.this) { } } {code} after that, it would require the *QueueManager's queues lock*. {code} private FSQueue getQueue(String name, boolean create, FSQueueType queueType) { name = ensureRootPrefix(name); synchronized (queues) { } } {code} Another thread FairScheduler.assignToQueue may also need to create a new queue when a new job submitted. This thread would hold the *QueueManager's queues lock* firstly, and then would like to hold the *FairScheduler's lock* as it needs to call FairScheduler.getClock() function when creating a new FSLeafQueue. Deadlock may happen here. 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's lock* first, and then waits for *FairScheduler's lock*. Another thread (like AdminService.refreshQueues) may call FairScheduler's reinitialize function, which holds *FairScheduler's lock* first, and then waits for *AllocationFileLoaderService's lock*. Deadlock may happen here. was:Two potential deadlocks exist inside the FairScheduler. > FairScheduler may hung due to two potential deadlocks > - > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2608-1.patch > > > Two potential deadlocks exist inside the FairScheduler. > 1. AllocationFileLoaderService would reload the queue configuration, which > calls FairScheduler.AllocationReloadListener.onReload() function. And require > *FairScheduler's lock*; > {code} > public void onReload(AllocationConfiguration queueInfo) { > synchronized (FairScheduler.this) { > > } > } > {code} > after that, it would require the *QueueManager's queues lock*. > {code} > private FSQueue getQueue(String name, boolean create, FSQueueType > queueType) { > name = ensureRootPrefix(name); > synchronized (queues) { > > } > } > {code} > Another thread FairScheduler.assignToQueue may also need to create a new > queue when a new job submitted. This thread would hold the *QueueManager's > queues lock* firstly, and then would like to hold the *FairScheduler's lock* > as it needs to call FairScheduler.getClock() function when creating a new > FSLeafQueue. Deadlock may happen here. > 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's > lock* first, and then waits for *FairScheduler's lock*. Another thread (like > AdminService.refreshQueues) may call FairScheduler's reinitialize function, > which holds *FairScheduler's lock* first, and then waits for > *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148288#comment-14148288 ] Hadoop QA commented on YARN-2606: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671295/YARN-2606.patch against trunk revision 1861b32. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5130//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5130//console This message is automatically generated. > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148282#comment-14148282 ] Mit Desai commented on YARN-2606: - Thanks for the the suggestion [~zjshen]. Moving the FS operations to serviceStart() will work too. But I went with this option as to me, doing a login during initialization makes more sense than it be on the mid-way. > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2608) FairScheduler may hung due to two potential deadlocks
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2608: -- Attachment: YARN-2608-1.patch > FairScheduler may hung due to two potential deadlocks > - > > Key: YARN-2608 > URL: https://issues.apache.org/jira/browse/YARN-2608 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wei Yan >Assignee: Wei Yan > Attachments: YARN-2608-1.patch > > > Two potential deadlocks exist inside the FairScheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148278#comment-14148278 ] Zhijie Shen commented on YARN-2606: --- [~jeagles], I saw doSecureLogin is invoked at start stage in both RM and NM, and I'm a bit concerned that moving it to init will cause some unexpected behavior. Do you have any idea about the rationale behind this choice? I'm not aware of it before. > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2608) FairScheduler may hung due to two potential deadlocks
Wei Yan created YARN-2608: - Summary: FairScheduler may hung due to two potential deadlocks Key: YARN-2608 URL: https://issues.apache.org/jira/browse/YARN-2608 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Two potential deadlocks exist inside the FairScheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148273#comment-14148273 ] Jonathan Eagles commented on YARN-2606: --- I can see this both ways. It seems to correct to login during initialization and and to wait until start to do file operations. Although a fix to either one of them does indeed fix the issue at hand. > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-913: Attachment: YARN-913-010.patch This patch doesn't look at why yesterday's jenkin tests failed, so if they are due to these changes, those changes won't have been fixed. Key changes are due to experience implementing a (not in this patch) read only REST view. # renamed fields in the {{ServiceRecord}} because Jersey ignores {{@JsonProperty}} annotations giving fields specific names. So no {{yarn:id}} {{yarn:persistence}} in the JSON; fields called {{yarn_id}} and {{yarn_persistence}} instead. # Specific exception {{NoRecordException}} to differentiate "could not resolve a node as there isn't any entry with the header used to identify service records from {{InvalidRecordException}} which is only triggered on parse problems. # added a lightweight {{list()}} operation that only returns the child paths; the original {{list(path) -> List}} renamed to {{listFull}}. There's a CLI client for this being written; it'll help validate the API & identify any further points for tuning > Add a way to register long-lived services in a YARN cluster > --- > > Key: YARN-913 > URL: https://issues.apache.org/jira/browse/YARN-913 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, resourcemanager >Affects Versions: 2.5.0, 2.4.1 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, > 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, > YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, > YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, > YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, > YARN-913-010.patch, yarnregistry.pdf, yarnregistry.tla > > > In a YARN cluster you can't predict where services will come up -or on what > ports. The services need to work those things out as they come up and then > publish them somewhere. > Applications need to be able to find the service instance they are to bond to > -and not any others in the cluster. > Some kind of service registry -in the RM, in ZK, could do this. If the RM > held the write access to the ZK nodes, it would be more secure than having > apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148260#comment-14148260 ] Hadoop QA commented on YARN-2198: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671287/YARN-2198.trunk.10.patch against trunk revision 6c54308. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5128//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5128//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5128//console This message is automatically generated. > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, > YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, > YARN-2198.delta.7.patch, YARN-2198.separation.patch, > YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, > YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires a the process launching the container to be LocalSystem or > a member of the a local Administrators group. Since the process in question > is the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148255#comment-14148255 ] Zhijie Shen commented on YARN-2606: --- The right fix may be moving the FS operations to serviceStart(). See the similar code in FileSystemRMStateStore: {code} @Override protected synchronized void startInternal() throws Exception { // create filesystem only now, as part of service-start. By this time, RM is // authenticated with kerberos so we are good to create a file-system // handle. Configuration conf = new Configuration(getConfig()); conf.setBoolean("dfs.client.retry.policy.enabled", true); String retryPolicy = conf.get(YarnConfiguration.FS_RM_STATE_STORE_RETRY_POLICY_SPEC, YarnConfiguration.DEFAULT_FS_RM_STATE_STORE_RETRY_POLICY_SPEC); conf.set("dfs.client.retry.policy.spec", retryPolicy); fs = fsWorkingPath.getFileSystem(conf); fs.mkdirs(rmDTSecretManagerRoot); fs.mkdirs(rmAppRoot); fs.mkdirs(amrmTokenSecretManagerRoot); } {code} BTW, we're thinking about removing the old application history store stack (YARN-2320). > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2607) TestDistributedShell fails in trunk
Ted Yu created YARN-2607: Summary: TestDistributedShell fails in trunk Key: YARN-2607 URL: https://issues.apache.org/jira/browse/YARN-2607 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu >From https://builds.apache.org/job/Hadoop-Yarn-trunk/691/console : {code} testDSRestartWithPreviousRunningContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 35.641 sec <<< FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSRestartWithPreviousRunningContainers(TestDistributedShell.java:308) {code} On Linux, I got the following locally: {code} testDSAttemptFailuresValidityIntervalFailed(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 64.715 sec <<< FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertFalse(Assert.java:64) at org.junit.Assert.assertFalse(Assert.java:74) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSAttemptFailuresValidityIntervalFailed(TestDistributedShell.java:384) testDSAttemptFailuresValidityIntervalSucess(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 115.842 sec <<< ERROR! java.lang.Exception: test timed out after 9 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.applications.distributedshell.Client.monitorApplication(Client.java:680) at org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:661) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSAttemptFailuresValidityIntervalSucess(TestDistributedShell.java:342) testDSRestartWithPreviousRunningContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 35.633 sec <<< FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSRestartWithPreviousRunningContainers(TestDistributedShell.java:308) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2606: Attachment: YARN-2606.patch Attaching the patch. > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2606: Component/s: timelineserver > Application History Server tries to access hdfs before doing secure login > - > > Key: YARN-2606 > URL: https://issues.apache.org/jira/browse/YARN-2606 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-2606.patch > > > While testing the Application Timeline Server, the server would not come up > in a secure cluster, as it would keep trying to access hdfs without having > done the secure login. It would repeatedly try authenticating and finally hit > stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2606) Application History Server tries to access hdfs before doing secure login
Mit Desai created YARN-2606: --- Summary: Application History Server tries to access hdfs before doing secure login Key: YARN-2606 URL: https://issues.apache.org/jira/browse/YARN-2606 Project: Hadoop YARN Issue Type: Bug Reporter: Mit Desai Assignee: Mit Desai While testing the Application Timeline Server, the server would not come up in a secure cluster, as it would keep trying to access hdfs without having done the secure login. It would repeatedly try authenticating and finally hit stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148227#comment-14148227 ] Hadoop QA commented on YARN-2179: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671286/YARN-2179-trunk-v7.patch against trunk revision 6c54308. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5129//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5129//console This message is automatically generated. > Initial cache manager structure and context > --- > > Key: YARN-2179 > URL: https://issues.apache.org/jira/browse/YARN-2179 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, > YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, > YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch > > > Implement the initial shared cache manager structure and context. The > SCMContext will be used by a number of manager services (i.e. the backing > store and the cleaner service). The AppChecker is used to gather the > currently running applications on SCM startup (necessary for an scm that is > backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2180) In-memory backing store for cache manager
[ https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2180: --- Attachment: YARN-2180-trunk-v4.patch [~kasha] [~vinodkv] [~sjlee0] Attached is v4. Here are some significant changes: 1. Bootstrapping and old SCMContext logic is now moved to the serviceInit of the in-memory store. 2. SCMStore interface is annotated properly with private and evolving. 3. Eviction logic of a shared cache resource has moved to the SCMStore implementation. The isResourceEvictable method has been added to the SCMStore interface to expose this. 4. There is a new configuration class (InMemorySCMStoreConfiguration) that allows for InMemorySCMStore implementation specific configuration. 5. Javadoc rework and method name refactoring so that items and their references in the shared cache are referred to as shared cache resources and shared cache resource references. 6. Various other refactors to address comments. > In-memory backing store for cache manager > - > > Key: YARN-2180 > URL: https://issues.apache.org/jira/browse/YARN-2180 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, > YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch > > > Implement an in-memory backing store for the cache manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2198: --- Attachment: YARN-2198.trunk.10.patch > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, > YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, > YARN-2198.delta.7.patch, YARN-2198.separation.patch, > YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, > YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires a the process launching the container to be LocalSystem or > a member of the a local Administrators group. Since the process in question > is the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2198: --- Attachment: (was: YARN-2198.trunk.10.patch) > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, > YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, > YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.4.patch, > YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, > YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires a the process launching the container to be LocalSystem or > a member of the a local Administrators group. Since the process in question > is the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2179: --- Attachment: YARN-2179-trunk-v7.patch Slight update. AppChecker and RemoteAppChecker are now a service to allow for proper handling of YarnClient. > Initial cache manager structure and context > --- > > Key: YARN-2179 > URL: https://issues.apache.org/jira/browse/YARN-2179 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chris Trezzo >Assignee: Chris Trezzo > Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, > YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, > YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch > > > Implement the initial shared cache manager structure and context. The > SCMContext will be used by a number of manager services (i.e. the backing > store and the cleaner service). The AppChecker is used to gather the > currently running applications on SCM startup (necessary for an scm that is > backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-1963: -- Attachment: YARN Application Priorities Design.pdf Hi All I am uploading an initial draft for Application Priority design. Kindly review the same and share your thoughts. I am planning to bring up the subjiras by end of week and after a round of review. Thank you [~vinodkv] for support. > Support priorities across applications within the same queue > - > > Key: YARN-1963 > URL: https://issues.apache.org/jira/browse/YARN-1963 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, resourcemanager >Reporter: Arun C Murthy >Assignee: Sunil G > Attachments: YARN Application Priorities Design.pdf > > > It will be very useful to support priorities among applications within the > same queue, particularly in production scenarios. It allows for finer-grained > controls without having to force admins to create a multitude of queues, plus > allows existing applications to continue using existing queues which are > usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2009) Priority support for preemption in ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-2009: - Assignee: Sunil G > Priority support for preemption in ProportionalCapacityPreemptionPolicy > --- > > Key: YARN-2009 > URL: https://issues.apache.org/jira/browse/YARN-2009 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Devaraj K >Assignee: Sunil G > > While preempting containers based on the queue ideal assignment, we may need > to consider preempting the low priority application containers first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2601) RMs(HA RMS) can't enter active state
[ https://issues.apache.org/jira/browse/YARN-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148155#comment-14148155 ] Aroop Maliakkal commented on YARN-2601: --- As a workaround, we deleted the entries in /rmstore/ZKRMStateRoot/RMAppRoot and restarted the RMs. Looks like that fixed the issue. > RMs(HA RMS) can't enter active state > > > Key: YARN-2601 > URL: https://issues.apache.org/jira/browse/YARN-2601 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Cindy Li > > 2014-09-24 15:04:04,527 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing > event for application_1409048687352_0552 of type APP_REJECTED > 2014-09-24 15:04:04,528 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1409048687352_0552 State change from NEW to FAILED > 2014-09-24 15:04:04,528 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: > Dispatching the event > org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.AppRemovedSchedulerEvent.EventType: > APP_REMOVED > 2014-09-24 15:04:04,528 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: > Dispatching the event > org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEvent.EventType: > APP_COMPLETED > 2014-09-24 15:04:04,528 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RMAppManager > processing event for application_1409048687352_0552 of type APP_COMPLETED > 2014-09-24 15:04:04,528 WARN > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=b_hiveperf0 > OPERATION=Application Finished - Failed TARGET=RMAppManager > RESULT=FAILURE DESCRIPTION=App failed with state: FAILED > PERMISSIONS=hadoop tried to renew an expired token > at > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:366) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:6279) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:488) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:923) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2020) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2016) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1650) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2014) > APPID=application_1409048687352_0552 > 2014-09-24 15:04:04,529 DEBUG org.apache.hadoop.service.AbstractService: > Service: RMActiveServices entered state STOPPED > > 2014-09-24 15:04:04,538 WARN > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop > OPERATION=transitionToActiveTARGET=RMHAProtocolService > RESULT=FAILURE DESCRIPTION=Exception transitioning to active > PERMISSIONS=Users [hadoop] are allowed > 2014-09-24 15:04:04,539 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:292) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.security.token.SecretManager$InvalidToken: hadoop tried to > renew an expired token >
[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field
[ https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148135#comment-14148135 ] Jian He commented on YARN-2523: --- [~jlowe], would you like to take another look ? > ResourceManager UI showing negative value for "Decommissioned Nodes" field > -- > > Key: YARN-2523 > URL: https://issues.apache.org/jira/browse/YARN-2523 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, webapp >Affects Versions: 3.0.0 >Reporter: Nishan Shetty >Assignee: Rohith > Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, > YARN-2523.patch > > > 1. Decommission one NodeManager by configuring ip in excludehost file > 2. Remove ip from excludehost file > 3. Execute -refreshNodes command and restart Decommissioned NodeManager > Observe that in RM UI negative value for "Decommissioned Nodes" field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148125#comment-14148125 ] Jian He commented on YARN-668: -- - containerManagerImpl, TestApplicationMasterService changes revert - default value of AMRMTokenIdentifier keyId. {{private int keyId = Integer.MIN_VALUE;}}. Proto definition should have the same default - following constructors may be not needed. {code} public NMTokenIdentifier(NMTokenIdentifierProto proto) { this.proto = proto; } {code} - why remove following ? {code} // LogAggregationContext is set as null Assert.assertNull(getLogAggregationContextFromContainerToken(rm1, nm1, null)); {code} - remove the commented code {code} /*ByteArrayDataInput input = ByteStreams.newDataInput( containerToken.getIdentifier().array()); ContainerTokenIdentifier containerTokenIdentifier = new ContainerTokenIdentifier(); containerTokenIdentifier.readFields(input);*/ {code} > TokenIdentifier serialization should consider Unknown fields > > > Key: YARN-668 > URL: https://issues.apache.org/jira/browse/YARN-668 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-668-demo.patch, YARN-668-v2.patch, > YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, > YARN-668-v7.patch, YARN-668.patch > > > This would allow changing of the TokenIdentifier between versions. The > current serialization is Writable. A simple way to achieve this would be to > have a Proto object as the payload for TokenIdentifiers, instead of > individual fields. > TokenIdentifier continues to implement Writable to work with the RPC layer - > but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2605) [RM HA] Rest api endpoints doing redirect incorrectly
bc Wong created YARN-2605: - Summary: [RM HA] Rest api endpoints doing redirect incorrectly Key: YARN-2605 URL: https://issues.apache.org/jira/browse/YARN-2605 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: bc Wong The standby RM's webui tries to do a redirect via meta-refresh. That is fine for pages designed to be viewed by web browsers. But the API endpoints shouldn't do that. Most programmatic HTTP clients do not do meta-refresh. I'd suggest HTTP 303, or return a well-defined error message (json or xml) stating that the standby status and a link to the active RM. The standby RM is returning this today: {noformat} $ curl -i http://bcsec-1.ent.cloudera.com:8088/ws/v1/cluster/metrics HTTP/1.1 200 OK Cache-Control: no-cache Expires: Thu, 25 Sep 2014 18:34:53 GMT Date: Thu, 25 Sep 2014 18:34:53 GMT Pragma: no-cache Expires: Thu, 25 Sep 2014 18:34:53 GMT Date: Thu, 25 Sep 2014 18:34:53 GMT Pragma: no-cache Content-Type: text/plain; charset=UTF-8 Refresh: 3; url=http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics Content-Length: 117 Server: Jetty(6.1.26) This is standby RM. Redirecting to the current active RM: http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148098#comment-14148098 ] Thomas Graves commented on YARN-1769: - We've been running this now on cluster for quite a while and its showing great improvements in the time to get larger containers. I would like to put this in. > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148095#comment-14148095 ] Craig Welch commented on YARN-2494: --- ...other kinds of labels..., rather > [YARN-796] Node label manager API and storage implementations > - > > Key: YARN-2494 > URL: https://issues.apache.org/jira/browse/YARN-2494 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, > YARN-2494.patch, YARN-2494.patch, YARN-2494.patch > > > This JIRA includes APIs and storage implementations of node label manager, > NodeLabelManager is an abstract class used to manage labels of nodes in the > cluster, it has APIs to query/modify > - Nodes according to given label > - Labels according to given hostname > - Add/remove labels > - Set labels of nodes in the cluster > - Persist/recover changes of labels/labels-on-nodes to/from storage > And it has two implementations to store modifications > - Memory based storage: It will not persist changes, so all labels will be > lost when RM restart > - FileSystem based storage: It will persist/recover to/from FileSystem (like > HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148094#comment-14148094 ] Craig Welch commented on YARN-2494: --- -re I suggest to change addLabels to addNodeLabels because we may support more different kind of labels in the future, change removeLabels to removeExistingLabels, and leave NodeLabelsManager.existingLabels unchanged. I thought we'd setteled on just adding "Node" to the names which did not have it, so addNodeLables, removeNodeLabels, etc. I don't think "Existing" and "Known" are particularly helpful, the concern was to distinguish these as "NodeLabel" operations, to leave room in the future for other kinds of nodes. Also, with the refactor to a "store" type and dropping the configuration option, do we still have a way to specify something other than the hdfs store? wrt leveldb - we ended up with hdfs for the ha case, I think anything we do should be distributed, not local - so zookeeper, hbase, etc. > [YARN-796] Node label manager API and storage implementations > - > > Key: YARN-2494 > URL: https://issues.apache.org/jira/browse/YARN-2494 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, > YARN-2494.patch, YARN-2494.patch, YARN-2494.patch > > > This JIRA includes APIs and storage implementations of node label manager, > NodeLabelManager is an abstract class used to manage labels of nodes in the > cluster, it has APIs to query/modify > - Nodes according to given label > - Labels according to given hostname > - Add/remove labels > - Set labels of nodes in the cluster > - Persist/recover changes of labels/labels-on-nodes to/from storage > And it has two implementations to store modifications > - Memory based storage: It will not persist changes, so all labels will be > lost when RM restart > - FileSystem based storage: It will persist/recover to/from FileSystem (like > HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148074#comment-14148074 ] Vinod Kumar Vavilapalli commented on YARN-668: -- Quick look at the patch - None of the records in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/yarn_security_token.proto are supposed to be exposed to users. We can move it to a sub-folder server and explicit comment in the proto file saying they are consumable. - What about other tokens? We have Client to AM token, RM delegation-tokens etc. > TokenIdentifier serialization should consider Unknown fields > > > Key: YARN-668 > URL: https://issues.apache.org/jira/browse/YARN-668 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Siddharth Seth >Assignee: Junping Du >Priority: Blocker > Attachments: YARN-668-demo.patch, YARN-668-v2.patch, > YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, > YARN-668-v7.patch, YARN-668.patch > > > This would allow changing of the TokenIdentifier between versions. The > current serialization is Writable. A simple way to achieve this would be to > have a Proto object as the payload for TokenIdentifiers, instead of > individual fields. > TokenIdentifier continues to implement Writable to work with the RPC layer - > but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148059#comment-14148059 ] Hadoop QA commented on YARN-1769: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671255/YARN-1769.patch against trunk revision e0b1dc5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5127//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5127//console This message is automatically generated. > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2594: --- Summary: Potential deadlock in RM when querying ApplicationResourceUsageReport (was: ResourceManger sometimes become un-responsive) > Potential deadlock in RM when querying ApplicationResourceUsageReport > - > > Key: YARN-2594 > URL: https://issues.apache.org/jira/browse/YARN-2594 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-2594.patch > > > ResoruceManager sometimes become un-responsive: > There was in exception in ResourceManager log and contains only following > type of messages: > {code} > 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 > 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 > 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 > 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 > 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 > 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 > 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148049#comment-14148049 ] Karthik Kambatla commented on YARN-2594: Thanks for working on this, Wangda. As I see, we could adopt the approach in the current patch. If we do so, we should avoid using readLock in other get methods that access {{RMAppImpl#currentAttempt}}. {{RMAppAttemptImpl}} should handle the thread-safety of its fields. Either in addition to or instead of current approach, we really need to cleanup {{SchedulerApplicationAttempt}}. Most of the methods there are synchronized, and many of them just call synchronized methods in {{AppSchedulingInfo}}. Needless to say, this is more involved and we need to be very careful. I am open to adopting the first approach in this JIRA and file follow-up JIRAs to address the second approach suggested. PS: We really need to set up jcarder or something to identify most of these deadlock paths. > ResourceManger sometimes become un-responsive > - > > Key: YARN-2594 > URL: https://issues.apache.org/jira/browse/YARN-2594 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-2594.patch > > > ResoruceManager sometimes become un-responsive: > There was in exception in ResourceManager log and contains only following > type of messages: > {code} > 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 > 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 > 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 > 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 > 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 > 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 > 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148006#comment-14148006 ] Karthik Kambatla commented on YARN-2594: Taking a look at the issue and the patch.. > ResourceManger sometimes become un-responsive > - > > Key: YARN-2594 > URL: https://issues.apache.org/jira/browse/YARN-2594 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-2594.patch > > > ResoruceManager sometimes become un-responsive: > There was in exception in ResourceManager log and contains only following > type of messages: > {code} > 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 > 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 > 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 > 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 > 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 > 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 > 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2604) Scheduler should consider max-allocation-* in conjunction with the largest node
[ https://issues.apache.org/jira/browse/YARN-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148003#comment-14148003 ] Karthik Kambatla commented on YARN-2604: bq. I guess it comes down to whether we really want to immediately fail an app if no node in the cluster at the time of submission has the sufficient resources. If that's OK then we can do a simple change like the one you originally proposed. This would be light-weight and quick particularly for mis-configuration cases, and I think there is merit to doing this in addition to YARN-56. Let me re-open this and work on a patch. > Scheduler should consider max-allocation-* in conjunction with the largest > node > --- > > Key: YARN-2604 > URL: https://issues.apache.org/jira/browse/YARN-2604 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.5.1 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > If the scheduler max-allocation-* values are larger than the resources > available on the largest node in the cluster, an application requesting > resources between the two values will be accepted by the scheduler but the > requests will never be satisfied. The app essentially hangs forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (YARN-2604) Scheduler should consider max-allocation-* in conjunction with the largest node
[ https://issues.apache.org/jira/browse/YARN-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reopened YARN-2604: > Scheduler should consider max-allocation-* in conjunction with the largest > node > --- > > Key: YARN-2604 > URL: https://issues.apache.org/jira/browse/YARN-2604 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.5.1 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > If the scheduler max-allocation-* values are larger than the resources > available on the largest node in the cluster, an application requesting > resources between the two values will be accepted by the scheduler but the > requests will never be satisfied. The app essentially hangs forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2161) Fix build on macosx: YARN parts
[ https://issues.apache.org/jira/browse/YARN-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147977#comment-14147977 ] Hudson commented on YARN-2161: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1907 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1907/]) YARN-2161. Fix build on macosx: YARN parts (Binglin Chang via aw) (aw: rev 034df0e2eb2824fb46a1e75b52d43d9914a04e56) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/config.h.cmake * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/configuration.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/CMakeLists.txt > Fix build on macosx: YARN parts > --- > > Key: YARN-2161 > URL: https://issues.apache.org/jira/browse/YARN-2161 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Binglin Chang >Assignee: Binglin Chang > Fix For: 2.6.0 > > Attachments: YARN-2161.v1.patch, YARN-2161.v2.patch > > > When compiling on macosx with -Pnative, there are several warning and errors, > fix this would help hadoop developers with macosx env. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2596) TestWorkPreservingRMRestart fails with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147982#comment-14147982 ] Hudson commented on YARN-2596: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1907 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1907/]) YARN-2596. TestWorkPreservingRMRestart fails with FairScheduler. (kasha) (kasha: rev 39c87344e16a08ab69e25345b3bce92aec92db47) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/CHANGES.txt > TestWorkPreservingRMRestart fails with FairScheduler > > > Key: YARN-2596 > URL: https://issues.apache.org/jira/browse/YARN-2596 > Project: Hadoop YARN > Issue Type: Test >Reporter: Junping Du >Assignee: Karthik Kambatla > Fix For: 2.6.0 > > Attachments: yarn-2596-1.patch > > > As test result from YARN-668, the test failure can be reproduce locally > without apply new patch to trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2546) REST API for application creation/submission is using strings for numeric & boolean values
[ https://issues.apache.org/jira/browse/YARN-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147973#comment-14147973 ] Hudson commented on YARN-2546: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1907 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1907/]) YARN-2546. Made REST API for application creation/submission use numeric and boolean types instead of the string of them. Contributed by Varun Vasudev. (zjshen: rev 72b0881ca641fa830c907823f674a5c5e39aa15a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/JAXBContextResolver.java > REST API for application creation/submission is using strings for numeric & > boolean values > -- > > Key: YARN-2546 > URL: https://issues.apache.org/jira/browse/YARN-2546 > Project: Hadoop YARN > Issue Type: Bug > Components: api >Affects Versions: 2.5.1 >Reporter: Doug Haigh >Assignee: Varun Vasudev > Fix For: 2.6.0 > > Attachments: apache-yarn-2546.0.patch, apache-yarn-2546.1.patch > > > When YARN responds with or accepts JSON, numbers & booleans are being > represented as strings which can cause parsing problems. > Resource values look like > { > "application-id":"application_1404198295326_0001", > "maximum-resource-capability": >{ > "memory":"8192", > "vCores":"32" >} > } > Instead of > { > "application-id":"application_1404198295326_0001", > "maximum-resource-capability": >{ > "memory":8192, > "vCores":32 >} > } > When I POST to start a job, numeric values are represented as numbers: > "local-resources": > { > "entry": > [ > { > "key":"AppMaster.jar", > "value": > { > > "resource":"hdfs://hdfs-namenode:9000/user/testuser/DistributedShell/demo-app/AppMaster.jar", > "type":"FILE", > "visibility":"APPLICATION", > "size": "43004", > "timestamp": "1405452071209" > } > } > ] > }, > Instead of > "local-resources": > { > "entry": > [ > { > "key":"AppMaster.jar", > "value": > { > > "resource":"hdfs://hdfs-namenode:9000/user/testuser/DistributedShell/demo-app/AppMaster.jar", > "type":"FILE", > "visibility":"APPLICATION", > "size": 43004, > "timestamp": 1405452071209 > } > } > ] > }, > Similarly, Boolean values are also represented as strings: > "keep-containers-across-application-attempts":"false" > Instead of > "keep-containers-across-application-attempts":false -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2102) More generalized timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147980#comment-14147980 ] Hudson commented on YARN-2102: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1907 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1907/]) YARN-2102. Added the concept of a Timeline Domain to handle read/write ACLs on Timeline service event data. Contributed by Zhijie Shen. (vinodkv: rev d78b452a4f413c6931a494c33df0666ce9b44973) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineACLsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/timeline/TestTimelineRecords.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServicesWithSSL.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineReader.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomain.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineWriter.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomains.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/TimelineWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestMemoryTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestLeveldbTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineACLsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java > More generalized timeline ACLs > -- > > Key: YARN-2102 > URL: https://issues.apache.org/jira/browse/YARN-2102 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Fix For: 2.6.0 > > Attachments: GeneralizedTimelineACLs.pdf, YARN-2102.1.patch, > YARN-2102.2.patch, YARN-2102.3.patch, YARN-2102.5.patch, YARN-2102.6.patch, > YARN-2102.7.patch, YARN-2102.8.patch > > > We need to differentiate the access controls of reading and writing > operations, and we need to think about cross-entity access control. For > example, if we are executing a workflow of MR jobs, which writing the > timeline data of this workflow, we don't want other user to pollute the > timeline data of the workflow by putting something under it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2581) NMs need to find a way to get LogAggregationContext
[ https://issues.apache.org/jira/browse/YARN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147981#comment-14147981 ] Hudson commented on YARN-2581: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1907 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1907/]) YARN-2581. Passed LogAggregationContext to NM via ContainerTokenIdentifier. Contributed by Xuan Gong. (zjshen: rev c86674a3a4d99aa56bb8ed3f6df51e3fef215eba) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/RMContainerTokenSecretManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/ContainerTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/event/LogHandlerAppStartedEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationInitEvent.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManager.java > NMs need to find a way to get LogAggregationContext > --- > > Key: YARN-2581 > URL: https://issues.apache.org/jira/browse/YARN-2581 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Fix For: 2.6.0 > > Attachments: YARN-2581.1.patch, YARN-2581.2.patch, YARN-2581.3.patch, > YARN-2581.4.patch > > > After YARN-2569, we have LogAggregationContext for application in > ApplicationSubmissionContext. NMs need to find a way to get this information. > We have this requirement: For all containers in the same application should > honor the same LogAggregationContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field
[ https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147955#comment-14147955 ] Jian He commented on YARN-2523: --- +1 for the latest patch. > ResourceManager UI showing negative value for "Decommissioned Nodes" field > -- > > Key: YARN-2523 > URL: https://issues.apache.org/jira/browse/YARN-2523 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, webapp >Affects Versions: 3.0.0 >Reporter: Nishan Shetty >Assignee: Rohith > Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, > YARN-2523.patch > > > 1. Decommission one NodeManager by configuring ip in excludehost file > 2. Remove ip from excludehost file > 3. Execute -refreshNodes command and restart Decommissioned NodeManager > Observe that in RM UI negative value for "Decommissioned Nodes" field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147952#comment-14147952 ] Remus Rusanu commented on YARN-2198: The findbugs warning is {code} Inconsistent synchronization of org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.delegationTokenSequenceNumber; locked 71% of time {code} > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, > YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, > YARN-2198.delta.7.patch, YARN-2198.separation.patch, > YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, > YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires a the process launching the container to be LocalSystem or > a member of the a local Administrators group. Since the process in question > is the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147946#comment-14147946 ] Remus Rusanu commented on YARN-2198: Core test failure is: {code} Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 120.538 sec <<< FAILURE! - in org.apache.hadoop.crypto.random.TestOsSecureRandom testOsSecureRandomSetConf(org.apache.hadoop.crypto.random.TestOsSecureRandom) Time elapsed: 120.011 sec <<< ERROR! java.lang.Exception: test timed out after 12 milliseconds at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:220) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158) at java.io.InputStreamReader.read(InputStreamReader.java:167) at java.io.BufferedReader.fill(BufferedReader.java:136) at java.io.BufferedReader.read1(BufferedReader.java:187) at java.io.BufferedReader.read(BufferedReader.java:261) at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:727) at org.apache.hadoop.util.Shell.runCommand(Shell.java:524) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:714) at org.apache.hadoop.crypto.random.TestOsSecureRandom.testOsSecureRandomSetConf(TestOsSecureRandom.java:149) {code} > Remove the need to run NodeManager as privileged account for Windows Secure > Container Executor > -- > > Key: YARN-2198 > URL: https://issues.apache.org/jira/browse/YARN-2198 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, > YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, > YARN-2198.delta.7.patch, YARN-2198.separation.patch, > YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, > YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch > > > YARN-1972 introduces a Secure Windows Container Executor. However this > executor requires a the process launching the container to be LocalSystem or > a member of the a local Administrators group. Since the process in question > is the NodeManager, the requirement translates to the entire NM to run as a > privileged account, a very large surface area to review and protect. > This proposal is to move the privileged operations into a dedicated NT > service. The NM can run as a low privilege account and communicate with the > privileged NT service when it needs to launch a container. This would reduce > the surface exposed to the high privileges. > There has to exist a secure, authenticated and authorized channel of > communication between the NM and the privileged NT service. Possible > alternatives are a new TCP endpoint, Java RPC etc. My proposal though would > be to use Windows LPC (Local Procedure Calls), which is a Windows platform > specific inter-process communication channel that satisfies all requirements > and is easy to deploy. The privileged NT service would register and listen on > an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop > with libwinutils which would host the LPC client code. The client would > connect to the LPC port (NtConnectPort) and send a message requesting a > container launch (NtRequestWaitReplyPort). LPC provides authentication and > the privileged NT service can use authorization API (AuthZ) to validate the > caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2604) Scheduler should consider max-allocation-* in conjunction with the largest node
[ https://issues.apache.org/jira/browse/YARN-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147942#comment-14147942 ] Jason Lowe commented on YARN-2604: -- Ah, I see, yes they're a little bit different. They'd be the same if we want to consider the large node that is unhealthy/lost equivalent to an overloaded large node. In both cases we had the resources to satisfy the request at one point but no longer do. I guess it comes down to whether we really want to immediately fail an app if no node in the cluster at the time of submission has the sufficient resources. If that's OK then we can do a simple change like the one you originally proposed. If the nodes are there but unusable for some reason (e.g.: unhealthy) and we want to wait around for a bit then it gets closer to what YARN-56 is trying to do. > Scheduler should consider max-allocation-* in conjunction with the largest > node > --- > > Key: YARN-2604 > URL: https://issues.apache.org/jira/browse/YARN-2604 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.5.1 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > If the scheduler max-allocation-* values are larger than the resources > available on the largest node in the cluster, an application requesting > resources between the two values will be accepted by the scheduler but the > requests will never be satisfied. The app essentially hangs forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)