[jira] [Commented] (YARN-2468) Log handling for LRS

2014-09-25 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148834#comment-14148834
 ] 

Zhijie Shen commented on YARN-2468:
---

The patch is generally good. Some minor comments, and puzzles about the code.

1. The first one is \@VisibleForTesting? And the second one is not necessary?
{code}
-  private static String getNodeString(NodeId nodeId) {
+  public static String getNodeString(NodeId nodeId) {
 return nodeId.toString().replace(":", "_");
   }
-  
+
+  public static String getNodeString(String nodeId) {
+return nodeId.replace(":", "_");
+  }
{code}

2. Add a TODO to say the test will be fixed in a in followup Jira, in case we 
forget it?
{code}
+  @Ignore
   @Test
   public void testNoLogs() throws Exception {
{code}

3. Based on my understanding, uploadedFiles is the candidate files to upload? 
If so, can we rename the variables and related methods?
{code}
+private Set uploadedFiles = new HashSet();
{code}

4. I assume this var is going to capture all the existing log files on HDFS, 
isn't it? If so, the computation of it seems to be problematic, because it 
doesn't exclude the files to be excluded. And what's the effect on 
alreadyUploadedLogs?
{code}
+private Set allExistingFileMeta = new HashSet();
{code}
{code}
  Iterable mask =
  Iterables.filter(alreadyUploadedLogs, new Predicate() {
@Override
public boolean apply(String next) {
  return currentExistingLogFiles.contains(next);
}
  });
{code}

5. Make the old LogValue constructor based on the new one?

6. LogValue.write is not necessary to be changed?

7. It's recommended to close the Closable objects via IOUtils, but it seems 
that AggregatedLogFormat already has this issue before. Let's file a separate 
ticket for it.
{code}
+if (this.fsDataOStream != null) {
+  this.fsDataOStream.close();
+}
{code}

8. nodeId seems to be of no use. No need to be passed into AppLogAggregatorImpl.
{code}
+  private final NodeId nodeId;
{code}

9. remoteNodeLogDirForApp doesn't affect remoteNodeTmpLogFileForApp, which only 
depends on remoteNodeLogFileForApp. remoteNodeLogFileForApp is determined at 
construction, so remoteNodeTmpLogFileForApp should be final and computed once 
in constructor as well. And constructor param remoteNodeLogDirForApp should be 
renamed back to remoteNodeLogFileForApp.
{code}
-  private final Path remoteNodeTmpLogFileForApp;
+  private Path remoteNodeTmpLogFileForApp;
{code}
{code}
-  private Path getRemoteNodeTmpLogFileForApp() {
+  private Path getRemoteNodeTmpLogFileForApp(Path remoteNodeLogDirForApp) {
 return new Path(remoteNodeLogFileForApp.getParent(),
-(remoteNodeLogFileForApp.getName() + TMP_FILE_SUFFIX));
+  (remoteNodeLogFileForApp.getName() + 
LogAggregationUtils.TMP_FILE_SUFFIX));
   }
{code}

10. One typo
{code}
  // if any of the previous uoloaded logs have been deleted,
{code}

11. One question: if one file is failed at uploading in LogValue.write(), 
uploadedFiles will not reflect the missing uploaded file, and it will not be 
uploaded again?

> Log handling for LRS
> 
>
> Key: YARN-2468
> URL: https://issues.apache.org/jira/browse/YARN-2468
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, nodemanager, resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
> YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
> YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
> YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
> YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
> YARN-2468.7.1.patch, YARN-2468.7.patch
>
>
> Currently, when application is finished, NM will start to do the log 
> aggregation. But for Long running service applications, this is not ideal. 
> The problems we have are:
> 1) LRS applications are expected to run for a long time (weeks, months).
> 2) Currently, all the container logs (from one NM) will be written into a 
> single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148765#comment-14148765
 ] 

Hadoop QA commented on YARN-1051:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671361/YARN-1051.1.patch
  against trunk revision f435724.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 21 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.mapreduce.lib.input.TestMRCJCFileInputFormat

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5139//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5139//console

This message is automatically generated.

> YARN Admission Control/Planner: enhancing the resource allocation model with 
> time.
> --
>
> Key: YARN-1051
> URL: https://issues.apache.org/jira/browse/YARN-1051
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, resourcemanager, scheduler
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, 
> YARN-1051.patch, curino_MSR-TR-2013-108.pdf, techreport.pdf
>
>
> In this umbrella JIRA we propose to extend the YARN RM to handle time 
> explicitly, allowing users to "reserve" capacity over time. This is an 
> important step towards SLAs, long-running services, workflows, and helps for 
> gang scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148734#comment-14148734
 ] 

Hadoop QA commented on YARN-668:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671371/YARN-668-v9.patch
  against trunk revision e96ce6f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 11 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5141//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5141//console

This message is automatically generated.

> TokenIdentifier serialization should consider Unknown fields
> 
>
> Key: YARN-668
> URL: https://issues.apache.org/jira/browse/YARN-668
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>Assignee: Junping Du
>Priority: Blocker
> Attachments: YARN-668-demo.patch, YARN-668-v2.patch, 
> YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, 
> YARN-668-v7.patch, YARN-668-v8.patch, YARN-668-v9.patch, YARN-668.patch
>
>
> This would allow changing of the TokenIdentifier between versions. The 
> current serialization is Writable. A simple way to achieve this would be to 
> have a Proto object as the payload for TokenIdentifiers, instead of 
> individual fields.
> TokenIdentifier continues to implement Writable to work with the RPC layer - 
> but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148712#comment-14148712
 ] 

Hadoop QA commented on YARN-2179:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12671373/YARN-2179-trunk-v8.patch
  against trunk revision e96ce6f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5142//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5142//console

This message is automatically generated.

> Initial cache manager structure and context
> ---
>
> Key: YARN-2179
> URL: https://issues.apache.org/jira/browse/YARN-2179
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
> YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
> YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch
>
>
> Implement the initial shared cache manager structure and context. The 
> SCMContext will be used by a number of manager services (i.e. the backing 
> store and the cleaner service). The AppChecker is used to gather the 
> currently running applications on SCM startup (necessary for an scm that is 
> backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-25 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148696#comment-14148696
 ] 

Wangda Tan commented on YARN-2594:
--

I think previous uploaded patch can still solve the problem. Eliminate the read 
lock in thread#2 will make thread#2 not blocked by the pending writeLock, and 
it will release synchronized lock which thread#1 wait for, so thread#1 can 
continue too. After that, thread#3 can achieve writelock finally.

> Potential deadlock in RM when querying ApplicationResourceUsageReport
> -
>
> Key: YARN-2594
> URL: https://issues.apache.org/jira/browse/YARN-2594
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karam Singh
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-2594.patch
>
>
> ResoruceManager sometimes become un-responsive:
> There was in exception in ResourceManager log and contains only  following 
> type of messages:
> {code}
> 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
> 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
> 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
> 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
> 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
> 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
> 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2179) Initial cache manager structure and context

2014-09-25 Thread Chris Trezzo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2179:
---
Attachment: YARN-2179-trunk-v8.patch

[~kasha] [~vinodkv]

Attached is v8. This latest patch addresses the most recent comments from 
Karthik.

> Initial cache manager structure and context
> ---
>
> Key: YARN-2179
> URL: https://issues.apache.org/jira/browse/YARN-2179
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
> YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
> YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch
>
>
> Implement the initial shared cache manager structure and context. The 
> SCMContext will be used by a number of manager services (i.e. the backing 
> store and the cleaner service). The AppChecker is used to gather the 
> currently running applications on SCM startup (necessary for an scm that is 
> backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-25 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148676#comment-14148676
 ] 

Wangda Tan commented on YARN-2594:
--

[~zxu],
Thanks for the explanation, it's very helpful, now I can understand write lock 
can block read lock.

I've created a test program:
{code}
package sandbox;

import java.util.concurrent.locks.ReentrantReadWriteLock;
import java.util.concurrent.locks.ReentrantReadWriteLock.ReadLock;
import java.util.concurrent.locks.ReentrantReadWriteLock.WriteLock;

public class Tester {
  private static class ReadThread implements Runnable {
private String name;
private ReadLock readLock;

ReadThread(String name, ReadLock readLock) {
  this.name = name;
  this.readLock = readLock;
}
@Override
public void run() {
  System.out.println("try lock read - " + name);
  readLock.lock();
  System.out.println("lock read - " + name);
}
  }
  private static class WriteThread implements Runnable {
private String name;
private WriteLock writeLock;

WriteThread(String name, WriteLock writeLock) {
  this.name = name;
  this.writeLock = writeLock;
}

@Override
public void run() {
  System.out.println("try lock write - " + name);
  writeLock.lock();
  System.out.println("lock write - " + name);
}
  }
  
  public static void main(String[] args) throws InterruptedException {
ReentrantReadWriteLock lock = new ReentrantReadWriteLock();
ReadLock readLock = lock.readLock();
WriteLock writeLock = lock.writeLock();

Thread r1 = new Thread(new ReadThread("1", readLock));
Thread r2 = new Thread(new ReadThread("2", readLock));
Thread w = new Thread(new WriteThread("3", writeLock));

r1.start();
Thread.sleep(100);
w.start();
Thread.sleep(100);
r2.start();
  }
}
{code}

Exactly as you described, a waiting write lock will block read block to avoid 
starvation.

> Potential deadlock in RM when querying ApplicationResourceUsageReport
> -
>
> Key: YARN-2594
> URL: https://issues.apache.org/jira/browse/YARN-2594
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karam Singh
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-2594.patch
>
>
> ResoruceManager sometimes become un-responsive:
> There was in exception in ResourceManager log and contains only  following 
> type of messages:
> {code}
> 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
> 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
> 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
> 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
> 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
> 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
> 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-668) TokenIdentifier serialization should consider Unknown fields

2014-09-25 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-668:

Attachment: YARN-668-v9.patch

Fix test failures in v9 patch.

> TokenIdentifier serialization should consider Unknown fields
> 
>
> Key: YARN-668
> URL: https://issues.apache.org/jira/browse/YARN-668
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>Assignee: Junping Du
>Priority: Blocker
> Attachments: YARN-668-demo.patch, YARN-668-v2.patch, 
> YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, 
> YARN-668-v7.patch, YARN-668-v8.patch, YARN-668-v9.patch, YARN-668.patch
>
>
> This would allow changing of the TokenIdentifier between versions. The 
> current serialization is Writable. A simple way to achieve this would be to 
> have a Proto object as the payload for TokenIdentifiers, instead of 
> individual fields.
> TokenIdentifier continues to implement Writable to work with the RPC layer - 
> but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-25 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148665#comment-14148665
 ] 

zhihai xu commented on YARN-2594:
-

The [ReentrantReadWriteLock | 
http://tutorials.jenkov.com/java-util-concurrent/readwritelock.html] 
implementation  is 
{code}
Read Lock   If no threads have locked the ReadWriteLock for writing, 
and no thread have requested a write lock (but not yet obtained it). 
Thus, multiple threads can lock the lock for reading.
Write Lock  If no threads are reading or writing. 
Thus, only one thread at a time can lock the lock for writing
{code}
Base on the above information, the first three threads can cause a deadlock,
The readLock is firstly acquired by thread#1, then thread#3 is blocked for 
writeLock, finally when Thread#2 try to acquire the readLock, thread#2 is also 
blocked because thread#3 is requesting the writeLock before thread#2. 
So this is not a bug in Java.
The following is the source code in ReentrantReadWriteLock.java:
{code}
static final class NonfairSync extends Sync {
private static final long serialVersionUID = -8159625535654395037L;
final boolean writerShouldBlock() {
return false; // writers can always barge
}
final boolean readerShouldBlock() {
/* As a heuristic to avoid indefinite writer starvation,
 * block if the thread that momentarily appears to be head
 * of queue, if one exists, is a waiting writer.  This is
 * only a probabilistic effect since a new reader will not
 * block if there is a waiting writer behind other enabled
 * readers that have not yet drained from the queue.
 */
return apparentlyFirstQueuedIsExclusive();
}
}
{code}
readerShouldBlock will check whether any threads request writeLock before it.

> Potential deadlock in RM when querying ApplicationResourceUsageReport
> -
>
> Key: YARN-2594
> URL: https://issues.apache.org/jira/browse/YARN-2594
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karam Singh
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-2594.patch
>
>
> ResoruceManager sometimes become un-responsive:
> There was in exception in ResourceManager log and contains only  following 
> type of messages:
> {code}
> 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
> 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
> 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
> 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
> 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
> 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
> 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet doesn't close table tags

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148660#comment-14148660
 ] 

Hadoop QA commented on YARN-2610:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671365/YARN-2610-02.patch
  against trunk revision f435724.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5140//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5140//console

This message is automatically generated.

> Hamlet doesn't close table tags
> ---
>
> Key: YARN-2610
> URL: https://issues.apache.org/jira/browse/YARN-2610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: supportability
> Attachments: YARN-2610-01.patch, YARN-2610-02.patch
>
>
> Revisiting a subset of MAPREDUCE-2993.
> The , , , ,  tags are not configured to close 
> properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
> table tags tends to wreak havoc with a lot of HTML processors (although not 
> usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-25 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148646#comment-14148646
 ] 

Wangda Tan commented on YARN-2594:
--

[~kasha],
Thanks for your comments, definitely we should reduce synchronized lock, but 
this problems seems not caused by this, 
Had a discussion with Jian He, We found 4 suspicious threads,

Thread #2/#4 try to acquire readlock but failed, but at the same time, *no 
writelock hold by anyone* (thread#3 is waiting for writelock). This is more 
like a bug of Java to me.
Followings are links of descriptions of that bug, and there's some other people 
claims this not yet fixed.
1) Java bug description: 
http://webcache.googleusercontent.com/search?q=cache:fjM5oxWzmCsJ:bugs.java.com/view_bug.do%3Fbug_id%3D6822370+&cd=1&hl=en&ct=clnk&gl=hk
2) People report the bug still occurs:
http://cs.oswego.edu/pipermail/concurrency-interest/2010-September/007413.html

Thoughts? Following are thread#1-#4

*Thread#1*
{code}
"IPC Server handler 45 on 8032" daemon prio=10 tid=0x7f032909b000 
nid=0x7bd7 waiting for monitor entry [0x7f0307aa9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:541)
- waiting to lock <0xe0e7ea70> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:196)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:703)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:569)
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:294)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{code}

*Thread#2*
{code}
"ResourceManager Event Processor" prio=10 tid=0x7f0328db9800 nid=0x7aeb 
waiting on condition [0x7f0311a48000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xe0e72bc0> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getCurrentAppAttempt(RMAppImpl.java:476)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.updateAttemptMetrics(RMContainerImpl.java:509)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:495)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:484)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
- locked <0xe0e85318> (a 
org.apache.hadoop.yarn.state.S

[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field

2014-09-25 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148636#comment-14148636
 ] 

Rohith commented on YARN-2523:
--

Thanks [~jianhe] and [~jlowe] for review and committing this:-)

> ResourceManager UI showing negative value for "Decommissioned Nodes" field
> --
>
> Key: YARN-2523
> URL: https://issues.apache.org/jira/browse/YARN-2523
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, webapp
>Affects Versions: 3.0.0
>Reporter: Nishan Shetty
>Assignee: Rohith
> Fix For: 2.6.0
>
> Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, 
> YARN-2523.patch
>
>
> 1. Decommission one NodeManager by configuring ip in excludehost file
> 2. Remove ip from excludehost file
> 3. Execute -refreshNodes command and restart Decommissioned NodeManager
> Observe that in RM UI negative value for "Decommissioned Nodes" field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2610) Hamlet doesn't close table tags

2014-09-25 Thread Ray Chiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated YARN-2610:
-
Attachment: YARN-2610-02.patch

Fixes for unit tests that don't expect closing table tags.

> Hamlet doesn't close table tags
> ---
>
> Key: YARN-2610
> URL: https://issues.apache.org/jira/browse/YARN-2610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: supportability
> Attachments: YARN-2610-01.patch, YARN-2610-02.patch
>
>
> Revisiting a subset of MAPREDUCE-2993.
> The , , , ,  tags are not configured to close 
> properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
> table tags tends to wreak havoc with a lot of HTML processors (although not 
> usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.

2014-09-25 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-1051:
-
Attachment: YARN-1051.1.patch

Attaching a patch with the [fixes | 
https://issues.apache.org/jira/browse/YARN-2611?focusedCommentId=14148604] from 
YARN-2611.
 
  * MAPREDUCE-6094 is already tracking the fix for 
_TestMRCJCFileInputFormat.testAddInputPath()_ test case failure
  * MAPREDUCE-6048 has been opened for the intermittaent failure of 
_TestJavaSerialization_

> YARN Admission Control/Planner: enhancing the resource allocation model with 
> time.
> --
>
> Key: YARN-1051
> URL: https://issues.apache.org/jira/browse/YARN-1051
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, resourcemanager, scheduler
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, 
> YARN-1051.patch, curino_MSR-TR-2013-108.pdf, techreport.pdf
>
>
> In this umbrella JIRA we propose to extend the YARN RM to handle time 
> explicitly, allowing users to "reserve" capacity over time. This is an 
> important step towards SLAs, long-running services, workflows, and helps for 
> gang scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch

2014-09-25 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-2611:
-
Attachment: YARN-2611.patch

Attaching a patch that fixes the fingbugs warnings and 
TestRMWebServicesCapacitySched.

TestJavaSerialization runs successfully in my machine so the failure must be an 
intermittant one.

TestMRCJCFileInputFormat fails in my machine in both the branch & trunk and the 
error looks unrelated to our patch.

> Fix jenkins findbugs warning and test case failures for trunk merge patch
> -
>
> Key: YARN-2611
> URL: https://issues.apache.org/jira/browse/YARN-2611
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, resourcemanager, scheduler
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
> Attachments: YARN-2611.patch
>
>
> This JIRA is to fix jenkins findbugs warnings and test case failures for 
> trunk merge patch  as [reported | 
> https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in 
> YARN-1051



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch

2014-09-25 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-2611:
-
Description: This JIRA is to fix jenkins findbugs warnings and test case 
failures for trunk merge patch  as [reported | 
https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in 
YARN-1051  (was: This JIRA is to 
https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506)

> Fix jenkins findbugs warning and test case failures for trunk merge patch
> -
>
> Key: YARN-2611
> URL: https://issues.apache.org/jira/browse/YARN-2611
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, resourcemanager, scheduler
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
>
> This JIRA is to fix jenkins findbugs warnings and test case failures for 
> trunk merge patch  as [reported | 
> https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in 
> YARN-1051



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch

2014-09-25 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-2611:
-
Description: This JIRA is to 
https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506  
(was: This JIRA is to track the changes required to ensure branch yarn-1051 is 
ready to be merged with trunk. This includes fixing any compilation issues, 
findbug and/or javadoc warning, test cases failures, etc if any.)

> Fix jenkins findbugs warning and test case failures for trunk merge patch
> -
>
> Key: YARN-2611
> URL: https://issues.apache.org/jira/browse/YARN-2611
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, resourcemanager, scheduler
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
>
> This JIRA is to 
> https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch

2014-09-25 Thread Subru Krishnan (JIRA)
Subru Krishnan created YARN-2611:


 Summary: Fix jenkins findbugs warning and test case failures for 
trunk merge patch
 Key: YARN-2611
 URL: https://issues.apache.org/jira/browse/YARN-2611
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Subru Krishnan
Assignee: Subru Krishnan


This JIRA is to track the changes required to ensure branch yarn-1051 is ready 
to be merged with trunk. This includes fixing any compilation issues, findbug 
and/or javadoc warning, test cases failures, etc if any.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2610) Hamlet doesn't close table tags

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148569#comment-14148569
 ] 

Hadoop QA commented on YARN-2610:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671350/YARN-2610-01.patch
  against trunk revision e9c37de.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common:

  org.apache.hadoop.yarn.webapp.hamlet.TestHamlet
  org.apache.hadoop.yarn.webapp.view.TestInfoBlock

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5138//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5138//console

This message is automatically generated.

> Hamlet doesn't close table tags
> ---
>
> Key: YARN-2610
> URL: https://issues.apache.org/jira/browse/YARN-2610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: supportability
> Attachments: YARN-2610-01.patch
>
>
> Revisiting a subset of MAPREDUCE-2993.
> The , , , ,  tags are not configured to close 
> properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
> table tags tends to wreak havoc with a lot of HTML processors (although not 
> usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2608) FairScheduler: Potential deadlocks in loading alloc files and clock access

2014-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148562#comment-14148562
 ] 

Hudson commented on YARN-2608:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6115 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6115/])
YARN-2608. FairScheduler: Potential deadlocks in loading alloc files and clock 
access. (Wei Yan via kasha) (kasha: rev 
f4357240a6f81065d91d5f443ed8fc8cd2a14a8f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* hadoop-yarn-project/CHANGES.txt


> FairScheduler: Potential deadlocks in loading alloc files and clock access
> --
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Fix For: 2.6.0
>
> Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>   synchronized (FairScheduler.this) {
>   
>   }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>   name = ensureRootPrefix(name);
>   synchronized (queues) {
>   
>   }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2608) FairScheduler: Potential deadlocks in loading alloc files and clock

2014-09-25 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2608:
---
Summary: FairScheduler: Potential deadlocks in loading alloc files and 
clock  (was: FairScheduler may hung due to two potential deadlocks)

> FairScheduler: Potential deadlocks in loading alloc files and clock
> ---
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>   synchronized (FairScheduler.this) {
>   
>   }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>   name = ensureRootPrefix(name);
>   synchronized (queues) {
>   
>   }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2608) FairScheduler: Potential deadlocks in loading alloc files and clock access

2014-09-25 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2608:
---
Summary: FairScheduler: Potential deadlocks in loading alloc files and 
clock access  (was: FairScheduler: Potential deadlocks in loading alloc files 
and clock)

> FairScheduler: Potential deadlocks in loading alloc files and clock access
> --
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>   synchronized (FairScheduler.this) {
>   
>   }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>   name = ensureRootPrefix(name);
>   synchronized (queues) {
>   
>   }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks

2014-09-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148538#comment-14148538
 ] 

Karthik Kambatla commented on YARN-2608:


+1. Committing this. 

> FairScheduler may hung due to two potential deadlocks
> -
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>   synchronized (FairScheduler.this) {
>   
>   }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>   name = ensureRootPrefix(name);
>   synchronized (queues) {
>   
>   }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2610) Hamlet doesn't close table tags

2014-09-25 Thread Ray Chiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated YARN-2610:
-
Attachment: YARN-2610-01.patch

Turn on closing tags for HTML table formatting.

> Hamlet doesn't close table tags
> ---
>
> Key: YARN-2610
> URL: https://issues.apache.org/jira/browse/YARN-2610
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>  Labels: supportability
> Attachments: YARN-2610-01.patch
>
>
> Revisiting a subset of MAPREDUCE-2993.
> The , , , ,  tags are not configured to close 
> properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
> table tags tends to wreak havoc with a lot of HTML processors (although not 
> usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2610) Hamlet doesn't close table tags

2014-09-25 Thread Ray Chiang (JIRA)
Ray Chiang created YARN-2610:


 Summary: Hamlet doesn't close table tags
 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang


Revisiting a subset of MAPREDUCE-2993.

The , , , ,  tags are not configured to close 
properly in Hamlet.  While this is allowed in HTML 4.01, missing closing table 
tags tends to wreak havoc with a lot of HTML processors (although not usually 
browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2602) Generic History Service of TimelineServer sometimes not able to handle NPE

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148519#comment-14148519
 ] 

Hadoop QA commented on YARN-2602:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671335/YARN-2602.1.patch
  against trunk revision 8269bfa.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5137//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5137//console

This message is automatically generated.

> Generic History Service of TimelineServer sometimes not able to handle NPE
> --
>
> Key: YARN-2602
> URL: https://issues.apache.org/jira/browse/YARN-2602
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
> Environment: ATS is running with AHS/GHS enabled to use TimelineStore.
> Running for 4-5 days, with many random example jobs running
>Reporter: Karam Singh
>Assignee: Zhijie Shen
> Attachments: YARN-2602.1.patch
>
>
> ATS is running with AHS/GHS enabled to use TimelineStore.
> Running for 4-5 day, with many random example jobs running .
> When I ran WS API for AHS/GHS:
> {code}
> curl --negotiate -u : 
> 'http:///v1/applicationhistory/apps/application_1411579118376_0001'
> {code}
> It ran successfully.
> However
> {code}
> curl --negotiate -u : 
> 'http:///ws/v1/applicationhistory/apps'
> {"exception":"WebApplicationException","message":"java.lang.NullPointerException","javaClassName":"javax.ws.rs.WebApplicationException"}
> {code}
> Failed with Internal server error 500.
> After looking at TimelineServer logs found that there was NPE:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148506#comment-14148506
 ] 

Hadoop QA commented on YARN-1051:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671311/YARN-1051.patch
  against trunk revision 9f9a222.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 20 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 8 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.mapreduce.lib.input.TestMRCJCFileInputFormat
  org.apache.hadoop.mapred.TestJavaSerialization
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5133//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5133//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5133//console

This message is automatically generated.

> YARN Admission Control/Planner: enhancing the resource allocation model with 
> time.
> --
>
> Key: YARN-1051
> URL: https://issues.apache.org/jira/browse/YARN-1051
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, resourcemanager, scheduler
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: YARN-1051-design.pdf, YARN-1051.patch, 
> curino_MSR-TR-2013-108.pdf, techreport.pdf
>
>
> In this umbrella JIRA we propose to extend the YARN RM to handle time 
> explicitly, allowing users to "reserve" capacity over time. This is an 
> important step towards SLAs, long-running services, workflows, and helps for 
> gang scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148496#comment-14148496
 ] 

Hadoop QA commented on YARN-2608:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671328/YARN-2608-3.patch
  against trunk revision 8269bfa.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5136//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5136//console

This message is automatically generated.

> FairScheduler may hung due to two potential deadlocks
> -
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>   synchronized (FairScheduler.this) {
>   
>   }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>   name = ensureRootPrefix(name);
>   synchronized (queues) {
>   
>   }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148493#comment-14148493
 ] 

Karthik Kambatla commented on YARN-2179:


Comments:
# Nit: YarnConfiguration - The definition of string constants corresponding to 
the config names are inconsistently indented. My personal preference is to put 
the value being assigned in the subsequent line if it does not all fit in one 
line.
# AppChecker constructor that takes a name, should use that name. 
# In RemoteAppChecker, I would list all the known ACTIVE_STATES instead of 
using the complement:
{code}
  private static final EnumSet ACTIVE_STATES =
  EnumSet.complementOf(EnumSet.of(YarnApplicationState.FINISHED,
YarnApplicationState.FAILED,
YarnApplicationState.KILLED));
{code}
# SharedCacheManager: the following two lines should be moved to serviceInit(). 
We can get rid of serviceStart altogether. 
{code}
DefaultMetricsSystem.initialize("SharedCacheManager");
JvmMetrics.initSingleton("SharedCacheManager", null);
{code}

> Initial cache manager structure and context
> ---
>
> Key: YARN-2179
> URL: https://issues.apache.org/jira/browse/YARN-2179
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
> YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
> YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch
>
>
> Implement the initial shared cache manager structure and context. The 
> SCMContext will be used by a number of manager services (i.e. the backing 
> store and the cleaner service). The AppChecker is used to gather the 
> currently running applications on SCM startup (necessary for an scm that is 
> backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue

2014-09-25 Thread Maysam Yabandeh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148474#comment-14148474
 ] 

Maysam Yabandeh commented on YARN-1963:
---

Thanks [~sunilg] for the design doc.

It might be useful if I share with you our use cases. 

Our most important use case is to let the admin change an app priority at 
runtime while it is running. The example is when a job gets unlucky taking much 
longer than usual due to some node failures or bugs. The user complains that 
the job is about to miss the deadline and admin needs a way to prioritize the 
user's job over the other jobs in the queue. This use case seems to be 
mentioned in Item 3 of Section 1.5.3 in the design doc but its "priority" seems 
not to be high.

Another use case is to dynamically give a job higher priority based on the job 
status. For example, when mapper fails and there is no headroom in the queue, 
the job preempt its reducers to make space for its mappers. The freed space is 
however not necessarily offered back to the job in fair scheduling. Ideally job 
could increase its priority when its reducers are being stalled waiting for its 
mappers to be assigned.

bq. Once all these requests of higher priority applications are served, then 
lower priority application requests will get served from Resource Manager.

We are using fair scheduler and I assumed this jira is to also cover that since 
YARN-2098 created as a sub-task. The design doc however seems to be fairly 
centered around CapacityScheduler. In the case of fair scheduler, I guess the 
priority can also be incorporated to the fair share calculation, instead of the 
strict order of high priority first.


> Support priorities across applications within the same queue 
> -
>
> Key: YARN-1963
> URL: https://issues.apache.org/jira/browse/YARN-1963
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, resourcemanager
>Reporter: Arun C Murthy
>Assignee: Sunil G
> Attachments: YARN Application Priorities Design.pdf
>
>
> It will be very useful to support priorities among applications within the 
> same queue, particularly in production scenarios. It allows for finer-grained 
> controls without having to force admins to create a multitude of queues, plus 
> allows existing applications to continue using existing queues which are 
> usually part of institutional memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2320) Removing old application history store after we store the history data to timeline store

2014-09-25 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2320:
--
Component/s: timelineserver

> Removing old application history store after we store the history data to 
> timeline store
> 
>
> Key: YARN-2320
> URL: https://issues.apache.org/jira/browse/YARN-2320
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2320.1.patch, YARN-2320.2.patch
>
>
> After YARN-2033, we should deprecate application history store set. There's 
> no need to maintain two sets of store interfaces. In addition, we should 
> conclude the outstanding jira's under YARN-321 about the application history 
> store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2320) Removing old application history store after we store the history data to timeline store

2014-09-25 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2320:
--
Target Version/s: 2.6.0

> Removing old application history store after we store the history data to 
> timeline store
> 
>
> Key: YARN-2320
> URL: https://issues.apache.org/jira/browse/YARN-2320
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2320.1.patch, YARN-2320.2.patch
>
>
> After YARN-2033, we should deprecate application history store set. There's 
> no need to maintain two sets of store interfaces. In addition, we should 
> conclude the outstanding jira's under YARN-321 about the application history 
> store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-25 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148452#comment-14148452
 ] 

Wilfred Spiegelenburg commented on YARN-2578:
-

I proposed fixing the RPC code and by default set the timeout in HDFS-4858 but 
there was no interest to fix the client (at that point in time). So we now have 
to fix it everywhere unless we can get everyone on board and get the behaviour 
changed in the RPC code. The comments are still in that jira and it would be a 
straight forward fix in the RPC code.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2602) Generic History Service of TimelineServer sometimes not able to handle NPE

2014-09-25 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2602:
--
Attachment: YARN-2602.1.patch

Create a patch to fix the problem

> Generic History Service of TimelineServer sometimes not able to handle NPE
> --
>
> Key: YARN-2602
> URL: https://issues.apache.org/jira/browse/YARN-2602
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
> Environment: ATS is running with AHS/GHS enabled to use TimelineStore.
> Running for 4-5 days, with many random example jobs running
>Reporter: Karam Singh
>Assignee: Zhijie Shen
> Attachments: YARN-2602.1.patch
>
>
> ATS is running with AHS/GHS enabled to use TimelineStore.
> Running for 4-5 day, with many random example jobs running .
> When I ran WS API for AHS/GHS:
> {code}
> curl --negotiate -u : 
> 'http:///v1/applicationhistory/apps/application_1411579118376_0001'
> {code}
> It ran successfully.
> However
> {code}
> curl --negotiate -u : 
> 'http:///ws/v1/applicationhistory/apps'
> {"exception":"WebApplicationException","message":"java.lang.NullPointerException","javaClassName":"javax.ws.rs.WebApplicationException"}
> {code}
> Failed with Internal server error 500.
> After looking at TimelineServer logs found that there was NPE:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148424#comment-14148424
 ] 

Hadoop QA commented on YARN-2608:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671317/YARN-2608-2.patch
  against trunk revision 9f9a222.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 10 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5135//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5135//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5135//console

This message is automatically generated.

> FairScheduler may hung due to two potential deadlocks
> -
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>   synchronized (FairScheduler.this) {
>   
>   }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>   name = ensureRootPrefix(name);
>   synchronized (queues) {
>   
>   }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field

2014-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148422#comment-14148422
 ] 

Hudson commented on YARN-2523:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6113 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6113/])
YARN-2523. ResourceManager UI showing negative value for "Decommissioned Nodes" 
field. Contributed by Rohith (jlowe: rev 
8269bfa613999f71767de3c0369817b58cfe1416)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java


> ResourceManager UI showing negative value for "Decommissioned Nodes" field
> --
>
> Key: YARN-2523
> URL: https://issues.apache.org/jira/browse/YARN-2523
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, webapp
>Affects Versions: 3.0.0
>Reporter: Nishan Shetty
>Assignee: Rohith
> Fix For: 2.6.0
>
> Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, 
> YARN-2523.patch
>
>
> 1. Decommission one NodeManager by configuring ip in excludehost file
> 2. Remove ip from excludehost file
> 3. Execute -refreshNodes command and restart Decommissioned NodeManager
> Observe that in RM UI negative value for "Decommissioned Nodes" field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2608) FairScheduler may hung due to two potential deadlocks

2014-09-25 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2608:
--
Attachment: YARN-2608-3.patch

Update a patch to fix the findbugs.

> FairScheduler may hung due to two potential deadlocks
> -
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>   synchronized (FairScheduler.this) {
>   
>   }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>   name = ensureRootPrefix(name);
>   synchronized (queues) {
>   
>   }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148405#comment-14148405
 ] 

Hadoop QA commented on YARN-668:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671313/YARN-668-v8.patch
  against trunk revision 9f9a222.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 10 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

  org.apache.hadoop.yarn.client.api.impl.TestNMClient
  
org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart
  org.apache.hadoop.yarn.security.TestYARNTokenIdentifier
  
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerResync
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRecovery
  
org.apache.hadoop.yarn.server.nodemanager.security.TestNMTokenSecretManagerInNM
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager
  
org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesContainers
  
org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater
  org.apache.hadoop.yarn.server.nodemanager.TestEventFlow
  
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown
  
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerReboot
  
org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesApps
  
org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerUtils
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
  org.apache.hadoop.yarn.server.TestContainerManagerSecurity

  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

org.apache.hadoop.yarn.client.api.impl.TestAMRMClient

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5134//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5134//console

This message is automatically generated.

> TokenIdentifier serialization should consider Unknown fields
> 
>
> Key: YARN-668
> URL: https://issues.apache.org/jira/browse/YARN-668
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>Assignee: Junping Du
>Priority: Blocker
> Attachments: YARN-668-demo.patch, YARN-668-v2.patch, 
> YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, 
> YARN-668-v7.patch, YARN-668-v8.patch, YARN-668.patch
>
>
> This would allow changing of the TokenIdentifier between versions. The 
> current serialization is Writable. A simple way to achieve this would be to 
> have a Proto object as the payload for TokenIdentifiers, instead of 
> individual fields.
> TokenIdentifier continues to implement Writable to work with the RPC layer - 
> but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field

2014-09-25 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148401#comment-14148401
 ] 

Jason Lowe commented on YARN-2523:
--

+1 lgtm.  Committing this.

> ResourceManager UI showing negative value for "Decommissioned Nodes" field
> --
>
> Key: YARN-2523
> URL: https://issues.apache.org/jira/browse/YARN-2523
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, webapp
>Affects Versions: 3.0.0
>Reporter: Nishan Shetty
>Assignee: Rohith
> Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, 
> YARN-2523.patch
>
>
> 1. Decommission one NodeManager by configuring ip in excludehost file
> 2. Remove ip from excludehost file
> 3. Execute -refreshNodes command and restart Decommissioned NodeManager
> Observe that in RM UI negative value for "Decommissioned Nodes" field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2609) Example of use for the ReservationSystem

2014-09-25 Thread Carlo Curino (JIRA)
Carlo Curino created YARN-2609:
--

 Summary: Example of use for the ReservationSystem
 Key: YARN-2609
 URL: https://issues.apache.org/jira/browse/YARN-2609
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Carlo Curino
Assignee: Carlo Curino
Priority: Minor


This JIRA provides a simple new example in mapreduce-examples that request a 
reservation and submit a Pi computation in the reservation. This is meant just 
to show how to interact with the reservation system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148397#comment-14148397
 ] 

Hadoop QA commented on YARN-913:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671298/YARN-913-010.patch
  against trunk revision 9f9a222.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 36 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1266 javac 
compiler warnings (more than the trunk's current 1265 warnings).

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.
See 
https://builds.apache.org/job/PreCommit-YARN-Build/5131//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavadocWarnings.txt
 for details.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

  org.apache.hadoop.ha.TestZKFailoverControllerStress
  
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell
  
org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5131//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5131//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-registry.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5131//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-common.html
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5131//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5131//console

This message is automatically generated.

> Add a way to register long-lived services in a YARN cluster
> ---
>
> Key: YARN-913
> URL: https://issues.apache.org/jira/browse/YARN-913
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, resourcemanager
>Affects Versions: 2.5.0, 2.4.1
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
> 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
> YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
> YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
> YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
> YARN-913-010.patch, yarnregistry.pdf, yarnregistry.tla
>
>
> In a YARN cluster you can't predict where services will come up -or on what 
> ports. The services need to work those things out as they come up and then 
> publish them somewhere.
> Applications need to be able to find the service instance they are to bond to 
> -and not any others in the cluster.
> Some kind of service registry -in the RM, in ZK, could do this. If the RM 
> held the write access to the ZK nodes, it would be more secure than having 
> apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148381#comment-14148381
 ] 

Karthik Kambatla edited comment on YARN-2578 at 9/25/14 10:04 PM:
--

Thanks for the clarification, Wilfred. Don't we need to the same from AM->RM 
and Client->RM as well? 

Instead of fixing it everywhere, how about we fix this in RPC itself? In 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488,
 instead of using 0 as the default value, the default could be looked up in the 
Configuration. No? 

If we think it is better to do it, we should probably create a common JIRA and 
take the opinion from HDFS folks as well. 


was (Author: kkambatl):
Thanks for the clarification, Wilfred. Don't we need to the same from AM->RM 
and Client->RM as well? 

Instead of fixing it everywhere, how about we fix this in RPC itself? In 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488,
 instead of using 0 as the default value, the default could be looked up in the 
Configuration. No? 

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148381#comment-14148381
 ] 

Karthik Kambatla commented on YARN-2578:


Thanks for the clarification, Wilfred. Don't we need to the same from AM->RM 
and Client->RM as well? 

Instead of fixing it everywhere, how about we fix this in RPC itself? In 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488,
 instead of using 0 as the default value, the default could be looked up in the 
Configuration. No? 

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148378#comment-14148378
 ] 

Hadoop QA commented on YARN-2608:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671305/YARN-2608-1.patch
  against trunk revision 9f9a222.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 10 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5132//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5132//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5132//console

This message is automatically generated.

> FairScheduler may hung due to two potential deadlocks
> -
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2608-1.patch, YARN-2608-2.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>   synchronized (FairScheduler.this) {
>   
>   }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>   name = ensureRootPrefix(name);
>   synchronized (queues) {
>   
>   }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2550) TestAMRestart fails intermittently

2014-09-25 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148376#comment-14148376
 ] 

Jason Lowe commented on YARN-2550:
--

This looks like a dup of YARN-2483.

> TestAMRestart fails intermittently
> --
>
> Key: YARN-2550
> URL: https://issues.apache.org/jira/browse/YARN-2550
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Rohith
>
> testShouldNotCountFailureToMaxAttemptRetry(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart)
>   Time elapsed: 50.64 sec  <<< FAILURE!
> java.lang.AssertionError: AppAttempt state is not correct (timedout) 
> expected: but was:
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:84)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:417)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAM(MockRM.java:582)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAndRegisterAM(MockRM.java:589)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForNewAMToLaunchAndRegister(MockRM.java:182)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry(TestAMRestart.java:402)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks

2014-09-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148354#comment-14148354
 ] 

Karthik Kambatla commented on YARN-2608:


+1, pending Jenkins. 

> FairScheduler may hung due to two potential deadlocks
> -
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2608-1.patch, YARN-2608-2.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>   synchronized (FairScheduler.this) {
>   
>   }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>   name = ensureRootPrefix(name);
>   synchronized (queues) {
>   
>   }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2602) Generic History Service of TimelineServer sometimes not able to handle NPE

2014-09-25 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148348#comment-14148348
 ] 

Zhijie Shen commented on YARN-2602:
---

The problem is that YARN_APPLICATION_VIEW_ACLS field can be null. For example, 
if it is a DS app, the client leaves the ACL field null.

> Generic History Service of TimelineServer sometimes not able to handle NPE
> --
>
> Key: YARN-2602
> URL: https://issues.apache.org/jira/browse/YARN-2602
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
> Environment: ATS is running with AHS/GHS enabled to use TimelineStore.
> Running for 4-5 days, with many random example jobs running
>Reporter: Karam Singh
>Assignee: Zhijie Shen
>
> ATS is running with AHS/GHS enabled to use TimelineStore.
> Running for 4-5 day, with many random example jobs running .
> When I ran WS API for AHS/GHS:
> {code}
> curl --negotiate -u : 
> 'http:///v1/applicationhistory/apps/application_1411579118376_0001'
> {code}
> It ran successfully.
> However
> {code}
> curl --negotiate -u : 
> 'http:///ws/v1/applicationhistory/apps'
> {"exception":"WebApplicationException","message":"java.lang.NullPointerException","javaClassName":"javax.ws.rs.WebApplicationException"}
> {code}
> Failed with Internal server error 500.
> After looking at TimelineServer logs found that there was NPE:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2608) FairScheduler may hung due to two potential deadlocks

2014-09-25 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2608:
--
Attachment: YARN-2608-2.patch

> FairScheduler may hung due to two potential deadlocks
> -
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2608-1.patch, YARN-2608-2.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>   synchronized (FairScheduler.this) {
>   
>   }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>   name = ensureRootPrefix(name);
>   synchronized (queues) {
>   
>   }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks

2014-09-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148328#comment-14148328
 ] 

Karthik Kambatla commented on YARN-2608:


Nit: Nothing to do with this patch, can we annotate FS#setClock as 
VisibleForTesting? 

Otherwise, patch looks good to me. 

> FairScheduler may hung due to two potential deadlocks
> -
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2608-1.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>   synchronized (FairScheduler.this) {
>   
>   }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>   name = ensureRootPrefix(name);
>   synchronized (queues) {
>   
>   }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (YARN-668) TokenIdentifier serialization should consider Unknown fields

2014-09-25 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148074#comment-14148074
 ] 

Vinod Kumar Vavilapalli edited comment on YARN-668 at 9/25/14 9:19 PM:
---

Quick look at the patch
 - None of the records in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/yarn_security_token.proto
 are supposed to be exposed to users. We can move it to a sub-folder server and 
explicit comment in the proto file saying they are not consumable.
 - What about other tokens? We have Client to AM token, RM delegation-tokens 
etc.


was (Author: vinodkv):
Quick look at the patch
 - None of the records in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/yarn_security_token.proto
 are supposed to be exposed to users. We can move it to a sub-folder server and 
explicit comment in the proto file saying they are consumable.
 - What about other tokens? We have Client to AM token, RM delegation-tokens 
etc.

> TokenIdentifier serialization should consider Unknown fields
> 
>
> Key: YARN-668
> URL: https://issues.apache.org/jira/browse/YARN-668
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>Assignee: Junping Du
>Priority: Blocker
> Attachments: YARN-668-demo.patch, YARN-668-v2.patch, 
> YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, 
> YARN-668-v7.patch, YARN-668-v8.patch, YARN-668.patch
>
>
> This would allow changing of the TokenIdentifier between versions. The 
> current serialization is Writable. A simple way to achieve this would be to 
> have a Proto object as the payload for TokenIdentifiers, instead of 
> individual fields.
> TokenIdentifier continues to implement Writable to work with the RPC layer - 
> but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields

2014-09-25 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148326#comment-14148326
 ] 

Junping Du commented on YARN-668:
-

Thanks [~vinodkv] and [~jianhe] for review and comments! Address your comments 
in latest v8 patch.
bq. None of the records in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/yarn_security_token.proto
 are supposed to be exposed to users. We can move it to a sub-folder server and 
explicit comment in the proto file saying they are consumable.
Good point. Move and comments.

bq. What about other tokens? We have Client to AM token, RM delegation-tokens 
etc.
The plan is to address these two tokens in a separated patch. Given patch here 
is already big, and add chance to conflict with other changes that could happen 
recently.

bq. containerManagerImpl, TestApplicationMasterService changes revert
There are some change actually in these two files.

bq. Proto definition should have the same default.
Good point. Add default value now to proto definition. However, I am not sure 
if protobuffer have the similar definition for Integer.MIN_VALUE, but just use 
hard number so far. 

bq." following constructors may be not needed." And "remove the commented code"
Removed.


> TokenIdentifier serialization should consider Unknown fields
> 
>
> Key: YARN-668
> URL: https://issues.apache.org/jira/browse/YARN-668
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>Assignee: Junping Du
>Priority: Blocker
> Attachments: YARN-668-demo.patch, YARN-668-v2.patch, 
> YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, 
> YARN-668-v7.patch, YARN-668-v8.patch, YARN-668.patch
>
>
> This would allow changing of the TokenIdentifier between versions. The 
> current serialization is Writable. A simple way to achieve this would be to 
> have a Proto object as the payload for TokenIdentifiers, instead of 
> individual fields.
> TokenIdentifier continues to implement Writable to work with the RPC layer - 
> but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-25 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148315#comment-14148315
 ] 

Jonathan Eagles commented on YARN-2606:
---

I don't have any context why login is part of start. 

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-668) TokenIdentifier serialization should consider Unknown fields

2014-09-25 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-668:

Attachment: YARN-668-v8.patch

> TokenIdentifier serialization should consider Unknown fields
> 
>
> Key: YARN-668
> URL: https://issues.apache.org/jira/browse/YARN-668
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>Assignee: Junping Du
>Priority: Blocker
> Attachments: YARN-668-demo.patch, YARN-668-v2.patch, 
> YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, 
> YARN-668-v7.patch, YARN-668-v8.patch, YARN-668.patch
>
>
> This would allow changing of the TokenIdentifier between versions. The 
> current serialization is Writable. A simple way to achieve this would be to 
> have a Proto object as the payload for TokenIdentifiers, instead of 
> individual fields.
> TokenIdentifier continues to implement Writable to work with the RPC layer - 
> but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2608) FairScheduler may hung due to two potential deadlocks

2014-09-25 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148302#comment-14148302
 ] 

Wei Yan commented on YARN-2608:
---

For the first deadlock, as the clock is only changed by testcases, so we can 
directly remove the synchronized, and make the clock as volatile. For the 
second deadlock, we can also remove the synchronized from the reinitialize and 
initScheduler functions; thus, the reinitialize function would require the * 
AllocationFileLoaderService's lock* first, and then *FairScheduler's lock*.

> FairScheduler may hung due to two potential deadlocks
> -
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2608-1.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>   synchronized (FairScheduler.this) {
>   
>   }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>   name = ensureRootPrefix(name);
>   synchronized (queues) {
>   
>   }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.

2014-09-25 Thread Subru Krishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-1051:
-
Attachment: YARN-1051.patch

I am attaching a merge patch with trunk for easy reference. This patch is 
created after rebasing branch yarn-1051 with trunk. I ran test-patch against 
trunk with the attached patch in my box and got a +1.

> YARN Admission Control/Planner: enhancing the resource allocation model with 
> time.
> --
>
> Key: YARN-1051
> URL: https://issues.apache.org/jira/browse/YARN-1051
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, resourcemanager, scheduler
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: YARN-1051-design.pdf, YARN-1051.patch, 
> curino_MSR-TR-2013-108.pdf, techreport.pdf
>
>
> In this umbrella JIRA we propose to extend the YARN RM to handle time 
> explicitly, allowing users to "reserve" capacity over time. This is an 
> important step towards SLAs, long-running services, workflows, and helps for 
> gang scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2608) FairScheduler may hung due to two potential deadlocks

2014-09-25 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2608:
--
Description: 
Two potential deadlocks exist inside the FairScheduler.
1. AllocationFileLoaderService would reload the queue configuration, which 
calls FairScheduler.AllocationReloadListener.onReload() function. And require 
*FairScheduler's lock*; 
{code}
  public void onReload(AllocationConfiguration queueInfo) {
  synchronized (FairScheduler.this) {
  
  }
  }
{code}
after that, it would require the *QueueManager's queues lock*.
{code}
  private FSQueue getQueue(String name, boolean create, FSQueueType queueType) {
  name = ensureRootPrefix(name);
  synchronized (queues) {
  
  }
  }
{code}

Another thread FairScheduler.assignToQueue may also need to create a new queue 
when a new job submitted. This thread would hold the *QueueManager's queues 
lock* firstly, and then would like to hold the *FairScheduler's lock* as it 
needs to call FairScheduler.getClock() function when creating a new 
FSLeafQueue. Deadlock may happen here.

2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's lock* 
first, and then waits for *FairScheduler's lock*. Another thread (like 
AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
which holds *FairScheduler's lock* first, and then waits for 
*AllocationFileLoaderService's lock*. Deadlock may happen here.



  was:Two potential deadlocks exist inside the FairScheduler.


> FairScheduler may hung due to two potential deadlocks
> -
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2608-1.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.
> 1. AllocationFileLoaderService would reload the queue configuration, which 
> calls FairScheduler.AllocationReloadListener.onReload() function. And require 
> *FairScheduler's lock*; 
> {code}
>   public void onReload(AllocationConfiguration queueInfo) {
>   synchronized (FairScheduler.this) {
>   
>   }
>   }
> {code}
> after that, it would require the *QueueManager's queues lock*.
> {code}
>   private FSQueue getQueue(String name, boolean create, FSQueueType 
> queueType) {
>   name = ensureRootPrefix(name);
>   synchronized (queues) {
>   
>   }
>   }
> {code}
> Another thread FairScheduler.assignToQueue may also need to create a new 
> queue when a new job submitted. This thread would hold the *QueueManager's 
> queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
> as it needs to call FairScheduler.getClock() function when creating a new 
> FSLeafQueue. Deadlock may happen here.
> 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
> lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
> AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
> which holds *FairScheduler's lock* first, and then waits for 
> *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148288#comment-14148288
 ] 

Hadoop QA commented on YARN-2606:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671295/YARN-2606.patch
  against trunk revision 1861b32.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5130//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5130//console

This message is automatically generated.

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-25 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148282#comment-14148282
 ] 

Mit Desai commented on YARN-2606:
-

Thanks for the the suggestion [~zjshen]. Moving the FS operations to 
serviceStart() will work too. But I went with this option as to me, doing a 
login during initialization makes more sense than it be on the mid-way.

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2608) FairScheduler may hung due to two potential deadlocks

2014-09-25 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2608:
--
Attachment: YARN-2608-1.patch

> FairScheduler may hung due to two potential deadlocks
> -
>
> Key: YARN-2608
> URL: https://issues.apache.org/jira/browse/YARN-2608
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wei Yan
>Assignee: Wei Yan
> Attachments: YARN-2608-1.patch
>
>
> Two potential deadlocks exist inside the FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-25 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148278#comment-14148278
 ] 

Zhijie Shen commented on YARN-2606:
---

[~jeagles], I saw doSecureLogin is invoked at start stage in both RM and NM, 
and I'm a bit concerned that moving it to init will cause some unexpected 
behavior. Do you have any idea about the rationale behind this choice? I'm not 
aware of it before.

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2608) FairScheduler may hung due to two potential deadlocks

2014-09-25 Thread Wei Yan (JIRA)
Wei Yan created YARN-2608:
-

 Summary: FairScheduler may hung due to two potential deadlocks
 Key: YARN-2608
 URL: https://issues.apache.org/jira/browse/YARN-2608
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wei Yan
Assignee: Wei Yan


Two potential deadlocks exist inside the FairScheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-25 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148273#comment-14148273
 ] 

Jonathan Eagles commented on YARN-2606:
---

I can see this both ways. It seems to correct to login during initialization 
and and to wait until start to do file operations. Although a fix to either one 
of them does indeed fix the issue at hand.

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-25 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-913:

Attachment: YARN-913-010.patch

This patch doesn't look at why yesterday's jenkin tests failed, so if they are 
due to these changes, those changes won't have been fixed.

Key changes are due to experience implementing a (not in this patch) read only 
REST view.
# renamed fields in the {{ServiceRecord}} because Jersey ignores 
{{@JsonProperty}} annotations giving fields specific names. So no {{yarn:id}} 
{{yarn:persistence}} in the JSON; fields called {{yarn_id}} and 
{{yarn_persistence}} instead.
# Specific exception {{NoRecordException}} to differentiate "could not resolve 
a node as there isn't any entry with the header used to identify service 
records from {{InvalidRecordException}} which is only triggered on parse 
problems.
# added a lightweight {{list()}} operation that only returns the child paths; 
the original {{list(path) -> List}} renamed to 
{{listFull}}. 

There's a CLI client for this being written; it'll help validate the API & 
identify any further points for tuning

> Add a way to register long-lived services in a YARN cluster
> ---
>
> Key: YARN-913
> URL: https://issues.apache.org/jira/browse/YARN-913
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, resourcemanager
>Affects Versions: 2.5.0, 2.4.1
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
> 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
> YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
> YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
> YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
> YARN-913-010.patch, yarnregistry.pdf, yarnregistry.tla
>
>
> In a YARN cluster you can't predict where services will come up -or on what 
> ports. The services need to work those things out as they come up and then 
> publish them somewhere.
> Applications need to be able to find the service instance they are to bond to 
> -and not any others in the cluster.
> Some kind of service registry -in the RM, in ZK, could do this. If the RM 
> held the write access to the ZK nodes, it would be more secure than having 
> apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148260#comment-14148260
 ] 

Hadoop QA commented on YARN-2198:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12671287/YARN-2198.trunk.10.patch
  against trunk revision 6c54308.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5128//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5128//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5128//console

This message is automatically generated.

> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, 
> YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, 
> YARN-2198.delta.7.patch, YARN-2198.separation.patch, 
> YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, 
> YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires a the process launching the container to be LocalSystem or 
> a member of the a local Administrators group. Since the process in question 
> is the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-25 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148255#comment-14148255
 ] 

Zhijie Shen commented on YARN-2606:
---

The right fix may be moving the FS operations to serviceStart(). See the 
similar code in FileSystemRMStateStore:
{code}
  @Override
  protected synchronized void startInternal() throws Exception {
// create filesystem only now, as part of service-start. By this time, RM is
// authenticated with kerberos so we are good to create a file-system
// handle.
Configuration conf = new Configuration(getConfig());
conf.setBoolean("dfs.client.retry.policy.enabled", true);
String retryPolicy =
conf.get(YarnConfiguration.FS_RM_STATE_STORE_RETRY_POLICY_SPEC,
  YarnConfiguration.DEFAULT_FS_RM_STATE_STORE_RETRY_POLICY_SPEC);
conf.set("dfs.client.retry.policy.spec", retryPolicy);

fs = fsWorkingPath.getFileSystem(conf);
fs.mkdirs(rmDTSecretManagerRoot);
fs.mkdirs(rmAppRoot);
fs.mkdirs(amrmTokenSecretManagerRoot);
  }
{code}

BTW, we're thinking about removing the old application history store stack 
(YARN-2320).

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2607) TestDistributedShell fails in trunk

2014-09-25 Thread Ted Yu (JIRA)
Ted Yu created YARN-2607:


 Summary: TestDistributedShell fails in trunk
 Key: YARN-2607
 URL: https://issues.apache.org/jira/browse/YARN-2607
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu


>From https://builds.apache.org/job/Hadoop-Yarn-trunk/691/console :
{code}
testDSRestartWithPreviousRunningContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
  Time elapsed: 35.641 sec  <<< FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSRestartWithPreviousRunningContainers(TestDistributedShell.java:308)
{code}
On Linux, I got the following locally:
{code}
testDSAttemptFailuresValidityIntervalFailed(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
  Time elapsed: 64.715 sec  <<< FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertFalse(Assert.java:64)
at org.junit.Assert.assertFalse(Assert.java:74)
at 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSAttemptFailuresValidityIntervalFailed(TestDistributedShell.java:384)

testDSAttemptFailuresValidityIntervalSucess(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
  Time elapsed: 115.842 sec  <<< ERROR!
java.lang.Exception: test timed out after 9 milliseconds
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.yarn.applications.distributedshell.Client.monitorApplication(Client.java:680)
at 
org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:661)
at 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSAttemptFailuresValidityIntervalSucess(TestDistributedShell.java:342)

testDSRestartWithPreviousRunningContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
  Time elapsed: 35.633 sec  <<< FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSRestartWithPreviousRunningContainers(TestDistributedShell.java:308)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-25 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-2606:

Attachment: YARN-2606.patch

Attaching the patch.

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-25 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-2606:

Component/s: timelineserver

> Application History Server tries to access hdfs before doing secure login
> -
>
> Key: YARN-2606
> URL: https://issues.apache.org/jira/browse/YARN-2606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.6.0
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: YARN-2606.patch
>
>
> While testing the Application Timeline Server, the server would not come up 
> in a secure cluster, as it would keep trying to access hdfs without having 
> done the secure login. It would repeatedly try authenticating and finally hit 
> stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-25 Thread Mit Desai (JIRA)
Mit Desai created YARN-2606:
---

 Summary: Application History Server tries to access hdfs before 
doing secure login
 Key: YARN-2606
 URL: https://issues.apache.org/jira/browse/YARN-2606
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Mit Desai
Assignee: Mit Desai


While testing the Application Timeline Server, the server would not come up in 
a secure cluster, as it would keep trying to access hdfs without having done 
the secure login. It would repeatedly try authenticating and finally hit stack 
overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2179) Initial cache manager structure and context

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148227#comment-14148227
 ] 

Hadoop QA commented on YARN-2179:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12671286/YARN-2179-trunk-v7.patch
  against trunk revision 6c54308.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5129//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5129//console

This message is automatically generated.

> Initial cache manager structure and context
> ---
>
> Key: YARN-2179
> URL: https://issues.apache.org/jira/browse/YARN-2179
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
> YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
> YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch
>
>
> Implement the initial shared cache manager structure and context. The 
> SCMContext will be used by a number of manager services (i.e. the backing 
> store and the cleaner service). The AppChecker is used to gather the 
> currently running applications on SCM startup (necessary for an scm that is 
> backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2180) In-memory backing store for cache manager

2014-09-25 Thread Chris Trezzo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2180:
---
Attachment: YARN-2180-trunk-v4.patch

[~kasha] [~vinodkv] [~sjlee0]
Attached is v4. Here are some significant changes:
1. Bootstrapping and old SCMContext logic is now moved to the serviceInit of 
the in-memory store.
2. SCMStore interface is annotated properly with private and evolving.
3. Eviction logic of a shared cache resource has moved to the SCMStore 
implementation. The isResourceEvictable method has been added to the SCMStore 
interface to expose this.
4. There is a new configuration class (InMemorySCMStoreConfiguration) that 
allows for InMemorySCMStore implementation specific configuration.
5. Javadoc rework and method name refactoring so that items and their 
references in the shared cache are referred to as shared cache resources and 
shared cache resource references.
6. Various other refactors to address comments.

> In-memory backing store for cache manager
> -
>
> Key: YARN-2180
> URL: https://issues.apache.org/jira/browse/YARN-2180
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, 
> YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch
>
>
> Implement an in-memory backing store for the cache manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-25 Thread Remus Rusanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2198:
---
Attachment: YARN-2198.trunk.10.patch

> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, 
> YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, 
> YARN-2198.delta.7.patch, YARN-2198.separation.patch, 
> YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, 
> YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires a the process launching the container to be LocalSystem or 
> a member of the a local Administrators group. Since the process in question 
> is the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-25 Thread Remus Rusanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2198:
---
Attachment: (was: YARN-2198.trunk.10.patch)

> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, 
> YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, 
> YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.4.patch, 
> YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, 
> YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires a the process launching the container to be LocalSystem or 
> a member of the a local Administrators group. Since the process in question 
> is the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2179) Initial cache manager structure and context

2014-09-25 Thread Chris Trezzo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2179:
---
Attachment: YARN-2179-trunk-v7.patch

Slight update. AppChecker and RemoteAppChecker are now a service to allow for 
proper handling of YarnClient.

> Initial cache manager structure and context
> ---
>
> Key: YARN-2179
> URL: https://issues.apache.org/jira/browse/YARN-2179
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
> Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
> YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
> YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch
>
>
> Implement the initial shared cache manager structure and context. The 
> SCMContext will be used by a number of manager services (i.e. the backing 
> store and the cleaner service). The AppChecker is used to gather the 
> currently running applications on SCM startup (necessary for an scm that is 
> backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1963) Support priorities across applications within the same queue

2014-09-25 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-1963:
--
Attachment: YARN Application Priorities Design.pdf

Hi All

I am uploading an initial draft for Application Priority design. Kindly review 
the same and share your thoughts. I am planning to bring up the subjiras by end 
of week and after a round of review. 

Thank you [~vinodkv] for support.

> Support priorities across applications within the same queue 
> -
>
> Key: YARN-1963
> URL: https://issues.apache.org/jira/browse/YARN-1963
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, resourcemanager
>Reporter: Arun C Murthy
>Assignee: Sunil G
> Attachments: YARN Application Priorities Design.pdf
>
>
> It will be very useful to support priorities among applications within the 
> same queue, particularly in production scenarios. It allows for finer-grained 
> controls without having to force admins to create a multitude of queues, plus 
> allows existing applications to continue using existing queues which are 
> usually part of institutional memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2009) Priority support for preemption in ProportionalCapacityPreemptionPolicy

2014-09-25 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G reassigned YARN-2009:
-

Assignee: Sunil G

> Priority support for preemption in ProportionalCapacityPreemptionPolicy
> ---
>
> Key: YARN-2009
> URL: https://issues.apache.org/jira/browse/YARN-2009
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Devaraj K
>Assignee: Sunil G
>
> While preempting containers based on the queue ideal assignment, we may need 
> to consider preempting the low priority application containers first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2601) RMs(HA RMS) can't enter active state

2014-09-25 Thread Aroop Maliakkal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148155#comment-14148155
 ] 

Aroop Maliakkal commented on YARN-2601:
---

As a workaround, we deleted the entries in /rmstore/ZKRMStateRoot/RMAppRoot and 
restarted the RMs. Looks like that fixed the issue.

> RMs(HA RMS) can't enter active state
> 
>
> Key: YARN-2601
> URL: https://issues.apache.org/jira/browse/YARN-2601
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Cindy Li
>
> 2014-09-24 15:04:04,527 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing 
> event for application_1409048687352_0552 of type APP_REJECTED
> 2014-09-24 15:04:04,528 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1409048687352_0552 State change from NEW to FAILED
> 2014-09-24 15:04:04,528 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Dispatching the event 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.AppRemovedSchedulerEvent.EventType:
>  APP_REMOVED
> 2014-09-24 15:04:04,528 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Dispatching the event 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEvent.EventType: 
> APP_COMPLETED
> 2014-09-24 15:04:04,528 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RMAppManager 
> processing event for application_1409048687352_0552 of type APP_COMPLETED
> 2014-09-24 15:04:04,528 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=b_hiveperf0 
>  OPERATION=Application Finished - Failed TARGET=RMAppManager 
> RESULT=FAILURE  DESCRIPTION=App failed with state: FAILED   
> PERMISSIONS=hadoop tried to renew an expired token
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:366)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:6279)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:488)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:923)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2020)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2016)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1650)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2014)
> APPID=application_1409048687352_0552
> 2014-09-24 15:04:04,529 DEBUG org.apache.hadoop.service.AbstractService: 
> Service: RMActiveServices entered state STOPPED
> 
> 2014-09-24 15:04:04,538 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   
> OPERATION=transitionToActiveTARGET=RMHAProtocolService  
> RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   
> PERMISSIONS=Users [hadoop] are allowed
> 2014-09-24 15:04:04,539 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:292)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
> ... 4 more
> Caused by: org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: hadoop tried to 
> renew an expired token
>  

[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field

2014-09-25 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148135#comment-14148135
 ] 

Jian He commented on YARN-2523:
---

[~jlowe], would you like to take another look ? 

> ResourceManager UI showing negative value for "Decommissioned Nodes" field
> --
>
> Key: YARN-2523
> URL: https://issues.apache.org/jira/browse/YARN-2523
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, webapp
>Affects Versions: 3.0.0
>Reporter: Nishan Shetty
>Assignee: Rohith
> Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, 
> YARN-2523.patch
>
>
> 1. Decommission one NodeManager by configuring ip in excludehost file
> 2. Remove ip from excludehost file
> 3. Execute -refreshNodes command and restart Decommissioned NodeManager
> Observe that in RM UI negative value for "Decommissioned Nodes" field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields

2014-09-25 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148125#comment-14148125
 ] 

Jian He commented on YARN-668:
--

- containerManagerImpl, TestApplicationMasterService changes revert
- default value of AMRMTokenIdentifier keyId. {{private int keyId = 
Integer.MIN_VALUE;}}. Proto definition should have the same default
- following constructors may be not needed.
{code}
  public NMTokenIdentifier(NMTokenIdentifierProto proto) {
this.proto = proto;
  }
{code}
- why remove following ?
{code}
  // LogAggregationContext is set as null
Assert.assertNull(getLogAggregationContextFromContainerToken(rm1, nm1, 
null));
{code}
- remove the commented code
{code}
/*ByteArrayDataInput input = ByteStreams.newDataInput(
containerToken.getIdentifier().array());
ContainerTokenIdentifier containerTokenIdentifier =
new ContainerTokenIdentifier();
containerTokenIdentifier.readFields(input);*/
{code}

> TokenIdentifier serialization should consider Unknown fields
> 
>
> Key: YARN-668
> URL: https://issues.apache.org/jira/browse/YARN-668
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>Assignee: Junping Du
>Priority: Blocker
> Attachments: YARN-668-demo.patch, YARN-668-v2.patch, 
> YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, 
> YARN-668-v7.patch, YARN-668.patch
>
>
> This would allow changing of the TokenIdentifier between versions. The 
> current serialization is Writable. A simple way to achieve this would be to 
> have a Proto object as the payload for TokenIdentifiers, instead of 
> individual fields.
> TokenIdentifier continues to implement Writable to work with the RPC layer - 
> but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2605) [RM HA] Rest api endpoints doing redirect incorrectly

2014-09-25 Thread bc Wong (JIRA)
bc Wong created YARN-2605:
-

 Summary: [RM HA] Rest api endpoints doing redirect incorrectly
 Key: YARN-2605
 URL: https://issues.apache.org/jira/browse/YARN-2605
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: bc Wong


The standby RM's webui tries to do a redirect via meta-refresh. That is fine 
for pages designed to be viewed by web browsers. But the API endpoints 
shouldn't do that. Most programmatic HTTP clients do not do meta-refresh. I'd 
suggest HTTP 303, or return a well-defined error message (json or xml) stating 
that the standby status and a link to the active RM.

The standby RM is returning this today:
{noformat}
$ curl -i http://bcsec-1.ent.cloudera.com:8088/ws/v1/cluster/metrics
HTTP/1.1 200 OK
Cache-Control: no-cache
Expires: Thu, 25 Sep 2014 18:34:53 GMT
Date: Thu, 25 Sep 2014 18:34:53 GMT
Pragma: no-cache
Expires: Thu, 25 Sep 2014 18:34:53 GMT
Date: Thu, 25 Sep 2014 18:34:53 GMT
Pragma: no-cache
Content-Type: text/plain; charset=UTF-8
Refresh: 3; url=http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics
Content-Length: 117
Server: Jetty(6.1.26)

This is standby RM. Redirecting to the current active RM: 
http://bcsec-2.ent.cloudera.com:8088/ws/v1/cluster/metrics
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-25 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148098#comment-14148098
 ] 

Thomas Graves commented on YARN-1769:
-

We've been running this now on cluster for quite a while and its showing great 
improvements in the time to get larger containers.  I would like to put this in.

> CapacityScheduler:  Improve reservations
> 
>
> Key: YARN-1769
> URL: https://issues.apache.org/jira/browse/YARN-1769
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch
>
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host.
> The current algorithm for reservations is to reserve as many containers as 
> currently required and then it will start to reserve more above that after a 
> certain number of re-reservations (currently biased against larger 
> containers).  Anytime it hits the limit of number reserved it stops looking 
> at any other nodes. This results in potentially missing nodes that have 
> enough space to fullfill the request.   
> The other place for improvement is currently reservations count against your 
> queue capacity.  If you have reservations you could hit the various limits 
> which would then stop you from looking further at that node.  
> The above 2 cases can cause an application requesting a larger container to 
> take a long time to gets it resources.  
> We could improve upon both of those by simply continuing to look at incoming 
> nodes to see if we could potentially swap out a reservation for an actual 
> allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations

2014-09-25 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148095#comment-14148095
 ] 

Craig Welch commented on YARN-2494:
---

...other kinds of labels..., rather

> [YARN-796] Node label manager API and storage implementations
> -
>
> Key: YARN-2494
> URL: https://issues.apache.org/jira/browse/YARN-2494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, 
> YARN-2494.patch, YARN-2494.patch, YARN-2494.patch
>
>
> This JIRA includes APIs and storage implementations of node label manager,
> NodeLabelManager is an abstract class used to manage labels of nodes in the 
> cluster, it has APIs to query/modify
> - Nodes according to given label
> - Labels according to given hostname
> - Add/remove labels
> - Set labels of nodes in the cluster
> - Persist/recover changes of labels/labels-on-nodes to/from storage
> And it has two implementations to store modifications
> - Memory based storage: It will not persist changes, so all labels will be 
> lost when RM restart
> - FileSystem based storage: It will persist/recover to/from FileSystem (like 
> HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations

2014-09-25 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148094#comment-14148094
 ] 

Craig Welch commented on YARN-2494:
---

-re I suggest to change addLabels to addNodeLabels because we may support more 
different kind of labels in the future, change removeLabels to 
removeExistingLabels, and leave NodeLabelsManager.existingLabels unchanged.

I thought we'd setteled on just adding "Node" to the names which did not have 
it, so addNodeLables, removeNodeLabels, etc.  I don't think "Existing" and 
"Known" are particularly helpful, the concern was to distinguish these as 
"NodeLabel" operations, to leave room in the future for other kinds of nodes.

Also, with the refactor to a "store" type and dropping the configuration 
option, do we still have a way to specify something other than the hdfs store?

wrt leveldb - we ended up with hdfs for the ha case, I think anything we do 
should be distributed, not local - so zookeeper, hbase, etc.

> [YARN-796] Node label manager API and storage implementations
> -
>
> Key: YARN-2494
> URL: https://issues.apache.org/jira/browse/YARN-2494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, 
> YARN-2494.patch, YARN-2494.patch, YARN-2494.patch
>
>
> This JIRA includes APIs and storage implementations of node label manager,
> NodeLabelManager is an abstract class used to manage labels of nodes in the 
> cluster, it has APIs to query/modify
> - Nodes according to given label
> - Labels according to given hostname
> - Add/remove labels
> - Set labels of nodes in the cluster
> - Persist/recover changes of labels/labels-on-nodes to/from storage
> And it has two implementations to store modifications
> - Memory based storage: It will not persist changes, so all labels will be 
> lost when RM restart
> - FileSystem based storage: It will persist/recover to/from FileSystem (like 
> HDFS), and all labels and labels-on-nodes will be recovered upon RM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields

2014-09-25 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148074#comment-14148074
 ] 

Vinod Kumar Vavilapalli commented on YARN-668:
--

Quick look at the patch
 - None of the records in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/yarn_security_token.proto
 are supposed to be exposed to users. We can move it to a sub-folder server and 
explicit comment in the proto file saying they are consumable.
 - What about other tokens? We have Client to AM token, RM delegation-tokens 
etc.

> TokenIdentifier serialization should consider Unknown fields
> 
>
> Key: YARN-668
> URL: https://issues.apache.org/jira/browse/YARN-668
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>Assignee: Junping Du
>Priority: Blocker
> Attachments: YARN-668-demo.patch, YARN-668-v2.patch, 
> YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, 
> YARN-668-v7.patch, YARN-668.patch
>
>
> This would allow changing of the TokenIdentifier between versions. The 
> current serialization is Writable. A simple way to achieve this would be to 
> have a Proto object as the payload for TokenIdentifiers, instead of 
> individual fields.
> TokenIdentifier continues to implement Writable to work with the RPC layer - 
> but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148059#comment-14148059
 ] 

Hadoop QA commented on YARN-1769:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671255/YARN-1769.patch
  against trunk revision e0b1dc5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5127//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5127//console

This message is automatically generated.

> CapacityScheduler:  Improve reservations
> 
>
> Key: YARN-1769
> URL: https://issues.apache.org/jira/browse/YARN-1769
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch
>
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host.
> The current algorithm for reservations is to reserve as many containers as 
> currently required and then it will start to reserve more above that after a 
> certain number of re-reservations (currently biased against larger 
> containers).  Anytime it hits the limit of number reserved it stops looking 
> at any other nodes. This results in potentially missing nodes that have 
> enough space to fullfill the request.   
> The other place for improvement is currently reservations count against your 
> queue capacity.  If you have reservations you could hit the various limits 
> which would then stop you from looking further at that node.  
> The above 2 cases can cause an application requesting a larger container to 
> take a long time to gets it resources.  
> We could improve upon both of those by simply continuing to look at incoming 
> nodes to see if we could potentially swap out a reservation for an actual 
> allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-25 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2594:
---
Summary: Potential deadlock in RM when querying 
ApplicationResourceUsageReport  (was: ResourceManger sometimes become 
un-responsive)

> Potential deadlock in RM when querying ApplicationResourceUsageReport
> -
>
> Key: YARN-2594
> URL: https://issues.apache.org/jira/browse/YARN-2594
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karam Singh
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-2594.patch
>
>
> ResoruceManager sometimes become un-responsive:
> There was in exception in ResourceManager log and contains only  following 
> type of messages:
> {code}
> 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
> 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
> 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
> 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
> 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
> 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
> 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive

2014-09-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148049#comment-14148049
 ] 

Karthik Kambatla commented on YARN-2594:


Thanks for working on this, Wangda. 

As I see, we could adopt the approach in the current patch. If we do so, we 
should avoid using readLock in other get methods that access 
{{RMAppImpl#currentAttempt}}. {{RMAppAttemptImpl}} should handle the 
thread-safety of its fields.

Either in addition to or instead of current approach, we really need to cleanup 
{{SchedulerApplicationAttempt}}. Most of the methods there are synchronized, 
and many of them just call synchronized methods in {{AppSchedulingInfo}}. 
Needless to say, this is more involved and we need to be very careful. 

I am open to adopting the first approach in this JIRA and file follow-up JIRAs 
to address the second approach suggested. 

PS: We really need to set up jcarder or something to identify most of these 
deadlock paths. 

> ResourceManger sometimes become un-responsive
> -
>
> Key: YARN-2594
> URL: https://issues.apache.org/jira/browse/YARN-2594
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karam Singh
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-2594.patch
>
>
> ResoruceManager sometimes become un-responsive:
> There was in exception in ResourceManager log and contains only  following 
> type of messages:
> {code}
> 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
> 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
> 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
> 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
> 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
> 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
> 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2594) ResourceManger sometimes become un-responsive

2014-09-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148006#comment-14148006
 ] 

Karthik Kambatla commented on YARN-2594:


Taking a look at the issue and the patch.. 

> ResourceManger sometimes become un-responsive
> -
>
> Key: YARN-2594
> URL: https://issues.apache.org/jira/browse/YARN-2594
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karam Singh
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-2594.patch
>
>
> ResoruceManager sometimes become un-responsive:
> There was in exception in ResourceManager log and contains only  following 
> type of messages:
> {code}
> 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
> 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
> 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
> 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
> 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
> 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
> 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2604) Scheduler should consider max-allocation-* in conjunction with the largest node

2014-09-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148003#comment-14148003
 ] 

Karthik Kambatla commented on YARN-2604:


bq. I guess it comes down to whether we really want to immediately fail an app 
if no node in the cluster at the time of submission has the sufficient 
resources. If that's OK then we can do a simple change like the one you 
originally proposed.
This would be light-weight and quick particularly for mis-configuration cases, 
and I think there is merit to doing this in addition to YARN-56. Let me re-open 
this and work on a patch. 

> Scheduler should consider max-allocation-* in conjunction with the largest 
> node
> ---
>
> Key: YARN-2604
> URL: https://issues.apache.org/jira/browse/YARN-2604
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.5.1
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> If the scheduler max-allocation-* values are larger than the resources 
> available on the largest node in the cluster, an application requesting 
> resources between the two values will be accepted by the scheduler but the 
> requests will never be satisfied. The app essentially hangs forever. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (YARN-2604) Scheduler should consider max-allocation-* in conjunction with the largest node

2014-09-25 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla reopened YARN-2604:


> Scheduler should consider max-allocation-* in conjunction with the largest 
> node
> ---
>
> Key: YARN-2604
> URL: https://issues.apache.org/jira/browse/YARN-2604
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.5.1
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> If the scheduler max-allocation-* values are larger than the resources 
> available on the largest node in the cluster, an application requesting 
> resources between the two values will be accepted by the scheduler but the 
> requests will never be satisfied. The app essentially hangs forever. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-25 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:

Attachment: YARN-1769.patch

> CapacityScheduler:  Improve reservations
> 
>
> Key: YARN-1769
> URL: https://issues.apache.org/jira/browse/YARN-1769
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
> YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch
>
>
> Currently the CapacityScheduler uses reservations in order to handle requests 
> for large containers and the fact there might not currently be enough space 
> available on a single host.
> The current algorithm for reservations is to reserve as many containers as 
> currently required and then it will start to reserve more above that after a 
> certain number of re-reservations (currently biased against larger 
> containers).  Anytime it hits the limit of number reserved it stops looking 
> at any other nodes. This results in potentially missing nodes that have 
> enough space to fullfill the request.   
> The other place for improvement is currently reservations count against your 
> queue capacity.  If you have reservations you could hit the various limits 
> which would then stop you from looking further at that node.  
> The above 2 cases can cause an application requesting a larger container to 
> take a long time to gets it resources.  
> We could improve upon both of those by simply continuing to look at incoming 
> nodes to see if we could potentially swap out a reservation for an actual 
> allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2161) Fix build on macosx: YARN parts

2014-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147977#comment-14147977
 ] 

Hudson commented on YARN-2161:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1907 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1907/])
YARN-2161. Fix build on macosx: YARN parts (Binglin Chang via aw) (aw: rev 
034df0e2eb2824fb46a1e75b52d43d9914a04e56)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/config.h.cmake
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/configuration.c
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/CMakeLists.txt


> Fix build on macosx: YARN parts
> ---
>
> Key: YARN-2161
> URL: https://issues.apache.org/jira/browse/YARN-2161
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Binglin Chang
>Assignee: Binglin Chang
> Fix For: 2.6.0
>
> Attachments: YARN-2161.v1.patch, YARN-2161.v2.patch
>
>
> When compiling on macosx with -Pnative, there are several warning and errors, 
> fix this would help hadoop developers with macosx env. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2596) TestWorkPreservingRMRestart fails with FairScheduler

2014-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147982#comment-14147982
 ] 

Hudson commented on YARN-2596:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1907 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1907/])
YARN-2596. TestWorkPreservingRMRestart fails with FairScheduler. (kasha) 
(kasha: rev 39c87344e16a08ab69e25345b3bce92aec92db47)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java
* hadoop-yarn-project/CHANGES.txt


> TestWorkPreservingRMRestart fails with FairScheduler
> 
>
> Key: YARN-2596
> URL: https://issues.apache.org/jira/browse/YARN-2596
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Junping Du
>Assignee: Karthik Kambatla
> Fix For: 2.6.0
>
> Attachments: yarn-2596-1.patch
>
>
> As test result from YARN-668, the test failure can be reproduce locally 
> without apply new patch to trunk. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2546) REST API for application creation/submission is using strings for numeric & boolean values

2014-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147973#comment-14147973
 ] 

Hudson commented on YARN-2546:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1907 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1907/])
YARN-2546. Made REST API for application creation/submission use numeric and 
boolean types instead of the string of them. Contributed by Varun Vasudev. 
(zjshen: rev 72b0881ca641fa830c907823f674a5c5e39aa15a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/JAXBContextResolver.java


> REST API for application creation/submission is using strings for numeric & 
> boolean values
> --
>
> Key: YARN-2546
> URL: https://issues.apache.org/jira/browse/YARN-2546
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api
>Affects Versions: 2.5.1
>Reporter: Doug Haigh
>Assignee: Varun Vasudev
> Fix For: 2.6.0
>
> Attachments: apache-yarn-2546.0.patch, apache-yarn-2546.1.patch
>
>
> When YARN responds with or accepts JSON, numbers & booleans are being 
> represented as strings which can cause parsing problems.
> Resource values look like 
> {
>   "application-id":"application_1404198295326_0001",
>   "maximum-resource-capability":
>{
>   "memory":"8192",
>   "vCores":"32"
>}
> }
> Instead of
> {
>   "application-id":"application_1404198295326_0001",
>   "maximum-resource-capability":
>{
>   "memory":8192,
>   "vCores":32
>}
> }
> When I POST to start a job, numeric values are represented as numbers:
>   "local-resources":
>   {
> "entry":
> [
>   {
> "key":"AppMaster.jar",
> "value":
> {
>   
> "resource":"hdfs://hdfs-namenode:9000/user/testuser/DistributedShell/demo-app/AppMaster.jar",
>   "type":"FILE",
>   "visibility":"APPLICATION",
>   "size": "43004",
>   "timestamp": "1405452071209"
> }
>   }
> ]
>   },
> Instead of
>   "local-resources":
>   {
> "entry":
> [
>   {
> "key":"AppMaster.jar",
> "value":
> {
>   
> "resource":"hdfs://hdfs-namenode:9000/user/testuser/DistributedShell/demo-app/AppMaster.jar",
>   "type":"FILE",
>   "visibility":"APPLICATION",
>   "size": 43004,
>   "timestamp": 1405452071209
> }
>   }
> ]
>   },
> Similarly, Boolean values are also represented as strings:
> "keep-containers-across-application-attempts":"false"
> Instead of 
> "keep-containers-across-application-attempts":false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2102) More generalized timeline ACLs

2014-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147980#comment-14147980
 ] 

Hudson commented on YARN-2102:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1907 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1907/])
YARN-2102. Added the concept of a Timeline Domain to handle read/write ACLs on 
Timeline service event data. Contributed by Zhijie Shen. (vinodkv: rev 
d78b452a4f413c6931a494c33df0666ce9b44973)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineACLsManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/timeline/TestTimelineRecords.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServicesWithSSL.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineReader.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomain.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineWriter.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomains.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/TimelineWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestMemoryTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestLeveldbTimelineStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineACLsManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java


> More generalized timeline ACLs
> --
>
> Key: YARN-2102
> URL: https://issues.apache.org/jira/browse/YARN-2102
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Fix For: 2.6.0
>
> Attachments: GeneralizedTimelineACLs.pdf, YARN-2102.1.patch, 
> YARN-2102.2.patch, YARN-2102.3.patch, YARN-2102.5.patch, YARN-2102.6.patch, 
> YARN-2102.7.patch, YARN-2102.8.patch
>
>
> We need to differentiate the access controls of reading and writing 
> operations, and we need to think about cross-entity access control. For 
> example, if we are executing a workflow of MR jobs, which writing the 
> timeline data of this workflow, we don't want other user to pollute the 
> timeline data of the workflow by putting something under it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2581) NMs need to find a way to get LogAggregationContext

2014-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147981#comment-14147981
 ] 

Hudson commented on YARN-2581:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1907 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1907/])
YARN-2581. Passed LogAggregationContext to NM via ContainerTokenIdentifier. 
Contributed by Xuan Gong. (zjshen: rev c86674a3a4d99aa56bb8ed3f6df51e3fef215eba)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/RMContainerTokenSecretManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/ContainerTokenIdentifier.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/event/LogHandlerAppStartedEvent.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationInitEvent.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManager.java


> NMs need to find a way to get LogAggregationContext
> ---
>
> Key: YARN-2581
> URL: https://issues.apache.org/jira/browse/YARN-2581
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Fix For: 2.6.0
>
> Attachments: YARN-2581.1.patch, YARN-2581.2.patch, YARN-2581.3.patch, 
> YARN-2581.4.patch
>
>
> After YARN-2569, we have LogAggregationContext for application in 
> ApplicationSubmissionContext. NMs need to find a way to get this information.
> We have this requirement: For all containers in the same application should 
> honor the same LogAggregationContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field

2014-09-25 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147955#comment-14147955
 ] 

Jian He commented on YARN-2523:
---

+1 for the latest patch.

> ResourceManager UI showing negative value for "Decommissioned Nodes" field
> --
>
> Key: YARN-2523
> URL: https://issues.apache.org/jira/browse/YARN-2523
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, webapp
>Affects Versions: 3.0.0
>Reporter: Nishan Shetty
>Assignee: Rohith
> Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, 
> YARN-2523.patch
>
>
> 1. Decommission one NodeManager by configuring ip in excludehost file
> 2. Remove ip from excludehost file
> 3. Execute -refreshNodes command and restart Decommissioned NodeManager
> Observe that in RM UI negative value for "Decommissioned Nodes" field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-25 Thread Remus Rusanu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147952#comment-14147952
 ] 

Remus Rusanu commented on YARN-2198:


The findbugs warning is 
{code}
Inconsistent synchronization of 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.delegationTokenSequenceNumber;
 locked 71% of time
{code}

> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, 
> YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, 
> YARN-2198.delta.7.patch, YARN-2198.separation.patch, 
> YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, 
> YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires a the process launching the container to be LocalSystem or 
> a member of the a local Administrators group. Since the process in question 
> is the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-09-25 Thread Remus Rusanu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147946#comment-14147946
 ] 

Remus Rusanu commented on YARN-2198:


Core test failure is:
{code}
Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 120.538 sec <<< 
FAILURE! - in org.apache.hadoop.crypto.random.TestOsSecureRandom
testOsSecureRandomSetConf(org.apache.hadoop.crypto.random.TestOsSecureRandom)  
Time elapsed: 120.011 sec  <<< ERROR!
java.lang.Exception: test timed out after 12 milliseconds
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:220)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.read1(BufferedReader.java:187)
at java.io.BufferedReader.read(BufferedReader.java:261)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:727)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:714)
at 
org.apache.hadoop.crypto.random.TestOsSecureRandom.testOsSecureRandomSetConf(TestOsSecureRandom.java:149)
{code}


> Remove the need to run NodeManager as privileged account for Windows Secure 
> Container Executor
> --
>
> Key: YARN-2198
> URL: https://issues.apache.org/jira/browse/YARN-2198
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, 
> YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, 
> YARN-2198.delta.7.patch, YARN-2198.separation.patch, 
> YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, 
> YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch
>
>
> YARN-1972 introduces a Secure Windows Container Executor. However this 
> executor requires a the process launching the container to be LocalSystem or 
> a member of the a local Administrators group. Since the process in question 
> is the NodeManager, the requirement translates to the entire NM to run as a 
> privileged account, a very large surface area to review and protect.
> This proposal is to move the privileged operations into a dedicated NT 
> service. The NM can run as a low privilege account and communicate with the 
> privileged NT service when it needs to launch a container. This would reduce 
> the surface exposed to the high privileges. 
> There has to exist a secure, authenticated and authorized channel of 
> communication between the NM and the privileged NT service. Possible 
> alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
> be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
> specific inter-process communication channel that satisfies all requirements 
> and is easy to deploy. The privileged NT service would register and listen on 
> an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
> with libwinutils which would host the LPC client code. The client would 
> connect to the LPC port (NtConnectPort) and send a message requesting a 
> container launch (NtRequestWaitReplyPort). LPC provides authentication and 
> the privileged NT service can use authorization API (AuthZ) to validate the 
> caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2604) Scheduler should consider max-allocation-* in conjunction with the largest node

2014-09-25 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147942#comment-14147942
 ] 

Jason Lowe commented on YARN-2604:
--

Ah, I see, yes they're a little bit different.  They'd be the same if we want 
to consider the large node that is unhealthy/lost equivalent to an overloaded 
large node.  In both cases we had the resources to satisfy the request at one 
point but no longer do.

I guess it comes down to whether we really want to immediately fail an app if 
no node in the cluster at the time of submission has the sufficient resources.  
If that's OK then we can do a simple change like the one you originally 
proposed.  If the nodes are there but unusable for some reason (e.g.: 
unhealthy) and we want to wait around for a bit then it gets closer to what 
YARN-56 is trying to do.

> Scheduler should consider max-allocation-* in conjunction with the largest 
> node
> ---
>
> Key: YARN-2604
> URL: https://issues.apache.org/jira/browse/YARN-2604
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.5.1
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> If the scheduler max-allocation-* values are larger than the resources 
> available on the largest node in the cluster, an application requesting 
> resources between the two values will be accepted by the scheduler but the 
> requests will never be satisfied. The app essentially hangs forever. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >