date:20140926


[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148834#comment-14148834
 ] 

Zhijie Shen commented on YARN-2468:
---

The patch is generally good. Some minor comments, and puzzles about the code.

1. The first one is \@VisibleForTesting? And the second one is not necessary?
{code}
-  private static String getNodeString(NodeId nodeId) {
+  public static String getNodeString(NodeId nodeId) {
 return nodeId.toString().replace(:, _);
   }
-  
+
+  public static String getNodeString(String nodeId) {
+return nodeId.replace(:, _);
+  }
{code}

2. Add a TODO to say the test will be fixed in a in followup Jira, in case we 
forget it?
{code}
+  @Ignore
   @Test
   public void testNoLogs() throws Exception {
{code}

3. Based on my understanding, uploadedFiles is the candidate files to upload? 
If so, can we rename the variables and related methods?
{code}
+private SetFile uploadedFiles = new HashSetFile();
{code}

4. I assume this var is going to capture all the existing log files on HDFS, 
isn't it? If so, the computation of it seems to be problematic, because it 
doesn't exclude the files to be excluded. And what's the effect on 
alreadyUploadedLogs?
{code}
+private SetString allExistingFileMeta = new HashSetString();
{code}
{code}
  IterableString mask =
  Iterables.filter(alreadyUploadedLogs, new PredicateString() {
@Override
public boolean apply(String next) {
  return currentExistingLogFiles.contains(next);
}
  });
{code}

5. Make the old LogValue constructor based on the new one?

6. LogValue.write is not necessary to be changed?

7. It's recommended to close the Closable objects via IOUtils, but it seems 
that AggregatedLogFormat already has this issue before. Let's file a separate 
ticket for it.
{code}
+if (this.fsDataOStream != null) {
+  this.fsDataOStream.close();
+}
{code}

8. nodeId seems to be of no use. No need to be passed into AppLogAggregatorImpl.
{code}
+  private final NodeId nodeId;
{code}

9. remoteNodeLogDirForApp doesn't affect remoteNodeTmpLogFileForApp, which only 
depends on remoteNodeLogFileForApp. remoteNodeLogFileForApp is determined at 
construction, so remoteNodeTmpLogFileForApp should be final and computed once 
in constructor as well. And constructor param remoteNodeLogDirForApp should be 
renamed back to remoteNodeLogFileForApp.
{code}
-  private final Path remoteNodeTmpLogFileForApp;
+  private Path remoteNodeTmpLogFileForApp;
{code}
{code}
-  private Path getRemoteNodeTmpLogFileForApp() {
+  private Path getRemoteNodeTmpLogFileForApp(Path remoteNodeLogDirForApp) {
 return new Path(remoteNodeLogFileForApp.getParent(),
-(remoteNodeLogFileForApp.getName() + TMP_FILE_SUFFIX));
+  (remoteNodeLogFileForApp.getName() + 
LogAggregationUtils.TMP_FILE_SUFFIX));
   }
{code}

10. One typo
{code}
  // if any of the previous uoloaded logs have been deleted,
{code}

11. One question: if one file is failed at uploading in LogValue.write(), 
uploadedFiles will not reflect the missing uploaded file, and it will not be 
uploaded again?

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
 YARN-2468.7.1.patch, YARN-2468.7.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock

2014-09-26 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148904#comment-14148904
 ] 

Tsuyoshi OZAWA commented on YARN-1458:
--

[~kkambatl], do you mind backporting the patch for branch-2.5? It looks 
critical problem since 2.2.0.

 FairScheduler: Zero weight can lead to livelock
 ---

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
 YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, 
 YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, 
 yarn-1458-7.patch, yarn-1458-8.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2612) Some completed containers are not reported to NM

2014-09-26 Thread hex108 (JIRA)

hex108 created YARN-2612:


 Summary: Some completed containers are not reported to NM
 Key: YARN-2612
 URL: https://issues.apache.org/jira/browse/YARN-2612
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: hex108
 Fix For: 2.6.0


In YARN-1372, NM will report completed containers to RM until it gets ACK from 
RM.  If AM does not call allocate, which means that AM does not ack RM, RM will 
not ack NM. We have observed these two cases when running Mapreduce task 'pi':
1) RM sends completed containers to AM. After receiving it, AM thinks it has 
done the work and does not need resource, so it does not call allocate.
2) When AM finishes, it could not ack to RM because AM itself has not finished 
yet.

In order to solve this problem, we have two solutions:
1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then 
RM could send this AppAttempt's completed containers to NM.
2) In  FairScheduler#nodeUpdate, if completed containers sent by NM does not 
have corresponding RMContainer, RM just ack it to NM.

We prefer to solution 2 because it is more clear and concise. However RM might 
ack same completed containers to NM many times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM

2014-09-26 Thread hex108 (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hex108 updated YARN-2612:
-
Attachment: YARN-2612.patch

 Some completed containers are not reported to NM
 

 Key: YARN-2612
 URL: https://issues.apache.org/jira/browse/YARN-2612
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: hex108
 Fix For: 2.6.0

 Attachments: YARN-2612.patch


 In YARN-1372, NM will report completed containers to RM until it gets ACK 
 from RM.  If AM does not call allocate, which means that AM does not ack RM, 
 RM will not ack NM. We have observed these two cases when running Mapreduce 
 task 'pi':
 1) RM sends completed containers to AM. After receiving it, AM thinks it has 
 done the work and does not need resource, so it does not call allocate.
 2) When AM finishes, it could not ack to RM because AM itself has not 
 finished yet.
 In order to solve this problem, we have two solutions:
 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then 
 RM could send this AppAttempt's completed containers to NM.
 2) In  FairScheduler#nodeUpdate, if completed containers sent by NM does not 
 have corresponding RMContainer, RM just ack it to NM.
 We prefer to solution 2 because it is more clear and concise. However RM 
 might ack same completed containers to NM many times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM

2014-09-26 Thread hex108 (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

hex108 updated YARN-2612:
-
Description:
In YARN-1372, NM will report completed containers to RM until it gets ACK from
RM. If AM does not call allocate, which means that AM does not ack RM, RM will
not ack NM. We([~chenchun]) have observed these two cases when running
Mapreduce task 'pi':
1) RM sends completed containers to AM. After receiving it, AM thinks it has
done the work and does not need resource, so it does not call allocate.
2) When AM finishes, it could not ack to RM because AM itself has not finished
yet.

In order to solve this problem, we have two solutions:
1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then
RM could send this AppAttempt's completed containers to NM.
2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not
have corresponding RMContainer, RM just ack it to NM.

We prefer to solution 2 because it is more clear and concise. However RM might
ack same completed containers to NM many times.

was:
In YARN-1372, NM will report completed containers to RM until it gets ACK from
RM. If AM does not call allocate, which means that AM does not ack RM, RM will
not ack NM. We have observed these two cases when running Mapreduce task 'pi':
1) RM sends completed containers to AM. After receiving it, AM thinks it has
done the work and does not need resource, so it does not call allocate.
2) When AM finishes, it could not ack to RM because AM itself has not finished
yet.

We prefer to solution 2 because it is more clear and concise. However RM might
ack same completed containers to NM many times.

Some completed containers are not reported to NM

Key: YARN-2612
URL: https://issues.apache.org/jira/browse/YARN-2612
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.6.0
Reporter: hex108
Fix For: 2.6.0

Attachments: YARN-2612.patch

In YARN-1372, NM will report completed containers to RM until it gets ACK
from RM. If AM does not call allocate, which means that AM does not ack RM,
RM will not ack NM. We([~chenchun]) have observed these two cases when
running Mapreduce task 'pi':
1) RM sends completed containers to AM. After receiving it, AM thinks it has
done the work and does not need resource, so it does not call allocate.
2) When AM finishes, it could not ack to RM because AM itself has not
finished yet.
In order to solve this problem, we have two solutions:
1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then
RM could send this AppAttempt's completed containers to NM.
2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not
have corresponding RMContainer, RM just ack it to NM.
We prefer to solution 2 because it is more clear and concise. However RM
might ack same completed containers to NM many times.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-26 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148951#comment-14148951
 ] 

Steve Loughran commented on YARN-913:
-

Distributed shell test *appears* to be YARN-2607, i.e. independent of this 
patch; HADOOP-10668 covers TestZKFailoverControllerStress intermittent failure. 
{{TestSecureRMRegistryOperations}} is a failure in the setup phase, the setup 
of the registry path in a {{zookee...@example.com.doAs()}} clause is failing 
with permissions, as if the first test case has set up the path without write 
access. More diagnostics needed here, such as identity of user making the call, 
maybe start test with some diagnostics of the path

{code}
TestAnonReadAccess(org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations)
  Time elapsed: 0.099 sec   ERROR!
org.apache.hadoop.service.ServiceStateException: 
org.apache.hadoop.fs.PathAccessDeniedException: `/registry/users / [
1, 'world,'anyone
 31, 'sasl,'zookee...@example.com
 31, 'sasl,'zookee...@example.com
 31, 'sasl,'zookee...@example.com
 ]': Permission denied: KeeperErrorCode = NoAuth for /registry/users
at org.apache.zookeeper.KeeperException.create(KeeperException.java:113)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at 
org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:688)
at 
org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:672)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
at 
org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:668)
at 
org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453)
at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443)
at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:423)
at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44)
at 
org.apache.hadoop.yarn.registry.client.services.zk.CuratorService.zkMkPath(CuratorService.java:539)
at 
org.apache.hadoop.yarn.registry.client.services.zk.CuratorService.maybeCreate(CuratorService.java:426)
at 
org.apache.hadoop.yarn.registry.server.services.RegistryAdminService.createRootRegistryPaths(RegistryAdminService.java:201)
at 
org.apache.hadoop.yarn.registry.server.services.RegistryAdminService.serviceStart(RegistryAdminService.java:187)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations$1.run(TestSecureRMRegistryOperations.java:106)
at 
org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations$1.run(TestSecureRMRegistryOperations.java:98)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1640)
at 
org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations.startRMRegistryOperations(TestSecureRMRegistryOperations.java:97)
at 
org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations.testAnonReadAccess(TestSecureRMRegistryOperations.java:130)
{code}

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM

[
https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jun Gong updated YARN-2612:
---
Attachment: YARN-2612.2.patch

Also change Capacity and FIFO Scheduler.

Some completed containers are not reported to NM

Key: YARN-2612
URL: https://issues.apache.org/jira/browse/YARN-2612
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Fix For: 2.6.0

Attachments: YARN-2612.2.patch, YARN-2612.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM

2014-09-26 Thread Chun Chen (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chun Chen updated YARN-2612:

Description:
We are testing RM work preserving restart and found the following logs when we
ran a simple MapReduce task PI. Some completed containers which already
pulled by AM never reported back to NM, so NM continuously report the completed
containers while AM had finished.
{code}
2014-09-26 17:00:42,228 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:42,228 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:43,230 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:43,230 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:44,233 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:44,233 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
{code}

We prefer to solution 2 because it is more clear and concise. However RM might
ack same completed containers to NM many times.

was:
In YARN-1372, NM will report completed containers to RM until it gets ACK from
RM. If AM does not call allocate, which means that AM does not ack RM, RM will
not ack NM. We([~chenchun]) have observed these two cases when running
Mapreduce task 'pi':
1) RM sends completed containers to AM. After receiving it, AM thinks it has
done the work and does not need resource, so it does not call allocate.
2) When AM finishes, it could not ack to RM because AM itself has not finished
yet.

We prefer to solution 2 because it is more clear and concise. However RM might
ack same completed containers to NM many times.

Some completed containers are not reported to NM

Key: YARN-2612
URL: https://issues.apache.org/jira/browse/YARN-2612
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Fix For: 2.6.0

Attachments: YARN-2612.2.patch, YARN-2612.patch

[jira] [Commented] (YARN-2612) Some completed containers are not reported to NM

[
https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148983#comment-14148983
]

Hadoop QA commented on YARN-2612:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12671413/YARN-2612.patch
against trunk revision 662fc11.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:red}-1 tests included{color}. The patch doesn't appear to include
any new or modified tests.
Please justify why no new tests are needed for this
patch.
Also please list what manual steps were performed to
verify this patch.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:green}+1 core tests{color}. The patch passed unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/5143//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5143//console

This message is automatically generated.

Some completed containers are not reported to NM

Key: YARN-2612
URL: https://issues.apache.org/jira/browse/YARN-2612
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Fix For: 2.6.0

Attachments: YARN-2612.2.patch, YARN-2612.patch

We are testing RM work preserving restart and found the following logs when
we ran a simple MapReduce task PI. Some completed containers which already
pulled by AM never reported back to NM, so NM continuously report the
completed containers while AM had finished.
{code}
2014-09-26 17:00:42,228 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:42,228 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:43,230 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:43,230 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:44,233 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:44,233 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
{code}
In YARN-1372, NM will report completed containers to RM until it gets ACK
from RM. If AM does not call allocate, which means that AM does not ack RM,
RM will not ack NM. We([~chenchun]) have observed these two cases when
running Mapreduce task 'pi':
1) RM sends completed containers to AM. After receiving it, AM thinks it has
done the work and does not need resource, so it does not call allocate.
2) When AM finishes, it could not ack to RM because AM itself has not
finished yet.
In order to solve this problem, we have two solutions:
1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then
RM could send this AppAttempt's completed containers to NM.
2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not
have corresponding RMContainer, RM just ack it to NM.
We prefer to solution 2 because it is more clear and concise. However RM
might ack same completed containers to NM many times.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

[
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149002#comment-14149002
]

Remus Rusanu commented on YARN-2198:

The findbugs issue is HADOOP-11122

Remove the need to run NodeManager as privileged account for Windows Secure
Container Executor
--

Key: YARN-2198
URL: https://issues.apache.org/jira/browse/YARN-2198
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Labels: security, windows
Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch,
YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch,
YARN-2198.delta.7.patch, YARN-2198.separation.patch,
YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch,
YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch

YARN-1972 introduces a Secure Windows Container Executor. However this
executor requires a the process launching the container to be LocalSystem or
a member of the a local Administrators group. Since the process in question
is the NodeManager, the requirement translates to the entire NM to run as a
privileged account, a very large surface area to review and protect.
This proposal is to move the privileged operations into a dedicated NT
service. The NM can run as a low privilege account and communicate with the
privileged NT service when it needs to launch a container. This would reduce
the surface exposed to the high privileges.
There has to exist a secure, authenticated and authorized channel of
communication between the NM and the privileged NT service. Possible
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would
be to use Windows LPC (Local Procedure Calls), which is a Windows platform
specific inter-process communication channel that satisfies all requirements
and is easy to deploy. The privileged NT service would register and listen on
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop
with libwinutils which would host the LPC client code. The client would
connect to the LPC port (NtConnectPort) and send a message requesting a
container launch (NtRequestWaitReplyPort). LPC provides authentication and
the privileged NT service can use authorization API (AuthZ) to validate the
caller.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2


 [ 
https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2357:
---
Attachment: YARN-2357.3.patch

.3.patch is the port of YARN-2198.trunk.10.patch

 Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 
 changes to branch-2
 --

 Key: YARN-2357
 URL: https://issues.apache.org/jira/browse/YARN-2357
 Project: Hadoop YARN
  Issue Type: Task
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Priority: Critical
  Labels: security, windows
 Attachments: YARN-2357.1.patch, YARN-2357.2.patch, YARN-2357.3.patch


 As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to 
 trunk, they need to be backported to branch-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM

[
https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jun Gong updated YARN-2612:
---
Attachment: (was: YARN-2612.2.patch)

Some completed containers are not reported to NM

Key: YARN-2612
URL: https://issues.apache.org/jira/browse/YARN-2612
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Fix For: 2.6.0

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM

[
https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jun Gong updated YARN-2612:
---
Attachment: (was: YARN-2612.patch)

Some completed containers are not reported to NM

Key: YARN-2612
URL: https://issues.apache.org/jira/browse/YARN-2612
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Fix For: 2.6.0

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2612) Some completed containers are not reported to NM


[ 
https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149006#comment-14149006
 ] 

Hadoop QA commented on YARN-2612:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671420/YARN-2612.2.patch
  against trunk revision 662fc11.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5144//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5144//console

This message is automatically generated.

 Some completed containers are not reported to NM
 

 Key: YARN-2612
 URL: https://issues.apache.org/jira/browse/YARN-2612
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
 Fix For: 2.6.0


 We are testing RM work preserving restart and found the following logs when 
 we ran a simple MapReduce task PI. Some completed containers which already 
 pulled by AM never reported back to NM, so NM continuously report the 
 completed containers while AM had finished. 
 {code}
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:42,228 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:43,230 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 2014-09-26 17:00:44,233 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {code}
 In YARN-1372, NM will report completed containers to RM until it gets ACK 
 from RM.  If AM does not call allocate, which means that AM does not ack RM, 
 RM will not ack NM. We([~chenchun]) have observed these two cases when 
 running Mapreduce task 'pi':
 1) RM sends completed containers to AM. After receiving it, AM thinks it has 
 done the work and does not need resource, so it does not call allocate.
 2) When AM finishes, it could not ack to RM because AM itself has not 
 finished yet.
 In order to solve this problem, we have two solutions:
 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then 
 RM could send this AppAttempt's completed containers to NM.
 2) In  FairScheduler#nodeUpdate, if completed containers sent by NM does not 
 have corresponding RMContainer, RM just ack it to NM.
 We prefer to solution 2 because it is more clear and concise. However RM 
 might ack same completed containers to NM many times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM

[
https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jun Gong updated YARN-2612:
---
Description:
We are testing RM work preserving restart and found the following logs when we
ran a simple MapReduce task PI. Some completed containers which already
pulled by AM never reported back to NM, so NM continuously report the completed
containers while AM had finished.
{code}
2014-09-26 17:00:42,228 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:42,228 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:43,230 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:43,230 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:44,233 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:44,233 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
{code}

We think when RMAppAttempt call BaseFinalTransition, it means AppAttempt
finishes, then RM could send this AppAttempt's completed containers to NM.

was:
We are testing RM work preserving restart and found the following logs when we
ran a simple MapReduce task PI. Some completed containers which already
pulled by AM never reported back to NM, so NM continuously report the completed
containers while AM had finished.
{code}
2014-09-26 17:00:42,228 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:42,228 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:43,230 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:43,230 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:44,233 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
2014-09-26 17:00:44,233 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Null container completed...
{code}

We prefer to solution 2 because it is more clear and concise. However RM might
ack same completed containers to NM many times.

Some completed containers are not reported to NM

Key: YARN-2612
URL: https://issues.apache.org/jira/browse/YARN-2612
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Fix For: 2.6.0

[jira] [Commented] (YARN-2610) Hamlet doesn't close table tags

2014-09-26 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149029#comment-14149029
 ] 

Devaraj K commented on YARN-2610:
-

It seems this has been done purposefully. 
 
[~rchiang] Please have look into the discussion in jira MAPREDUCE-2993.


 Hamlet doesn't close table tags
 ---

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2


 [ 
https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2357:
---
Attachment: (was: YARN-2357.3.patch)

 Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 
 changes to branch-2
 --

 Key: YARN-2357
 URL: https://issues.apache.org/jira/browse/YARN-2357
 Project: Hadoop YARN
  Issue Type: Task
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Priority: Critical
  Labels: security, windows
 Attachments: YARN-2357.1.patch, YARN-2357.2.patch, YARN-2357.3.patch


 As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to 
 trunk, they need to be backported to branch-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2


 [ 
https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated YARN-2357:
---
Attachment: YARN-2357.3.patch

 Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 
 changes to branch-2
 --

 Key: YARN-2357
 URL: https://issues.apache.org/jira/browse/YARN-2357
 Project: Hadoop YARN
  Issue Type: Task
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Priority: Critical
  Labels: security, windows
 Attachments: YARN-2357.1.patch, YARN-2357.2.patch, YARN-2357.3.patch


 As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to 
 trunk, they need to be backported to branch-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (YARN-2610) Hamlet doesn't close table tags

2014-09-26 Thread Devaraj K (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated YARN-2610:

Comment: was deleted

(was: It seems this has been done purposefully. 
 
[~rchiang] Please have look into the discussion in jira MAPREDUCE-2993.
)

 Hamlet doesn't close table tags
 ---

 Key: YARN-2610
 URL: https://issues.apache.org/jira/browse/YARN-2610
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: supportability
 Attachments: YARN-2610-01.patch, YARN-2610-02.patch


 Revisiting a subset of MAPREDUCE-2993.
 The th, td, thead, tfoot, tr tags are not configured to close 
 properly in Hamlet.  While this is allowed in HTML 4.01, missing closing 
 table tags tends to wreak havoc with a lot of HTML processors (although not 
 usually browsers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2608) FairScheduler: Potential deadlocks in loading alloc files and clock access


[ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149053#comment-14149053
 ] 

Hudson commented on YARN-2608:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #692 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/692/])
YARN-2608. FairScheduler: Potential deadlocks in loading alloc files and clock 
access. (Wei Yan via kasha) (kasha: rev 
f4357240a6f81065d91d5f443ed8fc8cd2a14a8f)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java


 FairScheduler: Potential deadlocks in loading alloc files and clock access
 --

 Key: YARN-2608
 URL: https://issues.apache.org/jira/browse/YARN-2608
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wei Yan
Assignee: Wei Yan
 Fix For: 2.6.0

 Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch


 Two potential deadlocks exist inside the FairScheduler.
 1. AllocationFileLoaderService would reload the queue configuration, which 
 calls FairScheduler.AllocationReloadListener.onReload() function. And require 
 *FairScheduler's lock*; 
 {code}
   public void onReload(AllocationConfiguration queueInfo) {
   synchronized (FairScheduler.this) {
   
   }
   }
 {code}
 after that, it would require the *QueueManager's queues lock*.
 {code}
   private FSQueue getQueue(String name, boolean create, FSQueueType 
 queueType) {
   name = ensureRootPrefix(name);
   synchronized (queues) {
   
   }
   }
 {code}
 Another thread FairScheduler.assignToQueue may also need to create a new 
 queue when a new job submitted. This thread would hold the *QueueManager's 
 queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
 as it needs to call FairScheduler.getClock() function when creating a new 
 FSLeafQueue. Deadlock may happen here.
 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
 lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
 AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
 which holds *FairScheduler's lock* first, and then waits for 
 *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for Decommissioned Nodes field


[ 
https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149051#comment-14149051
 ] 

Hudson commented on YARN-2523:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #692 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/692/])
YARN-2523. ResourceManager UI showing negative value for Decommissioned Nodes 
field. Contributed by Rohith (jlowe: rev 
8269bfa613999f71767de3c0369817b58cfe1416)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java


 ResourceManager UI showing negative value for Decommissioned Nodes field
 --

 Key: YARN-2523
 URL: https://issues.apache.org/jira/browse/YARN-2523
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 3.0.0
Reporter: Nishan Shetty
Assignee: Rohith
 Fix For: 2.6.0

 Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, 
 YARN-2523.patch


 1. Decommission one NodeManager by configuring ip in excludehost file
 2. Remove ip from excludehost file
 3. Execute -refreshNodes command and restart Decommissioned NodeManager
 Observe that in RM UI negative value for Decommissioned Nodes field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-09-26 Thread Tsuyoshi OZAWA (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.15.patch

Updated.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.2-wip.patch, 
 YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol


[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149114#comment-14149114
 ] 

Hadoop QA commented on YARN-1879:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671436/YARN-1879.15.patch
  against trunk revision 662fc11.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5145//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5145//console

This message is automatically generated.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.2-wip.patch, 
 YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-26 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-913:

Attachment: YARN-913-011.patch

Print out detailed diags (inc ACLs) on permissions problems during registry 
bootstrap

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, YARN-913-011.patch, yarnregistry.pdf, yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2608) FairScheduler: Potential deadlocks in loading alloc files and clock access


[ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149194#comment-14149194
 ] 

Hudson commented on YARN-2608:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1883 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1883/])
YARN-2608. FairScheduler: Potential deadlocks in loading alloc files and clock 
access. (Wei Yan via kasha) (kasha: rev 
f4357240a6f81065d91d5f443ed8fc8cd2a14a8f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* hadoop-yarn-project/CHANGES.txt


 FairScheduler: Potential deadlocks in loading alloc files and clock access
 --

 Key: YARN-2608
 URL: https://issues.apache.org/jira/browse/YARN-2608
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wei Yan
Assignee: Wei Yan
 Fix For: 2.6.0

 Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch


 Two potential deadlocks exist inside the FairScheduler.
 1. AllocationFileLoaderService would reload the queue configuration, which 
 calls FairScheduler.AllocationReloadListener.onReload() function. And require 
 *FairScheduler's lock*; 
 {code}
   public void onReload(AllocationConfiguration queueInfo) {
   synchronized (FairScheduler.this) {
   
   }
   }
 {code}
 after that, it would require the *QueueManager's queues lock*.
 {code}
   private FSQueue getQueue(String name, boolean create, FSQueueType 
 queueType) {
   name = ensureRootPrefix(name);
   synchronized (queues) {
   
   }
   }
 {code}
 Another thread FairScheduler.assignToQueue may also need to create a new 
 queue when a new job submitted. This thread would hold the *QueueManager's 
 queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
 as it needs to call FairScheduler.getClock() function when creating a new 
 FSLeafQueue. Deadlock may happen here.
 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
 lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
 AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
 which holds *FairScheduler's lock* first, and then waits for 
 *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for Decommissioned Nodes field


[ 
https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149192#comment-14149192
 ] 

Hudson commented on YARN-2523:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1883 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1883/])
YARN-2523. ResourceManager UI showing negative value for Decommissioned Nodes 
field. Contributed by Rohith (jlowe: rev 
8269bfa613999f71767de3c0369817b58cfe1416)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java


 ResourceManager UI showing negative value for Decommissioned Nodes field
 --

 Key: YARN-2523
 URL: https://issues.apache.org/jira/browse/YARN-2523
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 3.0.0
Reporter: Nishan Shetty
Assignee: Rohith
 Fix For: 2.6.0

 Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, 
 YARN-2523.patch


 1. Decommission one NodeManager by configuring ip in excludehost file
 2. Remove ip from excludehost file
 3. Execute -refreshNodes command and restart Decommissioned NodeManager
 Observe that in RM UI negative value for Decommissioned Nodes field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for Decommissioned Nodes field


[ 
https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149271#comment-14149271
 ] 

Hudson commented on YARN-2523:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1908 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1908/])
YARN-2523. ResourceManager UI showing negative value for Decommissioned Nodes 
field. Contributed by Rohith (jlowe: rev 
8269bfa613999f71767de3c0369817b58cfe1416)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java


 ResourceManager UI showing negative value for Decommissioned Nodes field
 --

 Key: YARN-2523
 URL: https://issues.apache.org/jira/browse/YARN-2523
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 3.0.0
Reporter: Nishan Shetty
Assignee: Rohith
 Fix For: 2.6.0

 Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, 
 YARN-2523.patch


 1. Decommission one NodeManager by configuring ip in excludehost file
 2. Remove ip from excludehost file
 3. Execute -refreshNodes command and restart Decommissioned NodeManager
 Observe that in RM UI negative value for Decommissioned Nodes field is shown



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2608) FairScheduler: Potential deadlocks in loading alloc files and clock access


[ 
https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149273#comment-14149273
 ] 

Hudson commented on YARN-2608:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1908 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1908/])
YARN-2608. FairScheduler: Potential deadlocks in loading alloc files and clock 
access. (Wei Yan via kasha) (kasha: rev 
f4357240a6f81065d91d5f443ed8fc8cd2a14a8f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* hadoop-yarn-project/CHANGES.txt


 FairScheduler: Potential deadlocks in loading alloc files and clock access
 --

 Key: YARN-2608
 URL: https://issues.apache.org/jira/browse/YARN-2608
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wei Yan
Assignee: Wei Yan
 Fix For: 2.6.0

 Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch


 Two potential deadlocks exist inside the FairScheduler.
 1. AllocationFileLoaderService would reload the queue configuration, which 
 calls FairScheduler.AllocationReloadListener.onReload() function. And require 
 *FairScheduler's lock*; 
 {code}
   public void onReload(AllocationConfiguration queueInfo) {
   synchronized (FairScheduler.this) {
   
   }
   }
 {code}
 after that, it would require the *QueueManager's queues lock*.
 {code}
   private FSQueue getQueue(String name, boolean create, FSQueueType 
 queueType) {
   name = ensureRootPrefix(name);
   synchronized (queues) {
   
   }
   }
 {code}
 Another thread FairScheduler.assignToQueue may also need to create a new 
 queue when a new job submitted. This thread would hold the *QueueManager's 
 queues lock* firstly, and then would like to hold the *FairScheduler's lock* 
 as it needs to call FairScheduler.getClock() function when creating a new 
 FSLeafQueue. Deadlock may happen here.
 2. The AllocationFileLoaderService holds  *AllocationFileLoaderService's 
 lock* first, and then waits for *FairScheduler's lock*. Another thread (like 
 AdminService.refreshQueues) may call FairScheduler's reinitialize function, 
 which holds *FairScheduler's lock* first, and then waits for 
 *AllocationFileLoaderService's lock*. Deadlock may happen here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

[
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149310#comment-14149310
]

Hadoop QA commented on YARN-913:

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12671457/YARN-913-011.patch
against trunk revision 662fc11.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 36 new
or modified test files.

{color:red}-1 javac{color}. The applied patch generated 1266 javac
compiler warnings (more than the trunk's current 1265 warnings).

{color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2
warning messages.
See
https://builds.apache.org/job/PreCommit-YARN-Build/5146//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavadocWarnings.txt
for details.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:red}-1 findbugs{color}. The patch appears to introduce 2 new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:red}-1 core tests{color}. The patch failed these unit tests in
hadoop-common-project/hadoop-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell

org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations

{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/5146//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/5146//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-registry.html
Findbugs warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/5146//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-common.html
Javac warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/5146//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5146//console

This message is automatically generated.

Add a way to register long-lived services in a YARN cluster
---

Key: YARN-913
URL: https://issues.apache.org/jira/browse/YARN-913
Project: Hadoop YARN
Issue Type: New Feature
Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf,
2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt,
YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch,
YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch,
YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch,
YARN-913-010.patch, YARN-913-011.patch, yarnregistry.pdf, yarnregistry.tla

In a YARN cluster you can't predict where services will come up -or on what
ports. The services need to work those things out as they come up and then
publish them somewhere.
Applications need to be able to find the service instance they are to bond to
-and not any others in the cluster.
Some kind of service registry -in the RM, in ZK, could do this. If the RM
held the write access to the ZK nodes, it would be more secure than having
apps register with ZK themselves.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-26 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149565#comment-14149565
 ] 

Thomas Graves commented on YARN-1769:
-

Thanks for the review Jason. I'll update the patch and remove some of the 
logging or make it truly debug.

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch

2014-09-26 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149580#comment-14149580
 ] 

Subru Krishnan commented on YARN-2611:
--

With the fixes included in the previously attached patch, YARN-1051 got an all 
[clear | 
https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148765] from 
jenkins.

The only test case that fails _TestMRCJCFileInputFormat _ is independent of 
this patch and is tracked in MAPREDUCE-6094.

 Fix jenkins findbugs warning and test case failures for trunk merge patch
 -

 Key: YARN-2611
 URL: https://issues.apache.org/jira/browse/YARN-2611
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Subru Krishnan
Assignee: Subru Krishnan
 Attachments: YARN-2611.patch


 This JIRA is to fix jenkins findbugs warnings and test case failures for 
 trunk merge patch  as [reported | 
 https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in 
 YARN-1051



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN

2014-09-26 Thread Abin Shahab (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abin Shahab updated YARN-1964:
--
Attachment: YARN-1964.patch

Patch that scopes down the YARN integration as mentioned above.

 Create Docker analog of the LinuxContainerExecutor in YARN
 --

 Key: YARN-1964
 URL: https://issues.apache.org/jira/browse/YARN-1964
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.2.0
Reporter: Arun C Murthy
Assignee: Abin Shahab
 Attachments: YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, 
 yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch


 Docker (https://www.docker.io/) is, increasingly, a very popular container 
 technology.
 In context of YARN, the support for Docker will provide a very elegant 
 solution to allow applications to *package* their software into a Docker 
 container (entire Linux file system incl. custom versions of perl, python 
 etc.) and use it as a blueprint to launch all their YARN containers with 
 requisite software environment. This provides both consistency (all YARN 
 containers will have the same software environment) and isolation (no 
 interference with whatever is installed on the physical machine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN


[ 
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149664#comment-14149664
 ] 

Hadoop QA commented on YARN-1964:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671475/YARN-1964.patch
  against trunk revision a6049aa.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5147//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5147//console

This message is automatically generated.

 Create Docker analog of the LinuxContainerExecutor in YARN
 --

 Key: YARN-1964
 URL: https://issues.apache.org/jira/browse/YARN-1964
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.2.0
Reporter: Arun C Murthy
Assignee: Abin Shahab
 Attachments: YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, 
 yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch


 Docker (https://www.docker.io/) is, increasingly, a very popular container 
 technology.
 In context of YARN, the support for Docker will provide a very elegant 
 solution to allow applications to *package* their software into a Docker 
 container (entire Linux file system incl. custom versions of perl, python 
 etc.) and use it as a blueprint to launch all their YARN containers with 
 requisite software environment. This provides both consistency (all YARN 
 containers will have the same software environment) and isolation (no 
 interference with whatever is installed on the physical machine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch


[ 
https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149672#comment-14149672
 ] 

Carlo Curino commented on YARN-2611:


Minor: 
  the equals() method for ReservationInterval could be simplified 

Other than that the patch looks good.

 Fix jenkins findbugs warning and test case failures for trunk merge patch
 -

 Key: YARN-2611
 URL: https://issues.apache.org/jira/browse/YARN-2611
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Subru Krishnan
Assignee: Subru Krishnan
 Attachments: YARN-2611.patch


 This JIRA is to fix jenkins findbugs warnings and test case failures for 
 trunk merge patch  as [reported | 
 https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in 
 YARN-1051



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-09-26 Thread Mit Desai (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-2387:

Target Version/s: 2.6.0

 Resource Manager crashes with NPE due to lack of synchronization
 

 Key: YARN-2387
 URL: https://issues.apache.org/jira/browse/YARN-2387
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai

 We recently came across a 0.23 RM crashing with an NPE. Here is the 
 stacktrace for it.
 {noformat}
 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
 at
 org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
 at
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
 at java.lang.Thread.run(Thread.java:722)
 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}
 On investigating a on the issue we found that the ContainerStatusPBImpl has 
 methods that are called by different threads and are not synchronized. Even 
 the 2.X code looks alike.
 We need to make these methods synchronized so that we do not encounter this 
 problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization


 [ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2387:
---
Priority: Blocker  (was: Major)

 Resource Manager crashes with NPE due to lack of synchronization
 

 Key: YARN-2387
 URL: https://issues.apache.org/jira/browse/YARN-2387
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai
Priority: Blocker

 We recently came across a 0.23 RM crashing with an NPE. Here is the 
 stacktrace for it.
 {noformat}
 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
 at
 org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
 at
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
 at java.lang.Thread.run(Thread.java:722)
 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}
 On investigating a on the issue we found that the ContainerStatusPBImpl has 
 methods that are called by different threads and are not synchronized. Even 
 the 2.X code looks alike.
 We need to make these methods synchronized so that we do not encounter this 
 problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization

2014-09-26 Thread Mit Desai (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated YARN-2387:

Attachment: YARN-2387.patch

Attaching the patch

 Resource Manager crashes with NPE due to lack of synchronization
 

 Key: YARN-2387
 URL: https://issues.apache.org/jira/browse/YARN-2387
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai
Priority: Blocker
 Attachments: YARN-2387.patch


 We recently came across a 0.23 RM crashing with an NPE. Here is the 
 stacktrace for it.
 {noformat}
 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
 at
 org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
 at
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
 at java.lang.Thread.run(Thread.java:722)
 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}
 On investigating a on the issue we found that the ContainerStatusPBImpl has 
 methods that are called by different threads and are not synchronized. Even 
 the 2.X code looks alike.
 We need to make these methods synchronized so that we do not encounter this 
 problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue

2014-09-26 Thread Sunil G (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149696#comment-14149696
]

Sunil G commented on YARN-1963:
---

Thank you [~maysamyabandeh] for providing us the use cases.

1.
bq.use case seems to be mentioned in Item 3 of Section 1.5.3
Yes. By changing priority of an application at runtime, will help to over come
the scenario mentioned by you. I will in-cooperate the same by providing more
scenarios and impacts about it.
2.
bq.priority can also be incorporated to the fair share calculation
Application Priority will be supported by both schedulers. And there are sub
jiras opened for same, however we can re allign the same w.r.t the same base
design, and I will include changes from Fair also. As of now priority labels
and internal implementation will be common, however separate ACL/per queue
priority-label configurations will be required per scheduler level. In future,
when both scheduler shares same config and common code, this can be pulled out
as common code. For now, configurations and its specific implementation can be
done separate for both schedulers. Sub jiras will be split ted accordingly

Support priorities across applications within the same queue
-

Key: YARN-1963
URL: https://issues.apache.org/jira/browse/YARN-1963
Project: Hadoop YARN
Issue Type: New Feature
Components: api, resourcemanager
Reporter: Arun C Murthy
Assignee: Sunil G
Attachments: YARN Application Priorities Design.pdf

It will be very useful to support priorities among applications within the
same queue, particularly in production scenarios. It allows for finer-grained
controls without having to force admins to create a multitude of queues, plus
allows existing applications to continue using existing queues which are
usually part of institutional memory.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock


[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149701#comment-14149701
 ] 

Karthik Kambatla commented on YARN-1458:


I am open to backporting this to branch-2.5, but we don't have a 2.5.2 release 
planned yet. We should probably discuss 2.5.2 and the need for it on the dev 
lists. 

 FairScheduler: Zero weight can lead to livelock
 ---

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
 YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, 
 YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, 
 yarn-1458-7.patch, yarn-1458-8.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch

2014-09-26 Thread Subru Krishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan resolved YARN-2611.
--
Resolution: Fixed

Thanks [~curino] for reviewing the patch. The _ReservationInterval.equals()_ is 
autogenerated by eclipse.

I just committed this to branch yarn-1051.

 Fix jenkins findbugs warning and test case failures for trunk merge patch
 -

 Key: YARN-2611
 URL: https://issues.apache.org/jira/browse/YARN-2611
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Subru Krishnan
Assignee: Subru Krishnan
 Attachments: YARN-2611.patch


 This JIRA is to fix jenkins findbugs warnings and test case failures for 
 trunk merge patch  as [reported | 
 https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in 
 YARN-1051



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.


 [ 
https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlo Curino updated YARN-1051:
---
Attachment: socc14-paper15.pdf

Pre-camera ready version of SoCC paper.

 YARN Admission Control/Planner: enhancing the resource allocation model with 
 time.
 --

 Key: YARN-1051
 URL: https://issues.apache.org/jira/browse/YARN-1051
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
 Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, 
 YARN-1051.patch, curino_MSR-TR-2013-108.pdf, socc14-paper15.pdf, 
 techreport.pdf


 In this umbrella JIRA we propose to extend the YARN RM to handle time 
 explicitly, allowing users to reserve capacity over time. This is an 
 important step towards SLAs, long-running services, workflows, and helps for 
 gang scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.


[ 
https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149727#comment-14149727
 ] 

Hadoop QA commented on YARN-1051:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671498/socc14-paper15.pdf
  against trunk revision 55302cc.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5149//console

This message is automatically generated.

 YARN Admission Control/Planner: enhancing the resource allocation model with 
 time.
 --

 Key: YARN-1051
 URL: https://issues.apache.org/jira/browse/YARN-1051
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
 Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, 
 YARN-1051.patch, curino_MSR-TR-2013-108.pdf, socc14-paper15.pdf, 
 techreport.pdf


 In this umbrella JIRA we propose to extend the YARN RM to handle time 
 explicitly, allowing users to reserve capacity over time. This is an 
 important step towards SLAs, long-running services, workflows, and helps for 
 gang scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2577) Clarify ACL delimiter and how to configure ACL groups only

2014-09-26 Thread Allen Wittenauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-2577:
---
Assignee: Miklos Christine

 Clarify ACL delimiter and how to configure ACL groups only
 --

 Key: YARN-2577
 URL: https://issues.apache.org/jira/browse/YARN-2577
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation, fairscheduler
Affects Versions: 2.5.1
Reporter: Miklos Christine
Assignee: Miklos Christine
Priority: Trivial
  Labels: newbie
 Attachments: YARN-2577.patch


 Reading through the Fair Scheduler documentation, it would be great to 
 explicitly state that the delimiter for the fair scheduler ACLs is the space 
 character.
 If specifying only ACL groups, users should begin the value with the space 
 character. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2577) Clarify ACL delimiter and how to configure ACL groups only

2014-09-26 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149732#comment-14149732
 ] 

Allen Wittenauer commented on YARN-2577:


+1 lgtm. Will commit to trunk and branch-2.

Thanks!

 Clarify ACL delimiter and how to configure ACL groups only
 --

 Key: YARN-2577
 URL: https://issues.apache.org/jira/browse/YARN-2577
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation, fairscheduler
Affects Versions: 2.5.1
Reporter: Miklos Christine
Assignee: Miklos Christine
Priority: Trivial
  Labels: newbie
 Attachments: YARN-2577.patch


 Reading through the Fair Scheduler documentation, it would be great to 
 explicitly state that the delimiter for the fair scheduler ACLs is the space 
 character.
 If specifying only ACL groups, users should begin the value with the space 
 character. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization


[ 
https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149744#comment-14149744
 ] 

Hadoop QA commented on YARN-2387:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671494/YARN-2387.patch
  against trunk revision 55302cc.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5148//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5148//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5148//console

This message is automatically generated.

 Resource Manager crashes with NPE due to lack of synchronization
 

 Key: YARN-2387
 URL: https://issues.apache.org/jira/browse/YARN-2387
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai
Priority: Blocker
 Attachments: YARN-2387.patch


 We recently came across a 0.23 RM crashing with an NPE. Here is the 
 stacktrace for it.
 {noformat}
 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
 handling event type NODE_UPDATE to the scheduler
 java.lang.NullPointerException
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
 at
 org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
 at java.lang.String.valueOf(String.java:2854)
 at java.lang.StringBuilder.append(StringBuilder.java:128)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
 at
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
 at
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
 at java.lang.Thread.run(Thread.java:722)
 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
 {noformat}
 On investigating a on the issue we found that the ContainerStatusPBImpl has 
 methods that are called by different threads and are not synchronized. Even 
 the 2.X code looks alike.
 We need to make these methods synchronized so that we do not encounter this 
 problem in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2577) Clarify ACL delimiter and how to configure ACL groups only


[ 
https://issues.apache.org/jira/browse/YARN-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149747#comment-14149747
 ] 

Hudson commented on YARN-2577:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6121 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6121/])
YARN-2577. Clarify ACL delimiter and how to configure ACL groups only (Mikos 
Christine via aw) (aw: rev ac70c27473251b389f32f4a33085d6a9ee3a0b3c)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm
* hadoop-yarn-project/CHANGES.txt


 Clarify ACL delimiter and how to configure ACL groups only
 --

 Key: YARN-2577
 URL: https://issues.apache.org/jira/browse/YARN-2577
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation, fairscheduler
Affects Versions: 2.5.1
Reporter: Miklos Christine
Assignee: Miklos Christine
Priority: Trivial
  Labels: newbie
 Fix For: 2.6.0

 Attachments: YARN-2577.patch


 Reading through the Fair Scheduler documentation, it would be great to 
 explicitly state that the delimiter for the fair scheduler ACLs is the space 
 character.
 If specifying only ACL groups, users should begin the value with the space 
 character. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager


[ 
https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149814#comment-14149814
 ] 

Zhijie Shen commented on YARN-2527:
---

This is the code in ContainerLaunchContextPBImpl. It seems that acls will never 
been null from CLC.
{code}
  public MapApplicationAccessType, String getApplicationACLs() {
initApplicationACLs();
return this.applicationACLS;
  }

  private void initApplicationACLs() {
if (this.applicationACLS != null) {
  return;
}
ContainerLaunchContextProtoOrBuilder p = viaProto ? proto : builder;
ListApplicationACLMapProto list = p.getApplicationACLsList();
this.applicationACLS = new HashMapApplicationAccessType, String(list
.size());

for (ApplicationACLMapProto aclProto : list) {
  this.applicationACLS.put(ProtoUtils.convertFromProtoFormat(aclProto
  .getAccessType()), aclProto.getAcl());
}
  }
{code}

I'm still thinking it may be the race condition that app is already in 
RMContext but acls is not put into ApplicationACLsManager. It needs to be 
confirmed from [~miguenther].

Anyway NPE happens, and ApplicationACLsManager should be self-sufficient to 
handle the potential null case. Let's do the fix as just suggested. Will review 
the patch and come back to you asap.

 NPE in ApplicationACLsManager
 -

 Key: YARN-2527
 URL: https://issues.apache.org/jira/browse/YARN-2527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: YARN-2527.patch, YARN-2527.patch


 NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error.
 The relevant stacktrace snippet from the ResourceManager logs is as below
 {code}
 Caused by: java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101)
 at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
 at 
 org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
 at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
 {code}
 This issue was reported by [~miguenther].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2179) Initial cache manager structure and context


[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149838#comment-14149838
 ] 

Karthik Kambatla commented on YARN-2179:


Thanks Chris. The latest patch looks good to me. Just two more nits, sorry for 
not noticing sooner.
# Rename CacheStructureUtil to SharedCache(Structure)Util? 
# Mark RemoteAppChecker, Util-class, SharedCacheManager Private-Unstable. 

[~vinodkv] - do you have any other comments on this patch? 


 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
 YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
 YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations

2014-09-26 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-1769:

Attachment: YARN-1769.patch

patch with log statments changed to debug

 CapacityScheduler:  Improve reservations
 

 Key: YARN-1769
 URL: https://issues.apache.org/jira/browse/YARN-1769
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, 
 YARN-1769.patch


 Currently the CapacityScheduler uses reservations in order to handle requests 
 for large containers and the fact there might not currently be enough space 
 available on a single host.
 The current algorithm for reservations is to reserve as many containers as 
 currently required and then it will start to reserve more above that after a 
 certain number of re-reservations (currently biased against larger 
 containers).  Anytime it hits the limit of number reserved it stops looking 
 at any other nodes. This results in potentially missing nodes that have 
 enough space to fullfill the request.   
 The other place for improvement is currently reservations count against your 
 queue capacity.  If you have reservations you could hit the various limits 
 which would then stop you from looking further at that node.  
 The above 2 cases can cause an application requesting a larger container to 
 take a long time to gets it resources.  
 We could improve upon both of those by simply continuing to look at incoming 
 nodes to see if we could potentially swap out a reservation for an actual 
 allocation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2180) In-memory backing store for cache manager


 [ 
https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2180:
---
Attachment: YARN-2180-trunk-v5.patch

Attached v5. This is a slight rebase so that it applies cleanly on top of 
trunk+YARN-2179.

 In-memory backing store for cache manager
 -

 Key: YARN-2180
 URL: https://issues.apache.org/jira/browse/YARN-2180
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, 
 YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch


 Implement an in-memory backing store for the cache manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2372) There are Chinese Characters in the FairScheduler's document


[ 
https://issues.apache.org/jira/browse/YARN-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149881#comment-14149881
 ] 

Hudson commented on YARN-2372:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6125 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6125/])
YARN-2372. There are Chinese Characters in the FairScheduler's document 
(Fengdong Yu via aw) (aw: rev 32870db0fb91e115b5e44edb7b313368e8e81b1e)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm


 There are Chinese Characters in the FairScheduler's document
 

 Key: YARN-2372
 URL: https://issues.apache.org/jira/browse/YARN-2372
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.4.1
Reporter: Fengdong Yu
Assignee: Fengdong Yu
Priority: Minor
 Fix For: 2.6.0

 Attachments: YARN-2372.patch, YARN-2372.patch, YARN-2372.patch, 
 YARN-2372.patch, YARN-2372.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login


[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149891#comment-14149891
 ] 

Zhijie Shen commented on YARN-2606:
---

Talk to vinod offline shortly. It seems that the YARN daemons are supposed to 
make external calls until start stage.

 Application History Server tries to access hdfs before doing secure login
 -

 Key: YARN-2606
 URL: https://issues.apache.org/jira/browse/YARN-2606
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai
 Attachments: YARN-2606.patch


 While testing the Application Timeline Server, the server would not come up 
 in a secure cluster, as it would keep trying to access hdfs without having 
 done the secure login. It would repeatedly try authenticating and finally hit 
 stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login

2014-09-26 Thread Mit Desai (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149905#comment-14149905
 ] 

Mit Desai commented on YARN-2606:
-

I see. Thanks for the info. I did not know about that. I will post a refreshed 
patch once I have made the changes and tested it. 

 Application History Server tries to access hdfs before doing secure login
 -

 Key: YARN-2606
 URL: https://issues.apache.org/jira/browse/YARN-2606
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.6.0
Reporter: Mit Desai
Assignee: Mit Desai
 Attachments: YARN-2606.patch


 While testing the Application Timeline Server, the server would not come up 
 in a secure cluster, as it would keep trying to access hdfs without having 
 done the secure login. It would repeatedly try authenticating and finally hit 
 stack overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster

2014-09-26 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated YARN-913:

Attachment: YARN-913-012.patch

tightened down code, docs, javadocs, move classes around to psitions things

The test for registry security failing on jenkins didn't arise last patch 
submission  there's no obvious reason for that (more precisely, why it 
arose in the frst place)

 Add a way to register long-lived services in a YARN cluster
 ---

 Key: YARN-913
 URL: https://issues.apache.org/jira/browse/YARN-913
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: api, resourcemanager
Affects Versions: 2.5.0, 2.4.1
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 
 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, 
 YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, 
 YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, 
 YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, 
 YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, yarnregistry.pdf, 
 yarnregistry.tla


 In a YARN cluster you can't predict where services will come up -or on what 
 ports. The services need to work those things out as they come up and then 
 publish them somewhere.
 Applications need to be able to find the service instance they are to bond to 
 -and not any others in the cluster.
 Some kind of service registry -in the RM, in ZK, could do this. If the RM 
 held the write access to the ZK nodes, it would be more secure than having 
 apps register with ZK themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


 [ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2566:

Attachment: (was: YARN-2566.000.patch)

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2566.000.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
 OPERATION=Container Finished - Failed   TARGET=ContainerImpl
 RESULT=FAILURE  DESCRIPTION=Container failed with state: LOCALIZATION_FAILED  
   APPID=application_1410663092546_0004
 CONTAINERID=container_1410663092546_0004_01_01
 2014-09-13 23:33:25,187 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:

[jira] [Updated] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


 [ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2566:

Attachment: YARN-2566.000.patch

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2566.000.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
 OPERATION=Container Finished - Failed   TARGET=ContainerImpl
 RESULT=FAILURE  DESCRIPTION=Container failed with state: LOCALIZATION_FAILED  
   APPID=application_1410663092546_0004
 CONTAINERID=container_1410663092546_0004_01_01
 2014-09-13 23:33:25,187 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:

[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149937#comment-14149937
 ] 

zhihai xu commented on YARN-2566:
-

[The Findbugs warnings: link | 
https://builds.apache.org/job/PreCommit-YARN-Build/5037//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html]
 does not exist.
Reattach the patch to restart test.

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2566.000.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
 OPERATION=Container Finished - Failed   TARGET=ContainerImpl
 RESULT=FAILURE

[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations

[
https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149941#comment-14149941
]

Hadoop QA commented on YARN-1769:
-

{color:green}+1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12671525/YARN-1769.patch
against trunk revision 3a1f981.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 5 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:green}+1 core tests{color}. The patch passed unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/5150//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5150//console

This message is automatically generated.

CapacityScheduler: Improve reservations

Key: YARN-1769
URL: https://issues.apache.org/jira/browse/YARN-1769
Project: Hadoop YARN
Issue Type: Improvement
Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch,
YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch,
YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch,
YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch,
YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch,
YARN-1769.patch

Currently the CapacityScheduler uses reservations in order to handle requests
for large containers and the fact there might not currently be enough space
available on a single host.
The current algorithm for reservations is to reserve as many containers as
currently required and then it will start to reserve more above that after a
certain number of re-reservations (currently biased against larger
containers). Anytime it hits the limit of number reserved it stops looking
at any other nodes. This results in potentially missing nodes that have
enough space to fullfill the request.
The other place for improvement is currently reservations count against your
queue capacity. If you have reservations you could hit the various limits
which would then stop you from looking further at that node.
The above 2 cases can cause an application requesting a larger container to
take a long time to gets it resources.
We could improve upon both of those by simply continuing to look at incoming
nodes to see if we could potentially swap out a reservation for an actual
allocation.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2179) Initial cache manager structure and context


 [ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2179:
---
Attachment: YARN-2179-trunk-v9.patch

[~kasha] [~vinodkv]

Attached v9 to address last comments.

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
 YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
 YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, 
 YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2591) AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data


 [ 
https://issues.apache.org/jira/browse/YARN-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen reassigned YARN-2591:
-

Assignee: Zhijie Shen

 AHSWebServices should return FORBIDDEN(403) if the request user doesn't have 
 access to the history data
 ---

 Key: YARN-2591
 URL: https://issues.apache.org/jira/browse/YARN-2591
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 3.0.0, 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 AHSWebServices should return FORBIDDEN(403) if the request user doesn't have 
 access to the history data. Currently, it is going to return 
 INTERNAL_SERVER_ERROR(500).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields


[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149952#comment-14149952
 ] 

Jian He commented on YARN-668:
--

Looks good overall, minor comments:
- In ContainerTokenIdentifier, check null as well ?similarly for all 
{{getUser}} ?
{code}
  public ContainerId getContainerID() {
return new ContainerIdPBImpl(proto.getContainerId());
  }
{code}
- This change change can be reverted back 
{code}
 // LogAggregationContext is set as null
Assert.assertNull(getLogAggregationContextFromContainerToken(rm1, nm1, 
null));
{code}

 TokenIdentifier serialization should consider Unknown fields
 

 Key: YARN-668
 URL: https://issues.apache.org/jira/browse/YARN-668
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-668-demo.patch, YARN-668-v2.patch, 
 YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, 
 YARN-668-v7.patch, YARN-668-v8.patch, YARN-668-v9.patch, YARN-668.patch


 This would allow changing of the TokenIdentifier between versions. The 
 current serialization is Writable. A simple way to achieve this would be to 
 have a Proto object as the payload for TokenIdentifiers, instead of 
 individual fields.
 TokenIdentifier continues to implement Writable to work with the RPC layer - 
 but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2179) Initial cache manager structure and context


[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150005#comment-14150005
 ] 

Hadoop QA commented on YARN-2179:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12671538/YARN-2179-trunk-v9.patch
  against trunk revision b40f433.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5153//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5153//console

This message is automatically generated.

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
 YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
 YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, 
 YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-668) TokenIdentifier serialization should consider Unknown fields

2014-09-26 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-668:

Attachment: YARN-668-v10.patch

Nice catch, [~jianhe]! Fix these issues in v10 patch.

 TokenIdentifier serialization should consider Unknown fields
 

 Key: YARN-668
 URL: https://issues.apache.org/jira/browse/YARN-668
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-668-demo.patch, YARN-668-v10.patch, 
 YARN-668-v2.patch, YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, 
 YARN-668-v6.patch, YARN-668-v7.patch, YARN-668-v8.patch, YARN-668-v9.patch, 
 YARN-668.patch


 This would allow changing of the TokenIdentifier between versions. The 
 current serialization is Writable. A simple way to achieve this would be to 
 have a Proto object as the payload for TokenIdentifiers, instead of 
 individual fields.
 TokenIdentifier continues to implement Writable to work with the RPC layer - 
 but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150026#comment-14150026
 ] 

Hadoop QA commented on YARN-2566:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671536/YARN-2566.000.patch
  against trunk revision b40f433.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5152//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5152//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5152//console

This message is automatically generated.

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2566.000.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)

[jira] [Updated] (YARN-2180) In-memory backing store for cache manager


 [ 
https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2180:
---
Attachment: (was: YARN-2180-trunk-v5.patch)

 In-memory backing store for cache manager
 -

 Key: YARN-2180
 URL: https://issues.apache.org/jira/browse/YARN-2180
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, 
 YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch


 Implement an in-memory backing store for the cache manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2180) In-memory backing store for cache manager


 [ 
https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-2180:
---
Attachment: YARN-2180-trunk-v5.patch

Re-attach v5 to accommodate for SharedCacheStructureUtil rename.

 In-memory backing store for cache manager
 -

 Key: YARN-2180
 URL: https://issues.apache.org/jira/browse/YARN-2180
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, 
 YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch


 Implement an in-memory backing store for the cache manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2206) Update document for applications REST API response examples


[ 
https://issues.apache.org/jira/browse/YARN-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150044#comment-14150044
 ] 

Hadoop QA commented on YARN-2206:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12652326/YARN-2206.patch
  against trunk revision aa5d925.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5155//console

This message is automatically generated.

 Update document for applications REST API response examples
 ---

 Key: YARN-2206
 URL: https://issues.apache.org/jira/browse/YARN-2206
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.4.0
Reporter: Kenji Kikushima
Assignee: Kenji Kikushima
Priority: Minor
 Attachments: YARN-2206.patch


 In ResourceManagerRest.apt.vm, Applications API responses are missing some 
 elements.
 - JSON response should have applicationType and applicationTags.
 - XML response should have applicationTags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2481) YARN should allow defining the location of java

2014-09-26 Thread Arun C Murthy (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150049#comment-14150049
 ] 

Arun C Murthy commented on YARN-2481:
-

[~ashahab] YARN already allows the JAVA_HOME to be overridable... take a look 
at {{ApplicationConstants.Environment.JAVA_HOME}} and 
{{YarnConfiguration.DEFAULT_NM_ENV_WHITELIST}} for the code-path.

 YARN should allow defining the location of java
 ---

 Key: YARN-2481
 URL: https://issues.apache.org/jira/browse/YARN-2481
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Abin Shahab

 Yarn right now uses the location of the JAVA_HOME on the host to launch 
 containers. This does not work with Docker containers which have their own 
 filesystem namespace and OS. If the location of the Java binary of the 
 container to be launched is configurable, yarn can launch containers that 
 have java in a different location than the host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster

[
https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150060#comment-14150060
]

Hadoop QA commented on YARN-913:

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12671529/YARN-913-012.patch
against trunk revision b40f433.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 36 new
or modified test files.

{color:red}-1 javac{color}. The applied patch generated 1266 javac
compiler warnings (more than the trunk's current 1265 warnings).

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

org.apache.hadoop.ha.TestZKFailoverControllerStress

org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell

org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations

org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification

The test build failed in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests

{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/5151//testReport/
Javac warnings:
https://builds.apache.org/job/PreCommit-YARN-Build/5151//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5151//console

This message is automatically generated.

Add a way to register long-lived services in a YARN cluster
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades

Jian He created YARN-2613:
-

 Summary: NMClient doesn't have retries for supporting 
rolling-upgrades
 Key: YARN-2613
 URL: https://issues.apache.org/jira/browse/YARN-2613
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He


While NM is rolling upgrade, client should retry NM until it comes up. This 
jira is to add a NMProxy (similar to RMProxy) with retry implementation to 
support rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2468) Log handling for LRS

2014-09-26 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150093#comment-14150093
 ] 

Xuan Gong commented on YARN-2468:
-

bq.  One question: if one file is failed at uploading in LogValue.write(), 
uploadedFiles will not reflect the missing uploaded file, and it will not be 
uploaded again?

Good catch. Fixed this in the new patch.

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
 YARN-2468.7.1.patch, YARN-2468.7.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2468) Log handling for LRS

2014-09-26 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2468:

Attachment: YARN-2468.8.patch

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
 YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


 [ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2566:

Attachment: YARN-2566.001.patch

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
 OPERATION=Container Finished - Failed   TARGET=ContainerImpl
 RESULT=FAILURE  DESCRIPTION=Container failed with state: LOCALIZATION_FAILED  
   APPID=application_1410663092546_0004
 CONTAINERID=container_1410663092546_0004_01_01
 2014-09-13 23:33:25,187 INFO

[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150108#comment-14150108
 ] 

zhihai xu commented on YARN-2566:
-

upload a new patch YARN-2566.001.patch to fix the findbugs issue to catch 
IOException instead of Exception.

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,186 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1410663092546_0004_01_01 transitioned from 
 LOCALIZING to LOCALIZATION_FAILED
 2014-09-13 23:33:25,187 WARN 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
 OPERATION=Container Finished - Failed   TARGET=ContainerImpl
 RESULT=FAILURE  DESCRIPTION=Container failed with state: LOCALIZATION_FAILED  
   APPID=application_1410663092546_0004

[jira] [Updated] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS

2014-09-26 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2583:

Attachment: YARN-2583.1.patch

 Modify the LogDeletionService to support Log aggregation for LRS
 

 Key: YARN-2583
 URL: https://issues.apache.org/jira/browse/YARN-2583
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2583.1.patch


 Currently, AggregatedLogDeletionService will delete old logs from HDFS. It 
 will check the cut-off-time, if all logs for this application is older than 
 this cut-off-time. The app-log-dir from HDFS will be deleted. This will not 
 work for LRS. We expect a LRS application can keep running for a long time. 
 Two different scenarios: 
 1) If we configured the rollingIntervalSeconds, the new log file will be 
 always uploaded to HDFS. The number of log files for this application will 
 become larger and larger. And there is no log files will be deleted.
 2) If we did not configure the rollingIntervalSeconds, the log file can only 
 be uploaded to HDFS after the application is finished. It is very possible 
 that the logs are uploaded after the cut-off-time. It will cause problem 
 because at that time the app-log-dir for this application in HDFS has been 
 deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2591) AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data


 [ 
https://issues.apache.org/jira/browse/YARN-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2591:
--
Target Version/s: 2.6.0

 AHSWebServices should return FORBIDDEN(403) if the request user doesn't have 
 access to the history data
 ---

 Key: YARN-2591
 URL: https://issues.apache.org/jira/browse/YARN-2591
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 3.0.0, 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 AHSWebServices should return FORBIDDEN(403) if the request user doesn't have 
 access to the history data. Currently, it is going to return 
 INTERNAL_SERVER_ERROR(500).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2591) AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data


 [ 
https://issues.apache.org/jira/browse/YARN-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2591:
--
Attachment: YARN-2591.1.patch

Create a patch to throw ForbiddenException if the access is denied by 
ApplicationACLsManager.

 AHSWebServices should return FORBIDDEN(403) if the request user doesn't have 
 access to the history data
 ---

 Key: YARN-2591
 URL: https://issues.apache.org/jira/browse/YARN-2591
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 3.0.0, 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2591.1.patch


 AHSWebServices should return FORBIDDEN(403) if the request user doesn't have 
 access to the history data. Currently, it is going to return 
 INTERNAL_SERVER_ERROR(500).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields


[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150155#comment-14150155
 ] 

Hadoop QA commented on YARN-668:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671553/YARN-668-v10.patch
  against trunk revision c7c8e38.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 10 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5154//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5154//console

This message is automatically generated.

 TokenIdentifier serialization should consider Unknown fields
 

 Key: YARN-668
 URL: https://issues.apache.org/jira/browse/YARN-668
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Junping Du
Priority: Blocker
 Attachments: YARN-668-demo.patch, YARN-668-v10.patch, 
 YARN-668-v2.patch, YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, 
 YARN-668-v6.patch, YARN-668-v7.patch, YARN-668-v8.patch, YARN-668-v9.patch, 
 YARN-668.patch


 This would allow changing of the TokenIdentifier between versions. The 
 current serialization is Writable. A simple way to achieve this would be to 
 have a Proto object as the payload for TokenIdentifiers, instead of 
 individual fields.
 TokenIdentifier continues to implement Writable to work with the RPC layer - 
 but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport


[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150162#comment-14150162
 ] 

Jian He commented on YARN-2594:
---

current patch looks good to me, thanks all for the discussion !

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport


[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150177#comment-14150177
 ] 

Karthik Kambatla commented on YARN-2594:


As I commented earlier, the current approach is fine with me. My review 
comments still apply:  we should avoid using readLock in other get methods that 
access RMAppImpl#currentAttempt. RMAppAttemptImpl should handle the 
thread-safety of its fields.

Can we also file follow-up JIRAs to cleanup synchronization in 
SchedulingApplicationAttempt? 

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades


 [ 
https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2613:
--
Attachment: YARN-2613.1.patch

 NMClient doesn't have retries for supporting rolling-upgrades
 -

 Key: YARN-2613
 URL: https://issues.apache.org/jira/browse/YARN-2613
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2613.1.patch


 While NM is rolling upgrade, client should retry NM until it comes up. This 
 jira is to add a NMProxy (similar to RMProxy) with retry implementation to 
 support rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades


[ 
https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150185#comment-14150185
 ] 

Jian He commented on YARN-2613:
---

Patch:
- Create a new NMProxy class for instantiating containerManageMentProxy with 
retry implementation. And created a new common base ServerProxy class.
- Updated existing code to use the new NMProxy class
- Manually tested on a single node cluster. Submit a MR job and kill NM, MR job 
will retry the NM.

 NMClient doesn't have retries for supporting rolling-upgrades
 -

 Key: YARN-2613
 URL: https://issues.apache.org/jira/browse/YARN-2613
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2613.1.patch


 While NM is rolling upgrade, client should retry NM until it comes up. This 
 jira is to add a NMProxy (similar to RMProxy) with retry implementation to 
 support rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1615) Fix typos in FSSchedulerApp.java

[
https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150186#comment-14150186
]

Hadoop QA commented on YARN-1615:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12623903/YARN-1615.patch
against trunk revision 6b7673e.

{color:red}-1 patch{color}. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5157//console

This message is automatically generated.

Fix typos in FSSchedulerApp.java

Key: YARN-1615
URL: https://issues.apache.org/jira/browse/YARN-1615
Project: Hadoop YARN
Issue Type: Bug
Components: documentation, scheduler
Affects Versions: 2.2.0
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Trivial
Labels: newbie
Attachments: YARN-1615.patch

In FSSchedulerApp.java there're 4 typos:
{code}
* containers over rack-local or off-switch containers. To acheive this
* we first only allow node-local assigments for a given prioirty level,
* then relax the locality threshold once we've had a long enough period
* without succesfully scheduling. We measure both the number of missed
{code}
They should be fixed as follows:
{code}
* containers over rack-local or off-switch containers. To achieve this
* we first only allow node-local assignments for a given priority level,
* then relax the locality threshold once we've had a long enough period
* without successfully scheduling. We measure both the number of missed
{code}

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2179) Initial cache manager structure and context


[ 
https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150200#comment-14150200
 ] 

Karthik Kambatla commented on YARN-2179:


+1.

 Initial cache manager structure and context
 ---

 Key: YARN-2179
 URL: https://issues.apache.org/jira/browse/YARN-2179
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, 
 YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, 
 YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, 
 YARN-2179-trunk-v9.patch


 Implement the initial shared cache manager structure and context. The 
 SCMContext will be used by a number of manager services (i.e. the backing 
 store and the cleaner service). The AppChecker is used to gather the 
 currently running applications on SCM startup (necessary for an scm that is 
 backed by an in-memory store).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2591) AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data


[ 
https://issues.apache.org/jira/browse/YARN-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150220#comment-14150220
 ] 

Hadoop QA commented on YARN-2591:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671578/YARN-2591.1.patch
  against trunk revision 6b7673e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5158//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5158//console

This message is automatically generated.

 AHSWebServices should return FORBIDDEN(403) if the request user doesn't have 
 access to the history data
 ---

 Key: YARN-2591
 URL: https://issues.apache.org/jira/browse/YARN-2591
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: 3.0.0, 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2591.1.patch


 AHSWebServices should return FORBIDDEN(403) if the request user doesn't have 
 access to the history data. Currently, it is going to return 
 INTERNAL_SERVER_ERROR(500).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2468) Log handling for LRS


[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150221#comment-14150221
 ] 

Zhijie Shen commented on YARN-2468:
---

+1 for the latest patch

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
 YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2468) Log handling for LRS

[
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150244#comment-14150244
]

Hadoop QA commented on YARN-2468:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12671569/YARN-2468.8.patch
against trunk revision 6b7673e.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 2 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:red}-1 core tests{color}. The patch failed these unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
org.apache.hadoop.yarn.client.TestResourceTrackerOnHA

org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager

org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater

org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch

The following test timeouts occurred in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart
org.apache.hadoop.yarn.server.TestContainerManageTests

{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/5159//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5159//console

This message is automatically generated.

Log handling for LRS

Key: YARN-2468
URL: https://issues.apache.org/jira/browse/YARN-2468
Project: Hadoop YARN
Issue Type: Sub-task
Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch,
YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch,
YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch,
YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch,
YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch,
YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch

Currently, when application is finished, NM will start to do the log
aggregation. But for Long running service applications, this is not ideal.
The problems we have are:
1) LRS applications are expected to run for a long time (weeks, months).
2) Currently, all the container logs (from one NM) will be written into a
single file. The files could become larger and larger.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades


[ 
https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150245#comment-14150245
 ] 

Hadoop QA commented on YARN-2613:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671581/YARN-2613.1.patch
  against trunk revision 6b7673e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

  
org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainersMonitor
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch
  
org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.TestPBLocalizerRPC

  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

org.apache.hadoop.yarn.server.TestContainerManageTests

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5160//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5160//console

This message is automatically generated.

 NMClient doesn't have retries for supporting rolling-upgrades
 -

 Key: YARN-2613
 URL: https://issues.apache.org/jira/browse/YARN-2613
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2613.1.patch


 While NM is rolling upgrade, client should retry NM until it comes up. This 
 jira is to add a NMProxy (similar to RMProxy) with retry implementation to 
 support rolling upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150255#comment-14150255
 ] 

Hadoop QA commented on YARN-2566:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671570/YARN-2566.001.patch
  against trunk revision 6b7673e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5156//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5156//console

This message is automatically generated.

 IOException happen in startLocalizer of DefaultContainerExecutor due to not 
 enough disk space for the first localDir.
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2566.000.patch, YARN-2566.001.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at

[jira] [Commented] (YARN-2468) Log handling for LRS


[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150266#comment-14150266
 ] 

Zhijie Shen commented on YARN-2468:
---

The test failures seems to be related to addressing binding conflicts. Restart 
a build.

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
 YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2609) Example of use for the ReservationSystem


 [ 
https://issues.apache.org/jira/browse/YARN-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlo Curino updated YARN-2609:
---
Attachment: YARN-2609.patch

 Example of use for the ReservationSystem
 

 Key: YARN-2609
 URL: https://issues.apache.org/jira/browse/YARN-2609
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Carlo Curino
Assignee: Carlo Curino
Priority: Minor
 Attachments: YARN-2609.patch


 This JIRA provides a simple new example in mapreduce-examples that request a 
 reservation and submit a Pi computation in the reservation. This is meant 
 just to show how to interact with the reservation system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2609) Example of use for the ReservationSystem


 [ 
https://issues.apache.org/jira/browse/YARN-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlo Curino updated YARN-2609:
---
Attachment: YARN-2609.docx

 Example of use for the ReservationSystem
 

 Key: YARN-2609
 URL: https://issues.apache.org/jira/browse/YARN-2609
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Carlo Curino
Assignee: Carlo Curino
Priority: Minor
 Attachments: YARN-2609.docx, YARN-2609.patch


 This JIRA provides a simple new example in mapreduce-examples that request a 
 reservation and submit a Pi computation in the reservation. This is meant 
 just to show how to interact with the reservation system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2609) Example of use for the ReservationSystem


[ 
https://issues.apache.org/jira/browse/YARN-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150285#comment-14150285
 ] 

Carlo Curino commented on YARN-2609:


Per [~kasha] request we provide a simple way to see YARN-1051 in action (adding 
to the mapreduce-examples), and provide a brief usage document. 
Please refer to the documents associated with YARN-1051 for a more context and 
design vision.
This patch can be improved/extended for actual commit, this is just to 
facilitate the evaluation of YARN-1051 for the merge-to-trunk ongoing vote.

 Example of use for the ReservationSystem
 

 Key: YARN-2609
 URL: https://issues.apache.org/jira/browse/YARN-2609
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Carlo Curino
Assignee: Carlo Curino
Priority: Minor
 Attachments: YARN-2609.docx, YARN-2609.patch


 This JIRA provides a simple new example in mapreduce-examples that request a 
 reservation and submit a Pi computation in the reservation. This is meant 
 just to show how to interact with the reservation system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2468) Log handling for LRS


[ 
https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150292#comment-14150292
 ] 

Hadoop QA commented on YARN-2468:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12671569/YARN-2468.8.patch
  against trunk revision 6b7673e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5161//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5161//console

This message is automatically generated.

 Log handling for LRS
 

 Key: YARN-2468
 URL: https://issues.apache.org/jira/browse/YARN-2468
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, 
 YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, 
 YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, 
 YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, 
 YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, 
 YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch


 Currently, when application is finished, NM will start to do the log 
 aggregation. But for Long running service applications, this is not ideal. 
 The problems we have are:
 1) LRS applications are expected to run for a long time (weeks, months).
 2) Currently, all the container logs (from one NM) will be written into a 
 single file. The files could become larger and larger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2614) Cleanup synchronized method in SchedulerApplicationAttempt

2014-09-26 Thread Wangda Tan (JIRA)

Wangda Tan created YARN-2614:


 Summary: Cleanup synchronized method in SchedulerApplicationAttempt
 Key: YARN-2614
 URL: https://issues.apache.org/jira/browse/YARN-2614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Wangda Tan


According to discussions in YARN-2594, there're some methods in 
SchedulerApplicationAttempt will be accessed by other modules, that will lead 
to potential dead lock in RM, we should cleanup them as much as we can.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport

2014-09-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150301#comment-14150301
 ] 

Wangda Tan commented on YARN-2594:
--

Thanks [~jianhe] and [~kasha] for review, I created YARN-2614 to tracking 
SchedulerApplicationAttempt synchronization cleanups.

 Potential deadlock in RM when querying ApplicationResourceUsageReport
 -

 Key: YARN-2594
 URL: https://issues.apache.org/jira/browse/YARN-2594
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Karam Singh
Assignee: Wangda Tan
Priority: Blocker
 Attachments: YARN-2594.patch


 ResoruceManager sometimes become un-responsive:
 There was in exception in ResourceManager log and contains only  following 
 type of messages:
 {code}
 2014-09-19 19:13:45,241 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000
 2014-09-19 19:30:26,312 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000
 2014-09-19 19:47:07,351 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000
 2014-09-19 20:03:48,460 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000
 2014-09-19 20:20:29,542 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000
 2014-09-19 20:37:10,635 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000
 2014-09-19 20:53:51,722 INFO  event.AsyncDispatcher 
 (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields


[ 
https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150300#comment-14150300
 ] 

Hudson commented on YARN-668:
-

SUCCESS: Integrated in Hadoop-trunk-Commit #6130 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6130/])
YARN-668. Changed 
NMTokenIdentifier/AMRMTokenIdentifier/ContainerTokenIdentifier to use protobuf 
object as the payload. Contributed by Junping Du. (jianhe: rev 
5391919b09ce9549d13c897aa89bb0a0536760fe)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/NMTokenIdentifierNewForTest.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/ContainerTokenIdentifierForTest.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/NMTokenIdentifier.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/AMRMTokenIdentifier.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/proto/test_token.proto
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/proto/test_amrm_token.proto
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/AMRMTokenIdentifierForTest.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/ContainerTokenIdentifier.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/security/TestYARNTokenIdentifier.java
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/pom.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java


 TokenIdentifier serialization should consider Unknown fields
 

 Key: YARN-668
 URL: https://issues.apache.org/jira/browse/YARN-668
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Siddharth Seth
Assignee: Junping Du
Priority: Blocker
 Fix For: 2.6.0

 Attachments: YARN-668-demo.patch, YARN-668-v10.patch, 
 YARN-668-v2.patch, YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, 
 YARN-668-v6.patch, YARN-668-v7.patch, YARN-668-v8.patch, YARN-668-v9.patch, 
 YARN-668.patch


 This would allow changing of the TokenIdentifier between versions. The 
 current serialization is Writable. A simple way to achieve this would be to 
 have a Proto object as the payload for TokenIdentifiers, instead of 
 individual fields.
 TokenIdentifier continues to implement Writable to work with the RPC layer - 
 but the payload itself is serialized using PB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2180) In-memory backing store for cache manager


[ 
https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150324#comment-14150324
 ] 

Karthik Kambatla commented on YARN-2180:


Thanks for the updates, Chris. The overall approach looks good. Review comments:
# SharedCacheManager#createSCMStoreService should use 
ReflectionUtils.newInstance. RMProxy is an example. 
# Thinking out loud here. YarnConfiguration (and yarn-default): I was wondering 
if we need a separate prefix for manager. Do we have more configs coming 
later specific to manager? yarn.sharedcache.store is not ambiguous.
# SCMStore
## A couple of lines longer than 80 chars. 
## For resources that are not in the store, isn't the access time trivially 
zero? I am okay with returning -1 for those cases, but will returning zero help 
at call sites? 
## Nit: Would keep the methods concerning references all together.
# InMemorySCMStore configuration - do we need a separate configuration class 
for in-memory-store? Why not include it in YarnConfiguration similar to RMStore 
implementations? According to what we decide on, we might want to change the 
actual config names. 
# InMemorySCMStore
## Can we rename map to something more descriptive? cacheResources? 
## Nit: Move bootstrapping code to a different method for readability? 
## Isn't the following synchronized block prone to races when different threads 
lock on different objects? 
{code}
synchronized (initialApps) {
  initialApps = getInitialApps(conf);
}
{code}
## We can leave it as is for now, but the implementation of AppChecker should 
come from some util method based on whether it is embedded or not. If we are 
open to it, we can add that method now an make it return RemoteAppChecker by 
default. 
## Nit: The following should fit on two lines:
{code}
  MapString, String getInitialCachedResources(FileSystem fs,
  Configuration conf)
  throws IOException {
{code}
## Use containsKey instead of the following? 
{code}
String mapped = initialCachedEntries.get(key);
if (mapped != null) {
{code}
## clearCache() - we should annotate each TODO with a follow-up JIRA, so we 
don't forget.

 In-memory backing store for cache manager
 -

 Key: YARN-2180
 URL: https://issues.apache.org/jira/browse/YARN-2180
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, 
 YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch


 Implement an in-memory backing store for the cache manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2180) In-memory backing store for cache manager


[ 
https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150328#comment-14150328
 ] 

Karthik Kambatla commented on YARN-2180:


Forgot to mention - we should annotate new classes as Private- 
Evolving|Unstable as appropriate. 

 In-memory backing store for cache manager
 -

 Key: YARN-2180
 URL: https://issues.apache.org/jira/browse/YARN-2180
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, 
 YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch


 Implement an in-memory backing store for the cache manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades