[jira] [Commented] (YARN-527) Local filecache mkdir fails

2013-04-04 Thread Knut O. Hellan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621878#comment-13621878
 ] 

Knut O. Hellan commented on YARN-527:
-

Yes, this is a duplicate of YARN-467 so you may close it. We will add cronjobs 
to delete old directories as a temporary workaround until we can test 
2.0.5-beta. Thanks!

 Local filecache mkdir fails
 ---

 Key: YARN-527
 URL: https://issues.apache.org/jira/browse/YARN-527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.0-alpha
 Environment: RHEL 6.3 with CDH4.1.3 Hadoop, HA with two name nodes 
 and six worker nodes.
Reporter: Knut O. Hellan
Priority: Minor
 Attachments: yarn-site.xml


 Jobs failed with no other explanation than this stack trace:
 2013-03-29 16:46:02,671 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diag
 nostics report from attempt_1364591875320_0017_m_00_0: 
 java.io.IOException: mkdir of /disk3/yarn/local/filecache/-42307893
 55400878397 failed
 at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:932)
 at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
 at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
 at 
 org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2333)
 at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Manually creating the directory worked. This behavior was common to at least 
 several nodes in the cluster.
 The situation was resolved by removing and recreating all 
 /disk?/yarn/local/filecache directories on all nodes.
 It is unclear whether Yarn struggled with the number of files or if there 
 were corrupt files in the caches. The situation was triggered by a node dying.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-538) RM address DNS lookup can cause unnecessary slowness on every JHS page load

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622012#comment-13622012
 ] 

Hudson commented on YARN-538:
-

Integrated in Hadoop-Yarn-trunk #174 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/174/])
YARN-538. RM address DNS lookup can cause unnecessary slowness on every JHS 
page load. (sandyr via tucu) (Revision 1464197)

 Result = SUCCESS
tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464197
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java


 RM address DNS lookup can cause unnecessary slowness on every JHS page load 
 

 Key: YARN-538
 URL: https://issues.apache.org/jira/browse/YARN-538
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 2.0.5-beta

 Attachments: MAPREDUCE-5111.patch


 When I run the job history server locally, every page load takes in the 10s 
 of seconds.  I profiled the process and discovered that all the extra time 
 was spent inside YarnConfiguration#getRMWebAppURL, trying to resolve 0.0.0.0 
 to a hostname.  When I changed my yarn.resourcemanager.address to localhost, 
 the page load times decreased drastically.
 There's no that we need to perform this resolution on every page load.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-516) TestContainerLocalizer.testContainerLocalizerMain is failing

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622015#comment-13622015
 ] 

Hudson commented on YARN-516:
-

Integrated in Hadoop-Yarn-trunk #174 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/174/])
Revert YARN-516 per HADOOP-9357. (Revision 1464181)

 Result = SUCCESS
eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464181
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java


 TestContainerLocalizer.testContainerLocalizerMain is failing
 

 Key: YARN-516
 URL: https://issues.apache.org/jira/browse/YARN-516
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Andrew Wang
 Fix For: 2.0.5-beta

 Attachments: YARN-516.txt




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-101) If the heartbeat message loss, the nodestatus info of complete container will loss too.

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622023#comment-13622023
 ] 

Hudson commented on YARN-101:
-

Integrated in Hadoop-Yarn-trunk #174 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/174/])
YARN-101. Fix NodeManager heartbeat processing to not lose track of 
completed containers in case of dropped heartbeats. Contributed by Xuan Gong. 
(Revision 1464105)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464105
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java


 If  the heartbeat message loss, the nodestatus info of complete container 
 will loss too.
 

 Key: YARN-101
 URL: https://issues.apache.org/jira/browse/YARN-101
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: suse.
Reporter: xieguiming
Assignee: Xuan Gong
Priority: Minor
 Fix For: 2.0.5-beta

 Attachments: YARN-101.1.patch, YARN-101.2.patch, YARN-101.3.patch, 
 YARN-101.4.patch, YARN-101.5.patch, YARN-101.6.patch


 see the red color:
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.java
  protected void startStatusUpdater() {
 new Thread(Node Status Updater) {
   @Override
   @SuppressWarnings(unchecked)
   public void run() {
 int lastHeartBeatID = 0;
 while (!isStopped) {
   // Send heartbeat
   try {
 synchronized (heartbeatMonitor) {
   heartbeatMonitor.wait(heartBeatInterval);
 }
 {color:red} 
 // Before we send the heartbeat, we get the NodeStatus,
 // whose method removes completed containers.
 NodeStatus nodeStatus = getNodeStatus();
  {color}
 nodeStatus.setResponseId(lastHeartBeatID);
 
 NodeHeartbeatRequest request = recordFactory
 .newRecordInstance(NodeHeartbeatRequest.class);
 request.setNodeStatus(nodeStatus);   
 {color:red} 
// But if the nodeHeartbeat fails, we've already removed the 
 containers away to know about it. We aren't handling a nodeHeartbeat failure 
 case here.
 HeartbeatResponse response =
   resourceTracker.nodeHeartbeat(request).getHeartbeatResponse();
{color} 
 if (response.getNodeAction() == NodeAction.SHUTDOWN) {
   LOG
   .info(Recieved SHUTDOWN signal from Resourcemanager as 
 part of heartbeat, +
hence shutting down.);
   NodeStatusUpdaterImpl.this.stop();
   break;
 }
 if (response.getNodeAction() == NodeAction.REBOOT) {
   LOG.info(Node is out of sync with ResourceManager,
   +  hence rebooting.);
   NodeStatusUpdaterImpl.this.reboot();
   break;
 }
 lastHeartBeatID = response.getResponseId();
 ListContainerId containersToCleanup = response
 .getContainersToCleanupList();
 if (containersToCleanup.size() != 0) {
   dispatcher.getEventHandler().handle(
   new CMgrCompletedContainersEvent(containersToCleanup));
 }
 ListApplicationId appsToCleanup =
 response.getApplicationsToCleanupList();
 //Only start tracking for keepAlive on FINISH_APP
 trackAppsForKeepAlive(appsToCleanup);
 if (appsToCleanup.size() != 0) {
   dispatcher.getEventHandler().handle(
   new CMgrCompletedAppsEvent(appsToCleanup));
 }
   } catch (Throwable e) {
 // TODO Better error handling. Thread can die with the rest of the
 // NM still running.
 LOG.error(Caught exception in status-updater, e);
   }
 }
   }
 }.start();
   }
   private NodeStatus getNodeStatus() {
 NodeStatus nodeStatus = recordFactory.newRecordInstance(NodeStatus.class);
 nodeStatus.setNodeId(this.nodeId);
 int numActiveContainers = 0;
 

[jira] [Commented] (YARN-381) Improve FS docs

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622028#comment-13622028
 ] 

Hudson commented on YARN-381:
-

Integrated in Hadoop-Yarn-trunk #174 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/174/])
YARN-381. Improve fair scheduler docs. Contributed by Sandy Ryza. (Revision 
1464130)

 Result = SUCCESS
tomwhite : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464130
Files : 
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/site/apt/ClusterSetup.apt.vm
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm


 Improve FS docs
 ---

 Key: YARN-381
 URL: https://issues.apache.org/jira/browse/YARN-381
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.0.0-alpha
Reporter: Eli Collins
Assignee: Sandy Ryza
Priority: Minor
 Fix For: 2.0.5-beta

 Attachments: YARN-381.patch


 The MR2 FS docs could use some improvements.
 Configuration:
 - sizebasedweight - what is the size here? Total memory usage?
 Pool properties:
 - minResources - what does min amount of aggregate memory mean given that 
 this is not a reservation?
 - maxResources - is this a hard limit?
 - weight: How is this  ratio configured?  Eg base is 1 and all weights are 
 relative to that?
 - schedulingMode - what is the default? Is fifo pure fifo, eg waits until all 
 tasks for the job are finished before launching the next job?
 There's no mention of ACLs, even though they're supported. See the CS docs 
 for comparison.
 Also there are a couple typos worth fixing while we're at it, eg finish. 
 apps to run
 Worth keeping in mind that some of these will need to be updated to reflect 
 that resource calculators are now pluggable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-536) Remove ContainerStatus, ContainerState from Container api interface as they will not be called by the container object

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622029#comment-13622029
 ] 

Hudson commented on YARN-536:
-

Integrated in Hadoop-Yarn-trunk #174 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/174/])
YARN-536. Removed the unused objects ContainerStatus and ContainerStatus 
from Container which also don't belong to the container. Contributed by Xuan 
Gong. (Revision 1464271)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464271
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/Container.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java


 Remove ContainerStatus, ContainerState from Container api interface as they 
 will not be called by the container object
 --

 Key: YARN-536
 URL: https://issues.apache.org/jira/browse/YARN-536
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.0.5-beta

 Attachments: YARN-536.1.patch, YARN-536.2.patch


 Remove containerstate, containerStatus from container interface. They will 
 not be called by container object

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-381) Improve FS docs

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622154#comment-13622154
 ] 

Hudson commented on YARN-381:
-

Integrated in Hadoop-Hdfs-trunk #1363 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1363/])
YARN-381. Improve fair scheduler docs. Contributed by Sandy Ryza. (Revision 
1464130)

 Result = FAILURE
tomwhite : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464130
Files : 
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/site/apt/ClusterSetup.apt.vm
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm


 Improve FS docs
 ---

 Key: YARN-381
 URL: https://issues.apache.org/jira/browse/YARN-381
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.0.0-alpha
Reporter: Eli Collins
Assignee: Sandy Ryza
Priority: Minor
 Fix For: 2.0.5-beta

 Attachments: YARN-381.patch


 The MR2 FS docs could use some improvements.
 Configuration:
 - sizebasedweight - what is the size here? Total memory usage?
 Pool properties:
 - minResources - what does min amount of aggregate memory mean given that 
 this is not a reservation?
 - maxResources - is this a hard limit?
 - weight: How is this  ratio configured?  Eg base is 1 and all weights are 
 relative to that?
 - schedulingMode - what is the default? Is fifo pure fifo, eg waits until all 
 tasks for the job are finished before launching the next job?
 There's no mention of ACLs, even though they're supported. See the CS docs 
 for comparison.
 Also there are a couple typos worth fixing while we're at it, eg finish. 
 apps to run
 Worth keeping in mind that some of these will need to be updated to reflect 
 that resource calculators are now pluggable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-536) Remove ContainerStatus, ContainerState from Container api interface as they will not be called by the container object

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622155#comment-13622155
 ] 

Hudson commented on YARN-536:
-

Integrated in Hadoop-Hdfs-trunk #1363 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1363/])
YARN-536. Removed the unused objects ContainerStatus and ContainerStatus 
from Container which also don't belong to the container. Contributed by Xuan 
Gong. (Revision 1464271)

 Result = FAILURE
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464271
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/Container.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java


 Remove ContainerStatus, ContainerState from Container api interface as they 
 will not be called by the container object
 --

 Key: YARN-536
 URL: https://issues.apache.org/jira/browse/YARN-536
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.0.5-beta

 Attachments: YARN-536.1.patch, YARN-536.2.patch


 Remove containerstate, containerStatus from container interface. They will 
 not be called by container object

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-458) YARN daemon addresses must be placed in many different configs

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622145#comment-13622145
 ] 

Hudson commented on YARN-458:
-

Integrated in Hadoop-Hdfs-trunk #1363 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1363/])
YARN-458. YARN daemon addresses must be placed in many different configs. 
(sandyr via tucu) (Revision 1464204)

 Result = FAILURE
tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464204
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml


 YARN daemon addresses must be placed in many different configs
 --

 Key: YARN-458
 URL: https://issues.apache.org/jira/browse/YARN-458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 2.0.5-beta

 Attachments: YARN-458.patch


 The YARN resourcemanager's address is included in four different configs: 
 yarn.resourcemanager.scheduler.address, 
 yarn.resourcemanager.resource-tracker.address, yarn.resourcemanager.address, 
 and yarn.resourcemanager.admin.address
 A new user trying to configure a cluster needs to know the names of all these 
 four configs.
 The same issue exists for nodemanagers.
 It would be much easier if they could simply specify 
 yarn.resourcemanager.hostname and yarn.nodemanager.hostname and default ports 
 for the other ones would kick in.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-382) SchedulerUtils improve way normalizeRequest sets the resource capabilities

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622151#comment-13622151
 ] 

Hudson commented on YARN-382:
-

Integrated in Hadoop-Hdfs-trunk #1363 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1363/])
YARN-382. SchedulerUtils improve way normalizeRequest sets the resource 
capabilities (Zhijie Shen via bikas) (Revision 1463653)

 Result = FAILURE
bikas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1463653
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java


 SchedulerUtils improve way normalizeRequest sets the resource capabilities
 --

 Key: YARN-382
 URL: https://issues.apache.org/jira/browse/YARN-382
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Thomas Graves
Assignee: Zhijie Shen
 Fix For: 2.0.5-beta

 Attachments: YARN-382_1.patch, YARN-382_2.patch, YARN-382_demo.patch


 In YARN-370, we changed it from setting the capability to directly setting 
 memory and cores:
 -ask.setCapability(normalized);
 +ask.getCapability().setMemory(normalized.getMemory());
 +ask.getCapability().setVirtualCores(normalized.getVirtualCores());
 We did this because it is directly setting the values in the original 
 resource object passed in when the AM gets allocated and without it the AM 
 doesn't get the resource normalized correctly in the submission context. See 
 YARN-370 for more details.
 I think we should find a better way of doing this long term, one so we don't 
 have to keep adding things there when new resources are added, two because 
 its a bit confusing as to what its doing and prone to someone accidentally 
 breaking it in the future again.  Something closer to what Arun suggested in 
 YARN-370 would be better but we need to make sure all the places work and get 
 some more testing on it before putting it in. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-541) getAllocatedContainers() is not returning all the allocated containers

2013-04-04 Thread Krishna Kishore Bonagiri (JIRA)
Krishna Kishore Bonagiri created YARN-541:
-

 Summary: getAllocatedContainers() is not returning all the 
allocated containers
 Key: YARN-541
 URL: https://issues.apache.org/jira/browse/YARN-541
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.3-alpha
 Environment: Redhat Linux 64-bit
Reporter: Krishna Kishore Bonagiri


I am running an application that was written and working well with the 
hadoop-2.0.0-alpha but when I am running the same against 2.0.3-alpha, the 
getAllocatedContainers() method called on AMResponse is not returning all the 
containers allocated sometimes. For example, I request for 10 containers and 
this method gives me only 9 containers sometimes, and when I looked at the log 
of Resource Manager, the 10th container is also allocated. It happens only 
sometimes randomly and works fine all other times. If I send one more request 
for the remaining container to RM after it failed to give them the first 
time(and before releasing already acquired ones), it could allocate that 
container. I am running only one application at a time, but 1000s of them one 
after another.

My main worry is, even though the RM's log is saying that all 10 requested 
containers are allocated,  the getAllocatedContainers() method is not returning 
me all of them, it returned only 9 surprisingly. I never saw this kind of issue 
in the previous version, i.e. hadoop-2.0.0-alpha.

Thanks,
Kishore

 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-538) RM address DNS lookup can cause unnecessary slowness on every JHS page load

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622138#comment-13622138
 ] 

Hudson commented on YARN-538:
-

Integrated in Hadoop-Hdfs-trunk #1363 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1363/])
YARN-538. RM address DNS lookup can cause unnecessary slowness on every JHS 
page load. (sandyr via tucu) (Revision 1464197)

 Result = FAILURE
tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464197
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java


 RM address DNS lookup can cause unnecessary slowness on every JHS page load 
 

 Key: YARN-538
 URL: https://issues.apache.org/jira/browse/YARN-538
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 2.0.5-beta

 Attachments: MAPREDUCE-5111.patch


 When I run the job history server locally, every page load takes in the 10s 
 of seconds.  I profiled the process and discovered that all the extra time 
 was spent inside YarnConfiguration#getRMWebAppURL, trying to resolve 0.0.0.0 
 to a hostname.  When I changed my yarn.resourcemanager.address to localhost, 
 the page load times decreased drastically.
 There's no that we need to perform this resolution on every page load.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-101) If the heartbeat message loss, the nodestatus info of complete container will loss too.

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622344#comment-13622344
 ] 

Hudson commented on YARN-101:
-

Integrated in Hadoop-Mapreduce-trunk #1390 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1390/])
YARN-101. Fix NodeManager heartbeat processing to not lose track of 
completed containers in case of dropped heartbeats. Contributed by Xuan Gong. 
(Revision 1464105)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464105
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java


 If  the heartbeat message loss, the nodestatus info of complete container 
 will loss too.
 

 Key: YARN-101
 URL: https://issues.apache.org/jira/browse/YARN-101
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: suse.
Reporter: xieguiming
Assignee: Xuan Gong
Priority: Minor
 Fix For: 2.0.5-beta

 Attachments: YARN-101.1.patch, YARN-101.2.patch, YARN-101.3.patch, 
 YARN-101.4.patch, YARN-101.5.patch, YARN-101.6.patch


 see the red color:
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.java
  protected void startStatusUpdater() {
 new Thread(Node Status Updater) {
   @Override
   @SuppressWarnings(unchecked)
   public void run() {
 int lastHeartBeatID = 0;
 while (!isStopped) {
   // Send heartbeat
   try {
 synchronized (heartbeatMonitor) {
   heartbeatMonitor.wait(heartBeatInterval);
 }
 {color:red} 
 // Before we send the heartbeat, we get the NodeStatus,
 // whose method removes completed containers.
 NodeStatus nodeStatus = getNodeStatus();
  {color}
 nodeStatus.setResponseId(lastHeartBeatID);
 
 NodeHeartbeatRequest request = recordFactory
 .newRecordInstance(NodeHeartbeatRequest.class);
 request.setNodeStatus(nodeStatus);   
 {color:red} 
// But if the nodeHeartbeat fails, we've already removed the 
 containers away to know about it. We aren't handling a nodeHeartbeat failure 
 case here.
 HeartbeatResponse response =
   resourceTracker.nodeHeartbeat(request).getHeartbeatResponse();
{color} 
 if (response.getNodeAction() == NodeAction.SHUTDOWN) {
   LOG
   .info(Recieved SHUTDOWN signal from Resourcemanager as 
 part of heartbeat, +
hence shutting down.);
   NodeStatusUpdaterImpl.this.stop();
   break;
 }
 if (response.getNodeAction() == NodeAction.REBOOT) {
   LOG.info(Node is out of sync with ResourceManager,
   +  hence rebooting.);
   NodeStatusUpdaterImpl.this.reboot();
   break;
 }
 lastHeartBeatID = response.getResponseId();
 ListContainerId containersToCleanup = response
 .getContainersToCleanupList();
 if (containersToCleanup.size() != 0) {
   dispatcher.getEventHandler().handle(
   new CMgrCompletedContainersEvent(containersToCleanup));
 }
 ListApplicationId appsToCleanup =
 response.getApplicationsToCleanupList();
 //Only start tracking for keepAlive on FINISH_APP
 trackAppsForKeepAlive(appsToCleanup);
 if (appsToCleanup.size() != 0) {
   dispatcher.getEventHandler().handle(
   new CMgrCompletedAppsEvent(appsToCleanup));
 }
   } catch (Throwable e) {
 // TODO Better error handling. Thread can die with the rest of the
 // NM still running.
 LOG.error(Caught exception in status-updater, e);
   }
 }
   }
 }.start();
   }
   private NodeStatus getNodeStatus() {
 NodeStatus nodeStatus = recordFactory.newRecordInstance(NodeStatus.class);
 nodeStatus.setNodeId(this.nodeId);
 int numActiveContainers = 

[jira] [Commented] (YARN-536) Remove ContainerStatus, ContainerState from Container api interface as they will not be called by the container object

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622350#comment-13622350
 ] 

Hudson commented on YARN-536:
-

Integrated in Hadoop-Mapreduce-trunk #1390 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1390/])
YARN-536. Removed the unused objects ContainerStatus and ContainerStatus 
from Container which also don't belong to the container. Contributed by Xuan 
Gong. (Revision 1464271)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464271
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/Container.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java


 Remove ContainerStatus, ContainerState from Container api interface as they 
 will not be called by the container object
 --

 Key: YARN-536
 URL: https://issues.apache.org/jira/browse/YARN-536
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.0.5-beta

 Attachments: YARN-536.1.patch, YARN-536.2.patch


 Remove containerstate, containerStatus from container interface. They will 
 not be called by container object

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (YARN-527) Local filecache mkdir fails

2013-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-527.
--

Resolution: Duplicate

Closing as duplicates as per comments above.

 Local filecache mkdir fails
 ---

 Key: YARN-527
 URL: https://issues.apache.org/jira/browse/YARN-527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.0-alpha
 Environment: RHEL 6.3 with CDH4.1.3 Hadoop, HA with two name nodes 
 and six worker nodes.
Reporter: Knut O. Hellan
Priority: Minor
 Attachments: yarn-site.xml


 Jobs failed with no other explanation than this stack trace:
 2013-03-29 16:46:02,671 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diag
 nostics report from attempt_1364591875320_0017_m_00_0: 
 java.io.IOException: mkdir of /disk3/yarn/local/filecache/-42307893
 55400878397 failed
 at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:932)
 at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
 at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
 at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
 at 
 org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2333)
 at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Manually creating the directory worked. This behavior was common to at least 
 several nodes in the cluster.
 The situation was resolved by removing and recreating all 
 /disk?/yarn/local/filecache directories on all nodes.
 It is unclear whether Yarn struggled with the number of files or if there 
 were corrupt files in the caches. The situation was triggered by a node dying.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-398) Allow white-list and black-list of resources

2013-04-04 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated YARN-398:
---

Attachment: YARN-398.patch

I got this done on a long flight a week or two ago... needs more testing etc.

 Allow white-list and black-list of resources
 

 Key: YARN-398
 URL: https://issues.apache.org/jira/browse/YARN-398
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Attachments: YARN-398.patch


 Allow white-list and black-list of resources in scheduler api.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality

2013-04-04 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622558#comment-13622558
 ] 

Arun C Murthy commented on YARN-392:


[~bikassaha] I'm against using timers for specifying locality delays - it 
doesn't make sense for a variety of reasons documented elsewhere. 



[~sandyr] I just uploaded a patch I lost track of for a week or so on YARN-398. 
Looks like we both are doing the same thing. I'm happy to repurpose one of the 
two jiras for CS while the other can do the same for FS. Makes sense? 

In my patch I called the flag as 'strictLocality' which defaults to 'false'. 
That should solve the need for white-lists. Makes sense? 



I agree we should tackle black-listing separately.




 Make it possible to schedule to specific nodes without dropping locality
 

 Key: YARN-392
 URL: https://issues.apache.org/jira/browse/YARN-392
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Sandy Ryza
 Attachments: YARN-392-1.patch, YARN-392.patch


 Currently its not possible to specify scheduling requests for specific nodes 
 and nowhere else. The RM automatically relaxes locality to rack and * and 
 assigns non-specified machines to the app.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality

2013-04-04 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622561#comment-13622561
 ] 

Arun C Murthy commented on YARN-392:


To be clear, the approach I took on YARN-398 allows for the 'I want only one 
container, and only on node1 or node2' use-case.

 Make it possible to schedule to specific nodes without dropping locality
 

 Key: YARN-392
 URL: https://issues.apache.org/jira/browse/YARN-392
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Sandy Ryza
 Attachments: YARN-392-1.patch, YARN-392.patch


 Currently its not possible to specify scheduling requests for specific nodes 
 and nowhere else. The RM automatically relaxes locality to rack and * and 
 assigns non-specified machines to the app.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality

2013-04-04 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622563#comment-13622563
 ] 

Arun C Murthy commented on YARN-392:


Also, it allows for I want 'one container on any one of the following n racks' 
too.

 Make it possible to schedule to specific nodes without dropping locality
 

 Key: YARN-392
 URL: https://issues.apache.org/jira/browse/YARN-392
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Sandy Ryza
 Attachments: YARN-392-1.patch, YARN-392.patch


 Currently its not possible to specify scheduling requests for specific nodes 
 and nowhere else. The RM automatically relaxes locality to rack and * and 
 assigns non-specified machines to the app.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-525) make CS node-locality-delay refreshable

2013-04-04 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-525:
---

Assignee: Thomas Graves

 make CS node-locality-delay refreshable
 ---

 Key: YARN-525
 URL: https://issues.apache.org/jira/browse/YARN-525
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.0.3-alpha, 0.23.7
Reporter: Thomas Graves
Assignee: Thomas Graves

 the config yarn.scheduler.capacity.node-locality-delay doesn't change when 
 you change the value in capacity_scheduler.xml and then run yarn rmadmin 
 -refreshQueues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality

2013-04-04 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622584#comment-13622584
 ] 

Bikas Saha commented on YARN-392:
-

bq. I'm against using timers for specifying locality delays - it doesn't make 
sense for a variety of reasons documented elsewhere.
Can you please point me to them?

 Make it possible to schedule to specific nodes without dropping locality
 

 Key: YARN-392
 URL: https://issues.apache.org/jira/browse/YARN-392
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Sandy Ryza
 Attachments: YARN-392-1.patch, YARN-392.patch


 Currently its not possible to specify scheduling requests for specific nodes 
 and nowhere else. The RM automatically relaxes locality to rack and * and 
 assigns non-specified machines to the app.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-495) Change NM behavior of reboot to resync

2013-04-04 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated YARN-495:


Summary: Change NM behavior of reboot to resync  (was: Containers are not 
terminated when the NM is rebooted)

 Change NM behavior of reboot to resync
 --

 Key: YARN-495
 URL: https://issues.apache.org/jira/browse/YARN-495
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-495.1.patch, YARN-495.2.patch


 When a reboot command is sent from RM, the node manager doesn't clean up the 
 containers while its stopping.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-529) MR job succeeds and exits even when unregister with RM fails

2013-04-04 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated YARN-529:


Summary: MR job succeeds and exits even when unregister with RM fails  
(was: Succeeded MR job is retried by RM if finishApplicationMaster() call fails)

 MR job succeeds and exits even when unregister with RM fails
 

 Key: YARN-529
 URL: https://issues.apache.org/jira/browse/YARN-529
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 MR app master will clean staging dir, if the job is already succeeded and 
 asked to reboot. If the finishApplicationMaster call fails, RM will consider 
 this job unfinished and launch further attempts, further attempts will fail 
 because staging dir is cleaned

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-540) RM state store not cleaned if job succeeds but RM shutdown and restart-dispatcher stopped before it can process REMOVE_APP event

2013-04-04 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622681#comment-13622681
 ] 

Bikas Saha commented on YARN-540:
-

This is a known issue. The problem here is that the rm state store is 
essentially a write ahead log. But in the application unregister/finish case, 
the application has already finished before the rm stores that fact in its 
state. So the RM by itself cannot avoid this problem. Since its a race 
condition we may choose not not fix it unless we see this happen often in 
practice.
The solutions that come to mind are
1) finishApplicationMaster() blocks until the finish is stored in the store. 
This has issues of getting blocked on a slow/unavailable store. Also, the RM 
does a bunch of other things before and application finishes. The RM may not be 
able to remove the application from the store until all those steps are 
complete.
2) finishApplicationMaster() becomes a 2-step process in which, in the second 
step the app waits for the RM to change the app's state to FINISHED before 
exiting.

 RM state store not cleaned if job succeeds but RM shutdown and 
 restart-dispatcher stopped before it can process REMOVE_APP event
 

 Key: YARN-540
 URL: https://issues.apache.org/jira/browse/YARN-540
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 When job succeeds and successfully call finishApplicationMaster, RM shutdown 
 and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
 next time RM comes back, it will reload the existing state files even though 
 the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-534) AM max attempts is not checked when RM restart and try to recover attempts

2013-04-04 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622685#comment-13622685
 ] 

Bikas Saha commented on YARN-534:
-

Turns out that the max attempts limit is checked when job fails (and tries to 
launch new attempt) and not when the new attempt is actually being launched. 
The RM on restart, could choose to remove applications that have already hit 
the limit.

 AM max attempts is not checked when RM restart and try to recover attempts
 --

 Key: YARN-534
 URL: https://issues.apache.org/jira/browse/YARN-534
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 Currently,AM max attempts is only checked if the current attempt fails and 
 check to see whether to create new attempt. If the RM restarts before the 
 max-attempt fails, it'll not clean the state store, when RM comes back, it 
 will retry attempt again.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-542) Change the default AM retry value to be not one

2013-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reassigned YARN-542:


Assignee: Vinod Kumar Vavilapalli

 Change the default AM retry value to be not one
 ---

 Key: YARN-542
 URL: https://issues.apache.org/jira/browse/YARN-542
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 Today, the AM max-retries is set to 1 which is a bad choice. AM max-retries 
 accounts for both AM level failures as well as container crashes due to 
 localization issue, lost nodes etc. To account for AM crashes due to problems 
 that are not caused by user code, mainly lost nodes, we want to give AMs some 
 retires.
 I propose we change it to atleast two. Can change it to 4 to match other 
 retry-configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-493) NodeManager job control logic flaws on Windows

2013-04-04 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-493:
---

Attachment: YARN-493.3.patch

Here is a new patch that renames the new {{Shell}} methods to 
{{appendScriptExtension}}.

Regarding trying to use {{Shell#getRunScriptCommand}} in the badSymlink test, I 
have not been able to get this to work.  The test depends on very specific 
quoting, and the conversion to absolute path inside 
{{Shell#getRunScriptCommand}} (required by other callers) interferes with this.

 NodeManager job control logic flaws on Windows
 --

 Key: YARN-493
 URL: https://issues.apache.org/jira/browse/YARN-493
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Fix For: 3.0.0

 Attachments: YARN-493.1.patch, YARN-493.2.patch, YARN-493.3.patch


 Both product and test code contain some platform-specific assumptions, such 
 as availability of bash for executing a command in a container and signals to 
 check existence of a process and terminate it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-525) make CS node-locality-delay refreshable

2013-04-04 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-525:
---

Attachment: YARN-525-branch-0.23.patch

 make CS node-locality-delay refreshable
 ---

 Key: YARN-525
 URL: https://issues.apache.org/jira/browse/YARN-525
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.0.3-alpha, 0.23.7
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-525-branch-0.23.patch, YARN-525-branch-0.23.patch


 the config yarn.scheduler.capacity.node-locality-delay doesn't change when 
 you change the value in capacity_scheduler.xml and then run yarn rmadmin 
 -refreshQueues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality

2013-04-04 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622765#comment-13622765
 ] 

Sandy Ryza commented on YARN-392:
-

[~acmurthy], that makes sense to me.  We can use this one for FS and YARN-398 
for CS?  Do you think this should go into FIFO as well?
[~bikassaha], if we went with your proposal, would it not make sense to go with 
the convention used in the FS/CS already, in which the locality delay is a 
fraction of the cluster size?  In your proposal, if I want a node-local 
container at node1, would I specify the locality delay on the request for node1 
or on the request for the rack that node1 is on?

 Make it possible to schedule to specific nodes without dropping locality
 

 Key: YARN-392
 URL: https://issues.apache.org/jira/browse/YARN-392
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Sandy Ryza
 Attachments: YARN-392-1.patch, YARN-392.patch


 Currently its not possible to specify scheduling requests for specific nodes 
 and nowhere else. The RM automatically relaxes locality to rack and * and 
 assigns non-specified machines to the app.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-493) NodeManager job control logic flaws on Windows

2013-04-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622772#comment-13622772
 ] 

Hadoop QA commented on YARN-493:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12577046/YARN-493.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/670//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/670//console

This message is automatically generated.

 NodeManager job control logic flaws on Windows
 --

 Key: YARN-493
 URL: https://issues.apache.org/jira/browse/YARN-493
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Fix For: 3.0.0

 Attachments: YARN-493.1.patch, YARN-493.2.patch, YARN-493.3.patch


 Both product and test code contain some platform-specific assumptions, such 
 as availability of bash for executing a command in a container and signals to 
 check existence of a process and terminate it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-525) make CS node-locality-delay refreshable

2013-04-04 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-525:
---

Attachment: YARN-525.patch

added unit test and include patch for trunk and branch-2.

 make CS node-locality-delay refreshable
 ---

 Key: YARN-525
 URL: https://issues.apache.org/jira/browse/YARN-525
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.0.3-alpha, 0.23.7
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-525-branch-0.23.patch, YARN-525-branch-0.23.patch, 
 YARN-525.patch


 the config yarn.scheduler.capacity.node-locality-delay doesn't change when 
 you change the value in capacity_scheduler.xml and then run yarn rmadmin 
 -refreshQueues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-479) NM retry behavior for connection to RM should be similar for lost heartbeats

2013-04-04 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622783#comment-13622783
 ] 

Bikas Saha commented on YARN-479:
-

I dont see the value of waitForever if we can specify a large value for retry 
interval (1 day or so)

Not sure what retryCounts is buying us.

Whats the intention of catching and rethrowing the exception without doing 
anything else
{code}
+  } catch (YarnException e) {
+//catch and throw the exception if tried MAX wait time to connect 
RM
+throw e;
{code}

there is a finally block which will make the code sleeping for longer than 
necessary before exiting. this becomes important because admins might kill the 
NM after waiting for a few seconds for it to exit. In that much time NM has to 
do a bunch of clean up tasks and this extra sleep does not help.

Unrelated to this change, but does the NM really shutdown when the heartbeat 
fails right now? It looks like that the thread just keeps running. After this 
change it looks like the heartbeat thread will just exit. This does not mean 
that the NM will shutdown?

 NM retry behavior for connection to RM should be similar for lost heartbeats
 

 Key: YARN-479
 URL: https://issues.apache.org/jira/browse/YARN-479
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-479.1.patch, YARN-479.2.patch, YARN-479.3.patch, 
 YARN-479.4.patch, YARN-479.5.patch


 Regardless of connection loss at the start or at an intermediate point, NM's 
 retry behavior to the RM should follow the same flow. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-196) Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started

2013-04-04 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622785#comment-13622785
 ] 

Bikas Saha commented on YARN-196:
-

here is a finally block which will make the code sleeping for longer than 
necessary before exiting. this becomes important because admins might kill the 
NM after waiting for a few seconds for it to exit. In that much time NM has to 
do a bunch of clean up tasks and this extra sleep does not help.

 Nodemanager should be more robust in handling connection failure  to 
 ResourceManager when a cluster is started
 --

 Key: YARN-196
 URL: https://issues.apache.org/jira/browse/YARN-196
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0, 2.0.0-alpha
Reporter: Ramgopal N
Assignee: Xuan Gong
 Fix For: 2.0.5-beta

 Attachments: MAPREDUCE-3676.patch, YARN-196.10.patch, 
 YARN-196.11.patch, YARN-196.12.1.patch, YARN-196.12.patch, YARN-196.1.patch, 
 YARN-196.2.patch, YARN-196.3.patch, YARN-196.4.patch, YARN-196.5.patch, 
 YARN-196.6.patch, YARN-196.7.patch, YARN-196.8.patch, YARN-196.9.patch


 If NM is started before starting the RM ,NM is shutting down with the 
 following error
 {code}
 ERROR org.apache.hadoop.yarn.service.CompositeService: Error starting 
 services org.apache.hadoop.yarn.server.nodemanager.NodeManager
 org.apache.avro.AvroRuntimeException: 
 java.lang.reflect.UndeclaredThrowableException
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:149)
   at 
 org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:167)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:242)
 Caused by: java.lang.reflect.UndeclaredThrowableException
   at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:66)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:182)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:145)
   ... 3 more
 Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: 
 Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on 
 connection exception: java.net.ConnectException: Connection refused; For more 
 details see:  http://wiki.apache.org/hadoop/ConnectionRefused
   at 
 org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:131)
   at $Proxy23.registerNodeManager(Unknown Source)
   at 
 org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59)
   ... 5 more
 Caused by: java.net.ConnectException: Call From 
 HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection 
 exception: java.net.ConnectException: Connection refused; For more details 
 see:  http://wiki.apache.org/hadoop/ConnectionRefused
   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:857)
   at org.apache.hadoop.ipc.Client.call(Client.java:1141)
   at org.apache.hadoop.ipc.Client.call(Client.java:1100)
   at 
 org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:128)
   ... 7 more
 Caused by: java.net.ConnectException: Connection refused
   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   at 
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
   at 
 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:659)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:469)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:563)
   at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:211)
   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247)
   at org.apache.hadoop.ipc.Client.call(Client.java:1117)
   ... 9 more
 2012-01-16 15:04:13,336 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: 
 AsyncDispatcher thread interrupted
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
   

[jira] [Commented] (YARN-525) make CS node-locality-delay refreshable

2013-04-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622819#comment-13622819
 ] 

Hadoop QA commented on YARN-525:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12577058/YARN-525.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/671//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/671//console

This message is automatically generated.

 make CS node-locality-delay refreshable
 ---

 Key: YARN-525
 URL: https://issues.apache.org/jira/browse/YARN-525
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.0.3-alpha, 0.23.7
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-525-branch-0.23.patch, YARN-525-branch-0.23.patch, 
 YARN-525.patch


 the config yarn.scheduler.capacity.node-locality-delay doesn't change when 
 you change the value in capacity_scheduler.xml and then run yarn rmadmin 
 -refreshQueues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-470) Support a way to disable resource monitoring on the NodeManager

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622828#comment-13622828
 ] 

Hudson commented on YARN-470:
-

Integrated in Hadoop-trunk-Commit #3565 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3565/])
Updated CHANGES.txt to reflect YARN-470 being merged into 
branch-2.0.4-alpha. (Revision 1464772)

 Result = SUCCESS
sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464772
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt


 Support a way to disable resource monitoring on the NodeManager
 ---

 Key: YARN-470
 URL: https://issues.apache.org/jira/browse/YARN-470
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Siddharth Seth
  Labels: usability
 Fix For: 2.0.4-alpha

 Attachments: YARN-470_2.txt, YARN-470.txt


 Currently, the memory management monitor's check is disabled when the maxMem 
 is set to -1. However, the maxMem is also sent to the RM when the NM 
 registers with it ( to define the max limit of allocate-able resources ). 
 We need an explicit flag to disable monitoring to avoid the problems caused 
 by the overloading of the max memory value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-532) RMAdminProtocolPBClientImpl should implement Closeable

2013-04-04 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated YARN-532:


Attachment: YARN-532.txt

LocalizationProtocol implementing Closeable as well.

 RMAdminProtocolPBClientImpl should implement Closeable
 --

 Key: YARN-532
 URL: https://issues.apache.org/jira/browse/YARN-532
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.3-alpha
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: YARN-532.txt, YARN-532.txt


 Required for RPC.stopProxy to work. Already done in most of the other 
 protocols. (MAPREDUCE-5117 addressing the one other protocol missing this)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-99) Jobs fail during resource localization when private distributed-cache hits unix directory limits

2013-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-99:


Issue Type: Sub-task  (was: Bug)
Parent: YARN-543

 Jobs fail during resource localization when private distributed-cache hits 
 unix directory limits
 

 Key: YARN-99
 URL: https://issues.apache.org/jira/browse/YARN-99
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 3.0.0, 2.0.0-alpha
Reporter: Devaraj K
Assignee: Omkar Vinit Joshi
 Attachments: yarn-99-20130324.patch, yarn-99-20130403.1.patch, 
 yarn-99-20130403.patch


 If we have multiple jobs which uses distributed cache with small size of 
 files, the directory limit reaches before reaching the cache size and fails 
 to create any directories in file cache. The jobs start failing with the 
 below exception.
 {code:xml}
 java.io.IOException: mkdir of 
 /tmp/nm-local-dir/usercache/root/filecache/1701886847734194975 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
   at 
 org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {code}
 We should have a mechanism to clean the cache files if it crosses specified 
 number of directories like cache size.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-539) LocalizedResources are leaked in memory in case resource localization fails

2013-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-539:
-

Issue Type: Sub-task  (was: Bug)
Parent: YARN-543

 LocalizedResources are leaked in memory in case resource localization fails
 ---

 Key: YARN-539
 URL: https://issues.apache.org/jira/browse/YARN-539
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi

 If resource localization fails then resource remains in memory and is
 1) Either cleaned up when next time cache cleanup runs and there is space 
 crunch. (If sufficient space in cache is available then it will remain in 
 memory).
 2) reused if LocalizationRequest comes again for the same resource.
 I think when resource localization fails then that event should be sent to 
 LocalResourceTracker which will then remove it from its cache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-543) [Umbrella] NodeManager localization related issues

2013-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-543:
-

Component/s: nodemanager

 [Umbrella] NodeManager localization related issues
 --

 Key: YARN-543
 URL: https://issues.apache.org/jira/browse/YARN-543
 Project: Hadoop YARN
  Issue Type: Task
  Components: nodemanager
Reporter: Vinod Kumar Vavilapalli

 Seeing a bunch of localization related issues being worked on, this is the 
 tracking ticket.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-544) Failed resource localization might introduce a race condition.

2013-04-04 Thread Omkar Vinit Joshi (JIRA)
Omkar Vinit Joshi created YARN-544:
--

 Summary: Failed resource localization might introduce a race 
condition.
 Key: YARN-544
 URL: https://issues.apache.org/jira/browse/YARN-544
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi


When resource localization fails [Public localizer / LocalizerRunner(Private)] 
it sends ContainerResourceFailedEvent to the containers which then sends 
ResourceReleaseEvent to the failed resource. In the end when 
LocalizedResource's ref count drops to 0 its state is changed from DOWNLOADING 
to INIT.
Now if a Resource gets ResourceRequestEvent in between 
ContainerResourceFailedEvent and last ResourceReleaseEvent then for that 
resource ref count will not drop to 0 and the container which sent the 
ResourceRequestEvent will keep waiting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-532) RMAdminProtocolPBClientImpl should implement Closeable

2013-04-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622923#comment-13622923
 ] 

Hadoop QA commented on YARN-532:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12577088/YARN-532.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/672//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/672//console

This message is automatically generated.

 RMAdminProtocolPBClientImpl should implement Closeable
 --

 Key: YARN-532
 URL: https://issues.apache.org/jira/browse/YARN-532
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.3-alpha
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: YARN-532.txt, YARN-532.txt


 Required for RPC.stopProxy to work. Already done in most of the other 
 protocols. (MAPREDUCE-5117 addressing the one other protocol missing this)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-479) NM retry behavior for connection to RM should be similar for lost heartbeats

2013-04-04 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-479:
-

Attachment: YARN-479.6.patch

 NM retry behavior for connection to RM should be similar for lost heartbeats
 

 Key: YARN-479
 URL: https://issues.apache.org/jira/browse/YARN-479
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-479.1.patch, YARN-479.2.patch, YARN-479.3.patch, 
 YARN-479.4.patch, YARN-479.5.patch, YARN-479.6.patch


 Regardless of connection loss at the start or at an intermediate point, NM's 
 retry behavior to the RM should follow the same flow. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-479) NM retry behavior for connection to RM should be similar for lost heartbeats

2013-04-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622967#comment-13622967
 ] 

Hadoop QA commented on YARN-479:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12577107/YARN-479.6.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/673//console

This message is automatically generated.

 NM retry behavior for connection to RM should be similar for lost heartbeats
 

 Key: YARN-479
 URL: https://issues.apache.org/jira/browse/YARN-479
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-479.1.patch, YARN-479.2.patch, YARN-479.3.patch, 
 YARN-479.4.patch, YARN-479.5.patch, YARN-479.6.patch


 Regardless of connection loss at the start or at an intermediate point, NM's 
 retry behavior to the RM should follow the same flow. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-544) Failed resource localization might introduce a race condition.

2013-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622969#comment-13622969
 ] 

Vinod Kumar Vavilapalli commented on YARN-544:
--

When you come around to doing this, please write a test-case first to reproduce 
this. Tx.

 Failed resource localization might introduce a race condition.
 --

 Key: YARN-544
 URL: https://issues.apache.org/jira/browse/YARN-544
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi

 When resource localization fails [Public localizer / 
 LocalizerRunner(Private)] it sends ContainerResourceFailedEvent to the 
 containers which then sends ResourceReleaseEvent to the failed resource. In 
 the end when LocalizedResource's ref count drops to 0 its state is changed 
 from DOWNLOADING to INIT.
 Now if a Resource gets ResourceRequestEvent in between 
 ContainerResourceFailedEvent and last ResourceReleaseEvent then for that 
 resource ref count will not drop to 0 and the container which sent the 
 ResourceRequestEvent will keep waiting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-544) Failed resource localization might introduce a race condition.

2013-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-544:
-

Issue Type: Sub-task  (was: Bug)
Parent: YARN-543

 Failed resource localization might introduce a race condition.
 --

 Key: YARN-544
 URL: https://issues.apache.org/jira/browse/YARN-544
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi

 When resource localization fails [Public localizer / 
 LocalizerRunner(Private)] it sends ContainerResourceFailedEvent to the 
 containers which then sends ResourceReleaseEvent to the failed resource. In 
 the end when LocalizedResource's ref count drops to 0 its state is changed 
 from DOWNLOADING to INIT.
 Now if a Resource gets ResourceRequestEvent in between 
 ContainerResourceFailedEvent and last ResourceReleaseEvent then for that 
 resource ref count will not drop to 0 and the container which sent the 
 ResourceRequestEvent will keep waiting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-532) RMAdminProtocolPBClientImpl should implement Closeable

2013-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622972#comment-13622972
 ] 

Vinod Kumar Vavilapalli commented on YARN-532:
--

Looks good, checking it in.

 RMAdminProtocolPBClientImpl should implement Closeable
 --

 Key: YARN-532
 URL: https://issues.apache.org/jira/browse/YARN-532
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.3-alpha
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: YARN-532.txt, YARN-532.txt


 Required for RPC.stopProxy to work. Already done in most of the other 
 protocols. (MAPREDUCE-5117 addressing the one other protocol missing this)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-532) RMAdminProtocolPBClientImpl should implement Closeable

2013-04-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622993#comment-13622993
 ] 

Hudson commented on YARN-532:
-

Integrated in Hadoop-trunk-Commit #3567 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3567/])
YARN-532. Change RMAdmin and Localization client protocol PB 
implementations to implement closeable so that they can be stopped when needed 
via RPC.stopProxy(). Contributed by Siddharth Seth. (Revision 1464788)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464788
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/RMAdminProtocolPBClientImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/impl/pb/client/LocalizationProtocolPBClientImpl.java


 RMAdminProtocolPBClientImpl should implement Closeable
 --

 Key: YARN-532
 URL: https://issues.apache.org/jira/browse/YARN-532
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.3-alpha
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Fix For: 2.0.5-beta

 Attachments: YARN-532.txt, YARN-532.txt


 Required for RPC.stopProxy to work. Already done in most of the other 
 protocols. (MAPREDUCE-5117 addressing the one other protocol missing this)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-493) NodeManager job control logic flaws on Windows

2013-04-04 Thread Ivan Mitic (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622995#comment-13622995
 ] 

Ivan Mitic commented on YARN-493:
-

+1, latest patch looks good to me, thanks Chris

 NodeManager job control logic flaws on Windows
 --

 Key: YARN-493
 URL: https://issues.apache.org/jira/browse/YARN-493
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Fix For: 3.0.0

 Attachments: YARN-493.1.patch, YARN-493.2.patch, YARN-493.3.patch


 Both product and test code contain some platform-specific assumptions, such 
 as availability of bash for executing a command in a container and signals to 
 check existence of a process and terminate it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-539) LocalizedResources are leaked in memory in case resource localization fails

2013-04-04 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13623082#comment-13623082
 ] 

Omkar Vinit Joshi commented on YARN-539:


At present the flow of events in case resource localization is as follows
* When resource localization fails (Public localizer / LocalizerRunner(Private) 
)it sends ContainerResourceFailedEvent to the containers which then sends 
ResourceReleaseEvent to the failed resource. In the end when 
LocalizedResource's ref count drops to 0 its state is changed from DOWNLOADING 
to INIT.

Now due to this resource may end up in memory (ResourceLocalizationTracker - 
memory leak) or may also introduce a race condition 
[yarn-544|https://issues.apache.org/jira/browse/YARN-544]

Now proposed solution is
* when resource localization fails, resource localization failed event 
(ResourceFailedEvent) is sent to (LocalResourcesTrackerImpl). The tracker will 
remove this localized resource from its cache and will then pass the event to 
LocalizedResource. LocalizedResource will then notify all the containers which 
were waiting for this resource. The containers will no longer send an 
additional ResourceReleaseEvent.
* Now to keep the flow same for Success as well as Failure, even the 
Localization successful event will be sent to LocalizedResource via 
LocalResourcesTrackerImpl.

 LocalizedResources are leaked in memory in case resource localization fails
 ---

 Key: YARN-539
 URL: https://issues.apache.org/jira/browse/YARN-539
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi

 If resource localization fails then resource remains in memory and is
 1) Either cleaned up when next time cache cleanup runs and there is space 
 crunch. (If sufficient space in cache is available then it will remain in 
 memory).
 2) reused if LocalizationRequest comes again for the same resource.
 I think when resource localization fails then that event should be sent to 
 LocalResourceTracker which will then remove it from its cache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-493) NodeManager job control logic flaws on Windows

2013-04-04 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13623084#comment-13623084
 ] 

Chris Nauroth commented on YARN-493:


Thank you for the reviews, Ivan!

 NodeManager job control logic flaws on Windows
 --

 Key: YARN-493
 URL: https://issues.apache.org/jira/browse/YARN-493
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Fix For: 3.0.0

 Attachments: YARN-493.1.patch, YARN-493.2.patch, YARN-493.3.patch


 Both product and test code contain some platform-specific assumptions, such 
 as availability of bash for executing a command in a container and signals to 
 check existence of a process and terminate it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-493) NodeManager job control logic flaws on Windows

2013-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13623088#comment-13623088
 ] 

Vinod Kumar Vavilapalli commented on YARN-493:
--

Looking at this for final review/commit.

 NodeManager job control logic flaws on Windows
 --

 Key: YARN-493
 URL: https://issues.apache.org/jira/browse/YARN-493
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Fix For: 3.0.0

 Attachments: YARN-493.1.patch, YARN-493.2.patch, YARN-493.3.patch


 Both product and test code contain some platform-specific assumptions, such 
 as availability of bash for executing a command in a container and signals to 
 check existence of a process and terminate it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-479) NM retry behavior for connection to RM should be similar for lost heartbeats

2013-04-04 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-479:
-

Attachment: YARN-479.7.patch

fix conflicts with YARN-101

 NM retry behavior for connection to RM should be similar for lost heartbeats
 

 Key: YARN-479
 URL: https://issues.apache.org/jira/browse/YARN-479
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-479.1.patch, YARN-479.2.patch, YARN-479.3.patch, 
 YARN-479.4.patch, YARN-479.5.patch, YARN-479.6.patch, YARN-479.7.patch


 Regardless of connection loss at the start or at an intermediate point, NM's 
 retry behavior to the RM should follow the same flow. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-479) NM retry behavior for connection to RM should be similar for lost heartbeats

2013-04-04 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-479:
-

Attachment: YARN-479.8.patch

Add test case that nodeStatusUpdater will retry a fixed number of time and 
eventually send SHUTDOWN to NM

 NM retry behavior for connection to RM should be similar for lost heartbeats
 

 Key: YARN-479
 URL: https://issues.apache.org/jira/browse/YARN-479
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-479.1.patch, YARN-479.2.patch, YARN-479.3.patch, 
 YARN-479.4.patch, YARN-479.5.patch, YARN-479.6.patch, YARN-479.7.patch, 
 YARN-479.8.patch


 Regardless of connection loss at the start or at an intermediate point, NM's 
 retry behavior to the RM should follow the same flow. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-479) NM retry behavior for connection to RM should be similar for lost heartbeats

2013-04-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13623294#comment-13623294
 ] 

Hadoop QA commented on YARN-479:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12577136/YARN-479.8.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/675//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/675//console

This message is automatically generated.

 NM retry behavior for connection to RM should be similar for lost heartbeats
 

 Key: YARN-479
 URL: https://issues.apache.org/jira/browse/YARN-479
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-479.1.patch, YARN-479.2.patch, YARN-479.3.patch, 
 YARN-479.4.patch, YARN-479.5.patch, YARN-479.6.patch, YARN-479.7.patch, 
 YARN-479.8.patch


 Regardless of connection loss at the start or at an intermediate point, NM's 
 retry behavior to the RM should follow the same flow. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-157) The option shell_command and shell_script have conflict

2013-04-04 Thread rainy Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rainy Yu updated YARN-157:
--

Attachment: shell_script.sh
YARN-157.patch

Add unit test. Thank Vinod Kumar Vavilapalli for help

 The option shell_command and shell_script have conflict
 ---

 Key: YARN-157
 URL: https://issues.apache.org/jira/browse/YARN-157
 Project: Hadoop YARN
  Issue Type: Bug
  Components: applications/distributed-shell
Affects Versions: 2.0.1-alpha
Reporter: Li Ming
Assignee: rainy Yu
  Labels: patch
 Attachments: hadoop_yarn.patch, shell_script.sh, YARN-157.patch


 The DistributedShell has an option shell_script to let user specify a shell 
 script which will be executed in containers. But the issue is that the 
 shell_command option is a must, so if both options are set, then every 
 container executor will end with exitCode=1. This is because DistributedShell 
 executes the shell_command and shell_script together. For example, if 
 shell_command is 'date' then the final command to be executed in container is 
 date `ExecShellScript.sh`, so the date command will treat the result of 
 ExecShellScript.sh as its parameter, then there will be an error. 
 To solve this, the DistributedShell should not use the value of shell_command 
 option when the shell_script option is set, and the shell_command option also 
 should not be mandatory. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-54) AggregatedLogFormat should be marked Private / Unstable

2013-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-54?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-54:


Issue Type: Sub-task  (was: Bug)
Parent: YARN-386

 AggregatedLogFormat should be marked Private / Unstable
 ---

 Key: YARN-54
 URL: https://issues.apache.org/jira/browse/YARN-54
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.0.0-alpha
Reporter: Jason Lowe
Assignee: Siddharth Seth
Priority: Trivial
 Attachments: YARN54.txt


 AggregatedLogFormat is still in a state of flux, so we should mark it as 
 Private / Unstable for clarity.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-547) New resource localization is tried even when Localized Resource is in DOWNLOADING state

2013-04-04 Thread Omkar Vinit Joshi (JIRA)
Omkar Vinit Joshi created YARN-547:
--

 Summary: New resource localization is tried even when Localized 
Resource is in DOWNLOADING state
 Key: YARN-547
 URL: https://issues.apache.org/jira/browse/YARN-547
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi


At present when multiple containers try to request a localized resource 
1) If the resource is not present then first it is created and Resource 
Localization starts ( LocalizedResource is in DOWNLOADING state)
2) Now if in this state multiple ResourceRequestEvents come in then 
ResourceLocalizationEvents are fired for all of them.

Most of the times it is not resulting into a duplicate resource download but 
there is a race condition present there. 
Location : ResourceLocalizationService.addResource .. addition of the request 
into attempts in case of an event already exists.

The root cause for this is the presence of FetchResourceTransition on receiving 
ResourceRequestEvent in DOWNLOADING state.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira