[jira] [Commented] (YARN-527) Local filecache mkdir fails
[ https://issues.apache.org/jira/browse/YARN-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621878#comment-13621878 ] Knut O. Hellan commented on YARN-527: - Yes, this is a duplicate of YARN-467 so you may close it. We will add cronjobs to delete old directories as a temporary workaround until we can test 2.0.5-beta. Thanks! Local filecache mkdir fails --- Key: YARN-527 URL: https://issues.apache.org/jira/browse/YARN-527 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.0-alpha Environment: RHEL 6.3 with CDH4.1.3 Hadoop, HA with two name nodes and six worker nodes. Reporter: Knut O. Hellan Priority: Minor Attachments: yarn-site.xml Jobs failed with no other explanation than this stack trace: 2013-03-29 16:46:02,671 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diag nostics report from attempt_1364591875320_0017_m_00_0: java.io.IOException: mkdir of /disk3/yarn/local/filecache/-42307893 55400878397 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:932) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703) at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2333) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Manually creating the directory worked. This behavior was common to at least several nodes in the cluster. The situation was resolved by removing and recreating all /disk?/yarn/local/filecache directories on all nodes. It is unclear whether Yarn struggled with the number of files or if there were corrupt files in the caches. The situation was triggered by a node dying. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-538) RM address DNS lookup can cause unnecessary slowness on every JHS page load
[ https://issues.apache.org/jira/browse/YARN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622012#comment-13622012 ] Hudson commented on YARN-538: - Integrated in Hadoop-Yarn-trunk #174 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/174/]) YARN-538. RM address DNS lookup can cause unnecessary slowness on every JHS page load. (sandyr via tucu) (Revision 1464197) Result = SUCCESS tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464197 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java RM address DNS lookup can cause unnecessary slowness on every JHS page load Key: YARN-538 URL: https://issues.apache.org/jira/browse/YARN-538 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.0.5-beta Attachments: MAPREDUCE-5111.patch When I run the job history server locally, every page load takes in the 10s of seconds. I profiled the process and discovered that all the extra time was spent inside YarnConfiguration#getRMWebAppURL, trying to resolve 0.0.0.0 to a hostname. When I changed my yarn.resourcemanager.address to localhost, the page load times decreased drastically. There's no that we need to perform this resolution on every page load. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-516) TestContainerLocalizer.testContainerLocalizerMain is failing
[ https://issues.apache.org/jira/browse/YARN-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622015#comment-13622015 ] Hudson commented on YARN-516: - Integrated in Hadoop-Yarn-trunk #174 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/174/]) Revert YARN-516 per HADOOP-9357. (Revision 1464181) Result = SUCCESS eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464181 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java TestContainerLocalizer.testContainerLocalizerMain is failing Key: YARN-516 URL: https://issues.apache.org/jira/browse/YARN-516 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Andrew Wang Fix For: 2.0.5-beta Attachments: YARN-516.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-101) If the heartbeat message loss, the nodestatus info of complete container will loss too.
[ https://issues.apache.org/jira/browse/YARN-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622023#comment-13622023 ] Hudson commented on YARN-101: - Integrated in Hadoop-Yarn-trunk #174 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/174/]) YARN-101. Fix NodeManager heartbeat processing to not lose track of completed containers in case of dropped heartbeats. Contributed by Xuan Gong. (Revision 1464105) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464105 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java If the heartbeat message loss, the nodestatus info of complete container will loss too. Key: YARN-101 URL: https://issues.apache.org/jira/browse/YARN-101 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: suse. Reporter: xieguiming Assignee: Xuan Gong Priority: Minor Fix For: 2.0.5-beta Attachments: YARN-101.1.patch, YARN-101.2.patch, YARN-101.3.patch, YARN-101.4.patch, YARN-101.5.patch, YARN-101.6.patch see the red color: org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.java protected void startStatusUpdater() { new Thread(Node Status Updater) { @Override @SuppressWarnings(unchecked) public void run() { int lastHeartBeatID = 0; while (!isStopped) { // Send heartbeat try { synchronized (heartbeatMonitor) { heartbeatMonitor.wait(heartBeatInterval); } {color:red} // Before we send the heartbeat, we get the NodeStatus, // whose method removes completed containers. NodeStatus nodeStatus = getNodeStatus(); {color} nodeStatus.setResponseId(lastHeartBeatID); NodeHeartbeatRequest request = recordFactory .newRecordInstance(NodeHeartbeatRequest.class); request.setNodeStatus(nodeStatus); {color:red} // But if the nodeHeartbeat fails, we've already removed the containers away to know about it. We aren't handling a nodeHeartbeat failure case here. HeartbeatResponse response = resourceTracker.nodeHeartbeat(request).getHeartbeatResponse(); {color} if (response.getNodeAction() == NodeAction.SHUTDOWN) { LOG .info(Recieved SHUTDOWN signal from Resourcemanager as part of heartbeat, + hence shutting down.); NodeStatusUpdaterImpl.this.stop(); break; } if (response.getNodeAction() == NodeAction.REBOOT) { LOG.info(Node is out of sync with ResourceManager, + hence rebooting.); NodeStatusUpdaterImpl.this.reboot(); break; } lastHeartBeatID = response.getResponseId(); ListContainerId containersToCleanup = response .getContainersToCleanupList(); if (containersToCleanup.size() != 0) { dispatcher.getEventHandler().handle( new CMgrCompletedContainersEvent(containersToCleanup)); } ListApplicationId appsToCleanup = response.getApplicationsToCleanupList(); //Only start tracking for keepAlive on FINISH_APP trackAppsForKeepAlive(appsToCleanup); if (appsToCleanup.size() != 0) { dispatcher.getEventHandler().handle( new CMgrCompletedAppsEvent(appsToCleanup)); } } catch (Throwable e) { // TODO Better error handling. Thread can die with the rest of the // NM still running. LOG.error(Caught exception in status-updater, e); } } } }.start(); } private NodeStatus getNodeStatus() { NodeStatus nodeStatus = recordFactory.newRecordInstance(NodeStatus.class); nodeStatus.setNodeId(this.nodeId); int numActiveContainers = 0;
[jira] [Commented] (YARN-381) Improve FS docs
[ https://issues.apache.org/jira/browse/YARN-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622028#comment-13622028 ] Hudson commented on YARN-381: - Integrated in Hadoop-Yarn-trunk #174 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/174/]) YARN-381. Improve fair scheduler docs. Contributed by Sandy Ryza. (Revision 1464130) Result = SUCCESS tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464130 Files : * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/site/apt/ClusterSetup.apt.vm * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm Improve FS docs --- Key: YARN-381 URL: https://issues.apache.org/jira/browse/YARN-381 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.0.0-alpha Reporter: Eli Collins Assignee: Sandy Ryza Priority: Minor Fix For: 2.0.5-beta Attachments: YARN-381.patch The MR2 FS docs could use some improvements. Configuration: - sizebasedweight - what is the size here? Total memory usage? Pool properties: - minResources - what does min amount of aggregate memory mean given that this is not a reservation? - maxResources - is this a hard limit? - weight: How is this ratio configured? Eg base is 1 and all weights are relative to that? - schedulingMode - what is the default? Is fifo pure fifo, eg waits until all tasks for the job are finished before launching the next job? There's no mention of ACLs, even though they're supported. See the CS docs for comparison. Also there are a couple typos worth fixing while we're at it, eg finish. apps to run Worth keeping in mind that some of these will need to be updated to reflect that resource calculators are now pluggable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-536) Remove ContainerStatus, ContainerState from Container api interface as they will not be called by the container object
[ https://issues.apache.org/jira/browse/YARN-536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622029#comment-13622029 ] Hudson commented on YARN-536: - Integrated in Hadoop-Yarn-trunk #174 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/174/]) YARN-536. Removed the unused objects ContainerStatus and ContainerStatus from Container which also don't belong to the container. Contributed by Xuan Gong. (Revision 1464271) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464271 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/Container.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java Remove ContainerStatus, ContainerState from Container api interface as they will not be called by the container object -- Key: YARN-536 URL: https://issues.apache.org/jira/browse/YARN-536 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.0.5-beta Attachments: YARN-536.1.patch, YARN-536.2.patch Remove containerstate, containerStatus from container interface. They will not be called by container object -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-381) Improve FS docs
[ https://issues.apache.org/jira/browse/YARN-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622154#comment-13622154 ] Hudson commented on YARN-381: - Integrated in Hadoop-Hdfs-trunk #1363 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1363/]) YARN-381. Improve fair scheduler docs. Contributed by Sandy Ryza. (Revision 1464130) Result = FAILURE tomwhite : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464130 Files : * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/site/apt/ClusterSetup.apt.vm * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm Improve FS docs --- Key: YARN-381 URL: https://issues.apache.org/jira/browse/YARN-381 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.0.0-alpha Reporter: Eli Collins Assignee: Sandy Ryza Priority: Minor Fix For: 2.0.5-beta Attachments: YARN-381.patch The MR2 FS docs could use some improvements. Configuration: - sizebasedweight - what is the size here? Total memory usage? Pool properties: - minResources - what does min amount of aggregate memory mean given that this is not a reservation? - maxResources - is this a hard limit? - weight: How is this ratio configured? Eg base is 1 and all weights are relative to that? - schedulingMode - what is the default? Is fifo pure fifo, eg waits until all tasks for the job are finished before launching the next job? There's no mention of ACLs, even though they're supported. See the CS docs for comparison. Also there are a couple typos worth fixing while we're at it, eg finish. apps to run Worth keeping in mind that some of these will need to be updated to reflect that resource calculators are now pluggable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-536) Remove ContainerStatus, ContainerState from Container api interface as they will not be called by the container object
[ https://issues.apache.org/jira/browse/YARN-536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622155#comment-13622155 ] Hudson commented on YARN-536: - Integrated in Hadoop-Hdfs-trunk #1363 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1363/]) YARN-536. Removed the unused objects ContainerStatus and ContainerStatus from Container which also don't belong to the container. Contributed by Xuan Gong. (Revision 1464271) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464271 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/Container.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java Remove ContainerStatus, ContainerState from Container api interface as they will not be called by the container object -- Key: YARN-536 URL: https://issues.apache.org/jira/browse/YARN-536 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.0.5-beta Attachments: YARN-536.1.patch, YARN-536.2.patch Remove containerstate, containerStatus from container interface. They will not be called by container object -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-458) YARN daemon addresses must be placed in many different configs
[ https://issues.apache.org/jira/browse/YARN-458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622145#comment-13622145 ] Hudson commented on YARN-458: - Integrated in Hadoop-Hdfs-trunk #1363 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1363/]) YARN-458. YARN daemon addresses must be placed in many different configs. (sandyr via tucu) (Revision 1464204) Result = FAILURE tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464204 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml YARN daemon addresses must be placed in many different configs -- Key: YARN-458 URL: https://issues.apache.org/jira/browse/YARN-458 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.0.5-beta Attachments: YARN-458.patch The YARN resourcemanager's address is included in four different configs: yarn.resourcemanager.scheduler.address, yarn.resourcemanager.resource-tracker.address, yarn.resourcemanager.address, and yarn.resourcemanager.admin.address A new user trying to configure a cluster needs to know the names of all these four configs. The same issue exists for nodemanagers. It would be much easier if they could simply specify yarn.resourcemanager.hostname and yarn.nodemanager.hostname and default ports for the other ones would kick in. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-382) SchedulerUtils improve way normalizeRequest sets the resource capabilities
[ https://issues.apache.org/jira/browse/YARN-382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622151#comment-13622151 ] Hudson commented on YARN-382: - Integrated in Hadoop-Hdfs-trunk #1363 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1363/]) YARN-382. SchedulerUtils improve way normalizeRequest sets the resource capabilities (Zhijie Shen via bikas) (Revision 1463653) Result = FAILURE bikas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1463653 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java SchedulerUtils improve way normalizeRequest sets the resource capabilities -- Key: YARN-382 URL: https://issues.apache.org/jira/browse/YARN-382 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.0.3-alpha Reporter: Thomas Graves Assignee: Zhijie Shen Fix For: 2.0.5-beta Attachments: YARN-382_1.patch, YARN-382_2.patch, YARN-382_demo.patch In YARN-370, we changed it from setting the capability to directly setting memory and cores: -ask.setCapability(normalized); +ask.getCapability().setMemory(normalized.getMemory()); +ask.getCapability().setVirtualCores(normalized.getVirtualCores()); We did this because it is directly setting the values in the original resource object passed in when the AM gets allocated and without it the AM doesn't get the resource normalized correctly in the submission context. See YARN-370 for more details. I think we should find a better way of doing this long term, one so we don't have to keep adding things there when new resources are added, two because its a bit confusing as to what its doing and prone to someone accidentally breaking it in the future again. Something closer to what Arun suggested in YARN-370 would be better but we need to make sure all the places work and get some more testing on it before putting it in. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-541) getAllocatedContainers() is not returning all the allocated containers
Krishna Kishore Bonagiri created YARN-541: - Summary: getAllocatedContainers() is not returning all the allocated containers Key: YARN-541 URL: https://issues.apache.org/jira/browse/YARN-541 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.3-alpha Environment: Redhat Linux 64-bit Reporter: Krishna Kishore Bonagiri I am running an application that was written and working well with the hadoop-2.0.0-alpha but when I am running the same against 2.0.3-alpha, the getAllocatedContainers() method called on AMResponse is not returning all the containers allocated sometimes. For example, I request for 10 containers and this method gives me only 9 containers sometimes, and when I looked at the log of Resource Manager, the 10th container is also allocated. It happens only sometimes randomly and works fine all other times. If I send one more request for the remaining container to RM after it failed to give them the first time(and before releasing already acquired ones), it could allocate that container. I am running only one application at a time, but 1000s of them one after another. My main worry is, even though the RM's log is saying that all 10 requested containers are allocated, the getAllocatedContainers() method is not returning me all of them, it returned only 9 surprisingly. I never saw this kind of issue in the previous version, i.e. hadoop-2.0.0-alpha. Thanks, Kishore -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-538) RM address DNS lookup can cause unnecessary slowness on every JHS page load
[ https://issues.apache.org/jira/browse/YARN-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622138#comment-13622138 ] Hudson commented on YARN-538: - Integrated in Hadoop-Hdfs-trunk #1363 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1363/]) YARN-538. RM address DNS lookup can cause unnecessary slowness on every JHS page load. (sandyr via tucu) (Revision 1464197) Result = FAILURE tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464197 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java RM address DNS lookup can cause unnecessary slowness on every JHS page load Key: YARN-538 URL: https://issues.apache.org/jira/browse/YARN-538 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.0.5-beta Attachments: MAPREDUCE-5111.patch When I run the job history server locally, every page load takes in the 10s of seconds. I profiled the process and discovered that all the extra time was spent inside YarnConfiguration#getRMWebAppURL, trying to resolve 0.0.0.0 to a hostname. When I changed my yarn.resourcemanager.address to localhost, the page load times decreased drastically. There's no that we need to perform this resolution on every page load. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-101) If the heartbeat message loss, the nodestatus info of complete container will loss too.
[ https://issues.apache.org/jira/browse/YARN-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622344#comment-13622344 ] Hudson commented on YARN-101: - Integrated in Hadoop-Mapreduce-trunk #1390 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1390/]) YARN-101. Fix NodeManager heartbeat processing to not lose track of completed containers in case of dropped heartbeats. Contributed by Xuan Gong. (Revision 1464105) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464105 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java If the heartbeat message loss, the nodestatus info of complete container will loss too. Key: YARN-101 URL: https://issues.apache.org/jira/browse/YARN-101 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: suse. Reporter: xieguiming Assignee: Xuan Gong Priority: Minor Fix For: 2.0.5-beta Attachments: YARN-101.1.patch, YARN-101.2.patch, YARN-101.3.patch, YARN-101.4.patch, YARN-101.5.patch, YARN-101.6.patch see the red color: org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.java protected void startStatusUpdater() { new Thread(Node Status Updater) { @Override @SuppressWarnings(unchecked) public void run() { int lastHeartBeatID = 0; while (!isStopped) { // Send heartbeat try { synchronized (heartbeatMonitor) { heartbeatMonitor.wait(heartBeatInterval); } {color:red} // Before we send the heartbeat, we get the NodeStatus, // whose method removes completed containers. NodeStatus nodeStatus = getNodeStatus(); {color} nodeStatus.setResponseId(lastHeartBeatID); NodeHeartbeatRequest request = recordFactory .newRecordInstance(NodeHeartbeatRequest.class); request.setNodeStatus(nodeStatus); {color:red} // But if the nodeHeartbeat fails, we've already removed the containers away to know about it. We aren't handling a nodeHeartbeat failure case here. HeartbeatResponse response = resourceTracker.nodeHeartbeat(request).getHeartbeatResponse(); {color} if (response.getNodeAction() == NodeAction.SHUTDOWN) { LOG .info(Recieved SHUTDOWN signal from Resourcemanager as part of heartbeat, + hence shutting down.); NodeStatusUpdaterImpl.this.stop(); break; } if (response.getNodeAction() == NodeAction.REBOOT) { LOG.info(Node is out of sync with ResourceManager, + hence rebooting.); NodeStatusUpdaterImpl.this.reboot(); break; } lastHeartBeatID = response.getResponseId(); ListContainerId containersToCleanup = response .getContainersToCleanupList(); if (containersToCleanup.size() != 0) { dispatcher.getEventHandler().handle( new CMgrCompletedContainersEvent(containersToCleanup)); } ListApplicationId appsToCleanup = response.getApplicationsToCleanupList(); //Only start tracking for keepAlive on FINISH_APP trackAppsForKeepAlive(appsToCleanup); if (appsToCleanup.size() != 0) { dispatcher.getEventHandler().handle( new CMgrCompletedAppsEvent(appsToCleanup)); } } catch (Throwable e) { // TODO Better error handling. Thread can die with the rest of the // NM still running. LOG.error(Caught exception in status-updater, e); } } } }.start(); } private NodeStatus getNodeStatus() { NodeStatus nodeStatus = recordFactory.newRecordInstance(NodeStatus.class); nodeStatus.setNodeId(this.nodeId); int numActiveContainers =
[jira] [Commented] (YARN-536) Remove ContainerStatus, ContainerState from Container api interface as they will not be called by the container object
[ https://issues.apache.org/jira/browse/YARN-536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622350#comment-13622350 ] Hudson commented on YARN-536: - Integrated in Hadoop-Mapreduce-trunk #1390 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1390/]) YARN-536. Removed the unused objects ContainerStatus and ContainerStatus from Container which also don't belong to the container. Contributed by Xuan Gong. (Revision 1464271) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464271 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/Container.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java Remove ContainerStatus, ContainerState from Container api interface as they will not be called by the container object -- Key: YARN-536 URL: https://issues.apache.org/jira/browse/YARN-536 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.0.5-beta Attachments: YARN-536.1.patch, YARN-536.2.patch Remove containerstate, containerStatus from container interface. They will not be called by container object -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-527) Local filecache mkdir fails
[ https://issues.apache.org/jira/browse/YARN-527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-527. -- Resolution: Duplicate Closing as duplicates as per comments above. Local filecache mkdir fails --- Key: YARN-527 URL: https://issues.apache.org/jira/browse/YARN-527 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.0-alpha Environment: RHEL 6.3 with CDH4.1.3 Hadoop, HA with two name nodes and six worker nodes. Reporter: Knut O. Hellan Priority: Minor Attachments: yarn-site.xml Jobs failed with no other explanation than this stack trace: 2013-03-29 16:46:02,671 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diag nostics report from attempt_1364591875320_0017_m_00_0: java.io.IOException: mkdir of /disk3/yarn/local/filecache/-42307893 55400878397 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:932) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703) at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2333) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Manually creating the directory worked. This behavior was common to at least several nodes in the cluster. The situation was resolved by removing and recreating all /disk?/yarn/local/filecache directories on all nodes. It is unclear whether Yarn struggled with the number of files or if there were corrupt files in the caches. The situation was triggered by a node dying. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-398) Allow white-list and black-list of resources
[ https://issues.apache.org/jira/browse/YARN-398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-398: --- Attachment: YARN-398.patch I got this done on a long flight a week or two ago... needs more testing etc. Allow white-list and black-list of resources Key: YARN-398 URL: https://issues.apache.org/jira/browse/YARN-398 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Arun C Murthy Attachments: YARN-398.patch Allow white-list and black-list of resources in scheduler api. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality
[ https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622558#comment-13622558 ] Arun C Murthy commented on YARN-392: [~bikassaha] I'm against using timers for specifying locality delays - it doesn't make sense for a variety of reasons documented elsewhere. [~sandyr] I just uploaded a patch I lost track of for a week or so on YARN-398. Looks like we both are doing the same thing. I'm happy to repurpose one of the two jiras for CS while the other can do the same for FS. Makes sense? In my patch I called the flag as 'strictLocality' which defaults to 'false'. That should solve the need for white-lists. Makes sense? I agree we should tackle black-listing separately. Make it possible to schedule to specific nodes without dropping locality Key: YARN-392 URL: https://issues.apache.org/jira/browse/YARN-392 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Sandy Ryza Attachments: YARN-392-1.patch, YARN-392.patch Currently its not possible to specify scheduling requests for specific nodes and nowhere else. The RM automatically relaxes locality to rack and * and assigns non-specified machines to the app. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality
[ https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622561#comment-13622561 ] Arun C Murthy commented on YARN-392: To be clear, the approach I took on YARN-398 allows for the 'I want only one container, and only on node1 or node2' use-case. Make it possible to schedule to specific nodes without dropping locality Key: YARN-392 URL: https://issues.apache.org/jira/browse/YARN-392 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Sandy Ryza Attachments: YARN-392-1.patch, YARN-392.patch Currently its not possible to specify scheduling requests for specific nodes and nowhere else. The RM automatically relaxes locality to rack and * and assigns non-specified machines to the app. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality
[ https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622563#comment-13622563 ] Arun C Murthy commented on YARN-392: Also, it allows for I want 'one container on any one of the following n racks' too. Make it possible to schedule to specific nodes without dropping locality Key: YARN-392 URL: https://issues.apache.org/jira/browse/YARN-392 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Sandy Ryza Attachments: YARN-392-1.patch, YARN-392.patch Currently its not possible to specify scheduling requests for specific nodes and nowhere else. The RM automatically relaxes locality to rack and * and assigns non-specified machines to the app. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-525) make CS node-locality-delay refreshable
[ https://issues.apache.org/jira/browse/YARN-525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-525: --- Assignee: Thomas Graves make CS node-locality-delay refreshable --- Key: YARN-525 URL: https://issues.apache.org/jira/browse/YARN-525 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.0.3-alpha, 0.23.7 Reporter: Thomas Graves Assignee: Thomas Graves the config yarn.scheduler.capacity.node-locality-delay doesn't change when you change the value in capacity_scheduler.xml and then run yarn rmadmin -refreshQueues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality
[ https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622584#comment-13622584 ] Bikas Saha commented on YARN-392: - bq. I'm against using timers for specifying locality delays - it doesn't make sense for a variety of reasons documented elsewhere. Can you please point me to them? Make it possible to schedule to specific nodes without dropping locality Key: YARN-392 URL: https://issues.apache.org/jira/browse/YARN-392 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Sandy Ryza Attachments: YARN-392-1.patch, YARN-392.patch Currently its not possible to specify scheduling requests for specific nodes and nowhere else. The RM automatically relaxes locality to rack and * and assigns non-specified machines to the app. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-495) Change NM behavior of reboot to resync
[ https://issues.apache.org/jira/browse/YARN-495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-495: Summary: Change NM behavior of reboot to resync (was: Containers are not terminated when the NM is rebooted) Change NM behavior of reboot to resync -- Key: YARN-495 URL: https://issues.apache.org/jira/browse/YARN-495 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-495.1.patch, YARN-495.2.patch When a reboot command is sent from RM, the node manager doesn't clean up the containers while its stopping. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-529) MR job succeeds and exits even when unregister with RM fails
[ https://issues.apache.org/jira/browse/YARN-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-529: Summary: MR job succeeds and exits even when unregister with RM fails (was: Succeeded MR job is retried by RM if finishApplicationMaster() call fails) MR job succeeds and exits even when unregister with RM fails Key: YARN-529 URL: https://issues.apache.org/jira/browse/YARN-529 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Jian He Assignee: Jian He MR app master will clean staging dir, if the job is already succeeded and asked to reboot. If the finishApplicationMaster call fails, RM will consider this job unfinished and launch further attempts, further attempts will fail because staging dir is cleaned -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-540) RM state store not cleaned if job succeeds but RM shutdown and restart-dispatcher stopped before it can process REMOVE_APP event
[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622681#comment-13622681 ] Bikas Saha commented on YARN-540: - This is a known issue. The problem here is that the rm state store is essentially a write ahead log. But in the application unregister/finish case, the application has already finished before the rm stores that fact in its state. So the RM by itself cannot avoid this problem. Since its a race condition we may choose not not fix it unless we see this happen often in practice. The solutions that come to mind are 1) finishApplicationMaster() blocks until the finish is stored in the store. This has issues of getting blocked on a slow/unavailable store. Also, the RM does a bunch of other things before and application finishes. The RM may not be able to remove the application from the store until all those steps are complete. 2) finishApplicationMaster() becomes a 2-step process in which, in the second step the app waits for the RM to change the app's state to FINISHED before exiting. RM state store not cleaned if job succeeds but RM shutdown and restart-dispatcher stopped before it can process REMOVE_APP event Key: YARN-540 URL: https://issues.apache.org/jira/browse/YARN-540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-534) AM max attempts is not checked when RM restart and try to recover attempts
[ https://issues.apache.org/jira/browse/YARN-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622685#comment-13622685 ] Bikas Saha commented on YARN-534: - Turns out that the max attempts limit is checked when job fails (and tries to launch new attempt) and not when the new attempt is actually being launched. The RM on restart, could choose to remove applications that have already hit the limit. AM max attempts is not checked when RM restart and try to recover attempts -- Key: YARN-534 URL: https://issues.apache.org/jira/browse/YARN-534 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Currently,AM max attempts is only checked if the current attempt fails and check to see whether to create new attempt. If the RM restarts before the max-attempt fails, it'll not clean the state store, when RM comes back, it will retry attempt again. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-542) Change the default AM retry value to be not one
[ https://issues.apache.org/jira/browse/YARN-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli reassigned YARN-542: Assignee: Vinod Kumar Vavilapalli Change the default AM retry value to be not one --- Key: YARN-542 URL: https://issues.apache.org/jira/browse/YARN-542 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Today, the AM max-retries is set to 1 which is a bad choice. AM max-retries accounts for both AM level failures as well as container crashes due to localization issue, lost nodes etc. To account for AM crashes due to problems that are not caused by user code, mainly lost nodes, we want to give AMs some retires. I propose we change it to atleast two. Can change it to 4 to match other retry-configs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-493) NodeManager job control logic flaws on Windows
[ https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated YARN-493: --- Attachment: YARN-493.3.patch Here is a new patch that renames the new {{Shell}} methods to {{appendScriptExtension}}. Regarding trying to use {{Shell#getRunScriptCommand}} in the badSymlink test, I have not been able to get this to work. The test depends on very specific quoting, and the conversion to absolute path inside {{Shell#getRunScriptCommand}} (required by other callers) interferes with this. NodeManager job control logic flaws on Windows -- Key: YARN-493 URL: https://issues.apache.org/jira/browse/YARN-493 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 3.0.0 Attachments: YARN-493.1.patch, YARN-493.2.patch, YARN-493.3.patch Both product and test code contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-525) make CS node-locality-delay refreshable
[ https://issues.apache.org/jira/browse/YARN-525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-525: --- Attachment: YARN-525-branch-0.23.patch make CS node-locality-delay refreshable --- Key: YARN-525 URL: https://issues.apache.org/jira/browse/YARN-525 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.0.3-alpha, 0.23.7 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-525-branch-0.23.patch, YARN-525-branch-0.23.patch the config yarn.scheduler.capacity.node-locality-delay doesn't change when you change the value in capacity_scheduler.xml and then run yarn rmadmin -refreshQueues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality
[ https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622765#comment-13622765 ] Sandy Ryza commented on YARN-392: - [~acmurthy], that makes sense to me. We can use this one for FS and YARN-398 for CS? Do you think this should go into FIFO as well? [~bikassaha], if we went with your proposal, would it not make sense to go with the convention used in the FS/CS already, in which the locality delay is a fraction of the cluster size? In your proposal, if I want a node-local container at node1, would I specify the locality delay on the request for node1 or on the request for the rack that node1 is on? Make it possible to schedule to specific nodes without dropping locality Key: YARN-392 URL: https://issues.apache.org/jira/browse/YARN-392 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Sandy Ryza Attachments: YARN-392-1.patch, YARN-392.patch Currently its not possible to specify scheduling requests for specific nodes and nowhere else. The RM automatically relaxes locality to rack and * and assigns non-specified machines to the app. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-493) NodeManager job control logic flaws on Windows
[ https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622772#comment-13622772 ] Hadoop QA commented on YARN-493: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577046/YARN-493.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/670//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/670//console This message is automatically generated. NodeManager job control logic flaws on Windows -- Key: YARN-493 URL: https://issues.apache.org/jira/browse/YARN-493 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 3.0.0 Attachments: YARN-493.1.patch, YARN-493.2.patch, YARN-493.3.patch Both product and test code contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-525) make CS node-locality-delay refreshable
[ https://issues.apache.org/jira/browse/YARN-525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-525: --- Attachment: YARN-525.patch added unit test and include patch for trunk and branch-2. make CS node-locality-delay refreshable --- Key: YARN-525 URL: https://issues.apache.org/jira/browse/YARN-525 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.0.3-alpha, 0.23.7 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-525-branch-0.23.patch, YARN-525-branch-0.23.patch, YARN-525.patch the config yarn.scheduler.capacity.node-locality-delay doesn't change when you change the value in capacity_scheduler.xml and then run yarn rmadmin -refreshQueues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-479) NM retry behavior for connection to RM should be similar for lost heartbeats
[ https://issues.apache.org/jira/browse/YARN-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622783#comment-13622783 ] Bikas Saha commented on YARN-479: - I dont see the value of waitForever if we can specify a large value for retry interval (1 day or so) Not sure what retryCounts is buying us. Whats the intention of catching and rethrowing the exception without doing anything else {code} + } catch (YarnException e) { +//catch and throw the exception if tried MAX wait time to connect RM +throw e; {code} there is a finally block which will make the code sleeping for longer than necessary before exiting. this becomes important because admins might kill the NM after waiting for a few seconds for it to exit. In that much time NM has to do a bunch of clean up tasks and this extra sleep does not help. Unrelated to this change, but does the NM really shutdown when the heartbeat fails right now? It looks like that the thread just keeps running. After this change it looks like the heartbeat thread will just exit. This does not mean that the NM will shutdown? NM retry behavior for connection to RM should be similar for lost heartbeats Key: YARN-479 URL: https://issues.apache.org/jira/browse/YARN-479 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-479.1.patch, YARN-479.2.patch, YARN-479.3.patch, YARN-479.4.patch, YARN-479.5.patch Regardless of connection loss at the start or at an intermediate point, NM's retry behavior to the RM should follow the same flow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-196) Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started
[ https://issues.apache.org/jira/browse/YARN-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622785#comment-13622785 ] Bikas Saha commented on YARN-196: - here is a finally block which will make the code sleeping for longer than necessary before exiting. this becomes important because admins might kill the NM after waiting for a few seconds for it to exit. In that much time NM has to do a bunch of clean up tasks and this extra sleep does not help. Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started -- Key: YARN-196 URL: https://issues.apache.org/jira/browse/YARN-196 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0, 2.0.0-alpha Reporter: Ramgopal N Assignee: Xuan Gong Fix For: 2.0.5-beta Attachments: MAPREDUCE-3676.patch, YARN-196.10.patch, YARN-196.11.patch, YARN-196.12.1.patch, YARN-196.12.patch, YARN-196.1.patch, YARN-196.2.patch, YARN-196.3.patch, YARN-196.4.patch, YARN-196.5.patch, YARN-196.6.patch, YARN-196.7.patch, YARN-196.8.patch, YARN-196.9.patch If NM is started before starting the RM ,NM is shutting down with the following error {code} ERROR org.apache.hadoop.yarn.service.CompositeService: Error starting services org.apache.hadoop.yarn.server.nodemanager.NodeManager org.apache.avro.AvroRuntimeException: java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:149) at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:167) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:242) Caused by: java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:66) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:182) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:145) ... 3 more Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:131) at $Proxy23.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59) ... 5 more Caused by: java.net.ConnectException: Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:857) at org.apache.hadoop.ipc.Client.call(Client.java:1141) at org.apache.hadoop.ipc.Client.call(Client.java:1100) at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:128) ... 7 more Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:659) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:469) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:563) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:211) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247) at org.apache.hadoop.ipc.Client.call(Client.java:1117) ... 9 more 2012-01-16 15:04:13,336 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher thread interrupted java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
[jira] [Commented] (YARN-525) make CS node-locality-delay refreshable
[ https://issues.apache.org/jira/browse/YARN-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622819#comment-13622819 ] Hadoop QA commented on YARN-525: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577058/YARN-525.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/671//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/671//console This message is automatically generated. make CS node-locality-delay refreshable --- Key: YARN-525 URL: https://issues.apache.org/jira/browse/YARN-525 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.0.3-alpha, 0.23.7 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-525-branch-0.23.patch, YARN-525-branch-0.23.patch, YARN-525.patch the config yarn.scheduler.capacity.node-locality-delay doesn't change when you change the value in capacity_scheduler.xml and then run yarn rmadmin -refreshQueues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-470) Support a way to disable resource monitoring on the NodeManager
[ https://issues.apache.org/jira/browse/YARN-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622828#comment-13622828 ] Hudson commented on YARN-470: - Integrated in Hadoop-trunk-Commit #3565 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3565/]) Updated CHANGES.txt to reflect YARN-470 being merged into branch-2.0.4-alpha. (Revision 1464772) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464772 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt Support a way to disable resource monitoring on the NodeManager --- Key: YARN-470 URL: https://issues.apache.org/jira/browse/YARN-470 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Hitesh Shah Assignee: Siddharth Seth Labels: usability Fix For: 2.0.4-alpha Attachments: YARN-470_2.txt, YARN-470.txt Currently, the memory management monitor's check is disabled when the maxMem is set to -1. However, the maxMem is also sent to the RM when the NM registers with it ( to define the max limit of allocate-able resources ). We need an explicit flag to disable monitoring to avoid the problems caused by the overloading of the max memory value. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-532) RMAdminProtocolPBClientImpl should implement Closeable
[ https://issues.apache.org/jira/browse/YARN-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated YARN-532: Attachment: YARN-532.txt LocalizationProtocol implementing Closeable as well. RMAdminProtocolPBClientImpl should implement Closeable -- Key: YARN-532 URL: https://issues.apache.org/jira/browse/YARN-532 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.3-alpha Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: YARN-532.txt, YARN-532.txt Required for RPC.stopProxy to work. Already done in most of the other protocols. (MAPREDUCE-5117 addressing the one other protocol missing this) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-99) Jobs fail during resource localization when private distributed-cache hits unix directory limits
[ https://issues.apache.org/jira/browse/YARN-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-99: Issue Type: Sub-task (was: Bug) Parent: YARN-543 Jobs fail during resource localization when private distributed-cache hits unix directory limits Key: YARN-99 URL: https://issues.apache.org/jira/browse/YARN-99 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 3.0.0, 2.0.0-alpha Reporter: Devaraj K Assignee: Omkar Vinit Joshi Attachments: yarn-99-20130324.patch, yarn-99-20130403.1.patch, yarn-99-20130403.patch If we have multiple jobs which uses distributed cache with small size of files, the directory limit reaches before reaching the cache size and fails to create any directories in file cache. The jobs start failing with the below exception. {code:xml} java.io.IOException: mkdir of /tmp/nm-local-dir/usercache/root/filecache/1701886847734194975 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703) at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {code} We should have a mechanism to clean the cache files if it crosses specified number of directories like cache size. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-539) LocalizedResources are leaked in memory in case resource localization fails
[ https://issues.apache.org/jira/browse/YARN-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-539: - Issue Type: Sub-task (was: Bug) Parent: YARN-543 LocalizedResources are leaked in memory in case resource localization fails --- Key: YARN-539 URL: https://issues.apache.org/jira/browse/YARN-539 Project: Hadoop YARN Issue Type: Sub-task Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi If resource localization fails then resource remains in memory and is 1) Either cleaned up when next time cache cleanup runs and there is space crunch. (If sufficient space in cache is available then it will remain in memory). 2) reused if LocalizationRequest comes again for the same resource. I think when resource localization fails then that event should be sent to LocalResourceTracker which will then remove it from its cache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-543) [Umbrella] NodeManager localization related issues
[ https://issues.apache.org/jira/browse/YARN-543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-543: - Component/s: nodemanager [Umbrella] NodeManager localization related issues -- Key: YARN-543 URL: https://issues.apache.org/jira/browse/YARN-543 Project: Hadoop YARN Issue Type: Task Components: nodemanager Reporter: Vinod Kumar Vavilapalli Seeing a bunch of localization related issues being worked on, this is the tracking ticket. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-544) Failed resource localization might introduce a race condition.
Omkar Vinit Joshi created YARN-544: -- Summary: Failed resource localization might introduce a race condition. Key: YARN-544 URL: https://issues.apache.org/jira/browse/YARN-544 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi When resource localization fails [Public localizer / LocalizerRunner(Private)] it sends ContainerResourceFailedEvent to the containers which then sends ResourceReleaseEvent to the failed resource. In the end when LocalizedResource's ref count drops to 0 its state is changed from DOWNLOADING to INIT. Now if a Resource gets ResourceRequestEvent in between ContainerResourceFailedEvent and last ResourceReleaseEvent then for that resource ref count will not drop to 0 and the container which sent the ResourceRequestEvent will keep waiting. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-532) RMAdminProtocolPBClientImpl should implement Closeable
[ https://issues.apache.org/jira/browse/YARN-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622923#comment-13622923 ] Hadoop QA commented on YARN-532: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577088/YARN-532.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/672//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/672//console This message is automatically generated. RMAdminProtocolPBClientImpl should implement Closeable -- Key: YARN-532 URL: https://issues.apache.org/jira/browse/YARN-532 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.3-alpha Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: YARN-532.txt, YARN-532.txt Required for RPC.stopProxy to work. Already done in most of the other protocols. (MAPREDUCE-5117 addressing the one other protocol missing this) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-479) NM retry behavior for connection to RM should be similar for lost heartbeats
[ https://issues.apache.org/jira/browse/YARN-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-479: - Attachment: YARN-479.6.patch NM retry behavior for connection to RM should be similar for lost heartbeats Key: YARN-479 URL: https://issues.apache.org/jira/browse/YARN-479 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-479.1.patch, YARN-479.2.patch, YARN-479.3.patch, YARN-479.4.patch, YARN-479.5.patch, YARN-479.6.patch Regardless of connection loss at the start or at an intermediate point, NM's retry behavior to the RM should follow the same flow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-479) NM retry behavior for connection to RM should be similar for lost heartbeats
[ https://issues.apache.org/jira/browse/YARN-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622967#comment-13622967 ] Hadoop QA commented on YARN-479: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577107/YARN-479.6.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/673//console This message is automatically generated. NM retry behavior for connection to RM should be similar for lost heartbeats Key: YARN-479 URL: https://issues.apache.org/jira/browse/YARN-479 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-479.1.patch, YARN-479.2.patch, YARN-479.3.patch, YARN-479.4.patch, YARN-479.5.patch, YARN-479.6.patch Regardless of connection loss at the start or at an intermediate point, NM's retry behavior to the RM should follow the same flow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-544) Failed resource localization might introduce a race condition.
[ https://issues.apache.org/jira/browse/YARN-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622969#comment-13622969 ] Vinod Kumar Vavilapalli commented on YARN-544: -- When you come around to doing this, please write a test-case first to reproduce this. Tx. Failed resource localization might introduce a race condition. -- Key: YARN-544 URL: https://issues.apache.org/jira/browse/YARN-544 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi When resource localization fails [Public localizer / LocalizerRunner(Private)] it sends ContainerResourceFailedEvent to the containers which then sends ResourceReleaseEvent to the failed resource. In the end when LocalizedResource's ref count drops to 0 its state is changed from DOWNLOADING to INIT. Now if a Resource gets ResourceRequestEvent in between ContainerResourceFailedEvent and last ResourceReleaseEvent then for that resource ref count will not drop to 0 and the container which sent the ResourceRequestEvent will keep waiting. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-544) Failed resource localization might introduce a race condition.
[ https://issues.apache.org/jira/browse/YARN-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-544: - Issue Type: Sub-task (was: Bug) Parent: YARN-543 Failed resource localization might introduce a race condition. -- Key: YARN-544 URL: https://issues.apache.org/jira/browse/YARN-544 Project: Hadoop YARN Issue Type: Sub-task Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi When resource localization fails [Public localizer / LocalizerRunner(Private)] it sends ContainerResourceFailedEvent to the containers which then sends ResourceReleaseEvent to the failed resource. In the end when LocalizedResource's ref count drops to 0 its state is changed from DOWNLOADING to INIT. Now if a Resource gets ResourceRequestEvent in between ContainerResourceFailedEvent and last ResourceReleaseEvent then for that resource ref count will not drop to 0 and the container which sent the ResourceRequestEvent will keep waiting. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-532) RMAdminProtocolPBClientImpl should implement Closeable
[ https://issues.apache.org/jira/browse/YARN-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622972#comment-13622972 ] Vinod Kumar Vavilapalli commented on YARN-532: -- Looks good, checking it in. RMAdminProtocolPBClientImpl should implement Closeable -- Key: YARN-532 URL: https://issues.apache.org/jira/browse/YARN-532 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.3-alpha Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: YARN-532.txt, YARN-532.txt Required for RPC.stopProxy to work. Already done in most of the other protocols. (MAPREDUCE-5117 addressing the one other protocol missing this) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-532) RMAdminProtocolPBClientImpl should implement Closeable
[ https://issues.apache.org/jira/browse/YARN-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622993#comment-13622993 ] Hudson commented on YARN-532: - Integrated in Hadoop-trunk-Commit #3567 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3567/]) YARN-532. Change RMAdmin and Localization client protocol PB implementations to implement closeable so that they can be stopped when needed via RPC.stopProxy(). Contributed by Siddharth Seth. (Revision 1464788) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1464788 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/impl/pb/client/RMAdminProtocolPBClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/impl/pb/client/LocalizationProtocolPBClientImpl.java RMAdminProtocolPBClientImpl should implement Closeable -- Key: YARN-532 URL: https://issues.apache.org/jira/browse/YARN-532 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.3-alpha Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: 2.0.5-beta Attachments: YARN-532.txt, YARN-532.txt Required for RPC.stopProxy to work. Already done in most of the other protocols. (MAPREDUCE-5117 addressing the one other protocol missing this) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-493) NodeManager job control logic flaws on Windows
[ https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13622995#comment-13622995 ] Ivan Mitic commented on YARN-493: - +1, latest patch looks good to me, thanks Chris NodeManager job control logic flaws on Windows -- Key: YARN-493 URL: https://issues.apache.org/jira/browse/YARN-493 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 3.0.0 Attachments: YARN-493.1.patch, YARN-493.2.patch, YARN-493.3.patch Both product and test code contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-539) LocalizedResources are leaked in memory in case resource localization fails
[ https://issues.apache.org/jira/browse/YARN-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13623082#comment-13623082 ] Omkar Vinit Joshi commented on YARN-539: At present the flow of events in case resource localization is as follows * When resource localization fails (Public localizer / LocalizerRunner(Private) )it sends ContainerResourceFailedEvent to the containers which then sends ResourceReleaseEvent to the failed resource. In the end when LocalizedResource's ref count drops to 0 its state is changed from DOWNLOADING to INIT. Now due to this resource may end up in memory (ResourceLocalizationTracker - memory leak) or may also introduce a race condition [yarn-544|https://issues.apache.org/jira/browse/YARN-544] Now proposed solution is * when resource localization fails, resource localization failed event (ResourceFailedEvent) is sent to (LocalResourcesTrackerImpl). The tracker will remove this localized resource from its cache and will then pass the event to LocalizedResource. LocalizedResource will then notify all the containers which were waiting for this resource. The containers will no longer send an additional ResourceReleaseEvent. * Now to keep the flow same for Success as well as Failure, even the Localization successful event will be sent to LocalizedResource via LocalResourcesTrackerImpl. LocalizedResources are leaked in memory in case resource localization fails --- Key: YARN-539 URL: https://issues.apache.org/jira/browse/YARN-539 Project: Hadoop YARN Issue Type: Sub-task Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi If resource localization fails then resource remains in memory and is 1) Either cleaned up when next time cache cleanup runs and there is space crunch. (If sufficient space in cache is available then it will remain in memory). 2) reused if LocalizationRequest comes again for the same resource. I think when resource localization fails then that event should be sent to LocalResourceTracker which will then remove it from its cache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-493) NodeManager job control logic flaws on Windows
[ https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13623084#comment-13623084 ] Chris Nauroth commented on YARN-493: Thank you for the reviews, Ivan! NodeManager job control logic flaws on Windows -- Key: YARN-493 URL: https://issues.apache.org/jira/browse/YARN-493 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 3.0.0 Attachments: YARN-493.1.patch, YARN-493.2.patch, YARN-493.3.patch Both product and test code contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-493) NodeManager job control logic flaws on Windows
[ https://issues.apache.org/jira/browse/YARN-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13623088#comment-13623088 ] Vinod Kumar Vavilapalli commented on YARN-493: -- Looking at this for final review/commit. NodeManager job control logic flaws on Windows -- Key: YARN-493 URL: https://issues.apache.org/jira/browse/YARN-493 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 3.0.0 Attachments: YARN-493.1.patch, YARN-493.2.patch, YARN-493.3.patch Both product and test code contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-479) NM retry behavior for connection to RM should be similar for lost heartbeats
[ https://issues.apache.org/jira/browse/YARN-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-479: - Attachment: YARN-479.7.patch fix conflicts with YARN-101 NM retry behavior for connection to RM should be similar for lost heartbeats Key: YARN-479 URL: https://issues.apache.org/jira/browse/YARN-479 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-479.1.patch, YARN-479.2.patch, YARN-479.3.patch, YARN-479.4.patch, YARN-479.5.patch, YARN-479.6.patch, YARN-479.7.patch Regardless of connection loss at the start or at an intermediate point, NM's retry behavior to the RM should follow the same flow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-479) NM retry behavior for connection to RM should be similar for lost heartbeats
[ https://issues.apache.org/jira/browse/YARN-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-479: - Attachment: YARN-479.8.patch Add test case that nodeStatusUpdater will retry a fixed number of time and eventually send SHUTDOWN to NM NM retry behavior for connection to RM should be similar for lost heartbeats Key: YARN-479 URL: https://issues.apache.org/jira/browse/YARN-479 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-479.1.patch, YARN-479.2.patch, YARN-479.3.patch, YARN-479.4.patch, YARN-479.5.patch, YARN-479.6.patch, YARN-479.7.patch, YARN-479.8.patch Regardless of connection loss at the start or at an intermediate point, NM's retry behavior to the RM should follow the same flow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-479) NM retry behavior for connection to RM should be similar for lost heartbeats
[ https://issues.apache.org/jira/browse/YARN-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13623294#comment-13623294 ] Hadoop QA commented on YARN-479: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577136/YARN-479.8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/675//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/675//console This message is automatically generated. NM retry behavior for connection to RM should be similar for lost heartbeats Key: YARN-479 URL: https://issues.apache.org/jira/browse/YARN-479 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-479.1.patch, YARN-479.2.patch, YARN-479.3.patch, YARN-479.4.patch, YARN-479.5.patch, YARN-479.6.patch, YARN-479.7.patch, YARN-479.8.patch Regardless of connection loss at the start or at an intermediate point, NM's retry behavior to the RM should follow the same flow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-157) The option shell_command and shell_script have conflict
[ https://issues.apache.org/jira/browse/YARN-157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rainy Yu updated YARN-157: -- Attachment: shell_script.sh YARN-157.patch Add unit test. Thank Vinod Kumar Vavilapalli for help The option shell_command and shell_script have conflict --- Key: YARN-157 URL: https://issues.apache.org/jira/browse/YARN-157 Project: Hadoop YARN Issue Type: Bug Components: applications/distributed-shell Affects Versions: 2.0.1-alpha Reporter: Li Ming Assignee: rainy Yu Labels: patch Attachments: hadoop_yarn.patch, shell_script.sh, YARN-157.patch The DistributedShell has an option shell_script to let user specify a shell script which will be executed in containers. But the issue is that the shell_command option is a must, so if both options are set, then every container executor will end with exitCode=1. This is because DistributedShell executes the shell_command and shell_script together. For example, if shell_command is 'date' then the final command to be executed in container is date `ExecShellScript.sh`, so the date command will treat the result of ExecShellScript.sh as its parameter, then there will be an error. To solve this, the DistributedShell should not use the value of shell_command option when the shell_script option is set, and the shell_command option also should not be mandatory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-54) AggregatedLogFormat should be marked Private / Unstable
[ https://issues.apache.org/jira/browse/YARN-54?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-54: Issue Type: Sub-task (was: Bug) Parent: YARN-386 AggregatedLogFormat should be marked Private / Unstable --- Key: YARN-54 URL: https://issues.apache.org/jira/browse/YARN-54 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.0.0-alpha Reporter: Jason Lowe Assignee: Siddharth Seth Priority: Trivial Attachments: YARN54.txt AggregatedLogFormat is still in a state of flux, so we should mark it as Private / Unstable for clarity. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-547) New resource localization is tried even when Localized Resource is in DOWNLOADING state
Omkar Vinit Joshi created YARN-547: -- Summary: New resource localization is tried even when Localized Resource is in DOWNLOADING state Key: YARN-547 URL: https://issues.apache.org/jira/browse/YARN-547 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi At present when multiple containers try to request a localized resource 1) If the resource is not present then first it is created and Resource Localization starts ( LocalizedResource is in DOWNLOADING state) 2) Now if in this state multiple ResourceRequestEvents come in then ResourceLocalizationEvents are fired for all of them. Most of the times it is not resulting into a duplicate resource download but there is a race condition present there. Location : ResourceLocalizationService.addResource .. addition of the request into attempts in case of an event already exists. The root cause for this is the presence of FetchResourceTransition on receiving ResourceRequestEvent in DOWNLOADING state. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira