[jira] [Created] (YARN-892) Resource Manager throws InvalidStateTransitonException: Invalid event: CONTAINER_FINISHED at ALLOCATED
Devaraj K created YARN-892: -- Summary: Resource Manager throws InvalidStateTransitonException: Invalid event: CONTAINER_FINISHED at ALLOCATED Key: YARN-892 URL: https://issues.apache.org/jira/browse/YARN-892 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: Devaraj K Assignee: Devaraj K {code:xml} 2013-06-28 18:18:59,255 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_FINISHED at ALLOCATED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:627) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:476) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:662) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-888) clean up POM dependencies
[ https://issues.apache.org/jira/browse/YARN-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696842#comment-13696842 ] Timothy St. Clair commented on YARN-888: [~tucu00], I have a series of tickets relating to *this, and I'm wondering if it makes sense to use this as an umbrella and tree off. clean up POM dependencies - Key: YARN-888 URL: https://issues.apache.org/jira/browse/YARN-888 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.0-beta Reporter: Alejandro Abdelnur Intermediate 'pom' modules define dependencies inherited by leaf modules. This is causing issues in intellij IDE. We should normalize the leaf modules like in common, hdfs and tools where all dependencies are defined in each leaf module and the intermediate 'pom' module do not define any dependency. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (YARN-892) Resource Manager throws InvalidStateTransitonException: Invalid event: CONTAINER_FINISHED at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved YARN-892. - Resolution: Duplicate Resource Manager throws InvalidStateTransitonException: Invalid event: CONTAINER_FINISHED at ALLOCATED -- Key: YARN-892 URL: https://issues.apache.org/jira/browse/YARN-892 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: Devaraj K Assignee: Devaraj K {code:xml} 2013-06-28 18:18:59,255 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_FINISHED at ALLOCATED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:627) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:476) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:662) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-862) ResourceManager and NodeManager versions should match on node registration or error out
[ https://issues.apache.org/jira/browse/YARN-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-862: --- Target Version/s: 0.23.10 (was: 0.23.9) ResourceManager and NodeManager versions should match on node registration or error out --- Key: YARN-862 URL: https://issues.apache.org/jira/browse/YARN-862 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Affects Versions: 0.23.8 Reporter: Robert Parker Assignee: Robert Parker Attachments: YARN-862-b0.23-v1.patch, YARN-862-b0.23-v2.patch For branch-0.23 the versions of the node manager and the resource manager should match to complete a successful registration. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-556: Issue Type: New Feature (was: Sub-task) Parent: (was: YARN-128) RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Labels: gsoc2013 The basic idea is already documented on YARN-128. This will describe further details. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-149) ResourceManager (RM) High-Availability (HA)
[ https://issues.apache.org/jira/browse/YARN-149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697005#comment-13697005 ] Bikas Saha commented on YARN-149: - I will be posting a short design/road-map document shortly. If anyone has ideas, notes etc. then please start posting so that I can consolidate them. Overall, most of the tools and interfaces are already available in common via the HDFS HA project. The work will mainly be around integrating them with YARN/RM. ResourceManager (RM) High-Availability (HA) --- Key: YARN-149 URL: https://issues.apache.org/jira/browse/YARN-149 Project: Hadoop YARN Issue Type: New Feature Reporter: Harsh J Assignee: Bikas Saha One of the goals presented on MAPREDUCE-279 was to have high availability. One way that was discussed, per Mahadev/others on https://issues.apache.org/jira/browse/MAPREDUCE-2648 and other places, was ZK: {quote} Am not sure, if you already know about the MR-279 branch (the next version of MR framework). We've been trying to integrate ZK into the framework from the beginning. As for now, we are just doing restart with ZK but soon we should have a HA soln with ZK. {quote} There is now MAPREDUCE-4343 that tracks recoverability via ZK. This JIRA is meant to track HA via ZK. Currently there isn't a HA solution for RM, via ZK or otherwise. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-128) RM Restart
[ https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-128: Summary: RM Restart (was: RM Restart ) RM Restart -- Key: YARN-128 URL: https://issues.apache.org/jira/browse/YARN-128 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.0-alpha Reporter: Arun C Murthy Assignee: Bikas Saha Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, restart-fs-store-11-17.patch, restart-zk-store-11-17.patch, RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, YARN-128.full-code.5.patch, YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch, YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch, YARN-128.patch We should resurrect 'RM Restart' which we disabled sometime during the RM refactor. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable
[ https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-814: - Attachment: (was: YARN-814.3.patch) Difficult to diagnose a failed container launch when error due to invalid environment variable -- Key: YARN-814 URL: https://issues.apache.org/jira/browse/YARN-814 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, YARN-814.patch The container's launch script sets up environment variables, symlinks etc. If there is any failure when setting up the basic context ( before the actual user's process is launched ), nothing is captured by the NM. This makes it impossible to diagnose the reason for the failure. To reproduce, set an env var where the value contains characters that throw syntax errors in bash. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable
[ https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-814: - Attachment: YARN-814.3.patch Difficult to diagnose a failed container launch when error due to invalid environment variable -- Key: YARN-814 URL: https://issues.apache.org/jira/browse/YARN-814 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, YARN-814.patch The container's launch script sets up environment variables, symlinks etc. If there is any failure when setting up the basic context ( before the actual user's process is launched ), nothing is captured by the NM. This makes it impossible to diagnose the reason for the failure. To reproduce, set an env var where the value contains characters that throw syntax errors in bash. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable
[ https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697049#comment-13697049 ] Hadoop QA commented on YARN-814: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590275/YARN-814.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1412//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1412//console This message is automatically generated. Difficult to diagnose a failed container launch when error due to invalid environment variable -- Key: YARN-814 URL: https://issues.apache.org/jira/browse/YARN-814 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, YARN-814.patch The container's launch script sets up environment variables, symlinks etc. If there is any failure when setting up the basic context ( before the actual user's process is launched ), nothing is captured by the NM. This makes it impossible to diagnose the reason for the failure. To reproduce, set an env var where the value contains characters that throw syntax errors in bash. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-864) YARN NM leaking containers with CGroups
[ https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-864: --- Assignee: Jian He YARN NM leaking containers with CGroups --- Key: YARN-864 URL: https://issues.apache.org/jira/browse/YARN-864 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and YARN-600. Reporter: Chris Riccomini Assignee: Jian He Attachments: rm-log, YARN-864.1.patch, YARN-864.2.patch Hey Guys, I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm seeing containers getting leaked by the NMs. I'm not quite sure what's going on -- has anyone seen this before? I'm concerned that maybe it's a mis-understanding on my part about how YARN's lifecycle works. When I look in my AM logs for my app (not an MR app master), I see: 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. This means that container container_1371141151815_0008_03_02 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a new container for the task. The AM has been running steadily the whole time. Here's what the NM logs say: {noformat} 05:34:59,783 WARN AsyncDispatcher:109 - Interrupted Exception while stopping java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1143) at java.lang.Thread.join(Thread.java:1196) at org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:619) 05:35:00,314 WARN ContainersMonitorImpl:463 - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_02 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) at
[jira] [Commented] (YARN-712) RMDelegationTokenSecretManager shouldn't start in non-secure mode
[ https://issues.apache.org/jira/browse/YARN-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697069#comment-13697069 ] Omkar Vinit Joshi commented on YARN-712: can we enable it irrespective of security? like ContainerToken? RMDelegationTokenSecretManager shouldn't start in non-secure mode -- Key: YARN-712 URL: https://issues.apache.org/jira/browse/YARN-712 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He RM will just be doing useless work as no tokens are issued. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-815) Add container failure handling to distributed-shell
[ https://issues.apache.org/jira/browse/YARN-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-815: - Issue Type: Improvement (was: Bug) Add container failure handling to distributed-shell --- Key: YARN-815 URL: https://issues.apache.org/jira/browse/YARN-815 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Reporter: Vinod Kumar Vavilapalli Today if any container fails because of whatever reason, the app simply ignores them. We should handle retries, improve error reporting etc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-769) Add metrics for number of containers
[ https://issues.apache.org/jira/browse/YARN-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-769: - Issue Type: Improvement (was: Bug) Add metrics for number of containers Key: YARN-769 URL: https://issues.apache.org/jira/browse/YARN-769 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Arun C Murthy We should add metrics to RM to track available (min-sized) containers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-772) Document ApplicationConstants for AM implementors
[ https://issues.apache.org/jira/browse/YARN-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-772: - Issue Type: Improvement (was: Bug) Document ApplicationConstants for AM implementors - Key: YARN-772 URL: https://issues.apache.org/jira/browse/YARN-772 Project: Hadoop YARN Issue Type: Improvement Reporter: Arun C Murthy We should document features like LOG_DIR_EXPANSION_VAR, APP_SUBMIT_TIME_ENV etc. for folks developing new applications in the WritingYarnApplications doc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-705) Review of Field Rules, Default Values and Sanity Check for ContainerManager
[ https://issues.apache.org/jira/browse/YARN-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-705: - Issue Type: Improvement (was: Bug) Review of Field Rules, Default Values and Sanity Check for ContainerManager --- Key: YARN-705 URL: https://issues.apache.org/jira/browse/YARN-705 Project: Hadoop YARN Issue Type: Improvement Reporter: Zhijie Shen Assignee: Zhijie Shen Need to do the similar things mentioned in YARN-698. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-710) Add to ser/deser methods to RecordFactory
[ https://issues.apache.org/jira/browse/YARN-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-710: - Issue Type: Improvement (was: Bug) Add to ser/deser methods to RecordFactory - Key: YARN-710 URL: https://issues.apache.org/jira/browse/YARN-710 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.0.4-alpha Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Attachments: YARN-710.patch, YARN-710.patch I order to do things like AMs failover and checkpointing I need to serialize app IDs, app attempt IDs, containers and/or IDs, resource requests, etc. Because we are wrapping/hiding the PB implementation from the APIs, we are hiding the built in PB ser/deser capabilities. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-662) Enforce required parameters for all the protocols
[ https://issues.apache.org/jira/browse/YARN-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-662: - Issue Type: Bug (was: Sub-task) Parent: (was: YARN-386) Enforce required parameters for all the protocols - Key: YARN-662 URL: https://issues.apache.org/jira/browse/YARN-662 Project: Hadoop YARN Issue Type: Bug Reporter: Siddharth Seth Assignee: Zhijie Shen All proto fields are marked as options. We need to mark some of them as requried, or enforce these server side. Server side is likely better since that's more flexible (Example deprecating a field type in favour of another - either of the two must be present) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-704) Review of Field Rules, Default Values and Sanity Check for AMRMProtocol
[ https://issues.apache.org/jira/browse/YARN-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-704: - Issue Type: Sub-task (was: Bug) Parent: YARN-662 Review of Field Rules, Default Values and Sanity Check for AMRMProtocol --- Key: YARN-704 URL: https://issues.apache.org/jira/browse/YARN-704 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Need to do the similar things mentioned in YARN-698. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-703) Review of Field Rules, Default Values and Sanity Check for RMAdminProtocol
[ https://issues.apache.org/jira/browse/YARN-703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-703: - Issue Type: Sub-task (was: Bug) Parent: YARN-662 Review of Field Rules, Default Values and Sanity Check for RMAdminProtocol -- Key: YARN-703 URL: https://issues.apache.org/jira/browse/YARN-703 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Need to do the similar things mentioned in YARN-698. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-705) Review of Field Rules, Default Values and Sanity Check for ContainerManager
[ https://issues.apache.org/jira/browse/YARN-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-705: - Issue Type: Sub-task (was: Improvement) Parent: YARN-662 Review of Field Rules, Default Values and Sanity Check for ContainerManager --- Key: YARN-705 URL: https://issues.apache.org/jira/browse/YARN-705 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Need to do the similar things mentioned in YARN-698. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-662) Enforce required parameters for all the protocols
[ https://issues.apache.org/jira/browse/YARN-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-662: - Issue Type: Improvement (was: Bug) Enforce required parameters for all the protocols - Key: YARN-662 URL: https://issues.apache.org/jira/browse/YARN-662 Project: Hadoop YARN Issue Type: Improvement Reporter: Siddharth Seth Assignee: Zhijie Shen All proto fields are marked as options. We need to mark some of them as requried, or enforce these server side. Server side is likely better since that's more flexible (Example deprecating a field type in favour of another - either of the two must be present) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-641) Make AMLauncher in RM Use NMClient
[ https://issues.apache.org/jira/browse/YARN-641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-641: - Issue Type: Improvement (was: Bug) Make AMLauncher in RM Use NMClient -- Key: YARN-641 URL: https://issues.apache.org/jira/browse/YARN-641 Project: Hadoop YARN Issue Type: Improvement Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-641.1.patch, YARN-641.2.patch, YARN-641.3.patch YARN-422 adds NMClient. RM's AMLauncher is responsible for the interactions with an application's AM container. AMLauncher should also replace the raw ContainerManager proxy with NMClient. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-698) Review of Field Rules, Default Values and Sanity Check for ClientRMProtocol
[ https://issues.apache.org/jira/browse/YARN-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-698: - Issue Type: Sub-task (was: Bug) Parent: YARN-662 Review of Field Rules, Default Values and Sanity Check for ClientRMProtocol --- Key: YARN-698 URL: https://issues.apache.org/jira/browse/YARN-698 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen We need to check the fields of the protos used by ClientRMProtocol (recursively) to clarify the following stuff: 1. Whether the field should be required or optional 2. What the default value should be if the field is optional 3. Whether sanity check is required to validate the input value against the field's value domain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-431) [Umbrella] Complete/Stabilize YARN appliation log-aggregation
[ https://issues.apache.org/jira/browse/YARN-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-431: - Issue Type: Task (was: Bug) [Umbrella] Complete/Stabilize YARN appliation log-aggregation - Key: YARN-431 URL: https://issues.apache.org/jira/browse/YARN-431 Project: Hadoop YARN Issue Type: Task Reporter: Vinod Kumar Vavilapalli -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-399) Add an out of band heartbeat damper
[ https://issues.apache.org/jira/browse/YARN-399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-399: - Issue Type: Improvement (was: Bug) Add an out of band heartbeat damper --- Key: YARN-399 URL: https://issues.apache.org/jira/browse/YARN-399 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 0.23.6 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-399.PATCH We are seeing issues with the scheduler queue backing up on the RM. We have the nodemanager heartbeats set at 5 seconds which should be more then long enough for the number of apps we are running. We believe this is due to the out of band heartbeats of the nodemanager coming to soon when we have jobs with lots of containers that finish quickly. To help with that we could add an out of band heartbeat damper to the nodemanager similar to what 1.X Tasktrackers have. MAPREDUCE-2355 added it in 1.x. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-437) Update documentation of Writing Yarn Applications to match current best practices
[ https://issues.apache.org/jira/browse/YARN-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-437: - Issue Type: Improvement (was: Bug) Update documentation of Writing Yarn Applications to match current best practices --- Key: YARN-437 URL: https://issues.apache.org/jira/browse/YARN-437 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Hitesh Shah Assignee: Eli Reisman Attachments: YARN-437-1.patch, YARN-437-2.patch, YARN-437-3.patch Should fix docs to point to usage of YarnClient and AMRMClient helper libs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-436) Document how to use DistributedShell yarn application
[ https://issues.apache.org/jira/browse/YARN-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-436: - Issue Type: Improvement (was: Bug) Document how to use DistributedShell yarn application - Key: YARN-436 URL: https://issues.apache.org/jira/browse/YARN-436 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Hitesh Shah Assignee: Hitesh Shah -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-149) ResourceManager (RM) High-Availability (HA)
[ https://issues.apache.org/jira/browse/YARN-149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697165#comment-13697165 ] Karthik Kambatla commented on YARN-149: --- Sounds good, thanks Bikas. I also have been thinking about this and working on a draft. Will get it to shape, and attach it here. ResourceManager (RM) High-Availability (HA) --- Key: YARN-149 URL: https://issues.apache.org/jira/browse/YARN-149 Project: Hadoop YARN Issue Type: New Feature Reporter: Harsh J Assignee: Bikas Saha This jira tracks work needed to be done to support one RM instance failing over to another RM instance so that we can have RM HA. Work includes leader election, transfer of control to leader and client re-direction to new leader. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-675) In YarnClient, pull AM logs on AM container failure
[ https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697185#comment-13697185 ] Zhijie Shen commented on YARN-675: -- [~sandyr], would you mind my taking this ticket over? We're trying to push the better error reporting tickets to be fixed ASAP. Thanks! In YarnClient, pull AM logs on AM container failure --- Key: YARN-675 URL: https://issues.apache.org/jira/browse/YARN-675 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to pull its logs from the NM to the client so that they can be displayed immediately to the user. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-675) In YarnClient, pull AM logs on AM container failure
[ https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697209#comment-13697209 ] Sandy Ryza commented on YARN-675: - [~zjshen], thanks for the help, feel free to take it over. We're also trying to get these in ASAP. My delay in working on it has been that it depends on YARN-649, so any feedback there would help move things forward as well. In YarnClient, pull AM logs on AM container failure --- Key: YARN-675 URL: https://issues.apache.org/jira/browse/YARN-675 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to pull its logs from the NM to the client so that they can be displayed immediately to the user. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-353) Add Zookeeper-based store implementation for RMStateStore
[ https://issues.apache.org/jira/browse/YARN-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697224#comment-13697224 ] Jian He commented on YARN-353: -- I'm taking this over Add Zookeeper-based store implementation for RMStateStore - Key: YARN-353 URL: https://issues.apache.org/jira/browse/YARN-353 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Hitesh Shah Assignee: Bikas Saha Attachments: YARN-353.1.patch Add store that write RM state data to ZK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable
[ https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697240#comment-13697240 ] Hitesh Shah commented on YARN-814: -- Comments: Why is shExec.getOutput() being ignored ( and replaced with exception.getMessage() )? Have you run this with a test script that emits information both to stdout and stderr? {code} + LOG.warn(Exception from container-launch with container ID: + + containerId + and exit code: + exitCode , e); + logOutput(e.getMessage()); {code} - logging the exception twice? -logOutput() does not seem to log any contextual information - have you logged at the NM logs to see if it actually provides useful debugging information when running multiple containers at the same time? {code} LOG.warn(Exit code from container is : + exitCode); - logOutput(shExec.getOutput()); + logOutput(e.getMessage()); {code} - Earlier comment about the LOG.warn not being useful not addressed? {code} throw new IOException(App initialization failed ( + exitCode + - ) with output: + shExec.getOutput(), e); + ) with output: + e.getMessage(), e); {code} - The exception e is already being passed. Why the need to add e.getMessage() too? Difficult to diagnose a failed container launch when error due to invalid environment variable -- Key: YARN-814 URL: https://issues.apache.org/jira/browse/YARN-814 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, YARN-814.patch The container's launch script sets up environment variables, symlinks etc. If there is any failure when setting up the basic context ( before the actual user's process is launched ), nothing is captured by the NM. This makes it impossible to diagnose the reason for the failure. To reproduce, set an env var where the value contains characters that throw syntax errors in bash. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-661) NM fails to cleanup local directories for users
[ https://issues.apache.org/jira/browse/YARN-661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi reassigned YARN-661: -- Assignee: Omkar Vinit Joshi NM fails to cleanup local directories for users --- Key: YARN-661 URL: https://issues.apache.org/jira/browse/YARN-661 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 0.23.8 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi YARN-71 added deletion of local directories on startup, but in practice it fails to delete the directories because of permission problems. The top-level usercache directory is owned by the user but is in a directory that is not writable by the user. Therefore the deletion of the user's usercache directory, as the user, fails due to lack of permissions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-661) NM fails to cleanup local directories for users
[ https://issues.apache.org/jira/browse/YARN-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697277#comment-13697277 ] Omkar Vinit Joshi commented on YARN-661: taking this over... Just reproduced this issue on secured cluster..It exists.. need to be fixed.. NM fails to cleanup local directories for users --- Key: YARN-661 URL: https://issues.apache.org/jira/browse/YARN-661 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 0.23.8 Reporter: Jason Lowe YARN-71 added deletion of local directories on startup, but in practice it fails to delete the directories because of permission problems. The top-level usercache directory is owned by the user but is in a directory that is not writable by the user. Therefore the deletion of the user's usercache directory, as the user, fails due to lack of permissions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-661) NM fails to cleanup local directories for users
[ https://issues.apache.org/jira/browse/YARN-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697320#comment-13697320 ] Omkar Vinit Joshi commented on YARN-661: I guess we need 2 features in deletion service. * A way for user to specify that delete all the sub directories and files inside a parent directory but don't delete parent directory. * A way to define dependency between deletion tasks. For example we need to delete usercache files before actually deleting the parent usercache itself... NM fails to cleanup local directories for users --- Key: YARN-661 URL: https://issues.apache.org/jira/browse/YARN-661 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 0.23.8 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi YARN-71 added deletion of local directories on startup, but in practice it fails to delete the directories because of permission problems. The top-level usercache directory is owned by the user but is in a directory that is not writable by the user. Therefore the deletion of the user's usercache directory, as the user, fails due to lack of permissions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable
[ https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697377#comment-13697377 ] Hitesh Shah commented on YARN-814: -- There is no guarantee that shExec.getOutput() will always be empty. For example: {code} echo About to run invalid command ./run_invalid_command.sh {code} The above should generate output both on stdout and stderr. The patch seems to be throwing away potential valid output that may be useful for debugging. It seems like you need to capture both stdout and stderr information. Difficult to diagnose a failed container launch when error due to invalid environment variable -- Key: YARN-814 URL: https://issues.apache.org/jira/browse/YARN-814 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, YARN-814.patch The container's launch script sets up environment variables, symlinks etc. If there is any failure when setting up the basic context ( before the actual user's process is launched ), nothing is captured by the NM. This makes it impossible to diagnose the reason for the failure. To reproduce, set an env var where the value contains characters that throw syntax errors in bash. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-894) NodeHealthScriptRunner timeout checking is inaccurate on Windows
Chuan Liu created YARN-894: -- Summary: NodeHealthScriptRunner timeout checking is inaccurate on Windows Key: YARN-894 URL: https://issues.apache.org/jira/browse/YARN-894 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.1.0-beta Reporter: Chuan Liu Assignee: Chuan Liu Priority: Minor In {{NodeHealthScriptRunner}} method, we will set HealthChecker status based on the Shell execution results. Some status are based on the exception thrown during the Shell script execution. Currently, we will catch a non-ExitCodeException from ShellCommandExecutor, and if Shell has the timeout status set at the same time, we will also set HealthChecker status to timeout. We have following execution sequence in Shell: 1) In main thread, schedule a delayed timer task that will kill the original process upon timeout. 2) In main thread, open a buffered reader and feed in the process's standard input stream. 3) When timeout happens, the timer task will call {{Process#destroy()}} to kill the main process. On Linux, when timeout happened and process killed, the buffered reader will thrown an IOException with message: Stream closed in main thread. On Windows, we don't have the IOException. Only -1 was returned from the reader that indicates the buffer is finished. As a result, the timeout status is not set on Windows, and {{TestNodeHealthService}} fails on Windows because of this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-894) NodeHealthScriptRunner timeout checking is inaccurate on Windows
[ https://issues.apache.org/jira/browse/YARN-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuan Liu updated YARN-894: --- Attachment: wait.sh wait.cmd ReadProcessStdout.java Attach a Java file that verifies the above description. When executed on Windows, we have the following result: {noformat} C:\Users\chuanliu\Documentsjava ReadProcessStdout wait.cmd Process was destroyed! -1 exit code: 1 {noformat} On Linux, the results look like the following: {noformat} ~$ java ReadProcessStdout ./wait.sh Process was destroyed! -1 Stream closed java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:145) at java.io.BufferedInputStream.read(BufferedInputStream.java:308) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158) at java.io.InputStreamReader.read(InputStreamReader.java:167) at java.io.BufferedReader.fill(BufferedReader.java:136) at java.io.BufferedReader.readLine(BufferedReader.java:299) at java.io.BufferedReader.readLine(BufferedReader.java:362) at ReadProcessStdout.main(ReadProcessStdout.java:25) {noformat} NodeHealthScriptRunner timeout checking is inaccurate on Windows Key: YARN-894 URL: https://issues.apache.org/jira/browse/YARN-894 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.1.0-beta Reporter: Chuan Liu Assignee: Chuan Liu Priority: Minor Attachments: ReadProcessStdout.java, wait.cmd, wait.sh In {{NodeHealthScriptRunner}} method, we will set HealthChecker status based on the Shell execution results. Some status are based on the exception thrown during the Shell script execution. Currently, we will catch a non-ExitCodeException from ShellCommandExecutor, and if Shell has the timeout status set at the same time, we will also set HealthChecker status to timeout. We have following execution sequence in Shell: 1) In main thread, schedule a delayed timer task that will kill the original process upon timeout. 2) In main thread, open a buffered reader and feed in the process's standard input stream. 3) When timeout happens, the timer task will call {{Process#destroy()}} to kill the main process. On Linux, when timeout happened and process killed, the buffered reader will thrown an IOException with message: Stream closed in main thread. On Windows, we don't have the IOException. Only -1 was returned from the reader that indicates the buffer is finished. As a result, the timeout status is not set on Windows, and {{TestNodeHealthService}} fails on Windows because of this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-894) NodeHealthScriptRunner timeout checking is inaccurate on Windows
[ https://issues.apache.org/jira/browse/YARN-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuan Liu updated YARN-894: --- Attachment: YARN-894-trunk.patch Attaching a patch that fixes the above issue on Windows. Also changing the test to use different command for 'sleep' and Shell script extension on Windows. NodeHealthScriptRunner timeout checking is inaccurate on Windows Key: YARN-894 URL: https://issues.apache.org/jira/browse/YARN-894 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.1.0-beta Reporter: Chuan Liu Assignee: Chuan Liu Priority: Minor Attachments: ReadProcessStdout.java, wait.cmd, wait.sh, YARN-894-trunk.patch In {{NodeHealthScriptRunner}} method, we will set HealthChecker status based on the Shell execution results. Some status are based on the exception thrown during the Shell script execution. Currently, we will catch a non-ExitCodeException from ShellCommandExecutor, and if Shell has the timeout status set at the same time, we will also set HealthChecker status to timeout. We have following execution sequence in Shell: 1) In main thread, schedule a delayed timer task that will kill the original process upon timeout. 2) In main thread, open a buffered reader and feed in the process's standard input stream. 3) When timeout happens, the timer task will call {{Process#destroy()}} to kill the main process. On Linux, when timeout happened and process killed, the buffered reader will thrown an IOException with message: Stream closed in main thread. On Windows, we don't have the IOException. Only -1 was returned from the reader that indicates the buffer is finished. As a result, the timeout status is not set on Windows, and {{TestNodeHealthService}} fails on Windows because of this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-353) Add Zookeeper-based store implementation for RMStateStore
[ https://issues.apache.org/jira/browse/YARN-353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-353: - Attachment: YARN-353.2.patch rebased the patch and added RMDelegationToken restore implementation for the ZKStateStore Add Zookeeper-based store implementation for RMStateStore - Key: YARN-353 URL: https://issues.apache.org/jira/browse/YARN-353 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Hitesh Shah Assignee: Bikas Saha Attachments: YARN-353.1.patch, YARN-353.2.patch Add store that write RM state data to ZK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-894) NodeHealthScriptRunner timeout checking is inaccurate on Windows
[ https://issues.apache.org/jira/browse/YARN-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697407#comment-13697407 ] Hadoop QA commented on YARN-894: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590349/YARN-894-trunk.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1413//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1413//console This message is automatically generated. NodeHealthScriptRunner timeout checking is inaccurate on Windows Key: YARN-894 URL: https://issues.apache.org/jira/browse/YARN-894 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.1.0-beta Reporter: Chuan Liu Assignee: Chuan Liu Priority: Minor Attachments: ReadProcessStdout.java, wait.cmd, wait.sh, YARN-894-trunk.patch In {{NodeHealthScriptRunner}} method, we will set HealthChecker status based on the Shell execution results. Some status are based on the exception thrown during the Shell script execution. Currently, we will catch a non-ExitCodeException from ShellCommandExecutor, and if Shell has the timeout status set at the same time, we will also set HealthChecker status to timeout. We have following execution sequence in Shell: 1) In main thread, schedule a delayed timer task that will kill the original process upon timeout. 2) In main thread, open a buffered reader and feed in the process's standard input stream. 3) When timeout happens, the timer task will call {{Process#destroy()}} to kill the main process. On Linux, when timeout happened and process killed, the buffered reader will thrown an IOException with message: Stream closed in main thread. On Windows, we don't have the IOException. Only -1 was returned from the reader that indicates the buffer is finished. As a result, the timeout status is not set on Windows, and {{TestNodeHealthService}} fails on Windows because of this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-353) Add Zookeeper-based store implementation for RMStateStore
[ https://issues.apache.org/jira/browse/YARN-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697417#comment-13697417 ] Hadoop QA commented on YARN-353: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12590350/YARN-353.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1414//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/1414//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1414//console This message is automatically generated. Add Zookeeper-based store implementation for RMStateStore - Key: YARN-353 URL: https://issues.apache.org/jira/browse/YARN-353 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Hitesh Shah Assignee: Bikas Saha Attachments: YARN-353.1.patch, YARN-353.2.patch Add store that write RM state data to ZK -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-675) In YarnClient, pull AM logs on AM container failure
[ https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697493#comment-13697493 ] Zhijie Shen commented on YARN-675: -- Take it over. Thanks! In YarnClient, pull AM logs on AM container failure --- Key: YARN-675 URL: https://issues.apache.org/jira/browse/YARN-675 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Zhijie Shen Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to pull its logs from the NM to the client so that they can be displayed immediately to the user. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-675) In YarnClient, pull AM logs on AM container failure
[ https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen reassigned YARN-675: Assignee: Zhijie Shen In YarnClient, pull AM logs on AM container failure --- Key: YARN-675 URL: https://issues.apache.org/jira/browse/YARN-675 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Zhijie Shen Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to pull its logs from the NM to the client so that they can be displayed immediately to the user. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-873) YARNClient.getApplicationReport(unknownAppId) returns a null report
[ https://issues.apache.org/jira/browse/YARN-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697513#comment-13697513 ] Xuan Gong commented on YARN-873: Throwing Exceptions may not be a good option. If we throw an exception out, the clients may think there is a problem about this command, but actually this command works fine. Probably, we can output something like This appId is not exist. Please use command yarn application -list to get all application information YARNClient.getApplicationReport(unknownAppId) returns a null report --- Key: YARN-873 URL: https://issues.apache.org/jira/browse/YARN-873 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Xuan Gong How can the client find out that app does not exist? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-710) Add to ser/deser methods to RecordFactory
[ https://issues.apache.org/jira/browse/YARN-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated YARN-710: Attachment: YARN-710-wip.patch Sidd, attaching a patch with your suggestion on how to get the class. however, something has changed significantly since the last patch. I've tried getting things to work again but it is plain ugly, I don't like at all (see wip patch). Still it is not working because I cannot force to create the underlying proto. Any idea on how to untangle this? Add to ser/deser methods to RecordFactory - Key: YARN-710 URL: https://issues.apache.org/jira/browse/YARN-710 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.0.4-alpha Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Attachments: YARN-710.patch, YARN-710.patch, YARN-710-wip.patch I order to do things like AMs failover and checkpointing I need to serialize app IDs, app attempt IDs, containers and/or IDs, resource requests, etc. Because we are wrapping/hiding the PB implementation from the APIs, we are hiding the built in PB ser/deser capabilities. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira