[jira] [Created] (YARN-892) Resource Manager throws InvalidStateTransitonException: Invalid event: CONTAINER_FINISHED at ALLOCATED

2013-07-01 Thread Devaraj K (JIRA)
Devaraj K created YARN-892:
--

 Summary: Resource Manager throws InvalidStateTransitonException: 
Invalid event: CONTAINER_FINISHED at ALLOCATED
 Key: YARN-892
 URL: https://issues.apache.org/jira/browse/YARN-892
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: Devaraj K
Assignee: Devaraj K


{code:xml}
2013-06-28 18:18:59,255 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
CONTAINER_FINISHED at ALLOCATED
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:627)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:476)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
at java.lang.Thread.run(Thread.java:662)
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-888) clean up POM dependencies

2013-07-01 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696842#comment-13696842
 ] 

Timothy St. Clair commented on YARN-888:


[~tucu00], I have a series of tickets relating to *this, and I'm wondering if 
it makes sense to use this as an umbrella and tree off.  

 clean up POM dependencies
 -

 Key: YARN-888
 URL: https://issues.apache.org/jira/browse/YARN-888
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Alejandro Abdelnur

 Intermediate 'pom' modules define dependencies inherited by leaf modules.
 This is causing issues in intellij IDE.
 We should normalize the leaf modules like in common, hdfs and tools where all 
 dependencies are defined in each leaf module and the intermediate 'pom' 
 module do not define any dependency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (YARN-892) Resource Manager throws InvalidStateTransitonException: Invalid event: CONTAINER_FINISHED at ALLOCATED

2013-07-01 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-892.
-

Resolution: Duplicate

 Resource Manager throws InvalidStateTransitonException: Invalid event: 
 CONTAINER_FINISHED at ALLOCATED
 --

 Key: YARN-892
 URL: https://issues.apache.org/jira/browse/YARN-892
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: Devaraj K
Assignee: Devaraj K

 {code:xml}
 2013-06-28 18:18:59,255 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 CONTAINER_FINISHED at ALLOCATED
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:627)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:476)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
   at java.lang.Thread.run(Thread.java:662)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-862) ResourceManager and NodeManager versions should match on node registration or error out

2013-07-01 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated YARN-862:
---

Target Version/s: 0.23.10  (was: 0.23.9)

 ResourceManager and NodeManager versions should match on node registration or 
 error out
 ---

 Key: YARN-862
 URL: https://issues.apache.org/jira/browse/YARN-862
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager
Affects Versions: 0.23.8
Reporter: Robert Parker
Assignee: Robert Parker
 Attachments: YARN-862-b0.23-v1.patch, YARN-862-b0.23-v2.patch


 For branch-0.23 the versions of the node manager and the resource manager 
 should match to complete a successful registration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-556) RM Restart phase 2 - Work preserving restart

2013-07-01 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated YARN-556:


Issue Type: New Feature  (was: Sub-task)
Parent: (was: YARN-128)

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
  Labels: gsoc2013

 The basic idea is already documented on YARN-128. This will describe further 
 details.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-149) ResourceManager (RM) High-Availability (HA)

2013-07-01 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697005#comment-13697005
 ] 

Bikas Saha commented on YARN-149:
-

I will be posting a short design/road-map document shortly. If anyone has 
ideas, notes etc. then please start posting so that I can consolidate them. 
Overall, most of the tools and interfaces are already available in common via 
the HDFS HA project. The work will mainly be around integrating them with 
YARN/RM.

 ResourceManager (RM) High-Availability (HA)
 ---

 Key: YARN-149
 URL: https://issues.apache.org/jira/browse/YARN-149
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Harsh J
Assignee: Bikas Saha

  One of the goals presented on MAPREDUCE-279 was to have high availability. 
 One way that was discussed, per Mahadev/others on 
 https://issues.apache.org/jira/browse/MAPREDUCE-2648 and other places, was ZK:
 {quote}
 Am not sure, if you already know about the MR-279 branch (the next version of 
 MR framework). We've been trying to integrate ZK into the framework from the 
 beginning. As for now, we are just doing restart with ZK but soon we should 
 have a HA soln with ZK.
 {quote}
 There is now MAPREDUCE-4343 that tracks recoverability via ZK. This JIRA is 
 meant to track HA via ZK.
 Currently there isn't a HA solution for RM, via ZK or otherwise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-128) RM Restart

2013-07-01 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated YARN-128:


Summary: RM Restart  (was: RM Restart )

 RM Restart
 --

 Key: YARN-128
 URL: https://issues.apache.org/jira/browse/YARN-128
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.0-alpha
Reporter: Arun C Murthy
Assignee: Bikas Saha
 Attachments: MR-4343.1.patch, restart-12-11-zkstore.patch, 
 restart-fs-store-11-17.patch, restart-zk-store-11-17.patch, 
 RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf, 
 YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, 
 YARN-128.full-code.5.patch, YARN-128.new-code-added.3.patch, 
 YARN-128.new-code-added-4.patch, YARN-128.old-code-removed.3.patch, 
 YARN-128.old-code-removed.4.patch, YARN-128.patch


 We should resurrect 'RM Restart' which we disabled sometime during the RM 
 refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable

2013-07-01 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-814:
-

Attachment: (was: YARN-814.3.patch)

 Difficult to diagnose a failed container launch when error due to invalid 
 environment variable
 --

 Key: YARN-814
 URL: https://issues.apache.org/jira/browse/YARN-814
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, 
 YARN-814.patch


 The container's launch script sets up environment variables, symlinks etc. 
 If there is any failure when setting up the basic context ( before the actual 
 user's process is launched ), nothing is captured by the NM. This makes it 
 impossible to diagnose the reason for the failure. 
 To reproduce, set an env var where the value contains characters that throw 
 syntax errors in bash. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable

2013-07-01 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-814:
-

Attachment: YARN-814.3.patch

 Difficult to diagnose a failed container launch when error due to invalid 
 environment variable
 --

 Key: YARN-814
 URL: https://issues.apache.org/jira/browse/YARN-814
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, 
 YARN-814.patch


 The container's launch script sets up environment variables, symlinks etc. 
 If there is any failure when setting up the basic context ( before the actual 
 user's process is launched ), nothing is captured by the NM. This makes it 
 impossible to diagnose the reason for the failure. 
 To reproduce, set an env var where the value contains characters that throw 
 syntax errors in bash. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable

2013-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697049#comment-13697049
 ] 

Hadoop QA commented on YARN-814:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12590275/YARN-814.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1412//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1412//console

This message is automatically generated.

 Difficult to diagnose a failed container launch when error due to invalid 
 environment variable
 --

 Key: YARN-814
 URL: https://issues.apache.org/jira/browse/YARN-814
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, 
 YARN-814.patch


 The container's launch script sets up environment variables, symlinks etc. 
 If there is any failure when setting up the basic context ( before the actual 
 user's process is launched ), nothing is captured by the NM. This makes it 
 impossible to diagnose the reason for the failure. 
 To reproduce, set an env var where the value contains characters that throw 
 syntax errors in bash. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-864) YARN NM leaking containers with CGroups

2013-07-01 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-864:
---

Assignee: Jian He

 YARN NM leaking containers with CGroups
 ---

 Key: YARN-864
 URL: https://issues.apache.org/jira/browse/YARN-864
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
 Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and 
 YARN-600.
Reporter: Chris Riccomini
Assignee: Jian He
 Attachments: rm-log, YARN-864.1.patch, YARN-864.2.patch


 Hey Guys,
 I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm 
 seeing containers getting leaked by the NMs. I'm not quite sure what's going 
 on -- has anyone seen this before? I'm concerned that maybe it's a 
 mis-understanding on my part about how YARN's lifecycle works.
 When I look in my AM logs for my app (not an MR app master), I see:
 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. 
 This means that container container_1371141151815_0008_03_02 was killed 
 by YARN, either due to being released by the application master or being 
 'lost' due to node failures etc.
 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container 
 container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a 
 new container for the task.
 The AM has been running steadily the whole time. Here's what the NM logs say:
 {noformat}
 05:34:59,783  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
 java.lang.InterruptedException
 at java.lang.Object.wait(Native Method)
 at java.lang.Thread.join(Thread.java:1143)
 at java.lang.Thread.join(Thread.java:1196)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
 at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
 at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
 at java.lang.Thread.run(Thread.java:619)
 05:35:00,314  WARN ContainersMonitorImpl:463 - 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
  is interrupted. Exiting.
 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
 at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598
 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
 at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_02
 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
 java.io.IOException: java.lang.InterruptedException
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
 at org.apache.hadoop.util.Shell.run(Shell.java:129)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
 at 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
 java.io.IOException: java.lang.InterruptedException
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
 at org.apache.hadoop.util.Shell.run(Shell.java:129)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
 at 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
 at 
 

[jira] [Commented] (YARN-712) RMDelegationTokenSecretManager shouldn't start in non-secure mode

2013-07-01 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697069#comment-13697069
 ] 

Omkar Vinit Joshi commented on YARN-712:


can we enable it irrespective of security? like ContainerToken?

 RMDelegationTokenSecretManager shouldn't start in non-secure mode 
 --

 Key: YARN-712
 URL: https://issues.apache.org/jira/browse/YARN-712
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He

 RM will just be doing useless work as no tokens are issued.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-815) Add container failure handling to distributed-shell

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-815:
-

Issue Type: Improvement  (was: Bug)

 Add container failure handling to distributed-shell
 ---

 Key: YARN-815
 URL: https://issues.apache.org/jira/browse/YARN-815
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications/distributed-shell
Reporter: Vinod Kumar Vavilapalli

 Today if any container fails because of whatever reason, the app simply 
 ignores them. We should handle retries, improve error reporting etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-769) Add metrics for number of containers

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-769:
-

Issue Type: Improvement  (was: Bug)

 Add metrics for number of containers
 

 Key: YARN-769
 URL: https://issues.apache.org/jira/browse/YARN-769
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.4-alpha
Reporter: Arun C Murthy

 We should add metrics to RM to track available (min-sized) containers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-772) Document ApplicationConstants for AM implementors

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-772:
-

Issue Type: Improvement  (was: Bug)

 Document ApplicationConstants for AM implementors
 -

 Key: YARN-772
 URL: https://issues.apache.org/jira/browse/YARN-772
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Arun C Murthy

 We should document features like LOG_DIR_EXPANSION_VAR, APP_SUBMIT_TIME_ENV 
 etc. for folks developing new applications in the WritingYarnApplications doc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-705) Review of Field Rules, Default Values and Sanity Check for ContainerManager

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-705:
-

Issue Type: Improvement  (was: Bug)

 Review of Field Rules, Default Values and Sanity Check for ContainerManager
 ---

 Key: YARN-705
 URL: https://issues.apache.org/jira/browse/YARN-705
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Need to do the similar things mentioned in YARN-698.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-710) Add to ser/deser methods to RecordFactory

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-710:
-

Issue Type: Improvement  (was: Bug)

 Add to ser/deser methods to RecordFactory
 -

 Key: YARN-710
 URL: https://issues.apache.org/jira/browse/YARN-710
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.0.4-alpha
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
 Attachments: YARN-710.patch, YARN-710.patch


 I order to do things like AMs failover and checkpointing I need to serialize 
 app IDs, app attempt IDs, containers and/or IDs,  resource requests, etc.
 Because we are wrapping/hiding the PB implementation from the APIs, we are 
 hiding the built in PB ser/deser capabilities.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-662) Enforce required parameters for all the protocols

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-662:
-

Issue Type: Bug  (was: Sub-task)
Parent: (was: YARN-386)

 Enforce required parameters for all the protocols
 -

 Key: YARN-662
 URL: https://issues.apache.org/jira/browse/YARN-662
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Siddharth Seth
Assignee: Zhijie Shen

 All proto fields are marked as options. We need to mark some of them as 
 requried, or enforce these server side. Server side is likely better since 
 that's more flexible (Example deprecating a field type in favour of another - 
 either of the two must be present)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-704) Review of Field Rules, Default Values and Sanity Check for AMRMProtocol

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-704:
-

Issue Type: Sub-task  (was: Bug)
Parent: YARN-662

 Review of Field Rules, Default Values and Sanity Check for AMRMProtocol
 ---

 Key: YARN-704
 URL: https://issues.apache.org/jira/browse/YARN-704
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Need to do the similar things mentioned in YARN-698.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-703) Review of Field Rules, Default Values and Sanity Check for RMAdminProtocol

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-703:
-

Issue Type: Sub-task  (was: Bug)
Parent: YARN-662

 Review of Field Rules, Default Values and Sanity Check for RMAdminProtocol
 --

 Key: YARN-703
 URL: https://issues.apache.org/jira/browse/YARN-703
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Need to do the similar things mentioned in YARN-698.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-705) Review of Field Rules, Default Values and Sanity Check for ContainerManager

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-705:
-

Issue Type: Sub-task  (was: Improvement)
Parent: YARN-662

 Review of Field Rules, Default Values and Sanity Check for ContainerManager
 ---

 Key: YARN-705
 URL: https://issues.apache.org/jira/browse/YARN-705
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Need to do the similar things mentioned in YARN-698.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-662) Enforce required parameters for all the protocols

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-662:
-

Issue Type: Improvement  (was: Bug)

 Enforce required parameters for all the protocols
 -

 Key: YARN-662
 URL: https://issues.apache.org/jira/browse/YARN-662
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Zhijie Shen

 All proto fields are marked as options. We need to mark some of them as 
 requried, or enforce these server side. Server side is likely better since 
 that's more flexible (Example deprecating a field type in favour of another - 
 either of the two must be present)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-641) Make AMLauncher in RM Use NMClient

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-641:
-

Issue Type: Improvement  (was: Bug)

 Make AMLauncher in RM Use NMClient
 --

 Key: YARN-641
 URL: https://issues.apache.org/jira/browse/YARN-641
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-641.1.patch, YARN-641.2.patch, YARN-641.3.patch


 YARN-422 adds NMClient. RM's AMLauncher is responsible for the interactions 
 with an application's AM container. AMLauncher should also replace the raw 
 ContainerManager proxy with NMClient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-698) Review of Field Rules, Default Values and Sanity Check for ClientRMProtocol

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-698:
-

Issue Type: Sub-task  (was: Bug)
Parent: YARN-662

 Review of Field Rules, Default Values and Sanity Check for ClientRMProtocol
 ---

 Key: YARN-698
 URL: https://issues.apache.org/jira/browse/YARN-698
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 We need to check the fields of the protos used by ClientRMProtocol 
 (recursively) to clarify the following stuff:
 1. Whether the field should be required or optional
 2. What the default value should be if the field is optional
 3. Whether sanity check is required to validate the input value against the 
 field's value domain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-431) [Umbrella] Complete/Stabilize YARN appliation log-aggregation

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-431:
-

Issue Type: Task  (was: Bug)

 [Umbrella] Complete/Stabilize YARN appliation log-aggregation
 -

 Key: YARN-431
 URL: https://issues.apache.org/jira/browse/YARN-431
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Vinod Kumar Vavilapalli



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-399) Add an out of band heartbeat damper

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-399:
-

Issue Type: Improvement  (was: Bug)

 Add an out of band heartbeat damper
 ---

 Key: YARN-399
 URL: https://issues.apache.org/jira/browse/YARN-399
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 0.23.6
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-399.PATCH


 We are seeing issues with the scheduler queue backing up on the RM. We have 
 the nodemanager heartbeats set at 5 seconds which should be more then long 
 enough for the number of apps we are running.  We believe this is due to the 
 out of band heartbeats of the nodemanager coming to soon when we have jobs 
 with lots of containers that finish quickly.
 To help with that we could add an out of band heartbeat damper to the 
 nodemanager similar to what 1.X Tasktrackers have.  MAPREDUCE-2355 added it 
 in 1.x.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-437) Update documentation of Writing Yarn Applications to match current best practices

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-437:
-

Issue Type: Improvement  (was: Bug)

 Update documentation of Writing Yarn Applications to match current best 
 practices
 ---

 Key: YARN-437
 URL: https://issues.apache.org/jira/browse/YARN-437
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Hitesh Shah
Assignee: Eli Reisman
 Attachments: YARN-437-1.patch, YARN-437-2.patch, YARN-437-3.patch


 Should fix docs to point to usage of YarnClient and AMRMClient helper libs. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-436) Document how to use DistributedShell yarn application

2013-07-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-436:
-

Issue Type: Improvement  (was: Bug)

 Document how to use DistributedShell yarn application
 -

 Key: YARN-436
 URL: https://issues.apache.org/jira/browse/YARN-436
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Hitesh Shah
Assignee: Hitesh Shah



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-149) ResourceManager (RM) High-Availability (HA)

2013-07-01 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697165#comment-13697165
 ] 

Karthik Kambatla commented on YARN-149:
---

Sounds good, thanks Bikas. I also have been thinking about this and working on 
a draft. Will get it to shape, and attach it here. 

 ResourceManager (RM) High-Availability (HA)
 ---

 Key: YARN-149
 URL: https://issues.apache.org/jira/browse/YARN-149
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Harsh J
Assignee: Bikas Saha

 This jira tracks work needed to be done to support one RM instance failing 
 over to another RM instance so that we can have RM HA. Work includes leader 
 election, transfer of control to leader and client re-direction to new leader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-675) In YarnClient, pull AM logs on AM container failure

2013-07-01 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697185#comment-13697185
 ] 

Zhijie Shen commented on YARN-675:
--

[~sandyr], would you mind my taking this ticket over? We're trying to push the 
better error reporting tickets to be fixed ASAP. Thanks!

 In YarnClient, pull AM logs on AM container failure
 ---

 Key: YARN-675
 URL: https://issues.apache.org/jira/browse/YARN-675
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza

 Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to 
 pull its logs from the NM to the client so that they can be displayed 
 immediately to the user.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-675) In YarnClient, pull AM logs on AM container failure

2013-07-01 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697209#comment-13697209
 ] 

Sandy Ryza commented on YARN-675:
-

[~zjshen], thanks for the help, feel free to take it over.  We're also trying 
to get these in ASAP.  My delay in working on it has been that it depends on 
YARN-649, so any feedback there would help move things forward as well.

 In YarnClient, pull AM logs on AM container failure
 ---

 Key: YARN-675
 URL: https://issues.apache.org/jira/browse/YARN-675
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza

 Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to 
 pull its logs from the NM to the client so that they can be displayed 
 immediately to the user.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-353) Add Zookeeper-based store implementation for RMStateStore

2013-07-01 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697224#comment-13697224
 ] 

Jian He commented on YARN-353:
--

I'm taking this over

 Add Zookeeper-based store implementation for RMStateStore
 -

 Key: YARN-353
 URL: https://issues.apache.org/jira/browse/YARN-353
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Hitesh Shah
Assignee: Bikas Saha
 Attachments: YARN-353.1.patch


 Add store that write RM state data to ZK

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable

2013-07-01 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697240#comment-13697240
 ] 

Hitesh Shah commented on YARN-814:
--

Comments:

Why is shExec.getOutput() being ignored ( and replaced with 
exception.getMessage() )? 
Have you run this with a test script that emits information both to stdout and 
stderr? 

{code}
+  LOG.warn(Exception from container-launch with container ID: 
+  + containerId +  and exit code:  + exitCode , e);
+  logOutput(e.getMessage());
{code}
  - logging the exception twice?
  -logOutput() does not seem to log any contextual information - have you 
logged at the NM logs to see if it actually provides useful debugging 
information when running multiple containers at the same time?

{code}
   LOG.warn(Exit code from container is :  + exitCode);
-  logOutput(shExec.getOutput());
+  logOutput(e.getMessage());
{code}
  - Earlier comment about the LOG.warn not being useful not addressed?

{code}
   throw new IOException(App initialization failed ( + exitCode + 
-  ) with output:  + shExec.getOutput(), e);
+  ) with output:  + e.getMessage(), e);
{code}
  - The exception e is already being passed. Why the need to add e.getMessage() 
too? 



 Difficult to diagnose a failed container launch when error due to invalid 
 environment variable
 --

 Key: YARN-814
 URL: https://issues.apache.org/jira/browse/YARN-814
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, 
 YARN-814.patch


 The container's launch script sets up environment variables, symlinks etc. 
 If there is any failure when setting up the basic context ( before the actual 
 user's process is launched ), nothing is captured by the NM. This makes it 
 impossible to diagnose the reason for the failure. 
 To reproduce, set an env var where the value contains characters that throw 
 syntax errors in bash. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-661) NM fails to cleanup local directories for users

2013-07-01 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi reassigned YARN-661:
--

Assignee: Omkar Vinit Joshi

 NM fails to cleanup local directories for users
 ---

 Key: YARN-661
 URL: https://issues.apache.org/jira/browse/YARN-661
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 0.23.8
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi

 YARN-71 added deletion of local directories on startup, but in practice it 
 fails to delete the directories because of permission problems.  The 
 top-level usercache directory is owned by the user but is in a directory that 
 is not writable by the user.  Therefore the deletion of the user's usercache 
 directory, as the user, fails due to lack of permissions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-661) NM fails to cleanup local directories for users

2013-07-01 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697277#comment-13697277
 ] 

Omkar Vinit Joshi commented on YARN-661:


taking this over... Just reproduced this issue on secured cluster..It exists.. 
need to be fixed..

 NM fails to cleanup local directories for users
 ---

 Key: YARN-661
 URL: https://issues.apache.org/jira/browse/YARN-661
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 0.23.8
Reporter: Jason Lowe

 YARN-71 added deletion of local directories on startup, but in practice it 
 fails to delete the directories because of permission problems.  The 
 top-level usercache directory is owned by the user but is in a directory that 
 is not writable by the user.  Therefore the deletion of the user's usercache 
 directory, as the user, fails due to lack of permissions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-661) NM fails to cleanup local directories for users

2013-07-01 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697320#comment-13697320
 ] 

Omkar Vinit Joshi commented on YARN-661:


I guess we need 2 features in deletion service.
* A way for user to specify that delete all the sub directories and files 
inside a parent directory but don't delete parent directory.
* A way to define dependency between deletion tasks. For example we need to 
delete usercache files before actually deleting the parent usercache itself...


 NM fails to cleanup local directories for users
 ---

 Key: YARN-661
 URL: https://issues.apache.org/jira/browse/YARN-661
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 0.23.8
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi

 YARN-71 added deletion of local directories on startup, but in practice it 
 fails to delete the directories because of permission problems.  The 
 top-level usercache directory is owned by the user but is in a directory that 
 is not writable by the user.  Therefore the deletion of the user's usercache 
 directory, as the user, fails due to lack of permissions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable

2013-07-01 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697377#comment-13697377
 ] 

Hitesh Shah commented on YARN-814:
--

There is no guarantee that shExec.getOutput() will always be empty.

For example:

{code}
echo About to run invalid command
./run_invalid_command.sh
{code}

The above should generate output both on stdout and stderr. The patch seems to 
be throwing away potential valid output that may be useful for debugging. It 
seems like you need to capture both stdout and stderr information. 

 Difficult to diagnose a failed container launch when error due to invalid 
 environment variable
 --

 Key: YARN-814
 URL: https://issues.apache.org/jira/browse/YARN-814
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, 
 YARN-814.patch


 The container's launch script sets up environment variables, symlinks etc. 
 If there is any failure when setting up the basic context ( before the actual 
 user's process is launched ), nothing is captured by the NM. This makes it 
 impossible to diagnose the reason for the failure. 
 To reproduce, set an env var where the value contains characters that throw 
 syntax errors in bash. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-894) NodeHealthScriptRunner timeout checking is inaccurate on Windows

2013-07-01 Thread Chuan Liu (JIRA)
Chuan Liu created YARN-894:
--

 Summary: NodeHealthScriptRunner timeout checking is inaccurate on 
Windows
 Key: YARN-894
 URL: https://issues.apache.org/jira/browse/YARN-894
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.1.0-beta
Reporter: Chuan Liu
Assignee: Chuan Liu
Priority: Minor


In {{NodeHealthScriptRunner}} method, we will set HealthChecker status based on 
the Shell execution results. Some status are based on the exception thrown 
during the Shell script execution.

Currently, we will catch a non-ExitCodeException from ShellCommandExecutor, and 
if Shell has the timeout status set at the same time, we will also set 
HealthChecker status to timeout.

We have following execution sequence in Shell:
1) In main thread, schedule a delayed timer task that will kill the original 
process upon timeout.
2) In main thread, open a buffered reader and feed in the process's standard 
input stream.
3) When timeout happens, the timer task will call {{Process#destroy()}}
 to kill the main process.

On Linux, when timeout happened and process killed, the buffered reader will 
thrown an IOException with message: Stream closed in main thread.

On Windows, we don't have the IOException. Only -1 was returned from the 
reader that indicates the buffer is finished. As a result, the timeout status 
is not set on Windows, and {{TestNodeHealthService}} fails on Windows because 
of this.


 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-894) NodeHealthScriptRunner timeout checking is inaccurate on Windows

2013-07-01 Thread Chuan Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuan Liu updated YARN-894:
---

Attachment: wait.sh
wait.cmd
ReadProcessStdout.java

Attach a Java file that verifies the above description. When executed on 
Windows, we have the following result:
{noformat}
C:\Users\chuanliu\Documentsjava ReadProcessStdout wait.cmd
Process was destroyed!
-1
exit code: 1
{noformat}

On Linux, the results look like the following:
{noformat}
~$ java ReadProcessStdout ./wait.sh
Process was destroyed!
-1
Stream closed
java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:145)
at java.io.BufferedInputStream.read(BufferedInputStream.java:308)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at ReadProcessStdout.main(ReadProcessStdout.java:25)
{noformat}

 NodeHealthScriptRunner timeout checking is inaccurate on Windows
 

 Key: YARN-894
 URL: https://issues.apache.org/jira/browse/YARN-894
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.1.0-beta
Reporter: Chuan Liu
Assignee: Chuan Liu
Priority: Minor
 Attachments: ReadProcessStdout.java, wait.cmd, wait.sh


 In {{NodeHealthScriptRunner}} method, we will set HealthChecker status based 
 on the Shell execution results. Some status are based on the exception thrown 
 during the Shell script execution.
 Currently, we will catch a non-ExitCodeException from ShellCommandExecutor, 
 and if Shell has the timeout status set at the same time, we will also set 
 HealthChecker status to timeout.
 We have following execution sequence in Shell:
 1) In main thread, schedule a delayed timer task that will kill the original 
 process upon timeout.
 2) In main thread, open a buffered reader and feed in the process's standard 
 input stream.
 3) When timeout happens, the timer task will call {{Process#destroy()}}
  to kill the main process.
 On Linux, when timeout happened and process killed, the buffered reader will 
 thrown an IOException with message: Stream closed in main thread.
 On Windows, we don't have the IOException. Only -1 was returned from the 
 reader that indicates the buffer is finished. As a result, the timeout status 
 is not set on Windows, and {{TestNodeHealthService}} fails on Windows because 
 of this.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-894) NodeHealthScriptRunner timeout checking is inaccurate on Windows

2013-07-01 Thread Chuan Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuan Liu updated YARN-894:
---

Attachment: YARN-894-trunk.patch

Attaching a patch that fixes the above issue on Windows. Also changing the test 
to use different command for 'sleep' and Shell script extension on Windows.

 NodeHealthScriptRunner timeout checking is inaccurate on Windows
 

 Key: YARN-894
 URL: https://issues.apache.org/jira/browse/YARN-894
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.1.0-beta
Reporter: Chuan Liu
Assignee: Chuan Liu
Priority: Minor
 Attachments: ReadProcessStdout.java, wait.cmd, wait.sh, 
 YARN-894-trunk.patch


 In {{NodeHealthScriptRunner}} method, we will set HealthChecker status based 
 on the Shell execution results. Some status are based on the exception thrown 
 during the Shell script execution.
 Currently, we will catch a non-ExitCodeException from ShellCommandExecutor, 
 and if Shell has the timeout status set at the same time, we will also set 
 HealthChecker status to timeout.
 We have following execution sequence in Shell:
 1) In main thread, schedule a delayed timer task that will kill the original 
 process upon timeout.
 2) In main thread, open a buffered reader and feed in the process's standard 
 input stream.
 3) When timeout happens, the timer task will call {{Process#destroy()}}
  to kill the main process.
 On Linux, when timeout happened and process killed, the buffered reader will 
 thrown an IOException with message: Stream closed in main thread.
 On Windows, we don't have the IOException. Only -1 was returned from the 
 reader that indicates the buffer is finished. As a result, the timeout status 
 is not set on Windows, and {{TestNodeHealthService}} fails on Windows because 
 of this.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-353) Add Zookeeper-based store implementation for RMStateStore

2013-07-01 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-353:
-

Attachment: YARN-353.2.patch

rebased the patch and added RMDelegationToken restore implementation for the 
ZKStateStore 

 Add Zookeeper-based store implementation for RMStateStore
 -

 Key: YARN-353
 URL: https://issues.apache.org/jira/browse/YARN-353
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Hitesh Shah
Assignee: Bikas Saha
 Attachments: YARN-353.1.patch, YARN-353.2.patch


 Add store that write RM state data to ZK

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-894) NodeHealthScriptRunner timeout checking is inaccurate on Windows

2013-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697407#comment-13697407
 ] 

Hadoop QA commented on YARN-894:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12590349/YARN-894-trunk.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1413//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1413//console

This message is automatically generated.

 NodeHealthScriptRunner timeout checking is inaccurate on Windows
 

 Key: YARN-894
 URL: https://issues.apache.org/jira/browse/YARN-894
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.1.0-beta
Reporter: Chuan Liu
Assignee: Chuan Liu
Priority: Minor
 Attachments: ReadProcessStdout.java, wait.cmd, wait.sh, 
 YARN-894-trunk.patch


 In {{NodeHealthScriptRunner}} method, we will set HealthChecker status based 
 on the Shell execution results. Some status are based on the exception thrown 
 during the Shell script execution.
 Currently, we will catch a non-ExitCodeException from ShellCommandExecutor, 
 and if Shell has the timeout status set at the same time, we will also set 
 HealthChecker status to timeout.
 We have following execution sequence in Shell:
 1) In main thread, schedule a delayed timer task that will kill the original 
 process upon timeout.
 2) In main thread, open a buffered reader and feed in the process's standard 
 input stream.
 3) When timeout happens, the timer task will call {{Process#destroy()}}
  to kill the main process.
 On Linux, when timeout happened and process killed, the buffered reader will 
 thrown an IOException with message: Stream closed in main thread.
 On Windows, we don't have the IOException. Only -1 was returned from the 
 reader that indicates the buffer is finished. As a result, the timeout status 
 is not set on Windows, and {{TestNodeHealthService}} fails on Windows because 
 of this.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-353) Add Zookeeper-based store implementation for RMStateStore

2013-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697417#comment-13697417
 ] 

Hadoop QA commented on YARN-353:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12590350/YARN-353.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 3 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1414//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/1414//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1414//console

This message is automatically generated.

 Add Zookeeper-based store implementation for RMStateStore
 -

 Key: YARN-353
 URL: https://issues.apache.org/jira/browse/YARN-353
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Hitesh Shah
Assignee: Bikas Saha
 Attachments: YARN-353.1.patch, YARN-353.2.patch


 Add store that write RM state data to ZK

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-675) In YarnClient, pull AM logs on AM container failure

2013-07-01 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697493#comment-13697493
 ] 

Zhijie Shen commented on YARN-675:
--

Take it over. Thanks!

 In YarnClient, pull AM logs on AM container failure
 ---

 Key: YARN-675
 URL: https://issues.apache.org/jira/browse/YARN-675
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Zhijie Shen

 Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to 
 pull its logs from the NM to the client so that they can be displayed 
 immediately to the user.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (YARN-675) In YarnClient, pull AM logs on AM container failure

2013-07-01 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen reassigned YARN-675:


Assignee: Zhijie Shen

 In YarnClient, pull AM logs on AM container failure
 ---

 Key: YARN-675
 URL: https://issues.apache.org/jira/browse/YARN-675
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Zhijie Shen

 Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to 
 pull its logs from the NM to the client so that they can be displayed 
 immediately to the user.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-873) YARNClient.getApplicationReport(unknownAppId) returns a null report

2013-07-01 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697513#comment-13697513
 ] 

Xuan Gong commented on YARN-873:


Throwing Exceptions may not be a good option. If we throw an exception out, the 
clients may think there is a problem about this command, but actually this 
command works fine. 
Probably, we can output something like This appId is not exist. Please use 
command yarn application -list to get all application information

 YARNClient.getApplicationReport(unknownAppId) returns a null report
 ---

 Key: YARN-873
 URL: https://issues.apache.org/jira/browse/YARN-873
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.1.0-beta
Reporter: Bikas Saha
Assignee: Xuan Gong

 How can the client find out that app does not exist?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-710) Add to ser/deser methods to RecordFactory

2013-07-01 Thread Alejandro Abdelnur (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-710:


Attachment: YARN-710-wip.patch

Sidd,

attaching a patch with your suggestion on how to get the class.

however, something has changed significantly since the last patch.

I've tried getting things to work again but it is plain ugly, I don't like at 
all (see wip patch). Still it is not working because I cannot force to create 
the underlying proto.

Any idea on how to untangle this?

 Add to ser/deser methods to RecordFactory
 -

 Key: YARN-710
 URL: https://issues.apache.org/jira/browse/YARN-710
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.0.4-alpha
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
 Attachments: YARN-710.patch, YARN-710.patch, YARN-710-wip.patch


 I order to do things like AMs failover and checkpointing I need to serialize 
 app IDs, app attempt IDs, containers and/or IDs,  resource requests, etc.
 Because we are wrapping/hiding the PB implementation from the APIs, we are 
 hiding the built in PB ser/deser capabilities.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira