date:20130701


 [ 
https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-814:
-

Attachment: (was: YARN-814.3.patch)

 Difficult to diagnose a failed container launch when error due to invalid 
 environment variable
 --

 Key: YARN-814
 URL: https://issues.apache.org/jira/browse/YARN-814
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, 
 YARN-814.patch


 The container's launch script sets up environment variables, symlinks etc. 
 If there is any failure when setting up the basic context ( before the actual 
 user's process is launched ), nothing is captured by the NM. This makes it 
 impossible to diagnose the reason for the failure. 
 To reproduce, set an env var where the value contains characters that throw 
 syntax errors in bash. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable


 [ 
https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-814:
-

Attachment: YARN-814.3.patch

 Difficult to diagnose a failed container launch when error due to invalid 
 environment variable
 --

 Key: YARN-814
 URL: https://issues.apache.org/jira/browse/YARN-814
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, 
 YARN-814.patch


 The container's launch script sets up environment variables, symlinks etc. 
 If there is any failure when setting up the basic context ( before the actual 
 user's process is launched ), nothing is captured by the NM. This makes it 
 impossible to diagnose the reason for the failure. 
 To reproduce, set an env var where the value contains characters that throw 
 syntax errors in bash. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable

2013-07-01 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697049#comment-13697049
 ] 

Hadoop QA commented on YARN-814:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12590275/YARN-814.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1412//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1412//console

This message is automatically generated.

 Difficult to diagnose a failed container launch when error due to invalid 
 environment variable
 --

 Key: YARN-814
 URL: https://issues.apache.org/jira/browse/YARN-814
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, 
 YARN-814.patch


 The container's launch script sets up environment variables, symlinks etc. 
 If there is any failure when setting up the basic context ( before the actual 
 user's process is launched ), nothing is captured by the NM. This makes it 
 impossible to diagnose the reason for the failure. 
 To reproduce, set an env var where the value contains characters that throw 
 syntax errors in bash. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-864) YARN NM leaking containers with CGroups


 [ 
https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-864:
---

Assignee: Jian He

 YARN NM leaking containers with CGroups
 ---

 Key: YARN-864
 URL: https://issues.apache.org/jira/browse/YARN-864
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
 Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and 
 YARN-600.
Reporter: Chris Riccomini
Assignee: Jian He
 Attachments: rm-log, YARN-864.1.patch, YARN-864.2.patch


 Hey Guys,
 I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm 
 seeing containers getting leaked by the NMs. I'm not quite sure what's going 
 on -- has anyone seen this before? I'm concerned that maybe it's a 
 mis-understanding on my part about how YARN's lifecycle works.
 When I look in my AM logs for my app (not an MR app master), I see:
 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. 
 This means that container container_1371141151815_0008_03_02 was killed 
 by YARN, either due to being released by the application master or being 
 'lost' due to node failures etc.
 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container 
 container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a 
 new container for the task.
 The AM has been running steadily the whole time. Here's what the NM logs say:
 {noformat}
 05:34:59,783  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
 java.lang.InterruptedException
 at java.lang.Object.wait(Native Method)
 at java.lang.Thread.join(Thread.java:1143)
 at java.lang.Thread.join(Thread.java:1196)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
 at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
 at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
 at java.lang.Thread.run(Thread.java:619)
 05:35:00,314  WARN ContainersMonitorImpl:463 - 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
  is interrupted. Exiting.
 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
 at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598
 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
 at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_02
 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
 java.io.IOException: java.lang.InterruptedException
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
 at org.apache.hadoop.util.Shell.run(Shell.java:129)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
 at 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
 java.io.IOException: java.lang.InterruptedException
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
 at org.apache.hadoop.util.Shell.run(Shell.java:129)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
 at 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
 at

[jira] [Commented] (YARN-712) RMDelegationTokenSecretManager shouldn't start in non-secure mode


[ 
https://issues.apache.org/jira/browse/YARN-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697069#comment-13697069
 ] 

Omkar Vinit Joshi commented on YARN-712:


can we enable it irrespective of security? like ContainerToken?

 RMDelegationTokenSecretManager shouldn't start in non-secure mode 
 --

 Key: YARN-712
 URL: https://issues.apache.org/jira/browse/YARN-712
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He

 RM will just be doing useless work as no tokens are issued.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-815) Add container failure handling to distributed-shell


 [ 
https://issues.apache.org/jira/browse/YARN-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-815:
-

Issue Type: Improvement  (was: Bug)

 Add container failure handling to distributed-shell
 ---

 Key: YARN-815
 URL: https://issues.apache.org/jira/browse/YARN-815
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications/distributed-shell
Reporter: Vinod Kumar Vavilapalli

 Today if any container fails because of whatever reason, the app simply 
 ignores them. We should handle retries, improve error reporting etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-769) Add metrics for number of containers


 [ 
https://issues.apache.org/jira/browse/YARN-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-769:
-

Issue Type: Improvement  (was: Bug)

 Add metrics for number of containers
 

 Key: YARN-769
 URL: https://issues.apache.org/jira/browse/YARN-769
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.4-alpha
Reporter: Arun C Murthy

 We should add metrics to RM to track available (min-sized) containers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-772) Document ApplicationConstants for AM implementors


 [ 
https://issues.apache.org/jira/browse/YARN-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-772:
-

Issue Type: Improvement  (was: Bug)

 Document ApplicationConstants for AM implementors
 -

 Key: YARN-772
 URL: https://issues.apache.org/jira/browse/YARN-772
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Arun C Murthy

 We should document features like LOG_DIR_EXPANSION_VAR, APP_SUBMIT_TIME_ENV 
 etc. for folks developing new applications in the WritingYarnApplications doc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-705) Review of Field Rules, Default Values and Sanity Check for ContainerManager


 [ 
https://issues.apache.org/jira/browse/YARN-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-705:
-

Issue Type: Improvement  (was: Bug)

 Review of Field Rules, Default Values and Sanity Check for ContainerManager
 ---

 Key: YARN-705
 URL: https://issues.apache.org/jira/browse/YARN-705
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Need to do the similar things mentioned in YARN-698.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-710) Add to ser/deser methods to RecordFactory


 [ 
https://issues.apache.org/jira/browse/YARN-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-710:
-

Issue Type: Improvement  (was: Bug)

 Add to ser/deser methods to RecordFactory
 -

 Key: YARN-710
 URL: https://issues.apache.org/jira/browse/YARN-710
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 2.0.4-alpha
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
 Attachments: YARN-710.patch, YARN-710.patch


 I order to do things like AMs failover and checkpointing I need to serialize 
 app IDs, app attempt IDs, containers and/or IDs,  resource requests, etc.
 Because we are wrapping/hiding the PB implementation from the APIs, we are 
 hiding the built in PB ser/deser capabilities.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-662) Enforce required parameters for all the protocols


 [ 
https://issues.apache.org/jira/browse/YARN-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-662:
-

Issue Type: Bug  (was: Sub-task)
Parent: (was: YARN-386)

 Enforce required parameters for all the protocols
 -

 Key: YARN-662
 URL: https://issues.apache.org/jira/browse/YARN-662
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Siddharth Seth
Assignee: Zhijie Shen

 All proto fields are marked as options. We need to mark some of them as 
 requried, or enforce these server side. Server side is likely better since 
 that's more flexible (Example deprecating a field type in favour of another - 
 either of the two must be present)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-704) Review of Field Rules, Default Values and Sanity Check for AMRMProtocol


 [ 
https://issues.apache.org/jira/browse/YARN-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-704:
-

Issue Type: Sub-task  (was: Bug)
Parent: YARN-662

 Review of Field Rules, Default Values and Sanity Check for AMRMProtocol
 ---

 Key: YARN-704
 URL: https://issues.apache.org/jira/browse/YARN-704
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Need to do the similar things mentioned in YARN-698.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-703) Review of Field Rules, Default Values and Sanity Check for RMAdminProtocol


 [ 
https://issues.apache.org/jira/browse/YARN-703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-703:
-

Issue Type: Sub-task  (was: Bug)
Parent: YARN-662

 Review of Field Rules, Default Values and Sanity Check for RMAdminProtocol
 --

 Key: YARN-703
 URL: https://issues.apache.org/jira/browse/YARN-703
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Need to do the similar things mentioned in YARN-698.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-705) Review of Field Rules, Default Values and Sanity Check for ContainerManager


 [ 
https://issues.apache.org/jira/browse/YARN-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-705:
-

Issue Type: Sub-task  (was: Improvement)
Parent: YARN-662

 Review of Field Rules, Default Values and Sanity Check for ContainerManager
 ---

 Key: YARN-705
 URL: https://issues.apache.org/jira/browse/YARN-705
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 Need to do the similar things mentioned in YARN-698.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-662) Enforce required parameters for all the protocols


 [ 
https://issues.apache.org/jira/browse/YARN-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-662:
-

Issue Type: Improvement  (was: Bug)

 Enforce required parameters for all the protocols
 -

 Key: YARN-662
 URL: https://issues.apache.org/jira/browse/YARN-662
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Zhijie Shen

 All proto fields are marked as options. We need to mark some of them as 
 requried, or enforce these server side. Server side is likely better since 
 that's more flexible (Example deprecating a field type in favour of another - 
 either of the two must be present)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-641) Make AMLauncher in RM Use NMClient


 [ 
https://issues.apache.org/jira/browse/YARN-641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-641:
-

Issue Type: Improvement  (was: Bug)

 Make AMLauncher in RM Use NMClient
 --

 Key: YARN-641
 URL: https://issues.apache.org/jira/browse/YARN-641
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-641.1.patch, YARN-641.2.patch, YARN-641.3.patch


 YARN-422 adds NMClient. RM's AMLauncher is responsible for the interactions 
 with an application's AM container. AMLauncher should also replace the raw 
 ContainerManager proxy with NMClient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-698) Review of Field Rules, Default Values and Sanity Check for ClientRMProtocol


 [ 
https://issues.apache.org/jira/browse/YARN-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-698:
-

Issue Type: Sub-task  (was: Bug)
Parent: YARN-662

 Review of Field Rules, Default Values and Sanity Check for ClientRMProtocol
 ---

 Key: YARN-698
 URL: https://issues.apache.org/jira/browse/YARN-698
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen

 We need to check the fields of the protos used by ClientRMProtocol 
 (recursively) to clarify the following stuff:
 1. Whether the field should be required or optional
 2. What the default value should be if the field is optional
 3. Whether sanity check is required to validate the input value against the 
 field's value domain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-431) [Umbrella] Complete/Stabilize YARN appliation log-aggregation


 [ 
https://issues.apache.org/jira/browse/YARN-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-431:
-

Issue Type: Task  (was: Bug)

 [Umbrella] Complete/Stabilize YARN appliation log-aggregation
 -

 Key: YARN-431
 URL: https://issues.apache.org/jira/browse/YARN-431
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Vinod Kumar Vavilapalli



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-399) Add an out of band heartbeat damper


 [ 
https://issues.apache.org/jira/browse/YARN-399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-399:
-

Issue Type: Improvement  (was: Bug)

 Add an out of band heartbeat damper
 ---

 Key: YARN-399
 URL: https://issues.apache.org/jira/browse/YARN-399
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 0.23.6
Reporter: Thomas Graves
Assignee: Thomas Graves
 Attachments: YARN-399.PATCH


 We are seeing issues with the scheduler queue backing up on the RM. We have 
 the nodemanager heartbeats set at 5 seconds which should be more then long 
 enough for the number of apps we are running.  We believe this is due to the 
 out of band heartbeats of the nodemanager coming to soon when we have jobs 
 with lots of containers that finish quickly.
 To help with that we could add an out of band heartbeat damper to the 
 nodemanager similar to what 1.X Tasktrackers have.  MAPREDUCE-2355 added it 
 in 1.x.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-437) Update documentation of Writing Yarn Applications to match current best practices


 [ 
https://issues.apache.org/jira/browse/YARN-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-437:
-

Issue Type: Improvement  (was: Bug)

 Update documentation of Writing Yarn Applications to match current best 
 practices
 ---

 Key: YARN-437
 URL: https://issues.apache.org/jira/browse/YARN-437
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Hitesh Shah
Assignee: Eli Reisman
 Attachments: YARN-437-1.patch, YARN-437-2.patch, YARN-437-3.patch


 Should fix docs to point to usage of YarnClient and AMRMClient helper libs. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-436) Document how to use DistributedShell yarn application


 [ 
https://issues.apache.org/jira/browse/YARN-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-436:
-

Issue Type: Improvement  (was: Bug)

 Document how to use DistributedShell yarn application
 -

 Key: YARN-436
 URL: https://issues.apache.org/jira/browse/YARN-436
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Hitesh Shah
Assignee: Hitesh Shah



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-149) ResourceManager (RM) High-Availability (HA)

2013-07-01 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697165#comment-13697165
 ] 

Karthik Kambatla commented on YARN-149:
---

Sounds good, thanks Bikas. I also have been thinking about this and working on 
a draft. Will get it to shape, and attach it here. 

 ResourceManager (RM) High-Availability (HA)
 ---

 Key: YARN-149
 URL: https://issues.apache.org/jira/browse/YARN-149
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Harsh J
Assignee: Bikas Saha

 This jira tracks work needed to be done to support one RM instance failing 
 over to another RM instance so that we can have RM HA. Work includes leader 
 election, transfer of control to leader and client re-direction to new leader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-675) In YarnClient, pull AM logs on AM container failure

2013-07-01 Thread Zhijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697185#comment-13697185
 ] 

Zhijie Shen commented on YARN-675:
--

[~sandyr], would you mind my taking this ticket over? We're trying to push the 
better error reporting tickets to be fixed ASAP. Thanks!

 In YarnClient, pull AM logs on AM container failure
 ---

 Key: YARN-675
 URL: https://issues.apache.org/jira/browse/YARN-675
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza

 Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to 
 pull its logs from the NM to the client so that they can be displayed 
 immediately to the user.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-675) In YarnClient, pull AM logs on AM container failure

2013-07-01 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697209#comment-13697209
 ] 

Sandy Ryza commented on YARN-675:
-

[~zjshen], thanks for the help, feel free to take it over.  We're also trying 
to get these in ASAP.  My delay in working on it has been that it depends on 
YARN-649, so any feedback there would help move things forward as well.

 In YarnClient, pull AM logs on AM container failure
 ---

 Key: YARN-675
 URL: https://issues.apache.org/jira/browse/YARN-675
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza

 Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to 
 pull its logs from the NM to the client so that they can be displayed 
 immediately to the user.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-353) Add Zookeeper-based store implementation for RMStateStore


[ 
https://issues.apache.org/jira/browse/YARN-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697224#comment-13697224
 ] 

Jian He commented on YARN-353:
--

I'm taking this over

 Add Zookeeper-based store implementation for RMStateStore
 -

 Key: YARN-353
 URL: https://issues.apache.org/jira/browse/YARN-353
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Hitesh Shah
Assignee: Bikas Saha
 Attachments: YARN-353.1.patch


 Add store that write RM state data to ZK

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable

2013-07-01 Thread Hitesh Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697240#comment-13697240
 ] 

Hitesh Shah commented on YARN-814:
--

Comments:

Why is shExec.getOutput() being ignored ( and replaced with 
exception.getMessage() )? 
Have you run this with a test script that emits information both to stdout and 
stderr? 

{code}
+  LOG.warn(Exception from container-launch with container ID: 
+  + containerId +  and exit code:  + exitCode , e);
+  logOutput(e.getMessage());
{code}
  - logging the exception twice?
  -logOutput() does not seem to log any contextual information - have you 
logged at the NM logs to see if it actually provides useful debugging 
information when running multiple containers at the same time?

{code}
   LOG.warn(Exit code from container is :  + exitCode);
-  logOutput(shExec.getOutput());
+  logOutput(e.getMessage());
{code}
  - Earlier comment about the LOG.warn not being useful not addressed?

{code}
   throw new IOException(App initialization failed ( + exitCode + 
-  ) with output:  + shExec.getOutput(), e);
+  ) with output:  + e.getMessage(), e);
{code}
  - The exception e is already being passed. Why the need to add e.getMessage() 
too? 



 Difficult to diagnose a failed container launch when error due to invalid 
 environment variable
 --

 Key: YARN-814
 URL: https://issues.apache.org/jira/browse/YARN-814
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch, 
 YARN-814.patch


 The container's launch script sets up environment variables, symlinks etc. 
 If there is any failure when setting up the basic context ( before the actual 
 user's process is launched ), nothing is captured by the NM. This makes it 
 impossible to diagnose the reason for the failure. 
 To reproduce, set an env var where the value contains characters that throw 
 syntax errors in bash. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (YARN-661) NM fails to cleanup local directories for users


 [ 
https://issues.apache.org/jira/browse/YARN-661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi reassigned YARN-661:
--

Assignee: Omkar Vinit Joshi

 NM fails to cleanup local directories for users
 ---

 Key: YARN-661
 URL: https://issues.apache.org/jira/browse/YARN-661
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 0.23.8
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi

 YARN-71 added deletion of local directories on startup, but in practice it 
 fails to delete the directories because of permission problems.  The 
 top-level usercache directory is owned by the user but is in a directory that 
 is not writable by the user.  Therefore the deletion of the user's usercache 
 directory, as the user, fails due to lack of permissions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-661) NM fails to cleanup local directories for users


[ 
https://issues.apache.org/jira/browse/YARN-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697277#comment-13697277
 ] 

Omkar Vinit Joshi commented on YARN-661:


taking this over... Just reproduced this issue on secured cluster..It exists.. 
need to be fixed..

 NM fails to cleanup local directories for users
 ---

 Key: YARN-661
 URL: https://issues.apache.org/jira/browse/YARN-661
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 0.23.8
Reporter: Jason Lowe

 YARN-71 added deletion of local directories on startup, but in practice it 
 fails to delete the directories because of permission problems.  The 
 top-level usercache directory is owned by the user but is in a directory that 
 is not writable by the user.  Therefore the deletion of the user's usercache 
 directory, as the user, fails due to lack of permissions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-661) NM fails to cleanup local directories for users


[ 
https://issues.apache.org/jira/browse/YARN-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697320#comment-13697320
 ] 

Omkar Vinit Joshi commented on YARN-661:


I guess we need 2 features in deletion service.
* A way for user to specify that delete all the sub directories and files 
inside a parent directory but don't delete parent directory.
* A way to define dependency between deletion tasks. For example we need to 
delete usercache files before actually deleting the parent usercache itself...


 NM fails to cleanup local directories for users
 ---

 Key: YARN-661
 URL: https://issues.apache.org/jira/browse/YARN-661
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 0.23.8
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi

 YARN-71 added deletion of local directories on startup, but in practice it 
 fails to delete the directories because of permission problems.  The 
 top-level usercache directory is owned by the user but is in a directory that 
 is not writable by the user.  Therefore the deletion of the user's usercache 
 directory, as the user, fails due to lack of permissions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-814) Difficult to diagnose a failed container launch when error due to invalid environment variable

2013-07-01 Thread Hitesh Shah (JIRA)

[
https://issues.apache.org/jira/browse/YARN-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697377#comment-13697377
]

Hitesh Shah commented on YARN-814:
--

There is no guarantee that shExec.getOutput() will always be empty.

For example:

{code}
echo About to run invalid command
./run_invalid_command.sh
{code}

The above should generate output both on stdout and stderr. The patch seems to
be throwing away potential valid output that may be useful for debugging. It
seems like you need to capture both stdout and stderr information.

Difficult to diagnose a failed container launch when error due to invalid
environment variable
--

Key: YARN-814
URL: https://issues.apache.org/jira/browse/YARN-814
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jian He
Attachments: YARN-814.1.patch, YARN-814.2.patch, YARN-814.3.patch,
YARN-814.patch

The container's launch script sets up environment variables, symlinks etc.
If there is any failure when setting up the basic context ( before the actual
user's process is launched ), nothing is captured by the NM. This makes it
impossible to diagnose the reason for the failure.
To reproduce, set an env var where the value contains characters that throw
syntax errors in bash.

[jira] [Created] (YARN-894) NodeHealthScriptRunner timeout checking is inaccurate on Windows

2013-07-01 Thread Chuan Liu (JIRA)

Chuan Liu created YARN-894:
--

 Summary: NodeHealthScriptRunner timeout checking is inaccurate on 
Windows
 Key: YARN-894
 URL: https://issues.apache.org/jira/browse/YARN-894
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.1.0-beta
Reporter: Chuan Liu
Assignee: Chuan Liu
Priority: Minor


In {{NodeHealthScriptRunner}} method, we will set HealthChecker status based on 
the Shell execution results. Some status are based on the exception thrown 
during the Shell script execution.

Currently, we will catch a non-ExitCodeException from ShellCommandExecutor, and 
if Shell has the timeout status set at the same time, we will also set 
HealthChecker status to timeout.

We have following execution sequence in Shell:
1) In main thread, schedule a delayed timer task that will kill the original 
process upon timeout.
2) In main thread, open a buffered reader and feed in the process's standard 
input stream.
3) When timeout happens, the timer task will call {{Process#destroy()}}
 to kill the main process.

On Linux, when timeout happened and process killed, the buffered reader will 
thrown an IOException with message: Stream closed in main thread.

On Windows, we don't have the IOException. Only -1 was returned from the 
reader that indicates the buffer is finished. As a result, the timeout status 
is not set on Windows, and {{TestNodeHealthService}} fails on Windows because 
of this.


 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-894) NodeHealthScriptRunner timeout checking is inaccurate on Windows

2013-07-01 Thread Chuan Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuan Liu updated YARN-894:
---

Attachment: wait.sh
wait.cmd
ReadProcessStdout.java

Attach a Java file that verifies the above description. When executed on 
Windows, we have the following result:
{noformat}
C:\Users\chuanliu\Documentsjava ReadProcessStdout wait.cmd
Process was destroyed!
-1
exit code: 1
{noformat}

On Linux, the results look like the following:
{noformat}
~$ java ReadProcessStdout ./wait.sh
Process was destroyed!
-1
Stream closed
java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:145)
at java.io.BufferedInputStream.read(BufferedInputStream.java:308)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at ReadProcessStdout.main(ReadProcessStdout.java:25)
{noformat}

 NodeHealthScriptRunner timeout checking is inaccurate on Windows
 

 Key: YARN-894
 URL: https://issues.apache.org/jira/browse/YARN-894
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.1.0-beta
Reporter: Chuan Liu
Assignee: Chuan Liu
Priority: Minor
 Attachments: ReadProcessStdout.java, wait.cmd, wait.sh


 In {{NodeHealthScriptRunner}} method, we will set HealthChecker status based 
 on the Shell execution results. Some status are based on the exception thrown 
 during the Shell script execution.
 Currently, we will catch a non-ExitCodeException from ShellCommandExecutor, 
 and if Shell has the timeout status set at the same time, we will also set 
 HealthChecker status to timeout.
 We have following execution sequence in Shell:
 1) In main thread, schedule a delayed timer task that will kill the original 
 process upon timeout.
 2) In main thread, open a buffered reader and feed in the process's standard 
 input stream.
 3) When timeout happens, the timer task will call {{Process#destroy()}}
  to kill the main process.
 On Linux, when timeout happened and process killed, the buffered reader will 
 thrown an IOException with message: Stream closed in main thread.
 On Windows, we don't have the IOException. Only -1 was returned from the 
 reader that indicates the buffer is finished. As a result, the timeout status 
 is not set on Windows, and {{TestNodeHealthService}} fails on Windows because 
 of this.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-894) NodeHealthScriptRunner timeout checking is inaccurate on Windows

2013-07-01 Thread Chuan Liu (JIRA)

[
https://issues.apache.org/jira/browse/YARN-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chuan Liu updated YARN-894:
---

Attachment: YARN-894-trunk.patch

Attaching a patch that fixes the above issue on Windows. Also changing the test
to use different command for 'sleep' and Shell script extension on Windows.

NodeHealthScriptRunner timeout checking is inaccurate on Windows

Key: YARN-894
URL: https://issues.apache.org/jira/browse/YARN-894
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 3.0.0, 2.1.0-beta
Reporter: Chuan Liu
Assignee: Chuan Liu
Priority: Minor
Attachments: ReadProcessStdout.java, wait.cmd, wait.sh,
YARN-894-trunk.patch

In {{NodeHealthScriptRunner}} method, we will set HealthChecker status based
on the Shell execution results. Some status are based on the exception thrown
during the Shell script execution.
Currently, we will catch a non-ExitCodeException from ShellCommandExecutor,
and if Shell has the timeout status set at the same time, we will also set
HealthChecker status to timeout.
We have following execution sequence in Shell:
1) In main thread, schedule a delayed timer task that will kill the original
process upon timeout.
2) In main thread, open a buffered reader and feed in the process's standard
input stream.
3) When timeout happens, the timer task will call {{Process#destroy()}}
to kill the main process.
On Linux, when timeout happened and process killed, the buffered reader will
thrown an IOException with message: Stream closed in main thread.
On Windows, we don't have the IOException. Only -1 was returned from the
reader that indicates the buffer is finished. As a result, the timeout status
is not set on Windows, and {{TestNodeHealthService}} fails on Windows because
of this.

[jira] [Updated] (YARN-353) Add Zookeeper-based store implementation for RMStateStore