date:20131008


 [ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-1284:
-

Attachment: YARN-1284.patch

The patch changes the deleteCgroup() method to retry the delete in a loop 
(retrying every 20ms) until it succeeds or it times out (500ms). Also, this is 
done for all containers, not only for AM containers. It also introdcues a 
configuration knob for the timeout.

Other changes, such as method signatures and initConfig() method are to enable 
unittesting of the new logic.

 LCE: Race condition leaves dangling cgroups entries for killed containers
 -

 Key: YARN-1284
 URL: https://issues.apache.org/jira/browse/YARN-1284
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Attachments: YARN-1284.patch


 When LCE  cgroups are enabled, when a container is is killed (in this case 
 by its owning AM, an MRAM) it seems to be a race condition at OS level when 
 doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
 LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
 immediately attempts to clean up the cgroups entry for the container. But 
 this is failing with an error like:
 {code}
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1381179532433_0016_01_11 is : 143
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Processing container_1381179532433_0016_01_11 of type 
 UPDATE_DIAGNOSTICS_MSG
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 deleteCgroup: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 Unable to delete cgroup at: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 {code}
 CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
 containers to avoid this problem. it seems this should be done for all 
 containers.
 Still, waiting for extra 500ms seems too expensive.
 We should look at a way of doing this in a more 'efficient way' from time 
 perspective, may be spinning while the deleteCgroup() cannot be done with a 
 minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers


 [ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-1284:
-

Target Version/s: 2.2.1

 LCE: Race condition leaves dangling cgroups entries for killed containers
 -

 Key: YARN-1284
 URL: https://issues.apache.org/jira/browse/YARN-1284
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Attachments: YARN-1284.patch


 When LCE  cgroups are enabled, when a container is is killed (in this case 
 by its owning AM, an MRAM) it seems to be a race condition at OS level when 
 doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
 LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
 immediately attempts to clean up the cgroups entry for the container. But 
 this is failing with an error like:
 {code}
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1381179532433_0016_01_11 is : 143
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Processing container_1381179532433_0016_01_11 of type 
 UPDATE_DIAGNOSTICS_MSG
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 deleteCgroup: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 Unable to delete cgroup at: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 {code}
 CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
 containers to avoid this problem. it seems this should be done for all 
 containers.
 Still, waiting for extra 500ms seems too expensive.
 We should look at a way of doing this in a more 'efficient way' from time 
 perspective, may be spinning while the deleteCgroup() cannot be done with a 
 minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers


[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789262#comment-13789262
 ] 

Hadoop QA commented on YARN-1284:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12607362/YARN-1284.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2144//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/2144//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2144//console

This message is automatically generated.

 LCE: Race condition leaves dangling cgroups entries for killed containers
 -

 Key: YARN-1284
 URL: https://issues.apache.org/jira/browse/YARN-1284
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Attachments: YARN-1284.patch


 When LCE  cgroups are enabled, when a container is is killed (in this case 
 by its owning AM, an MRAM) it seems to be a race condition at OS level when 
 doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
 LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
 immediately attempts to clean up the cgroups entry for the container. But 
 this is failing with an error like:
 {code}
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1381179532433_0016_01_11 is : 143
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Processing container_1381179532433_0016_01_11 of type 
 UPDATE_DIAGNOSTICS_MSG
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 deleteCgroup: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 Unable to delete cgroup at: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 {code}
 CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
 containers to avoid this problem. it seems this should be done for all 
 containers.
 Still, waiting for extra 500ms seems too expensive.
 We should look at a way of doing this in a more 'efficient way' from time 
 perspective, may be spinning while the deleteCgroup() cannot be done with a 
 minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers


 [ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-1284:
-

Attachment: YARN-1284.patch

 LCE: Race condition leaves dangling cgroups entries for killed containers
 -

 Key: YARN-1284
 URL: https://issues.apache.org/jira/browse/YARN-1284
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Attachments: YARN-1284.patch, YARN-1284.patch


 When LCE  cgroups are enabled, when a container is is killed (in this case 
 by its owning AM, an MRAM) it seems to be a race condition at OS level when 
 doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
 LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
 immediately attempts to clean up the cgroups entry for the container. But 
 this is failing with an error like:
 {code}
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1381179532433_0016_01_11 is : 143
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Processing container_1381179532433_0016_01_11 of type 
 UPDATE_DIAGNOSTICS_MSG
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 deleteCgroup: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 Unable to delete cgroup at: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 {code}
 CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
 containers to avoid this problem. it seems this should be done for all 
 containers.
 Still, waiting for extra 500ms seems too expensive.
 We should look at a way of doing this in a more 'efficient way' from time 
 perspective, may be spinning while the deleteCgroup() cannot be done with a 
 minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers


[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789355#comment-13789355
 ] 

Hadoop QA commented on YARN-1284:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12607374/YARN-1284.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2145//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/2145//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2145//console

This message is automatically generated.

 LCE: Race condition leaves dangling cgroups entries for killed containers
 -

 Key: YARN-1284
 URL: https://issues.apache.org/jira/browse/YARN-1284
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Attachments: YARN-1284.patch, YARN-1284.patch


 When LCE  cgroups are enabled, when a container is is killed (in this case 
 by its owning AM, an MRAM) it seems to be a race condition at OS level when 
 doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
 LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
 immediately attempts to clean up the cgroups entry for the container. But 
 this is failing with an error like:
 {code}
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1381179532433_0016_01_11 is : 143
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Processing container_1381179532433_0016_01_11 of type 
 UPDATE_DIAGNOSTICS_MSG
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 deleteCgroup: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 Unable to delete cgroup at: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 {code}
 CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
 containers to avoid this problem. it seems this should be done for all 
 containers.
 Still, waiting for extra 500ms seems too expensive.
 We should look at a way of doing this in a more 'efficient way' from time 
 perspective, may be spinning while the deleteCgroup() cannot be done with a 
 minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers


 [ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-1284:
-

Attachment: YARN-1284.patch

 LCE: Race condition leaves dangling cgroups entries for killed containers
 -

 Key: YARN-1284
 URL: https://issues.apache.org/jira/browse/YARN-1284
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch


 When LCE  cgroups are enabled, when a container is is killed (in this case 
 by its owning AM, an MRAM) it seems to be a race condition at OS level when 
 doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
 LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
 immediately attempts to clean up the cgroups entry for the container. But 
 this is failing with an error like:
 {code}
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1381179532433_0016_01_11 is : 143
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Processing container_1381179532433_0016_01_11 of type 
 UPDATE_DIAGNOSTICS_MSG
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 deleteCgroup: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 Unable to delete cgroup at: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 {code}
 CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
 containers to avoid this problem. it seems this should be done for all 
 containers.
 Still, waiting for extra 500ms seems too expensive.
 We should look at a way of doing this in a more 'efficient way' from time 
 perspective, may be spinning while the deleteCgroup() cannot be done with a 
 minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers


[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789429#comment-13789429
 ] 

Hadoop QA commented on YARN-1284:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12607388/YARN-1284.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2146//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2146//console

This message is automatically generated.

 LCE: Race condition leaves dangling cgroups entries for killed containers
 -

 Key: YARN-1284
 URL: https://issues.apache.org/jira/browse/YARN-1284
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch


 When LCE  cgroups are enabled, when a container is is killed (in this case 
 by its owning AM, an MRAM) it seems to be a race condition at OS level when 
 doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
 LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
 immediately attempts to clean up the cgroups entry for the container. But 
 this is failing with an error like:
 {code}
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1381179532433_0016_01_11 is : 143
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Processing container_1381179532433_0016_01_11 of type 
 UPDATE_DIAGNOSTICS_MSG
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 deleteCgroup: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 Unable to delete cgroup at: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 {code}
 CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
 containers to avoid this problem. it seems this should be done for all 
 containers.
 Still, waiting for extra 500ms seems too expensive.
 We should look at a way of doing this in a more 'efficient way' from time 
 perspective, may be spinning while the deleteCgroup() cannot be done with a 
 minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Resolved] (YARN-250) Add a generic mechanism to the resource manager for client communication with the scheduler.

[
https://issues.apache.org/jira/browse/YARN-250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sandy Ryza resolved YARN-250.
-

Resolution: Won't Fix

Add a generic mechanism to the resource manager for client communication with
the scheduler.

Key: YARN-250
URL: https://issues.apache.org/jira/browse/YARN-250
Project: Hadoop YARN
Issue Type: Improvement
Components: resourcemanager, scheduler
Affects Versions: 2.0.2-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza

In MR1 the fair scheduler allowed the queue of a running job to be changed
through the web UI. For feature parity, this should be supported in MR2, but
the web UI seems an inappropriate place to put it because of complicated
security implications and lack of programmatic access. A command line tool
makes more sense.

A generic mechanism could be leveraged by other schedulers to support similar
types of functionality and would allow us to avoid making changes to all the
plumbing each time functionality is added. Other possible uses might include
suspending pools or fetching scheduler-specific information.

This functionality could be made available through an RPC server within each
scheduler, but that would require reserving another port, and would introduce
unnecessary confusion if different schedulers implemented the same mechanism
in a different way.
A client should be able to send an RPC with a set of key/value pairs, which
would be passed to the scheduler. The RPC would return a string (or set of
key/value pairs).

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-250) Add a generic mechanism to the resource manager for client communication with the scheduler.

2013-10-08 Thread Bikas Saha (JIRA)

[
https://issues.apache.org/jira/browse/YARN-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789585#comment-13789585
]

Bikas Saha commented on YARN-250:
-

Seemed useful. Any reason for a wont fix?

Add a generic mechanism to the resource manager for client communication with
the scheduler.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-250) Add a generic mechanism to the resource manager for client communication with the scheduler.

[
https://issues.apache.org/jira/browse/YARN-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789605#comment-13789605
]

Sandy Ryza commented on YARN-250:
-

My original thinking was that this would be useful for allowing applications to
be moved between Fair Scheduler queues from the command line. But my current
thinking on that capability is that it would probably be useful to have for all
schedulers. I thought other use cases would come up, but nothing has jumped
out at me recently.

If there are other things a mechanism like this would be useful for, I'm
certainly not against adding it.

Add a generic mechanism to the resource manager for client communication with
the scheduler.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-250) Add a generic mechanism to the resource manager for client communication with the scheduler.

2013-10-08 Thread Bikas Saha (JIRA)

[
https://issues.apache.org/jira/browse/YARN-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789644#comment-13789644
]

Bikas Saha commented on YARN-250:
-

Will this functionality be added to the fair scheduler? If yes, then how would
it be exposed in the RM without some discovery logic?

Add a generic mechanism to the resource manager for client communication with
the scheduler.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (YARN-1285) Inconsistency of default yarn.acl.enable value

2013-10-08 Thread Zhijie Shen (JIRA)

Zhijie Shen created YARN-1285:
-

 Summary: Inconsistency of default yarn.acl.enable value
 Key: YARN-1285
 URL: https://issues.apache.org/jira/browse/YARN-1285
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen


In yarn-default.xml, yarn.acl.enable is true while in YarnConfiguration, 
DEFAULT_YARN_ACL_ENABLE is false.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-250) Add a generic mechanism to the resource manager for client communication with the scheduler.

[
https://issues.apache.org/jira/browse/YARN-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789687#comment-13789687
]

Sandy Ryza commented on YARN-250:
-

The mechanism this JIRA proposed would make most sense for features specific to
a single scheduler. As queue-moving functionality is probably something that
the Capacity Scheduler would like at some point as well, I think it would make
more sense to directly add it as a ResourceManager API.

Add a generic mechanism to the resource manager for client communication with
the scheduler.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers


[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789769#comment-13789769
 ] 

Alejandro Abdelnur commented on YARN-1284:
--

tested in a cluster using cgroups and works as expected, both the delete and 
the timeouts.

 LCE: Race condition leaves dangling cgroups entries for killed containers
 -

 Key: YARN-1284
 URL: https://issues.apache.org/jira/browse/YARN-1284
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch


 When LCE  cgroups are enabled, when a container is is killed (in this case 
 by its owning AM, an MRAM) it seems to be a race condition at OS level when 
 doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
 LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
 immediately attempts to clean up the cgroups entry for the container. But 
 this is failing with an error like:
 {code}
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1381179532433_0016_01_11 is : 143
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Processing container_1381179532433_0016_01_11 of type 
 UPDATE_DIAGNOSTICS_MSG
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 deleteCgroup: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 Unable to delete cgroup at: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 {code}
 CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
 containers to avoid this problem. it seems this should be done for all 
 containers.
 Still, waiting for extra 500ms seems too expensive.
 We should look at a way of doing this in a more 'efficient way' from time 
 perspective, may be spinning while the deleteCgroup() cannot be done with a 
 minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-1283) Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY


 [ 
https://issues.apache.org/jira/browse/YARN-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1283:


Attachment: YARN-1283.20131008.1.patch

 Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY
 -

 Key: YARN-1283
 URL: https://issues.apache.org/jira/browse/YARN-1283
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.1.1-beta
Reporter: Yesha Vora
Assignee: Omkar Vinit Joshi
  Labels: newbie
 Attachments: YARN-1283.20131007.1.patch, YARN-1283.20131008.1.patch


 After setting yarn.http.policy=HTTPS_ONLY, the job output shows incorrect 
 The url to track the job.
 Currently, its printing 
 http://RM:httpsport/proxy/application_1381162886563_0001/ instead 
 https://RM:httpsport/proxy/application_1381162886563_0001/
 http://hostname:8088/proxy/application_1381162886563_0001/ is invalid
 hadoop  jar hadoop-mapreduce-client-jobclient-tests.jar sleep -m 1 -r 1 
 13/10/07 18:39:39 INFO client.RMProxy: Connecting to ResourceManager at 
 hostname/100.00.00.000:8032
 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: number of splits:1
 13/10/07 18:39:40 INFO Configuration.deprecation: user.name is deprecated. 
 Instead, use mapreduce.job.user.name
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.jar is deprecated. 
 Instead, use mapreduce.job.jar
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.map.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.map.speculative
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks is 
 deprecated. Instead, use mapreduce.job.reduces
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.partitioner.class 
 is deprecated. Instead, use mapreduce.job.partitioner.class
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.reduce.speculative
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.mapoutput.value.class is deprecated. Instead, use 
 mapreduce.map.output.value.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.map.class is 
 deprecated. Instead, use mapreduce.job.map.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.job.name is 
 deprecated. Instead, use mapreduce.job.name
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.reduce.class is 
 deprecated. Instead, use mapreduce.job.reduce.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.inputformat.class 
 is deprecated. Instead, use mapreduce.job.inputformat.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.input.dir is 
 deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapreduce.outputformat.class is deprecated. Instead, use 
 mapreduce.job.outputformat.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks is 
 deprecated. Instead, use mapreduce.job.maps
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.key.class 
 is deprecated. Instead, use mapreduce.map.output.key.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.working.dir is 
 deprecated. Instead, use mapreduce.job.working.dir
 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
 job_1381162886563_0001
 13/10/07 18:39:40 INFO impl.YarnClientImpl: Submitted application 
 application_1381162886563_0001 to ResourceManager at 
 hostname/100.00.00.000:8032
 13/10/07 18:39:40 INFO mapreduce.Job: The url to track the job: 
 http://hostname:8088/proxy/application_1381162886563_0001/
 13/10/07 18:39:40 INFO mapreduce.Job: Running job: job_1381162886563_0001
 13/10/07 18:39:46 INFO mapreduce.Job: Job job_1381162886563_0001 running in 
 uber mode : false
 13/10/07 18:39:46 INFO mapreduce.Job:  map 0% reduce 0%
 13/10/07 18:39:53 INFO mapreduce.Job:  map 100% reduce 0%
 13/10/07 18:39:58 INFO mapreduce.Job:  map 100% reduce 100%
 13/10/07 18:39:58 INFO mapreduce.Job: Job job_1381162886563_0001 completed 
 successfully
 13/10/07 18:39:58 INFO mapreduce.Job: Counters: 43
   File System Counters
   FILE: Number of bytes read=26
   FILE: Number of bytes written=177279
   FILE: Number of read operations=0
   FILE: Number of large read operations=0
   FILE: Number of write operations=0
   HDFS: Number of bytes read=48
   HDFS: Number of bytes written=0
   HDFS: Number of read operations=1
   HDFS: Number of large read operations=0
   HDFS: Number of write operations=0
   Job Counters 
   Launched map tasks=1
   Launched reduce tasks=1
   Other local map

[jira] [Commented] (YARN-1283) Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY


[ 
https://issues.apache.org/jira/browse/YARN-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789784#comment-13789784
 ] 

Omkar Vinit Joshi commented on YARN-1283:
-

fixing javadoc warning.

 Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY
 -

 Key: YARN-1283
 URL: https://issues.apache.org/jira/browse/YARN-1283
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.1.1-beta
Reporter: Yesha Vora
Assignee: Omkar Vinit Joshi
  Labels: newbie
 Attachments: YARN-1283.20131007.1.patch, YARN-1283.20131008.1.patch


 After setting yarn.http.policy=HTTPS_ONLY, the job output shows incorrect 
 The url to track the job.
 Currently, its printing 
 http://RM:httpsport/proxy/application_1381162886563_0001/ instead 
 https://RM:httpsport/proxy/application_1381162886563_0001/
 http://hostname:8088/proxy/application_1381162886563_0001/ is invalid
 hadoop  jar hadoop-mapreduce-client-jobclient-tests.jar sleep -m 1 -r 1 
 13/10/07 18:39:39 INFO client.RMProxy: Connecting to ResourceManager at 
 hostname/100.00.00.000:8032
 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: number of splits:1
 13/10/07 18:39:40 INFO Configuration.deprecation: user.name is deprecated. 
 Instead, use mapreduce.job.user.name
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.jar is deprecated. 
 Instead, use mapreduce.job.jar
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.map.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.map.speculative
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks is 
 deprecated. Instead, use mapreduce.job.reduces
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.partitioner.class 
 is deprecated. Instead, use mapreduce.job.partitioner.class
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.reduce.speculative
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.mapoutput.value.class is deprecated. Instead, use 
 mapreduce.map.output.value.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.map.class is 
 deprecated. Instead, use mapreduce.job.map.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.job.name is 
 deprecated. Instead, use mapreduce.job.name
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.reduce.class is 
 deprecated. Instead, use mapreduce.job.reduce.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.inputformat.class 
 is deprecated. Instead, use mapreduce.job.inputformat.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.input.dir is 
 deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapreduce.outputformat.class is deprecated. Instead, use 
 mapreduce.job.outputformat.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks is 
 deprecated. Instead, use mapreduce.job.maps
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.key.class 
 is deprecated. Instead, use mapreduce.map.output.key.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.working.dir is 
 deprecated. Instead, use mapreduce.job.working.dir
 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
 job_1381162886563_0001
 13/10/07 18:39:40 INFO impl.YarnClientImpl: Submitted application 
 application_1381162886563_0001 to ResourceManager at 
 hostname/100.00.00.000:8032
 13/10/07 18:39:40 INFO mapreduce.Job: The url to track the job: 
 http://hostname:8088/proxy/application_1381162886563_0001/
 13/10/07 18:39:40 INFO mapreduce.Job: Running job: job_1381162886563_0001
 13/10/07 18:39:46 INFO mapreduce.Job: Job job_1381162886563_0001 running in 
 uber mode : false
 13/10/07 18:39:46 INFO mapreduce.Job:  map 0% reduce 0%
 13/10/07 18:39:53 INFO mapreduce.Job:  map 100% reduce 0%
 13/10/07 18:39:58 INFO mapreduce.Job:  map 100% reduce 100%
 13/10/07 18:39:58 INFO mapreduce.Job: Job job_1381162886563_0001 completed 
 successfully
 13/10/07 18:39:58 INFO mapreduce.Job: Counters: 43
   File System Counters
   FILE: Number of bytes read=26
   FILE: Number of bytes written=177279
   FILE: Number of read operations=0
   FILE: Number of large read operations=0
   FILE: Number of write operations=0
   HDFS: Number of bytes read=48
   HDFS: Number of bytes written=0
   HDFS: Number of read operations=1
   HDFS: Number of large read operations=0
   HDFS: Number of write operations=0
   Job Counters 
   Launched map tasks=1
   Launched reduce

[jira] [Created] (YARN-1286) Schedulers doesn't check whether ACL is enabled or not when adding an application

2013-10-08 Thread Zhijie Shen (JIRA)

Zhijie Shen created YARN-1286:
-

 Summary: Schedulers doesn't check whether ACL is enabled or not 
when adding an application
 Key: YARN-1286
 URL: https://issues.apache.org/jira/browse/YARN-1286
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen


The schedulers doesn't check whether ACL is enabled or not when an application 
is added. However, QueueACLsManager will check this for ClientRMService to get 
application(s) and kill application.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1283) Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY


[ 
https://issues.apache.org/jira/browse/YARN-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789854#comment-13789854
 ] 

Hadoop QA commented on YARN-1283:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12607454/YARN-1283.20131008.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.mapred.TestClientServiceDelegate
  org.apache.hadoop.mapred.TestJobCleanup

  The following test timeouts occurred in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.mapreduce.v2.TestUberAM

  The test build failed in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2147//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2147//console

This message is automatically generated.

 Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY
 -

 Key: YARN-1283
 URL: https://issues.apache.org/jira/browse/YARN-1283
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.1.1-beta
Reporter: Yesha Vora
Assignee: Omkar Vinit Joshi
  Labels: newbie
 Attachments: YARN-1283.20131007.1.patch, YARN-1283.20131008.1.patch


 After setting yarn.http.policy=HTTPS_ONLY, the job output shows incorrect 
 The url to track the job.
 Currently, its printing 
 http://RM:httpsport/proxy/application_1381162886563_0001/ instead 
 https://RM:httpsport/proxy/application_1381162886563_0001/
 http://hostname:8088/proxy/application_1381162886563_0001/ is invalid
 hadoop  jar hadoop-mapreduce-client-jobclient-tests.jar sleep -m 1 -r 1 
 13/10/07 18:39:39 INFO client.RMProxy: Connecting to ResourceManager at 
 hostname/100.00.00.000:8032
 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: number of splits:1
 13/10/07 18:39:40 INFO Configuration.deprecation: user.name is deprecated. 
 Instead, use mapreduce.job.user.name
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.jar is deprecated. 
 Instead, use mapreduce.job.jar
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.map.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.map.speculative
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks is 
 deprecated. Instead, use mapreduce.job.reduces
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.partitioner.class 
 is deprecated. Instead, use mapreduce.job.partitioner.class
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.reduce.speculative
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.mapoutput.value.class is deprecated. Instead, use 
 mapreduce.map.output.value.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.map.class is 
 deprecated. Instead, use mapreduce.job.map.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.job.name is 
 deprecated. Instead, use mapreduce.job.name
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.reduce.class is 
 deprecated.

[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2013-10-08 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789862#comment-13789862
 ] 

Tsuyoshi OZAWA commented on YARN-556:
-

Hi Bikas, can you share the current state about this JIRA?

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers


[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789861#comment-13789861
 ] 

Sandy Ryza commented on YARN-1284:
--

A few nits.  Otherwise LGTM.

{code}
+  //package private for testing purposes
+  private long deleteCgroupTimeout;
+  Clock clock;
{code}
Comment should go before the second variable. Also there should be a space 
after the //.

{code}
+  //visible for testing
{code}
Should the VisibleForTesting annotation be used? This is in two places.

{code}
+LOG.debug(deleteCgroup:  + cgroupPath);
{code}
Should be surrounded by if (LOG.isDebugEnabled())

{code}
+//file exists
{code}
Space after //?

 LCE: Race condition leaves dangling cgroups entries for killed containers
 -

 Key: YARN-1284
 URL: https://issues.apache.org/jira/browse/YARN-1284
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch


 When LCE  cgroups are enabled, when a container is is killed (in this case 
 by its owning AM, an MRAM) it seems to be a race condition at OS level when 
 doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
 LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
 immediately attempts to clean up the cgroups entry for the container. But 
 this is failing with an error like:
 {code}
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1381179532433_0016_01_11 is : 143
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Processing container_1381179532433_0016_01_11 of type 
 UPDATE_DIAGNOSTICS_MSG
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 deleteCgroup: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 Unable to delete cgroup at: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 {code}
 CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
 containers to avoid this problem. it seems this should be done for all 
 containers.
 Still, waiting for extra 500ms seems too expensive.
 We should look at a way of doing this in a more 'efficient way' from time 
 perspective, may be spinning while the deleteCgroup() cannot be done with a 
 minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers


[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789864#comment-13789864
 ] 

Sandy Ryza commented on YARN-1284:
--

Oh, and also:
{code{
+if (! new File(cgroupPath).delete()) {
+  LOG.warn(Unable to delete cgroup at:  + cgroupPath +, tried to delete 
for  +
+  deleteCgroupTimeout + ms);
 }
{code}
If the file was already deleted, delete() will return false and we'll log the 
warning even though nothing went wrong.  Instead, we should just check if 
(!deleted).

 LCE: Race condition leaves dangling cgroups entries for killed containers
 -

 Key: YARN-1284
 URL: https://issues.apache.org/jira/browse/YARN-1284
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch


 When LCE  cgroups are enabled, when a container is is killed (in this case 
 by its owning AM, an MRAM) it seems to be a race condition at OS level when 
 doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
 LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
 immediately attempts to clean up the cgroups entry for the container. But 
 this is failing with an error like:
 {code}
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1381179532433_0016_01_11 is : 143
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Processing container_1381179532433_0016_01_11 of type 
 UPDATE_DIAGNOSTICS_MSG
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 deleteCgroup: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 Unable to delete cgroup at: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 {code}
 CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
 containers to avoid this problem. it seems this should be done for all 
 containers.
 Still, waiting for extra 500ms seems too expensive.
 We should look at a way of doing this in a more 'efficient way' from time 
 perspective, may be spinning while the deleteCgroup() cannot be done with a 
 minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-321) Generic application history service

[
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789876#comment-13789876
]

Sandy Ryza commented on YARN-321:
-

Was a design doc ever written up for this? The HistoryStorageDemo.java is a
good start for understanding some of the interfaces, but it would be helpful to
have something that explains things like what the Application History Service's
role is, how it interacts with the RM, and key differences and similarities
with the Job History Server.

Generic application history service
---

Key: YARN-321
URL: https://issues.apache.org/jira/browse/YARN-321
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Luke Lu
Assignee: Vinod Kumar Vavilapalli
Attachments: HistoryStorageDemo.java

The mapreduce job history server currently needs to be deployed as a trusted
server in sync with the mapreduce runtime. Every new application would need a
similar application history server. Having to deploy O(T*V) (where T is
number of type of application, V is number of version of application) trusted
servers is clearly not scalable.
Job history storage handling itself is pretty generic: move the logs and
history data into a particular directory for later serving. Job history data
is already stored as json (or binary avro). I propose that we create only one
trusted application history server, which can have a generic UI (display json
as a tree of strings) as well. Specific application/version can deploy
untrusted webapps (a la AMs) to query the application history server and
interpret the json for its specific UI and/or analytics.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart

2013-10-08 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789919#comment-13789919
 ] 

Bikas Saha commented on YARN-556:
-

Thanks for the reminder. Based on the attached proposal, I am going to create 
sub-tasks of this jira. Contributors are free to pick up those tasks.

 RM Restart phase 2 - Work preserving restart
 

 Key: YARN-556
 URL: https://issues.apache.org/jira/browse/YARN-556
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: Work Preserving RM Restart.pdf


 YARN-128 covered storing the state needed for the RM to recover critical 
 information. This umbrella jira will track changes needed to recover the 
 running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-1283) Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY


 [ 
https://issues.apache.org/jira/browse/YARN-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-1283:


Attachment: YARN-1283.20131008.2.patch

 Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY
 -

 Key: YARN-1283
 URL: https://issues.apache.org/jira/browse/YARN-1283
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.1.1-beta
Reporter: Yesha Vora
Assignee: Omkar Vinit Joshi
  Labels: newbie
 Attachments: YARN-1283.20131007.1.patch, YARN-1283.20131008.1.patch, 
 YARN-1283.20131008.2.patch


 After setting yarn.http.policy=HTTPS_ONLY, the job output shows incorrect 
 The url to track the job.
 Currently, its printing 
 http://RM:httpsport/proxy/application_1381162886563_0001/ instead 
 https://RM:httpsport/proxy/application_1381162886563_0001/
 http://hostname:8088/proxy/application_1381162886563_0001/ is invalid
 hadoop  jar hadoop-mapreduce-client-jobclient-tests.jar sleep -m 1 -r 1 
 13/10/07 18:39:39 INFO client.RMProxy: Connecting to ResourceManager at 
 hostname/100.00.00.000:8032
 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: number of splits:1
 13/10/07 18:39:40 INFO Configuration.deprecation: user.name is deprecated. 
 Instead, use mapreduce.job.user.name
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.jar is deprecated. 
 Instead, use mapreduce.job.jar
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.map.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.map.speculative
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks is 
 deprecated. Instead, use mapreduce.job.reduces
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.partitioner.class 
 is deprecated. Instead, use mapreduce.job.partitioner.class
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.reduce.speculative
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.mapoutput.value.class is deprecated. Instead, use 
 mapreduce.map.output.value.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.map.class is 
 deprecated. Instead, use mapreduce.job.map.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.job.name is 
 deprecated. Instead, use mapreduce.job.name
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.reduce.class is 
 deprecated. Instead, use mapreduce.job.reduce.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.inputformat.class 
 is deprecated. Instead, use mapreduce.job.inputformat.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.input.dir is 
 deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapreduce.outputformat.class is deprecated. Instead, use 
 mapreduce.job.outputformat.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks is 
 deprecated. Instead, use mapreduce.job.maps
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.key.class 
 is deprecated. Instead, use mapreduce.map.output.key.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.working.dir is 
 deprecated. Instead, use mapreduce.job.working.dir
 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
 job_1381162886563_0001
 13/10/07 18:39:40 INFO impl.YarnClientImpl: Submitted application 
 application_1381162886563_0001 to ResourceManager at 
 hostname/100.00.00.000:8032
 13/10/07 18:39:40 INFO mapreduce.Job: The url to track the job: 
 http://hostname:8088/proxy/application_1381162886563_0001/
 13/10/07 18:39:40 INFO mapreduce.Job: Running job: job_1381162886563_0001
 13/10/07 18:39:46 INFO mapreduce.Job: Job job_1381162886563_0001 running in 
 uber mode : false
 13/10/07 18:39:46 INFO mapreduce.Job:  map 0% reduce 0%
 13/10/07 18:39:53 INFO mapreduce.Job:  map 100% reduce 0%
 13/10/07 18:39:58 INFO mapreduce.Job:  map 100% reduce 100%
 13/10/07 18:39:58 INFO mapreduce.Job: Job job_1381162886563_0001 completed 
 successfully
 13/10/07 18:39:58 INFO mapreduce.Job: Counters: 43
   File System Counters
   FILE: Number of bytes read=26
   FILE: Number of bytes written=177279
   FILE: Number of read operations=0
   FILE: Number of large read operations=0
   FILE: Number of write operations=0
   HDFS: Number of bytes read=48
   HDFS: Number of bytes written=0
   HDFS: Number of read operations=1
   HDFS: Number of large read operations=0
   HDFS: Number of write operations=0
   Job Counters 
   Launched map tasks=1
   Launched reduce tasks=1

[jira] [Commented] (YARN-1283) Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY


[ 
https://issues.apache.org/jira/browse/YARN-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789949#comment-13789949
 ] 

Omkar Vinit Joshi commented on YARN-1283:
-

MAPREDUCE-5552 is tracking TestJobCleanup failure
Fixed other test case.

 Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY
 -

 Key: YARN-1283
 URL: https://issues.apache.org/jira/browse/YARN-1283
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.1.1-beta
Reporter: Yesha Vora
Assignee: Omkar Vinit Joshi
  Labels: newbie
 Attachments: YARN-1283.20131007.1.patch, YARN-1283.20131008.1.patch, 
 YARN-1283.20131008.2.patch


 After setting yarn.http.policy=HTTPS_ONLY, the job output shows incorrect 
 The url to track the job.
 Currently, its printing 
 http://RM:httpsport/proxy/application_1381162886563_0001/ instead 
 https://RM:httpsport/proxy/application_1381162886563_0001/
 http://hostname:8088/proxy/application_1381162886563_0001/ is invalid
 hadoop  jar hadoop-mapreduce-client-jobclient-tests.jar sleep -m 1 -r 1 
 13/10/07 18:39:39 INFO client.RMProxy: Connecting to ResourceManager at 
 hostname/100.00.00.000:8032
 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: number of splits:1
 13/10/07 18:39:40 INFO Configuration.deprecation: user.name is deprecated. 
 Instead, use mapreduce.job.user.name
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.jar is deprecated. 
 Instead, use mapreduce.job.jar
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.map.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.map.speculative
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks is 
 deprecated. Instead, use mapreduce.job.reduces
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.partitioner.class 
 is deprecated. Instead, use mapreduce.job.partitioner.class
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.reduce.speculative
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.mapoutput.value.class is deprecated. Instead, use 
 mapreduce.map.output.value.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.map.class is 
 deprecated. Instead, use mapreduce.job.map.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.job.name is 
 deprecated. Instead, use mapreduce.job.name
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.reduce.class is 
 deprecated. Instead, use mapreduce.job.reduce.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.inputformat.class 
 is deprecated. Instead, use mapreduce.job.inputformat.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.input.dir is 
 deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapreduce.outputformat.class is deprecated. Instead, use 
 mapreduce.job.outputformat.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.map.tasks is 
 deprecated. Instead, use mapreduce.job.maps
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.mapoutput.key.class 
 is deprecated. Instead, use mapreduce.map.output.key.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.working.dir is 
 deprecated. Instead, use mapreduce.job.working.dir
 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
 job_1381162886563_0001
 13/10/07 18:39:40 INFO impl.YarnClientImpl: Submitted application 
 application_1381162886563_0001 to ResourceManager at 
 hostname/100.00.00.000:8032
 13/10/07 18:39:40 INFO mapreduce.Job: The url to track the job: 
 http://hostname:8088/proxy/application_1381162886563_0001/
 13/10/07 18:39:40 INFO mapreduce.Job: Running job: job_1381162886563_0001
 13/10/07 18:39:46 INFO mapreduce.Job: Job job_1381162886563_0001 running in 
 uber mode : false
 13/10/07 18:39:46 INFO mapreduce.Job:  map 0% reduce 0%
 13/10/07 18:39:53 INFO mapreduce.Job:  map 100% reduce 0%
 13/10/07 18:39:58 INFO mapreduce.Job:  map 100% reduce 100%
 13/10/07 18:39:58 INFO mapreduce.Job: Job job_1381162886563_0001 completed 
 successfully
 13/10/07 18:39:58 INFO mapreduce.Job: Counters: 43
   File System Counters
   FILE: Number of bytes read=26
   FILE: Number of bytes written=177279
   FILE: Number of read operations=0
   FILE: Number of large read operations=0
   FILE: Number of write operations=0
   HDFS: Number of bytes read=48
   HDFS: Number of bytes written=0
   HDFS: Number of read operations=1
   HDFS: Number of large read operations=0
   HDFS: Number of write operations=0
   Job

[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers


 [ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-1284:
-

Attachment: YARN-1284.patch

Addressing Sandy's comments. Reworked the while loop logic using a do-while 
block, seems a bit cleaner that way.

 LCE: Race condition leaves dangling cgroups entries for killed containers
 -

 Key: YARN-1284
 URL: https://issues.apache.org/jira/browse/YARN-1284
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
 YARN-1284.patch


 When LCE  cgroups are enabled, when a container is is killed (in this case 
 by its owning AM, an MRAM) it seems to be a race condition at OS level when 
 doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
 LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
 immediately attempts to clean up the cgroups entry for the container. But 
 this is failing with an error like:
 {code}
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1381179532433_0016_01_11 is : 143
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Processing container_1381179532433_0016_01_11 of type 
 UPDATE_DIAGNOSTICS_MSG
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 deleteCgroup: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 Unable to delete cgroup at: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 {code}
 CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
 containers to avoid this problem. it seems this should be done for all 
 containers.
 Still, waiting for extra 500ms seems too expensive.
 We should look at a way of doing this in a more 'efficient way' from time 
 perspective, may be spinning while the deleteCgroup() cannot be done with a 
 minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers


[ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789978#comment-13789978
 ] 

Alejandro Abdelnur commented on YARN-1284:
--

For the record, I've spend a couple hours trying an alternate approach 
suggested by [~rvs] while chatting offline about this. His suggestion was to 
initialize a trash cgroup next to the containers cgroups and when a container 
is cleanup transition the container/tasks to the trash/tasks, doing  the 
equivalent of a {{cat container/tasks  trash/tasks}}. Tried doing that but 
it seems some of the Java IO native calls make a system call which is not 
supported by the cgroups filesystem implementation and I was getting the 
following stack trace:

{code}
java.io.IOException: Argument list too long
java.io.IOException: Argument list too long
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:318)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:80)
...
{code}

Given this, beside that I didn't get it to work properly, I would not be 
comfortable doing this as this may behave different in different Linux versions.

 LCE: Race condition leaves dangling cgroups entries for killed containers
 -

 Key: YARN-1284
 URL: https://issues.apache.org/jira/browse/YARN-1284
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
 YARN-1284.patch


 When LCE  cgroups are enabled, when a container is is killed (in this case 
 by its owning AM, an MRAM) it seems to be a race condition at OS level when 
 doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
 LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
 immediately attempts to clean up the cgroups entry for the container. But 
 this is failing with an error like:
 {code}
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1381179532433_0016_01_11 is : 143
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Processing container_1381179532433_0016_01_11 of type 
 UPDATE_DIAGNOSTICS_MSG
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 deleteCgroup: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 Unable to delete cgroup at: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 {code}
 CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
 containers to avoid this problem. it seems this should be done for all 
 containers.
 Still, waiting for extra 500ms seems too expensive.
 We should look at a way of doing this in a more 'efficient way' from time 
 perspective, may be spinning while the deleteCgroup() cannot be done with a 
 minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers

[
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789977#comment-13789977
]

Hadoop QA commented on YARN-1284:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12607499/YARN-1284.patch
against trunk revision .

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.

{color:red}-1 javac{color:red}. The patch appears to cause the build to
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2149//console

This message is automatically generated.

LCE: Race condition leaves dangling cgroups entries for killed containers
-

Key: YARN-1284
URL: https://issues.apache.org/jira/browse/YARN-1284
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch,
YARN-1284.patch

When LCE cgroups are enabled, when a container is is killed (in this case
by its owning AM, an MRAM) it seems to be a race condition at OS level when
doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup.
LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode,
immediately attempts to clean up the cgroups entry for the container. But
this is failing with an error like:
{code}
2013-10-07 15:21:24,359 WARN
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code
from container container_1381179532433_0016_01_11 is : 143
2013-10-07 15:21:24,359 DEBUG
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Processing container_1381179532433_0016_01_11 of type
UPDATE_DIAGNOSTICS_MSG
2013-10-07 15:21:24,359 DEBUG
org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler:
deleteCgroup:
/run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
2013-10-07 15:21:24,359 WARN
org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler:
Unable to delete cgroup at:
/run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
{code}
CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM
containers to avoid this problem. it seems this should be done for all
containers.
Still, waiting for extra 500ms seems too expensive.
We should look at a way of doing this in a more 'efficient way' from time
perspective, may be spinning while the deleteCgroup() cannot be done with a
minimal sleep and a timeout.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-1284) LCE: Race condition leaves dangling cgroups entries for killed containers


 [ 
https://issues.apache.org/jira/browse/YARN-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated YARN-1284:
-

Attachment: YARN-1284.patch

Updating patch with one last change (which it was not in my git cache), the 
default timeout is not 1000ms (up from 500ms). While testing this in a 4 nodes 
cluster running pi 500 500, there was one occurrence of a left container cgroup 
because of a timeout. This was done in a cluster running in VMs,  which it 
would explain the 500ms timeout, but still I'd rather bump it up given that the 
wait will break as soon as the cgroup is deleted and the attempts are every 
20ms.

 LCE: Race condition leaves dangling cgroups entries for killed containers
 -

 Key: YARN-1284
 URL: https://issues.apache.org/jira/browse/YARN-1284
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Blocker
 Attachments: YARN-1284.patch, YARN-1284.patch, YARN-1284.patch, 
 YARN-1284.patch, YARN-1284.patch


 When LCE  cgroups are enabled, when a container is is killed (in this case 
 by its owning AM, an MRAM) it seems to be a race condition at OS level when 
 doing a SIGTERM/SIGKILL and when the OS does all necessary cleanup. 
 LCE code, after sending the SIGTERM/SIGKILL and getting the exitcode, 
 immediately attempts to clean up the cgroups entry for the container. But 
 this is failing with an error like:
 {code}
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
 from container container_1381179532433_0016_01_11 is : 143
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Processing container_1381179532433_0016_01_11 of type 
 UPDATE_DIAGNOSTICS_MSG
 2013-10-07 15:21:24,359 DEBUG 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 deleteCgroup: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 2013-10-07 15:21:24,359 WARN 
 org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
 Unable to delete cgroup at: 
 /run/cgroups/cpu/hadoop-yarn/container_1381179532433_0016_01_11
 {code}
 CgroupsLCEResourcesHandler.clearLimits() has logic to wait for 500 ms for AM 
 containers to avoid this problem. it seems this should be done for all 
 containers.
 Still, waiting for extra 500ms seems too expensive.
 We should look at a way of doing this in a more 'efficient way' from time 
 perspective, may be spinning while the deleteCgroup() cannot be done with a 
 minimal sleep and a timeout.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1258) Allow configuring the Fair Scheduler root queue


[ 
https://issues.apache.org/jira/browse/YARN-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789989#comment-13789989
 ] 

Alejandro Abdelnur commented on YARN-1258:
--

LGTM, just 1 minor thing, instead doing:

{code}
  if (!(node instanceof Element)) {
continue;
  }
{code}

I would do:

{code}
  if (node instanceof Element) {
   //ALL THE REST OF THE FORLOOP BLOCK
  }
{code}

+1 after this and a jenkins +1


 Allow configuring the Fair Scheduler root queue
 ---

 Key: YARN-1258
 URL: https://issues.apache.org/jira/browse/YARN-1258
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.1.1-beta
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-1258.patch


 This would be useful for acls, maxRunningApps, scheduling modes, etc.
 The allocation file should be able to accept both:
 * An implicit root queue
 * A root queue at the top of the hierarchy with all queues under/inside of it



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1283) Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY


[ 
https://issues.apache.org/jira/browse/YARN-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790011#comment-13790011
 ] 

Hadoop QA commented on YARN-1283:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12607494/YARN-1283.20131008.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.mapred.TestJobCleanup
  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions

  The following test timeouts occurred in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.mapreduce.v2.TestUberAM

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2148//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2148//console

This message is automatically generated.

 Invalid 'url of job' mentioned in Job output with yarn.http.policy=HTTPS_ONLY
 -

 Key: YARN-1283
 URL: https://issues.apache.org/jira/browse/YARN-1283
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.1.1-beta
Reporter: Yesha Vora
Assignee: Omkar Vinit Joshi
  Labels: newbie
 Attachments: YARN-1283.20131007.1.patch, YARN-1283.20131008.1.patch, 
 YARN-1283.20131008.2.patch


 After setting yarn.http.policy=HTTPS_ONLY, the job output shows incorrect 
 The url to track the job.
 Currently, its printing 
 http://RM:httpsport/proxy/application_1381162886563_0001/ instead 
 https://RM:httpsport/proxy/application_1381162886563_0001/
 http://hostname:8088/proxy/application_1381162886563_0001/ is invalid
 hadoop  jar hadoop-mapreduce-client-jobclient-tests.jar sleep -m 1 -r 1 
 13/10/07 18:39:39 INFO client.RMProxy: Connecting to ResourceManager at 
 hostname/100.00.00.000:8032
 13/10/07 18:39:40 INFO mapreduce.JobSubmitter: number of splits:1
 13/10/07 18:39:40 INFO Configuration.deprecation: user.name is deprecated. 
 Instead, use mapreduce.job.user.name
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.jar is deprecated. 
 Instead, use mapreduce.job.jar
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.map.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.map.speculative
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.reduce.tasks is 
 deprecated. Instead, use mapreduce.job.reduces
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.partitioner.class 
 is deprecated. Instead, use mapreduce.job.partitioner.class
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.reduce.speculative
 13/10/07 18:39:40 INFO Configuration.deprecation: 
 mapred.mapoutput.value.class is deprecated. Instead, use 
 mapreduce.map.output.value.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.map.class is 
 deprecated. Instead, use mapreduce.job.map.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.job.name is 
 deprecated. Instead, use mapreduce.job.name
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.reduce.class is 
 deprecated. Instead, use mapreduce.job.reduce.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapreduce.inputformat.class 
 is deprecated. Instead, use mapreduce.job.inputformat.class
 13/10/07 18:39:40 INFO Configuration.deprecation: mapred.input.dir is 
 deprecated. Instead, use mapreduce.input.fileinputformat.inputdir

[jira] [Created] (YARN-1287) Consolidate MockClocks

Sandy Ryza created YARN-1287:


 Summary: Consolidate MockClocks
 Key: YARN-1287
 URL: https://issues.apache.org/jira/browse/YARN-1287
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Sandy Ryza


A bunch of different tests have near-identical implementations of MockClock.  
TestFairScheduler, TestFSSchedulerApp, and TestCgroupsLCEResourcesHandler for 
example.  They should be consolidated into a single MockClock.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-321) Generic application history service


 [ 
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-321:


Attachment: ApplicationHistoryServiceHighLevel.pdf

 Generic application history service
 ---

 Key: YARN-321
 URL: https://issues.apache.org/jira/browse/YARN-321
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Luke Lu
Assignee: Vinod Kumar Vavilapalli
 Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, 
 HistoryStorageDemo.java


 The mapreduce job history server currently needs to be deployed as a trusted 
 server in sync with the mapreduce runtime. Every new application would need a 
 similar application history server. Having to deploy O(T*V) (where T is 
 number of type of application, V is number of version of application) trusted 
 servers is clearly not scalable.
 Job history storage handling itself is pretty generic: move the logs and 
 history data into a particular directory for later serving. Job history data 
 is already stored as json (or binary avro). I propose that we create only one 
 trusted application history server, which can have a generic UI (display json 
 as a tree of strings) as well. Specific application/version can deploy 
 untrusted webapps (a la AMs) to query the application history server and 
 interpret the json for its specific UI and/or analytics.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-321) Generic application history service

[
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790047#comment-13790047
]

Sandy Ryza commented on YARN-321:
-

Thanks Vinod and Zhijie. Didn't see the comment. I'm going to attach your
outline as a pdf to make it a little easier for passers-by to learn about.
Here's the google doc it came from if you want to edit:
https://docs.google.com/document/d/1cNsdGyLuagR8lzfeQrAclOAd-AdkVwgST6OG8Zzp43M/edit#heading=h.15p1lkmmm9g8

Generic application history service
---

Key: YARN-321
URL: https://issues.apache.org/jira/browse/YARN-321
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Luke Lu
Assignee: Vinod Kumar Vavilapalli
Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf,
HistoryStorageDemo.java

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (YARN-1258) Allow configuring the Fair Scheduler root queue