[jira] [Commented] (YARN-1621) Add CLI to list rows of task attempt ID, container ID, host of container, state of container
[ https://issues.apache.org/jira/browse/YARN-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342593#comment-14342593 ] Naganarasimha G R commented on YARN-1621: - hi Bartosz Lugowski, Sorry for the delayed response, will share the review comments today... Add CLI to list rows of task attempt ID, container ID, host of container, state of container -- Key: YARN-1621 URL: https://issues.apache.org/jira/browse/YARN-1621 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Tassapol Athiapinya Assignee: Bartosz Ćugowski Attachments: YARN-1621.1.patch, YARN-1621.2.patch, YARN-1621.3.patch, YARN-1621.4.patch As more applications are moved to YARN, we need generic CLI to list rows of task attempt ID, container ID, host of container, state of container. Today if YARN application running in a container does hang, there is no way to find out more info because a user does not know where each attempt is running in. For each running application, it is useful to differentiate between running/succeeded/failed/killed containers. {code:title=proposed yarn cli} $ yarn application -list-containers -applicationId appId [-containerState state of container] where containerState is optional filter to list container in given state only. container state can be running/succeeded/killed/failed/all. A user can specify more than one container state at once e.g. KILLED,FAILED. task attempt ID container ID host of container state of container {code} CLI should work with running application/completed application. If a container runs many task attempts, all attempts should be shown. That will likely be the case of Tez container-reuse application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2764) counters.LimitExceededException shouldn't abort AsyncDispatcher
[ https://issues.apache.org/jira/browse/YARN-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-2764: - Description: I saw the following in container log: {code} 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attemptattempt_1414221548789_0023_r_03_0 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1414221548789_0023_r_03 Task Transitioned from RUNNING to SUCCEEDED 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 24 2014-10-25 10:28:55,053 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1414221548789_0023Job Transitioned from RUNNING to COMMITTING 2014-10-25 10:28:55,054 INFO [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_COMMIT 2014-10-25 10:28:55,177 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120 at org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:101) at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:108) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:78) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:95) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:106) at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:203) at org.apache.hadoop.mapreduce.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:348) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.constructFinalFullcounters(JobImpl.java:1754) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.mayBeConstructFinalFullCounters(JobImpl.java:1737) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.createJobFinishedEvent(JobImpl.java:1718) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.logJobHistoryFinishedEvent(JobImpl.java:1089) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$CommitSucceededTransition.transition(JobImpl.java:2049) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$CommitSucceededTransition.transition(JobImpl.java:2045) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:996) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1289) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1285) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2014-10-25 10:28:55,185 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.. {code} Counter limit was exceeded when JobFinishedEvent was created. Better handling of LimitExceededException should be provided so that AsyncDispatcher can continue functioning. was: I saw the following in container log: {code} 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attemptattempt_1414221548789_0023_r_03_0 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1414221548789_0023_r_03 Task Transitioned from RUNNING to SUCCEEDED 2014-10-25 10:28:55,052 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 24 2014-10-25 10:28:55,053 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1414221548789_0023Job Transitioned from RUNNING to COMMITTING 2014-10-25 10:28:55,054 INFO [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_COMMIT 2014-10-25 10:28:55,177 FATAL
[jira] [Commented] (YARN-2828) Enable auto refresh of web pages (using http parameter)
[ https://issues.apache.org/jira/browse/YARN-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342074#comment-14342074 ] Allen Wittenauer commented on YARN-2828: If jsoup is really only choice here rather than something else we already have in use, then its version needs to be defined in hadoop-project rather than buried inside yarn. Enable auto refresh of web pages (using http parameter) --- Key: YARN-2828 URL: https://issues.apache.org/jira/browse/YARN-2828 Project: Hadoop YARN Issue Type: Improvement Reporter: Tim Robertson Assignee: Vijay Bhat Priority: Minor Attachments: YARN-2828.001.patch The MR1 Job Tracker had a useful HTTP parameter of e.g. refresh=3 that could be appended to URLs which enabled a page reload. This was very useful when developing mapreduce jobs, especially to watch counters changing. This is lost in the the Yarn interface. Could be implemented as a page element (e.g. drop down or so), but I'd recommend that the page not be more cluttered, and simply bring back the optional refresh HTTP param. It worked really nicely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3199) Fair Scheduler documentation improvements
[ https://issues.apache.org/jira/browse/YARN-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342150#comment-14342150 ] Hudson commented on YARN-3199: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #119 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/119/]) YARN-3199. Fair Scheduler documentation improvements (Rohit Agarwal via aw) (aw: rev 8472d729974ea3ccf9fff5ce4f5309aa8e43a49e) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/FairScheduler.md Fair Scheduler documentation improvements - Key: YARN-3199 URL: https://issues.apache.org/jira/browse/YARN-3199 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: Rohit Agarwal Priority: Minor Labels: documentation Fix For: 3.0.0 Attachments: YARN-3199-1.patch, YARN-3199.patch {{yarn.scheduler.increment-allocation-mb}} and {{yarn.scheduler.increment-allocation-vcores}} are not documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3199) Fair Scheduler documentation improvements
[ https://issues.apache.org/jira/browse/YARN-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342173#comment-14342173 ] Hudson commented on YARN-3199: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #853 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/853/]) YARN-3199. Fair Scheduler documentation improvements (Rohit Agarwal via aw) (aw: rev 8472d729974ea3ccf9fff5ce4f5309aa8e43a49e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/FairScheduler.md * hadoop-yarn-project/CHANGES.txt Fair Scheduler documentation improvements - Key: YARN-3199 URL: https://issues.apache.org/jira/browse/YARN-3199 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: Rohit Agarwal Priority: Minor Labels: documentation Fix For: 3.0.0 Attachments: YARN-3199-1.patch, YARN-3199.patch {{yarn.scheduler.increment-allocation-mb}} and {{yarn.scheduler.increment-allocation-vcores}} are not documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3199) Fair Scheduler documentation improvements
[ https://issues.apache.org/jira/browse/YARN-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342247#comment-14342247 ] Hudson commented on YARN-3199: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2051 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2051/]) YARN-3199. Fair Scheduler documentation improvements (Rohit Agarwal via aw) (aw: rev 8472d729974ea3ccf9fff5ce4f5309aa8e43a49e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/FairScheduler.md * hadoop-yarn-project/CHANGES.txt Fair Scheduler documentation improvements - Key: YARN-3199 URL: https://issues.apache.org/jira/browse/YARN-3199 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: Rohit Agarwal Priority: Minor Labels: documentation Fix For: 3.0.0 Attachments: YARN-3199-1.patch, YARN-3199.patch {{yarn.scheduler.increment-allocation-mb}} and {{yarn.scheduler.increment-allocation-vcores}} are not documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3199) Fair Scheduler documentation improvements
[ https://issues.apache.org/jira/browse/YARN-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342255#comment-14342255 ] Hudson commented on YARN-3199: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #110 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/110/]) YARN-3199. Fair Scheduler documentation improvements (Rohit Agarwal via aw) (aw: rev 8472d729974ea3ccf9fff5ce4f5309aa8e43a49e) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/FairScheduler.md Fair Scheduler documentation improvements - Key: YARN-3199 URL: https://issues.apache.org/jira/browse/YARN-3199 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: Rohit Agarwal Priority: Minor Labels: documentation Fix For: 3.0.0 Attachments: YARN-3199-1.patch, YARN-3199.patch {{yarn.scheduler.increment-allocation-mb}} and {{yarn.scheduler.increment-allocation-vcores}} are not documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3080) The DockerContainerExecutor could not write the right pid to container pidFile
[ https://issues.apache.org/jira/browse/YARN-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342613#comment-14342613 ] Beckham007 commented on YARN-3080: -- Getting getting the actual PID from docker inspect is good, but it is too complex. I think nm should use We sloved this by the same way as DefaultContainerExecutor(The same as [~chenchun]), and it works well. The DockerContainerExecutor could not write the right pid to container pidFile -- Key: YARN-3080 URL: https://issues.apache.org/jira/browse/YARN-3080 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Beckham007 Assignee: Abin Shahab Attachments: YARN-3080.patch, YARN-3080.patch, YARN-3080.patch, YARN-3080.patch The docker_container_executor_session.sh is like this: {quote} #!/usr/bin/env bash echo `/usr/bin/docker inspect --format {{.State.Pid}} container_1421723685222_0008_01_02` /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /bin/mv -f /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid /usr/bin/docker run --rm --name container_1421723685222_0008_01_02 -e GAIA_HOST_IP=c162 -e GAIA_API_SERVER=10.6.207.226:8080 -e GAIA_CLUSTER_ID=shpc-nm_restart -e GAIA_QUEUE=root.tdwadmin -e GAIA_APP_NAME=test_nm_docker -e GAIA_INSTANCE_ID=1 -e GAIA_CONTAINER_ID=container_1421723685222_0008_01_02 --memory=32M --cpu-shares=1024 -v /data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02 -v /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02 -P -e A=B --privileged=true docker.oa.com:8080/library/centos7 bash /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02/launch_container.sh {quote} The DockerContainerExecutor use docker inspect before docker run, so the docker inspect couldn't get the right pid for the docker, signalContainer() and nm restart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3080) The DockerContainerExecutor could not write the right pid to container pidFile
[ https://issues.apache.org/jira/browse/YARN-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342744#comment-14342744 ] Abin Shahab commented on YARN-3080: --- [~beckham007] Thanks for your comment. So you're saying that if I send a signal to the pid of the session script(as DefaultContainerExecutor does), it will work on the process that docker is running, and potentially kill it? Please help me clarify my understanding: I am running the following steps: First I create a file similar to the session script. It writes the pid of the session to a pidfile {code} $cat bash_session_pid.sh EOF #!/bin/bash echo $$ /tmp/pidfile exec setsid bash -c 'docker run -itd ubuntu sleep infinity' EOF {code} I chmod and run this script which starts a docker container {code} $chmod a+x bash_session_pid.sh $./bash_session_pid.sh $docker ps 1b8ee377e3d2ubuntu:14.04sleep infinity3 minutes ago Up 3 minutescranky_stallman {code} Now I cat the pid of the session, and it says the pid is 9281 {code} $cat /tmp/pidfile 9281 {code} As you've suggested, I send a kill signal to the pid, hoping that'd kill the container {code} $kill -9 9281 {code} I check if the docker container is killed: {code} $docker ps 1b8ee377e3d2ubuntu:14.04sleep infinity6 minutes ago Up 6 minutescranky_stallman {code} Since your method did not kill the container, I get the pid of the process running under the container: {code} $docker inspect 1b8ee377e3d2 9289 {code} I check the tree of this process: {code} $pstree -ps 9289 init(1)---docker(6512)---sleep(9289) {code} As I had expected, this process is a child of the docker daemon, and therefore, if it's killed, the container will be killed. Therefore, I send a kill signal to this pid: {code} $kill -9 9289 {code} Now I verify if the container is alive: {code} $docker ps {code} Container is dead. From what I understand, the session pid has no relation to the actual pid of the container, and therefore, sending it signal is meaningless. Therefore, if that meaningless pid is in the pidfile, NodeManager/ResourceManager will not be able to send signal to containers as needed. Please let me know where my understanding is mistaken, and I gladly will switch it to the simpler implementation. The DockerContainerExecutor could not write the right pid to container pidFile -- Key: YARN-3080 URL: https://issues.apache.org/jira/browse/YARN-3080 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Beckham007 Assignee: Abin Shahab Attachments: YARN-3080.patch, YARN-3080.patch, YARN-3080.patch, YARN-3080.patch The docker_container_executor_session.sh is like this: {quote} #!/usr/bin/env bash echo `/usr/bin/docker inspect --format {{.State.Pid}} container_1421723685222_0008_01_02` /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /bin/mv -f /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid /usr/bin/docker run --rm --name container_1421723685222_0008_01_02 -e GAIA_HOST_IP=c162 -e GAIA_API_SERVER=10.6.207.226:8080 -e GAIA_CLUSTER_ID=shpc-nm_restart -e GAIA_QUEUE=root.tdwadmin -e GAIA_APP_NAME=test_nm_docker -e GAIA_INSTANCE_ID=1 -e GAIA_CONTAINER_ID=container_1421723685222_0008_01_02 --memory=32M --cpu-shares=1024 -v /data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02 -v /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02 -P -e A=B --privileged=true docker.oa.com:8080/library/centos7 bash /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02/launch_container.sh {quote} The DockerContainerExecutor use docker inspect before docker run, so the docker inspect couldn't get the right pid for the docker, signalContainer()
[jira] [Commented] (YARN-3080) The DockerContainerExecutor could not write the right pid to container pidFile
[ https://issues.apache.org/jira/browse/YARN-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342760#comment-14342760 ] Beckham007 commented on YARN-3080: -- Sorry for mistaking. The pidfile has two functions. 1. When NM restarting, it use the pid to check whether the container finished. 2. As [~chenchun] said, As for signalContainer, we can use docker kill --signal=SIGNAL containerId instead. The DockerContainerExecutor could not write the right pid to container pidFile -- Key: YARN-3080 URL: https://issues.apache.org/jira/browse/YARN-3080 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Beckham007 Assignee: Abin Shahab Attachments: YARN-3080.patch, YARN-3080.patch, YARN-3080.patch, YARN-3080.patch The docker_container_executor_session.sh is like this: {quote} #!/usr/bin/env bash echo `/usr/bin/docker inspect --format {{.State.Pid}} container_1421723685222_0008_01_02` /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /bin/mv -f /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid /usr/bin/docker run --rm --name container_1421723685222_0008_01_02 -e GAIA_HOST_IP=c162 -e GAIA_API_SERVER=10.6.207.226:8080 -e GAIA_CLUSTER_ID=shpc-nm_restart -e GAIA_QUEUE=root.tdwadmin -e GAIA_APP_NAME=test_nm_docker -e GAIA_INSTANCE_ID=1 -e GAIA_CONTAINER_ID=container_1421723685222_0008_01_02 --memory=32M --cpu-shares=1024 -v /data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02 -v /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02 -P -e A=B --privileged=true docker.oa.com:8080/library/centos7 bash /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02/launch_container.sh {quote} The DockerContainerExecutor use docker inspect before docker run, so the docker inspect couldn't get the right pid for the docker, signalContainer() and nm restart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3080) The DockerContainerExecutor could not write the right pid to container pidFile
[ https://issues.apache.org/jira/browse/YARN-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342828#comment-14342828 ] Abin Shahab commented on YARN-3080: --- [~chengzhendong888], you are implying that the pidfile in future will never be used for anything else? Container Executors are required to provide a correct pidfile, and that's the api contract between them and the NodeManager. I don't see why DockerContainerExecutor should violate that contract. Also, how would you derive a containerId from a pid in the NM? The pid that will be sent to signal container won't even be the correct pid(it will be the session script pid). The DockerContainerExecutor could not write the right pid to container pidFile -- Key: YARN-3080 URL: https://issues.apache.org/jira/browse/YARN-3080 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Beckham007 Assignee: Abin Shahab Attachments: YARN-3080.patch, YARN-3080.patch, YARN-3080.patch, YARN-3080.patch The docker_container_executor_session.sh is like this: {quote} #!/usr/bin/env bash echo `/usr/bin/docker inspect --format {{.State.Pid}} container_1421723685222_0008_01_02` /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /bin/mv -f /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid /usr/bin/docker run --rm --name container_1421723685222_0008_01_02 -e GAIA_HOST_IP=c162 -e GAIA_API_SERVER=10.6.207.226:8080 -e GAIA_CLUSTER_ID=shpc-nm_restart -e GAIA_QUEUE=root.tdwadmin -e GAIA_APP_NAME=test_nm_docker -e GAIA_INSTANCE_ID=1 -e GAIA_CONTAINER_ID=container_1421723685222_0008_01_02 --memory=32M --cpu-shares=1024 -v /data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02 -v /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02 -P -e A=B --privileged=true docker.oa.com:8080/library/centos7 bash /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02/launch_container.sh {quote} The DockerContainerExecutor use docker inspect before docker run, so the docker inspect couldn't get the right pid for the docker, signalContainer() and nm restart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3080) The DockerContainerExecutor could not write the right pid to container pidFile
[ https://issues.apache.org/jira/browse/YARN-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342861#comment-14342861 ] Varun Vasudev commented on YARN-3080: - [~ashahab] sorry for the late response. The kill functionality uses the kill -9 -$PID form which sends the kill signal to the process as well as all the children, grandchildren, etc. There's a more detailed explanation here - http://stackoverflow.com/a/15139734. The code that generates the kill command is in Shell.java - look for getSignalKillCommand. The DockerContainerExecutor could not write the right pid to container pidFile -- Key: YARN-3080 URL: https://issues.apache.org/jira/browse/YARN-3080 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Beckham007 Assignee: Abin Shahab Attachments: YARN-3080.patch, YARN-3080.patch, YARN-3080.patch, YARN-3080.patch The docker_container_executor_session.sh is like this: {quote} #!/usr/bin/env bash echo `/usr/bin/docker inspect --format {{.State.Pid}} container_1421723685222_0008_01_02` /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /bin/mv -f /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid /usr/bin/docker run --rm --name container_1421723685222_0008_01_02 -e GAIA_HOST_IP=c162 -e GAIA_API_SERVER=10.6.207.226:8080 -e GAIA_CLUSTER_ID=shpc-nm_restart -e GAIA_QUEUE=root.tdwadmin -e GAIA_APP_NAME=test_nm_docker -e GAIA_INSTANCE_ID=1 -e GAIA_CONTAINER_ID=container_1421723685222_0008_01_02 --memory=32M --cpu-shares=1024 -v /data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02 -v /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02 -P -e A=B --privileged=true docker.oa.com:8080/library/centos7 bash /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02/launch_container.sh {quote} The DockerContainerExecutor use docker inspect before docker run, so the docker inspect couldn't get the right pid for the docker, signalContainer() and nm restart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3265) CapacityScheduler deadlock when computing absolute max avail capacity (fix for trunk/branch-2)
[ https://issues.apache.org/jira/browse/YARN-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3265: - Attachment: YARN-3265.6.patch Attached new patch addressed findbugs warnings, test failure is not related. CapacityScheduler deadlock when computing absolute max avail capacity (fix for trunk/branch-2) -- Key: YARN-3265 URL: https://issues.apache.org/jira/browse/YARN-3265 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Priority: Blocker Attachments: YARN-3265.1.patch, YARN-3265.2.patch, YARN-3265.3.patch, YARN-3265.5.patch, YARN-3265.6.patch This patch is trying to solve the same problem described in YARN-3251, but this is a longer term fix for trunk and branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3080) The DockerContainerExecutor could not write the right pid to container pidFile
[ https://issues.apache.org/jira/browse/YARN-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342876#comment-14342876 ] Abin Shahab commented on YARN-3080: --- I agree with that Varun. However, I'm not sure the process launched under docker is a child of the sessionScript(see my example above). I can be wrong though. On Sun, Mar 1, 2015 at 11:09 PM, Varun Vasudev (JIRA) j...@apache.org The DockerContainerExecutor could not write the right pid to container pidFile -- Key: YARN-3080 URL: https://issues.apache.org/jira/browse/YARN-3080 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Beckham007 Assignee: Abin Shahab Attachments: YARN-3080.patch, YARN-3080.patch, YARN-3080.patch, YARN-3080.patch The docker_container_executor_session.sh is like this: {quote} #!/usr/bin/env bash echo `/usr/bin/docker inspect --format {{.State.Pid}} container_1421723685222_0008_01_02` /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /bin/mv -f /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid.tmp /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_02/container_1421723685222_0008_01_02.pid /usr/bin/docker run --rm --name container_1421723685222_0008_01_02 -e GAIA_HOST_IP=c162 -e GAIA_API_SERVER=10.6.207.226:8080 -e GAIA_CLUSTER_ID=shpc-nm_restart -e GAIA_QUEUE=root.tdwadmin -e GAIA_APP_NAME=test_nm_docker -e GAIA_INSTANCE_ID=1 -e GAIA_CONTAINER_ID=container_1421723685222_0008_01_02 --memory=32M --cpu-shares=1024 -v /data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_02 -v /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02:/data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02 -P -e A=B --privileged=true docker.oa.com:8080/library/centos7 bash /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_02/launch_container.sh {quote} The DockerContainerExecutor use docker inspect before docker run, so the docker inspect couldn't get the right pid for the docker, signalContainer() and nm restart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3249) Add the kill application to the Resource Manager Web UI
[ https://issues.apache.org/jira/browse/YARN-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryu Kobayashi updated YARN-3249: Attachment: YARN-3249.4.patch Add the kill application to the Resource Manager Web UI --- Key: YARN-3249 URL: https://issues.apache.org/jira/browse/YARN-3249 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0, 2.7.0 Reporter: Ryu Kobayashi Assignee: Ryu Kobayashi Priority: Minor Attachments: YARN-3249.2.patch, YARN-3249.2.patch, YARN-3249.3.patch, YARN-3249.4.patch, YARN-3249.patch, killapp-failed.log, killapp-failed2.log, screenshot.png, screenshot2.png It want to kill the application on the JobTracker similarly Web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3249) Add the kill application to the Resource Manager Web UI
[ https://issues.apache.org/jira/browse/YARN-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342880#comment-14342880 ] Ryu Kobayashi commented on YARN-3249: - New patch has the above update. Add the kill application to the Resource Manager Web UI --- Key: YARN-3249 URL: https://issues.apache.org/jira/browse/YARN-3249 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0, 2.7.0 Reporter: Ryu Kobayashi Assignee: Ryu Kobayashi Priority: Minor Attachments: YARN-3249.2.patch, YARN-3249.2.patch, YARN-3249.3.patch, YARN-3249.4.patch, YARN-3249.patch, killapp-failed.log, killapp-failed2.log, screenshot.png, screenshot2.png It want to kill the application on the JobTracker similarly Web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3249) Add the kill application to the Resource Manager Web UI
[ https://issues.apache.org/jira/browse/YARN-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryu Kobayashi updated YARN-3249: Attachment: screenshot2.png Add the kill application to the Resource Manager Web UI --- Key: YARN-3249 URL: https://issues.apache.org/jira/browse/YARN-3249 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0, 2.7.0 Reporter: Ryu Kobayashi Assignee: Ryu Kobayashi Priority: Minor Attachments: YARN-3249.2.patch, YARN-3249.2.patch, YARN-3249.3.patch, YARN-3249.4.patch, YARN-3249.patch, killapp-failed.log, killapp-failed2.log, screenshot.png, screenshot2.png It want to kill the application on the JobTracker similarly Web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3199) Fair Scheduler documentation improvements
[ https://issues.apache.org/jira/browse/YARN-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342287#comment-14342287 ] Hudson commented on YARN-3199: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2069 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2069/]) YARN-3199. Fair Scheduler documentation improvements (Rohit Agarwal via aw) (aw: rev 8472d729974ea3ccf9fff5ce4f5309aa8e43a49e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/FairScheduler.md * hadoop-yarn-project/CHANGES.txt Fair Scheduler documentation improvements - Key: YARN-3199 URL: https://issues.apache.org/jira/browse/YARN-3199 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: Rohit Agarwal Priority: Minor Labels: documentation Fix For: 3.0.0 Attachments: YARN-3199-1.patch, YARN-3199.patch {{yarn.scheduler.increment-allocation-mb}} and {{yarn.scheduler.increment-allocation-vcores}} are not documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3199) Fair Scheduler documentation improvements
[ https://issues.apache.org/jira/browse/YARN-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342274#comment-14342274 ] Hudson commented on YARN-3199: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #119 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/119/]) YARN-3199. Fair Scheduler documentation improvements (Rohit Agarwal via aw) (aw: rev 8472d729974ea3ccf9fff5ce4f5309aa8e43a49e) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/FairScheduler.md Fair Scheduler documentation improvements - Key: YARN-3199 URL: https://issues.apache.org/jira/browse/YARN-3199 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: Rohit Agarwal Priority: Minor Labels: documentation Fix For: 3.0.0 Attachments: YARN-3199-1.patch, YARN-3199.patch {{yarn.scheduler.increment-allocation-mb}} and {{yarn.scheduler.increment-allocation-vcores}} are not documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)