[jira] [Created] (YARN-2845) MicroZookeeperService used in Yarn Registry tests doesn't shut down cleanly on windows
Steve Loughran created YARN-2845: Summary: MicroZookeeperService used in Yarn Registry tests doesn't shut down cleanly on windows Key: YARN-2845 URL: https://issues.apache.org/jira/browse/YARN-2845 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Environment: Windows Reporter: Steve Loughran Assignee: Steve Loughran Priority: Minor Fix For: 2.7.0 It's not surfacing in YARN's own tests, but we are seeing this in slider's windows testing ... two test methods, each setting up their own ZK micro cluster, seeing the previous test's data. The class needs the same cleanup logic as HBASE-6820 —as perhaps its origin, Twill's mini ZK cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2841) RMProxy should retry EOFException
[ https://issues.apache.org/jira/browse/YARN-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206336#comment-14206336 ] Hudson commented on YARN-2841: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #2 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/2/]) YARN-2841. RMProxy should retry EOFException. Contributed by Jian He (xgong: rev 5c9a51f140ba76ddb25580aeb288db25e3f9653f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java YARN-2841: Correct fix version from branch-2.6 to branch-2.7 in the (xgong: rev 58e9bf4b908e0b21309006eba49899b092f38071) * hadoop-yarn-project/CHANGES.txt RMProxy should retry EOFException -- Key: YARN-2841 URL: https://issues.apache.org/jira/browse/YARN-2841 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jian He Assignee: Jian He Priority: Critical Fix For: 2.7.0 Attachments: YARN-2841.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2841) RMProxy should retry EOFException
[ https://issues.apache.org/jira/browse/YARN-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206347#comment-14206347 ] Hudson commented on YARN-2841: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #740 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/740/]) YARN-2841. RMProxy should retry EOFException. Contributed by Jian He (xgong: rev 5c9a51f140ba76ddb25580aeb288db25e3f9653f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java YARN-2841: Correct fix version from branch-2.6 to branch-2.7 in the (xgong: rev 58e9bf4b908e0b21309006eba49899b092f38071) * hadoop-yarn-project/CHANGES.txt RMProxy should retry EOFException -- Key: YARN-2841 URL: https://issues.apache.org/jira/browse/YARN-2841 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jian He Assignee: Jian He Priority: Critical Fix For: 2.7.0 Attachments: YARN-2841.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2841) RMProxy should retry EOFException
[ https://issues.apache.org/jira/browse/YARN-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206459#comment-14206459 ] Hudson commented on YARN-2841: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1930 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/]) YARN-2841. RMProxy should retry EOFException. Contributed by Jian He (xgong: rev 5c9a51f140ba76ddb25580aeb288db25e3f9653f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java YARN-2841: Correct fix version from branch-2.6 to branch-2.7 in the (xgong: rev 58e9bf4b908e0b21309006eba49899b092f38071) * hadoop-yarn-project/CHANGES.txt RMProxy should retry EOFException -- Key: YARN-2841 URL: https://issues.apache.org/jira/browse/YARN-2841 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jian He Assignee: Jian He Priority: Critical Fix For: 2.7.0 Attachments: YARN-2841.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2780) Log aggregated resource allocation in rm-appsummary.log
[ https://issues.apache.org/jira/browse/YARN-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206532#comment-14206532 ] Jason Lowe commented on YARN-2780: -- +1 lgtm. Will commit this later today if there are no objections. Log aggregated resource allocation in rm-appsummary.log --- Key: YARN-2780 URL: https://issues.apache.org/jira/browse/YARN-2780 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.1 Reporter: Koji Noguchi Assignee: Eric Payne Priority: Minor Attachments: YARN-2780.v1.201411031728.txt, YARN-2780.v2.201411061601.txt YARN-415 added useful information about resource usage by applications. Asking to log that info inside rm-appsummary.log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2841) RMProxy should retry EOFException
[ https://issues.apache.org/jira/browse/YARN-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206536#comment-14206536 ] Hudson commented on YARN-2841: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1954 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1954/]) YARN-2841. RMProxy should retry EOFException. Contributed by Jian He (xgong: rev 5c9a51f140ba76ddb25580aeb288db25e3f9653f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/CHANGES.txt YARN-2841: Correct fix version from branch-2.6 to branch-2.7 in the (xgong: rev 58e9bf4b908e0b21309006eba49899b092f38071) * hadoop-yarn-project/CHANGES.txt RMProxy should retry EOFException -- Key: YARN-2841 URL: https://issues.apache.org/jira/browse/YARN-2841 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jian He Assignee: Jian He Priority: Critical Fix For: 2.7.0 Attachments: YARN-2841.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
Junping Du created YARN-2846: Summary: Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. Key: YARN-2846 URL: https://issues.apache.org/jira/browse/YARN-2846 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Priority: Blocker The NM restart work preserving feature could make running AM container get LOST and killed during stop NM daemon. The exception is like below: {code} 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB physical memory used; 931.3 MB of 1.0 GB virtual memory used 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications still running : [application_1415666714233_0001] 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping server on 45454 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping IPC Server listener on 45454 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping IPC Server Responder 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log aggregation for application_1415666714233_0001 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for application application_1415666714233_0001 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(476)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(87)) - Unable to recover container container_1415666714233_0001_01_01 java.io.IOException: Interrupted while waiting for process 20001 to exit at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177) ... 6 more {code} In reacquireContainer() of ContainerExecutor.java, the while loop of checking container process (AM container) will be interrupted by NM stop. The IOException get thrown and failed to generate an ExitCodeFile for the running container. Later, the IOException will be caught in upper call (RecoveredContainerLaunch.call()) and the ExitCode (by default to be LOST without any setting) get persistent in NMStateStore. After NM restart again, this container is recovered as COMPLETE state but exit code is LOST (154) - cause this (AM) container get killed later. We should get rid of recording the exit code of running containers if detecting process is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du reassigned YARN-2846: Assignee: Junping Du Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. --- Key: YARN-2846 URL: https://issues.apache.org/jira/browse/YARN-2846 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker The NM restart work preserving feature could make running AM container get LOST and killed during stop NM daemon. The exception is like below: {code} 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB physical memory used; 931.3 MB of 1.0 GB virtual memory used 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications still running : [application_1415666714233_0001] 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping server on 45454 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping IPC Server listener on 45454 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping IPC Server Responder 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log aggregation for application_1415666714233_0001 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for application application_1415666714233_0001 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(476)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(87)) - Unable to recover container container_1415666714233_0001_01_01 java.io.IOException: Interrupted while waiting for process 20001 to exit at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177) ... 6 more {code} In reacquireContainer() of ContainerExecutor.java, the while loop of checking container process (AM container) will be interrupted by NM stop. The IOException get thrown and failed to generate an ExitCodeFile for the running container. Later, the IOException will be caught in upper call (RecoveredContainerLaunch.call()) and the ExitCode (by default to be LOST without any setting) get persistent in NMStateStore. After NM restart again, this container is recovered as COMPLETE state but exit code is LOST (154) - cause this (AM) container get killed later. We should get rid of recording the exit code of running containers if detecting process is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2846: - Attachment: YARN-2846-demo.patch Upload the first demo patch to fix the problem. Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. --- Key: YARN-2846 URL: https://issues.apache.org/jira/browse/YARN-2846 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Attachments: YARN-2846-demo.patch The NM restart work preserving feature could make running AM container get LOST and killed during stop NM daemon. The exception is like below: {code} 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB physical memory used; 931.3 MB of 1.0 GB virtual memory used 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications still running : [application_1415666714233_0001] 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping server on 45454 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping IPC Server listener on 45454 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping IPC Server Responder 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log aggregation for application_1415666714233_0001 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for application application_1415666714233_0001 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(476)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(87)) - Unable to recover container container_1415666714233_0001_01_01 java.io.IOException: Interrupted while waiting for process 20001 to exit at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177) ... 6 more {code} In reacquireContainer() of ContainerExecutor.java, the while loop of checking container process (AM container) will be interrupted by NM stop. The IOException get thrown and failed to generate an ExitCodeFile for the running container. Later, the IOException will be caught in upper call (RecoveredContainerLaunch.call()) and the ExitCode (by default to be LOST without any setting) get persistent in NMStateStore. After NM restart again, this container is recovered as COMPLETE state but exit code is LOST (154) - cause this (AM) container get killed later. We should get rid of recording the exit code of running containers if detecting process is interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206601#comment-14206601 ] Hadoop QA commented on YARN-2846: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680801/YARN-2846-demo.patch against trunk revision 58e9bf4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5814//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5814//console This message is automatically generated. Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. --- Key: YARN-2846 URL: https://issues.apache.org/jira/browse/YARN-2846 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Attachments: YARN-2846-demo.patch The NM restart work preserving feature could make running AM container get LOST and killed during stop NM daemon. The exception is like below: {code} 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB physical memory used; 931.3 MB of 1.0 GB virtual memory used 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications still running : [application_1415666714233_0001] 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping server on 45454 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping IPC Server listener on 45454 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping IPC Server Responder 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log aggregation for application_1415666714233_0001 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for application application_1415666714233_0001 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(476)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(87)) - Unable to recover container container_1415666714233_0001_01_01 java.io.IOException: Interrupted while waiting for process 20001 to exit at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) at
[jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206616#comment-14206616 ] Jason Lowe commented on YARN-2846: -- Thanks for the report and patch, Junping! Nit: If reacquireContainer is going to allow InterruptedException to be thrown then I'd rather remove the try/catch around the Thread.sleep call and just let the exception be thrown directly from there. We can let the code catching the exception deal with any logging/etc as appropriate for that caller. In this case we can move the log message to RecoveredContainerLaunch when it fields the InterruptedException and chooses not to propagate it upwards. I'm curious why we're not seeing a similar issue with regular ContainerLaunch threads, as they should be interrupted as well. Are those threads silently swallowing the interrupt? Because otherwise I would expect us to log a container completion just like we were doing with a recovered container. Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. --- Key: YARN-2846 URL: https://issues.apache.org/jira/browse/YARN-2846 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Attachments: YARN-2846-demo.patch The NM restart work preserving feature could make running AM container get LOST and killed during stop NM daemon. The exception is like below: {code} 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB physical memory used; 931.3 MB of 1.0 GB virtual memory used 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications still running : [application_1415666714233_0001] 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping server on 45454 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping IPC Server listener on 45454 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping IPC Server Responder 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log aggregation for application_1415666714233_0001 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for application application_1415666714233_0001 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(476)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(87)) - Unable to recover container container_1415666714233_0001_01_01 java.io.IOException: Interrupted while waiting for process 20001 to exit at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177) ... 6 more {code} In reacquireContainer() of ContainerExecutor.java, the while loop of checking
[jira] [Commented] (YARN-2495) Allow admin specify labels from each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206626#comment-14206626 ] Naganarasimha G R commented on YARN-2495: - {quote} The benefit are 1) You don't have to update test cases for that 2) The semanic are clear, create a register request with label or not. {quote} True, and will be able to revert some unwanted testcase modification. have corrected it. bq. I suggest to have different option for script-based/config-based, even if we can combine them together. Ok, will have different config param for script and config based bq. IIUC, NM_NODE_LABELS_FROM_CONFIG is a list of labels, even if we want to separate the two properties, we cannot remove NM_NODE_LABELS_FROM_CONFIG, correct? Had searched it wrongly and as you mentioned the name of was not good enough for me to recollect back too. corrected it bq. I think it's better to leverage existing utility class instead of implement your own. For example, you have set values but not check them, which is incorrect, but using utility class can avoid such problem. Even if you added new fields, tests will cover them without any changes: Problem is ??TestPBImplRecords?? is in ??hadoop-yarn-common?? project and ??NodeHeartbeatRequestPBImpl?? and others are in ??hadoop-yarn-server-common?? project. So as we cant add dependency on ??hadoop-yarn-server-common?? in ??hadoop-yarn-common??, hence shall i create a new class extending TestPBImplRecords in ??hadoop-yarn-server-common?? project. ? Allow admin specify labels from each NM (Distributed configuration) --- Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-2495.20141023-1.patch, YARN-2495.20141024-1.patch, YARN-2495.20141030-1.patch, YARN-2495.20141031-1.patch, YARN-2495_20141022.1.patch Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml or using script suggested by [~aw]) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2838) Issues with TimeLineServer (Application History)
[ https://issues.apache.org/jira/browse/YARN-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206634#comment-14206634 ] Naganarasimha G R commented on YARN-2838: - Hi [~zjshen], Can you please feedback on these issues ? As some issues requires discussion before rectifiction... Issues with TimeLineServer (Application History) Key: YARN-2838 URL: https://issues.apache.org/jira/browse/YARN-2838 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.5.1 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: IssuesInTimelineServer.pdf Few issues in usage of Timeline server for generic application history access -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abin Shahab updated YARN-1964: -- Attachment: YARN-1964.patch fixed imports. Create Docker analog of the LinuxContainerExecutor in YARN -- Key: YARN-1964 URL: https://issues.apache.org/jira/browse/YARN-1964 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Abin Shahab Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch Docker (https://www.docker.io/) is, increasingly, a very popular container technology. In context of YARN, the support for Docker will provide a very elegant solution to allow applications to *package* their software into a Docker container (entire Linux file system incl. custom versions of perl, python etc.) and use it as a blueprint to launch all their YARN containers with requisite software environment. This provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-1680: -- Target Version/s: 2.7.0 (was: 2.6.0) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Craig Welch Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2735) diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection
[ https://issues.apache.org/jira/browse/YARN-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2735: --- Priority: Trivial (was: Minor) Labels: newbie (was: ) diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection --- Key: YARN-2735 URL: https://issues.apache.org/jira/browse/YARN-2735 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Labels: newbie Attachments: YARN-2735.000.patch diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2735) diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection
[ https://issues.apache.org/jira/browse/YARN-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206765#comment-14206765 ] Karthik Kambatla commented on YARN-2735: Trivial patch. +1. Checking this in. diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection --- Key: YARN-2735 URL: https://issues.apache.org/jira/browse/YARN-2735 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Labels: newbie Attachments: YARN-2735.000.patch diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206768#comment-14206768 ] Hadoop QA commented on YARN-1964: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680823/YARN-1964.patch against trunk revision 58e9bf4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5815//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5815//console This message is automatically generated. Create Docker analog of the LinuxContainerExecutor in YARN -- Key: YARN-1964 URL: https://issues.apache.org/jira/browse/YARN-1964 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Abin Shahab Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch Docker (https://www.docker.io/) is, increasingly, a very popular container technology. In context of YARN, the support for Docker will provide a very elegant solution to allow applications to *package* their software into a Docker container (entire Linux file system incl. custom versions of perl, python etc.) and use it as a blueprint to launch all their YARN containers with requisite software environment. This provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2735) diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection
[ https://issues.apache.org/jira/browse/YARN-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206788#comment-14206788 ] Hudson commented on YARN-2735: -- FAILURE: Integrated in Hadoop-trunk-Commit #6510 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6510/]) YARN-2735. diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection. (Zhihai Xu via kasha) (kasha: rev 061bc293c8dd3e2605cf150568988bde18407af6) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection --- Key: YARN-2735 URL: https://issues.apache.org/jira/browse/YARN-2735 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Labels: newbie Fix For: 2.7.0 Attachments: YARN-2735.000.patch diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2847) Linux native container executor segfaults if default banned user detected
Jason Lowe created YARN-2847: Summary: Linux native container executor segfaults if default banned user detected Key: YARN-2847 URL: https://issues.apache.org/jira/browse/YARN-2847 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe The check_user function in container-executor.c can cause a segmentation fault if banned.users is not provided but the user is detected as one of the default users. In that scenario it will call free_values on a NULL pointer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2847) Linux native container executor segfaults if default banned user detected
[ https://issues.apache.org/jira/browse/YARN-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206801#comment-14206801 ] Jason Lowe commented on YARN-2847: -- The problem is in this code: {code} char **banned_users = get_values(BANNED_USERS_KEY); char **banned_user = (banned_users == NULL) ? (char**) DEFAULT_BANNED_USERS : banned_users; for(; *banned_user; ++banned_user) { if (strcmp(*banned_user, user) == 0) { free(user_info); if (banned_users != (char**)DEFAULT_BANNED_USERS) { free_values(banned_users); } fprintf(LOGFILE, Requested user %s is banned\n, user); return NULL; } } if (banned_users != NULL banned_users != (char**)DEFAULT_BANNED_USERS) { free_values(banned_users); } {code} Note that in one case we check for banned_users != NULL and != DEFAULT_BANNED_USERS but in another case we're missing the NULL check. Lots of ways to fix it: - free_values could check for NULL - banned_users could always be non-NULL (i.e.: set it to DEFAULT_BANNED_USERS if get_values returns NULL) - add check for != NULL before calling free_values Linux native container executor segfaults if default banned user detected - Key: YARN-2847 URL: https://issues.apache.org/jira/browse/YARN-2847 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe The check_user function in container-executor.c can cause a segmentation fault if banned.users is not provided but the user is detected as one of the default users. In that scenario it will call free_values on a NULL pointer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206855#comment-14206855 ] Ravi Prakash commented on YARN-1964: Thanks Abin! The patch is looking really good now. However the documentation doesn't seem to be compiling for me. Once that is figured out, I'm a +1. I am looking to commit it EOD today to trunk, branch-2, branch-2.6. I'd like to commit it to 2.6 also and request a respin of the RC. Create Docker analog of the LinuxContainerExecutor in YARN -- Key: YARN-1964 URL: https://issues.apache.org/jira/browse/YARN-1964 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Abin Shahab Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch Docker (https://www.docker.io/) is, increasingly, a very popular container technology. In context of YARN, the support for Docker will provide a very elegant solution to allow applications to *package* their software into a Docker container (entire Linux file system incl. custom versions of perl, python etc.) and use it as a blueprint to launch all their YARN containers with requisite software environment. This provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2817) Disk drive as a resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla resolved YARN-2817. Resolution: Duplicate Disk drive as a resource in YARN Key: YARN-2817 URL: https://issues.apache.org/jira/browse/YARN-2817 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Reporter: Arun C Murthy Assignee: Arun C Murthy As YARN continues to cover new ground in terms of new workloads, disk is becoming a very important resource to govern. It might be prudent to start with something very simple - allow applications to request entire drives (e.g. 2 drives out of the 12 available on a node), we can then also add support for specific iops, bandwidth etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4
[ https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-2811: -- Attachment: YARN-2811.v5.patch Fair Scheduler is violating max memory settings in 2.4 -- Key: YARN-2811 URL: https://issues.apache.org/jira/browse/YARN-2811 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch, YARN-2811.v3.patch, YARN-2811.v4.patch, YARN-2811.v5.patch This has been seen on several queues showing the allocated MB going significantly above the max MB and it appears to have started with the 2.4 upgrade. It could be a regression bug from 2.0 to 2.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2139) Add support for disk IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2139: --- Assignee: (was: Wei Yan) Add support for disk IO isolation/scheduling for containers --- Key: YARN-2139 URL: https://issues.apache.org/jira/browse/YARN-2139 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Attachments: Disk_IO_Scheduling_Design_1.pdf, Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype-2.patch, YARN-2139-prototype.patch YARN should support considering disk for scheduling tasks on nodes, and provide isolation for these allocations at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2139) [Umbrella] Support for Disk as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2139: --- Summary: [Umbrella] Support for Disk as a Resource in YARN (was: Add support for disk IO isolation/scheduling for containers) [Umbrella] Support for Disk as a Resource in YARN -- Key: YARN-2139 URL: https://issues.apache.org/jira/browse/YARN-2139 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Attachments: Disk_IO_Scheduling_Design_1.pdf, Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype-2.patch, YARN-2139-prototype.patch YARN should support considering disk for scheduling tasks on nodes, and provide isolation for these allocations at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2139) [Umbrella] Support for Disk as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2139: --- Description: YARN should consider disk as another resource for (1) scheduling tasks on nodes, (2) isolation at runtime, (3) spindle locality. (was: YARN should support considering disk for scheduling tasks on nodes, and provide isolation for these allocations at runtime.) [Umbrella] Support for Disk as a Resource in YARN -- Key: YARN-2139 URL: https://issues.apache.org/jira/browse/YARN-2139 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Attachments: Disk_IO_Scheduling_Design_1.pdf, Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype-2.patch, YARN-2139-prototype.patch YARN should consider disk as another resource for (1) scheduling tasks on nodes, (2) isolation at runtime, (3) spindle locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206993#comment-14206993 ] Karthik Kambatla commented on YARN-2791: Thanks [~sdaingade] for sharing the design doc. Well articulated. The designs on YARN-2139 and YARN-2791 are very similar, except for the disk resources are called vdisks in YARN-2139 and spindles in YARN-2791. In addition to the items specified here, YARN-2139 talks about isolation as well. Other than that, do you see any major items YARN-2791 covers that YARN-2139? The WebUI is good and very desirable, we should definitely include it. Also, I suggest we make this (as is - or split into multiple JIRAs) a sub-task of YARN-2139. Discussing the high-level details on one JIRA helps with aligning on one final design doc based on everyone's suggestions. Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Attachments: DiskDriveAsResourceInYARN.pdf Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2139) [Umbrella] Support for Disk as a Resource in YARN
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206994#comment-14206994 ] Karthik Kambatla commented on YARN-2139: Thanks for the prototype, Wei. In light of the updates on YARN-2791 and YARN-2817, I propose we incorporate suggestions from [~sdaingade] and [~acmurthy] before posting patches for sub-tasks. Updated JIRA title, description, and marked it unassigned as this is an umbrella JIRA. [Umbrella] Support for Disk as a Resource in YARN -- Key: YARN-2139 URL: https://issues.apache.org/jira/browse/YARN-2139 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Attachments: Disk_IO_Scheduling_Design_1.pdf, Disk_IO_Scheduling_Design_2.pdf, YARN-2139-prototype-2.patch, YARN-2139-prototype.patch YARN should consider disk as another resource for (1) scheduling tasks on nodes, (2) isolation at runtime, (3) spindle locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2843) NodeLabels manager should trim all inputs for hosts and labels
[ https://issues.apache.org/jira/browse/YARN-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207017#comment-14207017 ] Hudson commented on YARN-2843: -- FAILURE: Integrated in Hadoop-trunk-Commit #6511 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6511/]) YARN-2843. Fixed NodeLabelsManager to trim inputs for hosts and labels so as to make them work correctly. Contributed by Wangda Tan. (vinodkv: rev 0fd97f9c1989a793b882e6678285607472a3f75a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/RMAdminCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/TestCommonNodeLabelsManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/NodeLabelTestBase.java NodeLabels manager should trim all inputs for hosts and labels -- Key: YARN-2843 URL: https://issues.apache.org/jira/browse/YARN-2843 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sushmitha Sreenivasan Assignee: Wangda Tan Attachments: YARN-2843-1.patch, YARN-2843-2.patch NodeLabels manager should trim all inputs for hosts and labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207046#comment-14207046 ] Ravi Prakash commented on YARN-1964: Hi Karthik! That's fair. I'll ask Arun if he is willing to re-spin 2.6.0. Create Docker analog of the LinuxContainerExecutor in YARN -- Key: YARN-1964 URL: https://issues.apache.org/jira/browse/YARN-1964 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Abin Shahab Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch Docker (https://www.docker.io/) is, increasingly, a very popular container technology. In context of YARN, the support for Docker will provide a very elegant solution to allow applications to *package* their software into a Docker container (entire Linux file system incl. custom versions of perl, python etc.) and use it as a blueprint to launch all their YARN containers with requisite software environment. This provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
Craig Welch created YARN-2848: - Summary: (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit Key: YARN-2848 URL: https://issues.apache.org/jira/browse/YARN-2848 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster level node additions and removals) will entail managing an application-level slice of the cluster resource available to the application for use in accurately calculating the application headroom and user limit. There is an assumption that events which impact this resource will change less frequently than the need to calculate headroom, userlimit, etc (which is a valid assumption given that occurs per-allocation heartbeat). Given that, the application should (with assistance from cluster-level code...) detect changes to the composition of the cluster (node addition, removal) and when those have occurred, calculate a application specific cluster resource by comparing cluster nodes to it's own blacklist (both rack and individual node). I think it makes sense to include nodelabel considerations into this calculation as it will be efficient to do both at the same time and the single resource value reflecting both constraints could then be used for efficient frequent headroom and userlimit calculations while remaining highly accurate. The application would need to be made aware of nodelabel changes it is interested in (the application or removal of labels of interest to the application to/from nodes). For this purpose, the application submissions's nodelabel expression would be used to determine the nodelabel impact on the resource used to calculate userlimit and headroom (Cases where application elected to request resources not using the application level label expression are out of scope for this - but for the common usecase of an application which uses a particular expression throughout, userlimit and headroom would be accurate) This could also provide an overall mechanism for handling application-specific resource constraints which might be added in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2843) NodeLabels manager should trim all inputs for hosts and labels
[ https://issues.apache.org/jira/browse/YARN-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207057#comment-14207057 ] Wangda Tan commented on YARN-2843: -- Thanks for [~vinodkv]'s review and commit! NodeLabels manager should trim all inputs for hosts and labels -- Key: YARN-2843 URL: https://issues.apache.org/jira/browse/YARN-2843 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sushmitha Sreenivasan Assignee: Wangda Tan Fix For: 2.7.0 Attachments: YARN-2843-1.patch, YARN-2843-2.patch NodeLabels manager should trim all inputs for hosts and labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2791) Add Disk as a resource for scheduling
[ https://issues.apache.org/jira/browse/YARN-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207080#comment-14207080 ] Swapnil Daingade commented on YARN-2791: Thanks Karthik Kambatla. Sure, lets make this a sub-task of YARN-2139. Add Disk as a resource for scheduling - Key: YARN-2791 URL: https://issues.apache.org/jira/browse/YARN-2791 Project: Hadoop YARN Issue Type: New Feature Components: scheduler Affects Versions: 2.5.1 Reporter: Swapnil Daingade Assignee: Yuliya Feldman Attachments: DiskDriveAsResourceInYARN.pdf Currently, the number of disks present on a node is not considered a factor while scheduling containers on that node. Having large amount of memory on a node can lead to high number of containers being launched on that node, all of which compete for I/O bandwidth. This multiplexing of I/O across containers can lead to slower overall progress and sub-optimal resource utilization as containers starved for I/O bandwidth hold on to other resources like cpu and memory. This problem can be solved by considering disk as a resource and including it in deciding how many containers can be concurrently run on a node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-570) Time strings are formated in different timezone
[ https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207083#comment-14207083 ] Karthik Kambatla commented on YARN-570: --- The patch looks reasonable. +1, relying on others' testing. Checking this in, will add one comment in Times.java in the process. Time strings are formated in different timezone --- Key: YARN-570 URL: https://issues.apache.org/jira/browse/YARN-570 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.2.0 Reporter: Peng Zhang Assignee: Akira AJISAKA Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch, YARN-570.3.patch, YARN-570.4.patch, YARN-570.5.patch Time strings on different page are displayed in different timezone. If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as Wed, 10 Apr 2013 08:29:56 GMT If it is formatted by format() in yarn.util.Times, it appears as 10-Apr-2013 16:29:56 Same value, but different timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2849) MRAppMaster: Add support for disk I/O request
Wei Yan created YARN-2849: - Summary: MRAppMaster: Add support for disk I/O request Key: YARN-2849 URL: https://issues.apache.org/jira/browse/YARN-2849 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2851) YarnClient: Add support for disk I/O resource/request information
Wei Yan created YARN-2851: - Summary: YarnClient: Add support for disk I/O resource/request information Key: YARN-2851 URL: https://issues.apache.org/jira/browse/YARN-2851 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2852) WebUI: Add disk I/O resource information to the web ui
Wei Yan created YARN-2852: - Summary: WebUI: Add disk I/O resource information to the web ui Key: YARN-2852 URL: https://issues.apache.org/jira/browse/YARN-2852 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2853) Killing app may hang while AM is unregistering
Jian He created YARN-2853: - Summary: Killing app may hang while AM is unregistering Key: YARN-2853 URL: https://issues.apache.org/jira/browse/YARN-2853 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He When killing an app, app first moves to KILLING state, If RMAppAttempt receives the attempt_unregister event before attempt_kill event, it'll ignore the later attempt_kill event. Hence, RMApp won't be able to move to KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207268#comment-14207268 ] Wangda Tan commented on YARN-2729: -- Hi [~Naganarasimha], IIRC, the script based patch should be based on YARN-2495, and we should create a script-based labels provider extend NodeLabelsProviderService, correct? But I haven't seen much relationship between this and YARN-2495 besides configuration options. Please let me know if I understood incorrectly. Thanks, Wangda Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup --- Key: YARN-2729 URL: https://issues.apache.org/jira/browse/YARN-2729 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, YARN-2729.20141031-1.patch Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2853: -- Attachment: YARN-2853.1.patch Uploaded a patch to handle the possible attempt_unregistered, attempt_failed, attempt_finished state at app_killing state. Killing app may hang while AM is unregistering -- Key: YARN-2853 URL: https://issues.apache.org/jira/browse/YARN-2853 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2853.1.patch When killing an app, app first moves to KILLING state, If RMAppAttempt receives the attempt_unregister event before attempt_kill event, it'll ignore the later attempt_kill event. Hence, RMApp won't be able to move to KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207292#comment-14207292 ] Jian He commented on YARN-2853: --- Instead, we could get rid of the killing state completely and let app stay at the original state and change RMApp to handle attempt_killed state at each possible state. this way, we could avoid race condition like this. I'll file a separate jira to do this. Killing app may hang while AM is unregistering -- Key: YARN-2853 URL: https://issues.apache.org/jira/browse/YARN-2853 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2853.1.patch When killing an app, app first moves to KILLING state, If RMAppAttempt receives the attempt_unregister event before attempt_kill event, it'll ignore the later attempt_kill event. Hence, RMApp won't be able to move to KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207303#comment-14207303 ] Hadoop QA commented on YARN-2853: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680930/YARN-2853.1.patch against trunk revision 163bb55. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5820//console This message is automatically generated. Killing app may hang while AM is unregistering -- Key: YARN-2853 URL: https://issues.apache.org/jira/browse/YARN-2853 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2853.1.patch When killing an app, app first moves to KILLING state, If RMAppAttempt receives the attempt_unregister event before attempt_kill event, it'll ignore the later attempt_kill event. Hence, RMApp won't be able to move to KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2853: -- Attachment: (was: YARN-2853.1.patch) Killing app may hang while AM is unregistering -- Key: YARN-2853 URL: https://issues.apache.org/jira/browse/YARN-2853 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2853.1.patch, YARN-2853.1.patch When killing an app, app first moves to KILLING state, If RMAppAttempt receives the attempt_unregister event before attempt_kill event, it'll ignore the later attempt_kill event. Hence, RMApp won't be able to move to KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207390#comment-14207390 ] Ravi Prakash commented on YARN-1964: I'm a +1 on this patch. I'll commit it to trunk and branch-2 soon. Soon as I get confirmation from Arun, I'll commit it into branch-2.6 as well. Create Docker analog of the LinuxContainerExecutor in YARN -- Key: YARN-1964 URL: https://issues.apache.org/jira/browse/YARN-1964 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Abin Shahab Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch Docker (https://www.docker.io/) is, increasingly, a very popular container technology. In context of YARN, the support for Docker will provide a very elegant solution to allow applications to *package* their software into a Docker container (entire Linux file system incl. custom versions of perl, python etc.) and use it as a blueprint to launch all their YARN containers with requisite software environment. This provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207405#comment-14207405 ] Craig Welch commented on YARN-2637: --- I think the fix is fairly straightforward - there is an amResource property on the SchedulerApplicationAttempt / FiCaSchedulerApp, it does not appear to be being populated in the CapacityScheduler case (but it should be, and the information is available in the submission / from the resource requests of the appliction) - populate this value, and then add a Resource property to LeafQueue which represents the resources used by active application masters - when an application starts, add it's amResource value to the LeafQueue's active application master resource value, when an application ends, remove it. Before starting an application compare the sum of the active application masters + the new application's resource to the resource represented by the percentage of cluster resource allowed to be used by am's in the queue (this can differ by queue...) and if it exceeds the value do not start the application. The existing trickle down logic base on the minimum allocation should be removed, there is also logic regarding how many applications can be running based on explicit configuration which should be retained. {code} if ((queue.activeApplicationMasterResourceTotal + readyToStartApplication.applicationMasterResource) = queue.portionOfClusterResourceAllowedForApplicatoinMaster * clusterResource maxAllowedApplications runningApplications + 1) { queue.startTheApp } {code} maximum-am-resource-percent could be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Priority: Critical Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2854) The document about timeline service and generic service needs to be updated
Zhijie Shen created YARN-2854: - Summary: The document about timeline service and generic service needs to be updated Key: YARN-2854 URL: https://issues.apache.org/jira/browse/YARN-2854 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-2838) Issues with TimeLineServer (Application History)
[ https://issues.apache.org/jira/browse/YARN-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206671#comment-14206671 ] Zhijie Shen edited comment on YARN-2838 at 11/12/14 12:44 AM: -- [~Naganarasimha], sorry for not responding you immediately as being busy on finalizing 2.6. A quick scan through your issue document. Here's my clarification: 1. While the entry point of the this sub-module is still called ApplicationHistoryServer, it is actually generalized to be TimelineServer right now (definitely we need to refactor the code at some point). The baseline service provided the the timeline server is to allow the cluster and its apps to store their history information, metrics and so on by complying with the defined timeline data model. Later on, users and admins can query this information to do the analysis. 2. Application history (or we prefer to call it generic history service) is now a built-in service in the timeline server to record the generic history information of YARN apps. It was on a separate store (on FS), but after YARN-2033, it has been moved to the timeline store too, as a payload. We replace the old storage layer, but keep the existing interfaces (web UI, services, CLI) not changed to be the analog of what RM provides for running apps. We still didn't integrate TimelineClient and AHSClient, the latter of which is RPC interface of getting generic history information via RPC interface. APPLICATION_HISTORY_ENABLED is the only remaining old config to control whether we also want to pull the app info from the generic history service inside the timeline server. You may want to take a look at YARN-2033 to get more context about the change. Moreover, as a number of limitation of the old history store, we're no longer going to support it. 3. The document is definitely staled. I'll file separate document Jira, however, it's too late for 2.6. Let's target 2.7 for an up-to-date document about timeline service and its built-in generic history service (YARN-2854). Does it sound good? was (Author: zjshen): [~Naganarasimha], sorry for not responding you immediately as being busy on finalizing 2.6. A quick scan through your issue document. Here's my clarification: 1. While the entry point of the this sub-module is still called ApplicationHistoryServer, it is actually generalized to be TimelineServer right now (definitely we need to refactor the code at some point). The baseline service provided the the timeline server is to allow the cluster and its apps to store their history information, metrics and so on by complying with the defined timeline data model. Later on, users and admins can query this information to do the analysis. 2. Application history (or we prefer to call it generic history service) is now a built-in service in the timeline server to record the generic history information of YARN apps. It was on a separate store (on FS), but after YARN-2033, it has been moved to the timeline store too, as a payload. We replace the old storage layer, but keep the existing interfaces (web UI, services, CLI) not changed to be the analog of what RM provides for running apps. We still didn't integrate TimelineClient and AHSClient, the latter of which is RPC interface of getting generic history information via RPC interface. APPLICATION_HISTORY_ENABLED is the only remaining old config to control whether we also want to pull the app info from the generic history service inside the timeline server. You may want to take a look at YARN-2033 to get more context about the change. Moreover, as a number of limitation of the old history store, we're no longer going to support it. 3. The document is definitely staled. I'll file separate document Jira, however, it's too late for 2.6. Let's target 2.7 for an up-to-date document about timeline service and its built-in generic history service. Does it sound good? Issues with TimeLineServer (Application History) Key: YARN-2838 URL: https://issues.apache.org/jira/browse/YARN-2838 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.5.1 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: IssuesInTimelineServer.pdf Few issues in usage of Timeline server for generic application history access -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2236) Shared Cache uploader service on the Node Manager
[ https://issues.apache.org/jira/browse/YARN-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207464#comment-14207464 ] Karthik Kambatla commented on YARN-2236: Sorry for the delay on this, Sangjin. Patch looks generally good, but for some minor comments: # LocalResource - mark the methods Public-Unstable for now, we can mark them Public-Stable once the feature is complete. # Unrelated to this patch, can me mark BuilderUtils @Private for clarity. # Also, mark FSDownload#isPublic @Private # Rename ContainerImpl#storeSharedCacheUploadPolicies to storeSharedCacheUploadPolicy? Also, should use block comments instead of line comments. # LocalResourceRequest - LOG is unused, we should probably get rid of it along with its imports. # SharedCacheChecksumFactory ## In the map, can we use Class instead of String? ## getCheckSum should use conf.getClass for getting the classname, and ReflectionUtils.newInstance for instantiation to go with rest of the YARN code. Refer to RMProxy for further information. # Nit: SharedCacheUploader#call - remove the TODOs # Instead of creating an event and submitting through the event-handler, would it be simpler to synchronously submit it since we are queueing it up to the executor anyway? Shared Cache uploader service on the Node Manager - Key: YARN-2236 URL: https://issues.apache.org/jira/browse/YARN-2236 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2236-trunk-v1.patch, YARN-2236-trunk-v2.patch, YARN-2236-trunk-v3.patch, YARN-2236-trunk-v4.patch, YARN-2236-trunk-v5.patch Implement the shared cache uploader service on the node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2236) Shared Cache uploader service on the Node Manager
[ https://issues.apache.org/jira/browse/YARN-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207485#comment-14207485 ] Sangjin Lee commented on YARN-2236: --- Thanks Karthik! Let me review them, and see what I can do. Just a quick question, in 2, did you mean marking the entire class BuilderUtils as Private or only the methods that are touched by this JIRA? Shared Cache uploader service on the Node Manager - Key: YARN-2236 URL: https://issues.apache.org/jira/browse/YARN-2236 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2236-trunk-v1.patch, YARN-2236-trunk-v2.patch, YARN-2236-trunk-v3.patch, YARN-2236-trunk-v4.patch, YARN-2236-trunk-v5.patch Implement the shared cache uploader service on the node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
[ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2848: -- Description: Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster level node additions and removals) will entail managing an application-level slice of the cluster resource available to the application for use in accurately calculating the application headroom and user limit. There is an assumption that events which impact this resource will occur less frequently than the need to calculate headroom, userlimit, etc (which is a valid assumption given that occurs per-allocation heartbeat). Given that, the application should (with assistance from cluster-level code...) detect changes to the composition of the cluster (node addition, removal) and when those have occurred, calculate an application specific cluster resource by comparing cluster nodes to it's own blacklist (both rack and individual node). I think it makes sense to include nodelabel considerations into this calculation as it will be efficient to do both at the same time and the single resource value reflecting both constraints could then be used for efficient frequent headroom and userlimit calculations while remaining highly accurate. The application would need to be made aware of nodelabel changes it is interested in (the application or removal of labels of interest to the application to/from nodes). For this purpose, the application submissions's nodelabel expression would be used to determine the nodelabel impact on the resource used to calculate userlimit and headroom (Cases where the application elected to request resources not using the application level label expression are out of scope for this - but for the common usecase of an application which uses a particular expression throughout, userlimit and headroom would be accurate) This could also provide an overall mechanism for handling application-specific resource constraints which might be added in the future. (was: Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster level node additions and removals) will entail managing an application-level slice of the cluster resource available to the application for use in accurately calculating the application headroom and user limit. There is an assumption that events which impact this resource will change less frequently than the need to calculate headroom, userlimit, etc (which is a valid assumption given that occurs per-allocation heartbeat). Given that, the application should (with assistance from cluster-level code...) detect changes to the composition of the cluster (node addition, removal) and when those have occurred, calculate a application specific cluster resource by comparing cluster nodes to it's own blacklist (both rack and individual node). I think it makes sense to include nodelabel considerations into this calculation as it will be efficient to do both at the same time and the single resource value reflecting both constraints could then be used for efficient frequent headroom and userlimit calculations while remaining highly accurate. The application would need to be made aware of nodelabel changes it is interested in (the application or removal of labels of interest to the application to/from nodes). For this purpose, the application submissions's nodelabel expression would be used to determine the nodelabel impact on the resource used to calculate userlimit and headroom (Cases where application elected to request resources not using the application level label expression are out of scope for this - but for the common usecase of an application which uses a particular expression throughout, userlimit and headroom would be accurate) This could also provide an overall mechanism for handling application-specific resource constraints which might be added in the future.) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit -- Key: YARN-2848 URL: https://issues.apache.org/jira/browse/YARN-2848 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster level node additions and removals) will entail managing an application-level slice of the cluster resource available to the application for use in accurately calculating the application headroom and user limit. There is an assumption that events which impact this resource will occur less frequently than the need to calculate headroom,
[jira] [Commented] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207486#comment-14207486 ] Hadoop QA commented on YARN-2853: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680948/YARN-2853.1.patch against trunk revision 163bb55. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5821//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5821//console This message is automatically generated. Killing app may hang while AM is unregistering -- Key: YARN-2853 URL: https://issues.apache.org/jira/browse/YARN-2853 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2853.1.patch, YARN-2853.1.patch When killing an app, app first moves to KILLING state, If RMAppAttempt receives the attempt_unregister event before attempt_kill event, it'll ignore the later attempt_kill event. Hence, RMApp won't be able to move to KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2855) Use local date format to show app date time ,
Li Junjun created YARN-2855: --- Summary: Use local date format to show app date time , Key: YARN-2855 URL: https://issues.apache.org/jira/browse/YARN-2855 Project: Hadoop YARN Issue Type: Wish Components: resourcemanager Affects Versions: 2.5.1 Reporter: Li Junjun Priority: Minor in yarn.dt.plugins.js function renderHadoopDate use toUTCString . I'm in China, so I need to add 8 hours in my mind every time! I wish use toLocaleString() to format Date instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2855) Wish yarn web app use local date format to show app date time
[ https://issues.apache.org/jira/browse/YARN-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Junjun updated YARN-2855: Summary: Wish yarn web app use local date format to show app date time (was: Use local date format to show app date time ,) Wish yarn web app use local date format to show app date time -- Key: YARN-2855 URL: https://issues.apache.org/jira/browse/YARN-2855 Project: Hadoop YARN Issue Type: Wish Components: resourcemanager Affects Versions: 2.5.1 Reporter: Li Junjun Priority: Minor in yarn.dt.plugins.js function renderHadoopDate use toUTCString . I'm in China, so I need to add 8 hours in my mind every time! I wish use toLocaleString() to format Date instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2838) Issues with TimeLineServer (Application History)
[ https://issues.apache.org/jira/browse/YARN-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207562#comment-14207562 ] Naganarasimha G R commented on YARN-2838: - Hi [~zjshen] I will go through it (YARN-2033), but felt like some issues still stand valid even if plan to continue as timeline server itself. {quote} # Whatever the CLI command user executes is historyserver or timelineserver it looks like ApplicationHistoryServer only run. So can we modify the name of the class ApplicationHistoryServer to TimelineHistoryServer (or any other suitable name as it seems like any command user runs ApplicationHistoryServer is started) # Instead of the Starting the History Server anyway... deprecated msg, can we have Starting the Timeline History Server anyway # Based on start or stop, deprecated message should get modified to Starting the Timeline History Server anyway... or Stopping the Timeline History Server anyway... {quote} So if you comment on the individual issues/points would like to start fixing as part of this jira There is also a 4th issue which i mentioned {quote} Missed to add point 4 : In YARNClientIMPL;history data can be either got from HistoryServer (old manager) or from TimeLineServer (new) So historyServiceEnabled flag needs to check for both Timeline server configurations and ApplicationHistoryServer configurations, as data can be got from either of them. {quote} I think this is also related to the issue which you mentioned ??We still didn't integrate TimelineClient and AHSClient, the latter of which is RPC interface of getting generic history information via RPC interface.??. But any way we need to fix this issue also right ? so already any jira is raised or shall i work as part of this jira ? And also please inform if this issue needs to be split into mulitple jiras (apart from documentation which you have already raised) would like to split and work on them. As already i have started looking into these issues, was also planning to work on documentation. If you don't mind can you assign the issue (YARN-2854) to me ? Issues with TimeLineServer (Application History) Key: YARN-2838 URL: https://issues.apache.org/jira/browse/YARN-2838 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.5.1 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: IssuesInTimelineServer.pdf Few issues in usage of Timeline server for generic application history access -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2855) Wish yarn web app use local date format to show app date time
[ https://issues.apache.org/jira/browse/YARN-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207610#comment-14207610 ] Karthik Kambatla commented on YARN-2855: Duplicate of YARN-570? Wish yarn web app use local date format to show app date time -- Key: YARN-2855 URL: https://issues.apache.org/jira/browse/YARN-2855 Project: Hadoop YARN Issue Type: Wish Components: resourcemanager Affects Versions: 2.5.1 Reporter: Li Junjun Priority: Minor in yarn.dt.plugins.js function renderHadoopDate use toUTCString . I'm in China, so I need to add 8 hours in my mind every time! I wish use toLocaleString() to format Date instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2855) Wish yarn web app use local date format to show app date time
[ https://issues.apache.org/jira/browse/YARN-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207624#comment-14207624 ] Li Junjun commented on YARN-2855: - yes! I closed it ! Wish yarn web app use local date format to show app date time -- Key: YARN-2855 URL: https://issues.apache.org/jira/browse/YARN-2855 Project: Hadoop YARN Issue Type: Wish Components: resourcemanager Affects Versions: 2.5.1 Reporter: Li Junjun Priority: Minor Fix For: 2.7.0 in yarn.dt.plugins.js function renderHadoopDate use toUTCString . I'm in China, so I need to add 8 hours in my mind every time! I wish use toLocaleString() to format Date instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash reassigned YARN-1964: -- Assignee: Ravi Prakash (was: Abin Shahab) Create Docker analog of the LinuxContainerExecutor in YARN -- Key: YARN-1964 URL: https://issues.apache.org/jira/browse/YARN-1964 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Ravi Prakash Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch Docker (https://www.docker.io/) is, increasingly, a very popular container technology. In context of YARN, the support for Docker will provide a very elegant solution to allow applications to *package* their software into a Docker container (entire Linux file system incl. custom versions of perl, python etc.) and use it as a blueprint to launch all their YARN containers with requisite software environment. This provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2856) Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED
Rohith created YARN-2856: Summary: Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED Key: YARN-2856 URL: https://issues.apache.org/jira/browse/YARN-2856 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Rohith Assignee: Rohith It is observed that recovering an application with its attempt KILLED final state throw below exception. And application remain in accepted state forever. {code} 2014-11-12 02:34:10,602 | ERROR | AsyncDispatcher event handler | Can't handle this event at current state | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:673) org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:671) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:90) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:730) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2236) Shared Cache uploader service on the Node Manager
[ https://issues.apache.org/jira/browse/YARN-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-2236: -- Attachment: YARN-2236-trunk-v6.patch v.6 patch posted. Again, to see the diff against the trunk, see https://github.com/ctrezzo/hadoop/compare/trunk...sharedcache-5-YARN-2236-uploader To see the diff between v.5 and v.6, see https://github.com/ctrezzo/hadoop/commit/a74f38cf3e3de824b3c6ced327acbe8e3937aef0 Shared Cache uploader service on the Node Manager - Key: YARN-2236 URL: https://issues.apache.org/jira/browse/YARN-2236 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2236-trunk-v1.patch, YARN-2236-trunk-v2.patch, YARN-2236-trunk-v3.patch, YARN-2236-trunk-v4.patch, YARN-2236-trunk-v5.patch, YARN-2236-trunk-v6.patch Implement the shared cache uploader service on the node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2856) Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED
[ https://issues.apache.org/jira/browse/YARN-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207744#comment-14207744 ] Rohith commented on YARN-2856: -- It is possible event ATTEMPT_KILLED can come to RMApp while recovering the attempt with KILLED state. This event need to be handled. Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED Key: YARN-2856 URL: https://issues.apache.org/jira/browse/YARN-2856 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Rohith Assignee: Rohith It is observed that recovering an application with its attempt KILLED final state throw below exception. And application remain in accepted state forever. {code} 2014-11-12 02:34:10,602 | ERROR | AsyncDispatcher event handler | Can't handle this event at current state | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:673) org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:671) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:90) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:730) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2236) Shared Cache uploader service on the Node Manager
[ https://issues.apache.org/jira/browse/YARN-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207748#comment-14207748 ] Sangjin Lee commented on YARN-2236: --- Karthik, the v.6 patch should address all of your comments except #8. As for #8, it is true that the event handler is bit extraneous. But from the code standpoint, it is pretty clean and elegant. We just initialize the SharedCacheUploadService, and ContainerImpl can simply publish the event when needed. It also makes the coupling between SharedCacheUploadService and ContainerImpl loose. It is possible to have ContainerImpl use SharedCacheUploadService directly, but then the SharedCacheUploadService needs to be passed into the ContainerImpl constructor so it can be invoked directly. So all in all, I feel that the current approach is as clean as the alternative, if not cleaner. Let me know your thoughts. Thanks! Shared Cache uploader service on the Node Manager - Key: YARN-2236 URL: https://issues.apache.org/jira/browse/YARN-2236 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2236-trunk-v1.patch, YARN-2236-trunk-v2.patch, YARN-2236-trunk-v3.patch, YARN-2236-trunk-v4.patch, YARN-2236-trunk-v5.patch, YARN-2236-trunk-v6.patch Implement the shared cache uploader service on the node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2236) Shared Cache uploader service on the Node Manager
[ https://issues.apache.org/jira/browse/YARN-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207775#comment-14207775 ] Hadoop QA commented on YARN-2236: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681014/YARN-2236-trunk-v6.patch against trunk revision 53f64ee. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5823//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5823//console This message is automatically generated. Shared Cache uploader service on the Node Manager - Key: YARN-2236 URL: https://issues.apache.org/jira/browse/YARN-2236 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2236-trunk-v1.patch, YARN-2236-trunk-v2.patch, YARN-2236-trunk-v3.patch, YARN-2236-trunk-v4.patch, YARN-2236-trunk-v5.patch, YARN-2236-trunk-v6.patch Implement the shared cache uploader service on the node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2856) Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED
[ https://issues.apache.org/jira/browse/YARN-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2856: - Attachment: YARN-2856.patch Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED Key: YARN-2856 URL: https://issues.apache.org/jira/browse/YARN-2856 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Rohith Assignee: Rohith Attachments: YARN-2856.patch It is observed that recovering an application with its attempt KILLED final state throw below exception. And application remain in accepted state forever. {code} 2014-11-12 02:34:10,602 | ERROR | AsyncDispatcher event handler | Can't handle this event at current state | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:673) org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:671) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:90) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:730) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)