[jira] [Commented] (YARN-3396) Handle URISyntaxException in ResourceLocalizationService
[ https://issues.apache.org/jira/browse/YARN-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528371#comment-14528371 ] Junping Du commented on YARN-3396: -- This is a simple fix which doesn't need unit test. I will go ahead to commit this later today. Handle URISyntaxException in ResourceLocalizationService Key: YARN-3396 URL: https://issues.apache.org/jira/browse/YARN-3396 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chengbing Liu Assignee: Brahma Reddy Battula Labels: newbie Attachments: YARN-3396-002.patch, YARN-3396.patch There are two occurrences of the following code snippet: {code} //TODO fail? Already translated several times... {code} It should be handled correctly in case that the resource URI is incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3577) Misspelling of threshold in log4j.properties for tests
Brahma Reddy Battula created YARN-3577: -- Summary: Misspelling of threshold in log4j.properties for tests Key: YARN-3577 URL: https://issues.apache.org/jira/browse/YARN-3577 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Priority: Minor log4j.properties file for test contains misspelling log4j.threshhold. We should use log4j.threshold correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3538) TimelineServer doesn't catch/translate all exceptions raised
[ https://issues.apache.org/jira/browse/YARN-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528442#comment-14528442 ] Junping Du commented on YARN-3538: -- May be it sounds slightly better if we handle RuntimeException in the same way as IOException here? At least, we add Error putting domain info. :) TimelineServer doesn't catch/translate all exceptions raised Key: YARN-3538 URL: https://issues.apache.org/jira/browse/YARN-3538 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Steve Loughran Assignee: Steve Loughran Priority: Minor Attachments: YARN-3538-001.patch Not all exceptions in TimelineServer are uprated to web exceptions; only IOEs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3396) Handle URISyntaxException in ResourceLocalizationService
[ https://issues.apache.org/jira/browse/YARN-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528282#comment-14528282 ] Junping Du commented on YARN-3396: -- Thanks [~brahmareddy] for updating the patch! v2 patch LGTM. +1 based on Jenkins' result. Handle URISyntaxException in ResourceLocalizationService Key: YARN-3396 URL: https://issues.apache.org/jira/browse/YARN-3396 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Chengbing Liu Assignee: Brahma Reddy Battula Labels: newbie Attachments: YARN-3396-002.patch, YARN-3396.patch There are two occurrences of the following code snippet: {code} //TODO fail? Already translated several times... {code} It should be handled correctly in case that the resource URI is incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3574) RM hangs on stopping MetricsSinkAdapter when transitioning to standby
[ https://issues.apache.org/jira/browse/YARN-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528350#comment-14528350 ] Rohith commented on YARN-3574: -- Very interesting bug!! Going back to Java basics, Thread.interrupt() does not guarentee the interrupt for running thread unless thread is waintig/sleeping for something. In this issue I think {{queue.consumeAll(this);}} processing something which never given chance to interrupt it. Just to reproduce this, small program wrote below code. If we run below code withoug commenting Thread.sleep, thread never get interrupted. Adding small sleep , result in thread get interrupted. {code} package com.test.basic; public class Test1 { Thread sinkThread; private volatile boolean stopping = false; public void start() { sinkThread = new Thread() { public void run() { while (!stopping) { try { while (true) { // Thread.sleep(1); } } catch (Exception e) { System.out.println(Interuppted..); } } }; }; sinkThread.setDaemon(true); sinkThread.start(); } public void stop() { stopping = true; System.out.println(Interrupting.. ); sinkThread.interrupt(); try { System.out.println(Joining.. ); sinkThread.join(); } catch (InterruptedException e) { System.out.println(Stop interrupted + e); } System.out.println(Stopped successfully); } public static void main(String[] args) throws InterruptedException { Test1 t1 = new Test1(); t1.start(); Thread.sleep(2000); t1.stop(); } } {code} RM hangs on stopping MetricsSinkAdapter when transitioning to standby - Key: YARN-3574 URL: https://issues.apache.org/jira/browse/YARN-3574 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Brahma Reddy Battula We've seen a situation that one RM hangs on stopping the MetricsSinkAdapter {code} main-EventThread daemon prio=10 tid=0x7f9b24031000 nid=0x2d18 in Object.wait() [0x7f9afe7eb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xc058dcf8 (a org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1) at java.lang.Thread.join(Thread.java:1281) - locked 0xc058dcf8 (a org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.stop(MetricsSinkAdapter.java:202) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stopSinks(MetricsSystemImpl.java:472) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.stop(MetricsSystemImpl.java:213) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.shutdown(MetricsSystemImpl.java:592) - locked 0xc04cc1a0 (a org.apache.hadoop.metrics2.impl.MetricsSystemImpl) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdownInstance(DefaultMetricsSystem.java:72) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.shutdown(DefaultMetricsSystem.java:68) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:605) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked 0xc0503568 (a java.lang.Object) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:1024) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1076) - locked 0xc03fe3b8 (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:322) - locked 0xc0502b10 (a org.apache.hadoop.yarn.server.resourcemanager.AdminService) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:135) at org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:911) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:428) - locked 0xc0718940 (a org.apache.hadoop.ha.ActiveStandbyElector) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) at
[jira] [Updated] (YARN-3552) RM Web UI shows -1 running containers for completed apps
[ https://issues.apache.org/jira/browse/YARN-3552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-3552: - Affects Version/s: 2.8.0 RM Web UI shows -1 running containers for completed apps Key: YARN-3552 URL: https://issues.apache.org/jira/browse/YARN-3552 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.8.0 Reporter: Rohith Assignee: Rohith Priority: Trivial Labels: newbie Fix For: 2.8.0 Attachments: 0001-YARN-3552.patch, 0001-YARN-3552.patch, 0002-YARN-3552.patch, yarn-3352.PNG In the RMServerUtils, the default values are negative number which results in the displayiing the RM web UI also negative number. {code} public static final ApplicationResourceUsageReport DUMMY_APPLICATION_RESOURCE_USAGE_REPORT = BuilderUtils.newApplicationResourceUsageReport(-1, -1, Resources.createResource(-1, -1), Resources.createResource(-1, -1), Resources.createResource(-1, -1), 0, 0); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3576) In Log - Container getting killed by AM even when work preserving is enabled
Anushri created YARN-3576: - Summary: In Log - Container getting killed by AM even when work preserving is enabled Key: YARN-3576 URL: https://issues.apache.org/jira/browse/YARN-3576 Project: Hadoop YARN Issue Type: Bug Environment: SUSE11 SP3 3 nodes cluster Reporter: Anushri Priority: Minor RM in HA mode NM running on one node work preserving enabled RM in HA mode one NM running work preserving is enabled An application is submitted and RM switch over happens. In the NM log it is found that AM kills some of the containers and those containers have exit code as 143. but in the container logs , logs are found for the same container. Problem : if work preserving is enabled why is it killing and cleaning the container? and if the container is getting killed , why is its log present in container logs? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies
[ https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528322#comment-14528322 ] Hudson commented on YARN-3097: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #918 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/918/]) YARN-3097. Logging of resource recovery on NM restart has redundancies. Contributed by Eric Payne (jlowe: rev 8f65c793f2930bfd16885a2ab188a9970b754974) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/CHANGES.txt Logging of resource recovery on NM restart has redundancies --- Key: YARN-3097 URL: https://issues.apache.org/jira/browse/YARN-3097 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Eric Payne Priority: Minor Labels: newbie Fix For: 2.8.0 Attachments: YARN-3097.001.patch ResourceLocalizationService logs that it is recovering a resource with the remote and local paths, but then very shortly afterwards the LocalizedResource emits an INIT-LOCALIZED transition that also logs the same remote and local paths. The recovery message should be a debug message, since it's not conveying any useful information that isn't already covered by the resource state transition log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528328#comment-14528328 ] Hudson commented on YARN-2725: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #918 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/918/]) YARN-2725. Added test cases of retrying creating znode in ZKRMStateStore. Contributed by Tsuyoshi Ozawa (jianhe: rev d701acc9c67adc578ba18035edde1166eedaae98) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java Adding test cases of retrying requests about ZKRMStateStore --- Key: YARN-2725 URL: https://issues.apache.org/jira/browse/YARN-2725 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi Ozawa Assignee: Tsuyoshi Ozawa Fix For: 2.8.0 Attachments: YARN-2725.1.patch, YARN-2725.1.patch YARN-2721 found a race condition for ZK-specific retry semantics. We should add tests about the case of retry requests to ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3375) NodeHealthScriptRunner.shouldRun() check is performing 3 times for starting NodeHealthScriptRunner
[ https://issues.apache.org/jira/browse/YARN-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528326#comment-14528326 ] Hudson commented on YARN-3375: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #918 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/918/]) YARN-3375. NodeHealthScriptRunner.shouldRun() check is performing 3 times for starting NodeHealthScriptRunner (Devaraj K via wangda) (wangda: rev 71f4de220c74bf2c90630bd0442979d92380d304) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/NodeHealthScriptRunner.java * hadoop-yarn-project/CHANGES.txt NodeHealthScriptRunner.shouldRun() check is performing 3 times for starting NodeHealthScriptRunner -- Key: YARN-3375 URL: https://issues.apache.org/jira/browse/YARN-3375 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Devaraj K Assignee: Devaraj K Priority: Minor Fix For: 2.8.0 Attachments: YARN-3375.patch 1. NodeHealthScriptRunner.shouldRun() check is happening 3 times for starting the NodeHealthScriptRunner. {code:title=NodeManager.java|borderStyle=solid} if(!NodeHealthScriptRunner.shouldRun(nodeHealthScript)) { LOG.info(Abey khali); return null; } {code} {code:title=NodeHealthCheckerService.java|borderStyle=solid} if (NodeHealthScriptRunner.shouldRun( conf.get(YarnConfiguration.NM_HEALTH_CHECK_SCRIPT_PATH))) { addService(nodeHealthScriptRunner); } {code} {code:title=NodeHealthScriptRunner.java|borderStyle=solid} if (!shouldRun(nodeHealthScript)) { LOG.info(Not starting node health monitor); return; } {code} 2. If we don't configure node health script or configured health script doesn't execute permission, NM logs with the below message. {code:xml} 2015-03-19 19:55:45,713 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: Abey khali {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3097) Logging of resource recovery on NM restart has redundancies
[ https://issues.apache.org/jira/browse/YARN-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528288#comment-14528288 ] Hudson commented on YARN-3097: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #184 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/184/]) YARN-3097. Logging of resource recovery on NM restart has redundancies. Contributed by Eric Payne (jlowe: rev 8f65c793f2930bfd16885a2ab188a9970b754974) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/CHANGES.txt Logging of resource recovery on NM restart has redundancies --- Key: YARN-3097 URL: https://issues.apache.org/jira/browse/YARN-3097 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Eric Payne Priority: Minor Labels: newbie Fix For: 2.8.0 Attachments: YARN-3097.001.patch ResourceLocalizationService logs that it is recovering a resource with the remote and local paths, but then very shortly afterwards the LocalizedResource emits an INIT-LOCALIZED transition that also logs the same remote and local paths. The recovery message should be a debug message, since it's not conveying any useful information that isn't already covered by the resource state transition log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528294#comment-14528294 ] Hudson commented on YARN-2725: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #184 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/184/]) YARN-2725. Added test cases of retrying creating znode in ZKRMStateStore. Contributed by Tsuyoshi Ozawa (jianhe: rev d701acc9c67adc578ba18035edde1166eedaae98) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java Adding test cases of retrying requests about ZKRMStateStore --- Key: YARN-2725 URL: https://issues.apache.org/jira/browse/YARN-2725 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi Ozawa Assignee: Tsuyoshi Ozawa Fix For: 2.8.0 Attachments: YARN-2725.1.patch, YARN-2725.1.patch YARN-2721 found a race condition for ZK-specific retry semantics. We should add tests about the case of retry requests to ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3484) Fix up yarn top shell code
[ https://issues.apache.org/jira/browse/YARN-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528397#comment-14528397 ] Junping Du commented on YARN-3484: -- Latest patch LGTM. [~aw], do you have any further comments? If not, I will go ahead to commit v2 patch soon. Thx! Fix up yarn top shell code -- Key: YARN-3484 URL: https://issues.apache.org/jira/browse/YARN-3484 Project: Hadoop YARN Issue Type: Bug Components: scripts Affects Versions: 3.0.0 Reporter: Allen Wittenauer Assignee: Varun Vasudev Labels: newbie Attachments: YARN-3484.001.patch, YARN-3484.002.patch We need to do some work on yarn top's shell code. a) Just checking for TERM isn't good enough. We really need to check the return on tput, especially since the output will not be a number but an error string which will likely blow up the java code in horrible ways. b) All the single bracket tests should be double brackets to force the bash built-in. c) I'd think I'd rather see the shell portion in a function since it's rather large. This will allow for args, etc, to get local'ized and clean up the case statement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3477) TimelineClientImpl swallows exceptions
[ https://issues.apache.org/jira/browse/YARN-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528420#comment-14528420 ] Junping Du commented on YARN-3477: -- Adding a little background (for timeline service v2) here why we could prefer DEBUG than INFO here in retry logic: In timeline service version 2, the timeline service address (per application per agent - we call it AppTimelineCollector) is automatically discovered, the current flow is: 1. for a new application, when AM get launched in NM, the auxiliary service of container launch will trigger initializing of AppTimelineCollector, which report its bind address to NM (we add new RPC there); 2. NM will notify RM about this new AppTimelineCollector address in next heartbeat; 3. Other NMs (has container running against this app) get this address from RM. Both AM and NM leverage TimelineClient to publish events/metrics info to timeline service and this auto-discovery process do need some time (several heartbeat intervals) to figure out rather than a static pre-configured address. And we will always see some disturbed info if we put INFO level message there. Thoughts? TimelineClientImpl swallows exceptions -- Key: YARN-3477 URL: https://issues.apache.org/jira/browse/YARN-3477 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0, 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-3477-001.patch, YARN-3477-002.patch If timeline client fails more than the retry count, the original exception is not thrown. Instead some runtime exception is raised saying retries run out # the failing exception should be rethrown, ideally via NetUtils.wrapException to include URL of the failing endpoing # Otherwise, the raised RTE should (a) state that URL and (b) set the original fault as the inner cause -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2123) Progress bars in Web UI always at 100% due to non-US locale
[ https://issues.apache.org/jira/browse/YARN-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-2123: Attachment: YARN-2123-branch-2.7.001.patch Thanks [~xgong] for review. Attaching a patch for branch-2.7. Progress bars in Web UI always at 100% due to non-US locale --- Key: YARN-2123 URL: https://issues.apache.org/jira/browse/YARN-2123 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.3.0 Reporter: Johannes Simon Assignee: Akira AJISAKA Attachments: NaN_after_launching_RM.png, YARN-2123-001.patch, YARN-2123-002.patch, YARN-2123-003.patch, YARN-2123-004.patch, YARN-2123-branch-2.7.001.patch, fair-scheduler-ajisaka.xml, screenshot-noPatch.png, screenshot-patch.png, screenshot.png, yarn-site-ajisaka.xml In our cluster setup, the YARN web UI always shows progress bars at 100% (see screenshot, progress of the reduce step is roughly at 32.82%). I opened the HTML source code to check (also see screenshot), and it seems the problem is that it uses a comma as decimal mark, where most browsers expect a dot for floating-point numbers. This could possibly be due to localized number formatting being used in the wrong place, which would also explain why this bug is not always visible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3549) use JNI-based FileStatus implementation from io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from RawLocalFileSystem in checkLocalDir.
[ https://issues.apache.org/jira/browse/YARN-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528392#comment-14528392 ] Junping Du commented on YARN-3549: -- Hi [~zxu] and [~cnauroth], this sounds like a change need to happen in hadoop-common project instead of YARN. Shall we move this JIRA from YARN to HADOOP? use JNI-based FileStatus implementation from io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from RawLocalFileSystem in checkLocalDir. Key: YARN-3549 URL: https://issues.apache.org/jira/browse/YARN-3549 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Use JNI-based FileStatus implementation from io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from RawLocalFileSystem in checkLocalDir. As discussed in YARN-3491, shell-based implementation getPermission runs shell command ls -ld to get permission, which take 4 or 5 ms(very slow). We should switch to io.nativeio.NativeIO.POSIX#getFstat as implementation in RawLocalFileSystem to get rid of shell-based implementation for FileStatus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3552) RM Web UI shows -1 running containers for completed apps
[ https://issues.apache.org/jira/browse/YARN-3552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528422#comment-14528422 ] Jason Lowe commented on YARN-3552: -- +1 lgtm. Committing this. RM Web UI shows -1 running containers for completed apps Key: YARN-3552 URL: https://issues.apache.org/jira/browse/YARN-3552 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: Rohith Assignee: Rohith Priority: Trivial Labels: newbie Attachments: 0001-YARN-3552.patch, 0001-YARN-3552.patch, 0002-YARN-3552.patch, yarn-3352.PNG In the RMServerUtils, the default values are negative number which results in the displayiing the RM web UI also negative number. {code} public static final ApplicationResourceUsageReport DUMMY_APPLICATION_RESOURCE_USAGE_REPORT = BuilderUtils.newApplicationResourceUsageReport(-1, -1, Resources.createResource(-1, -1), Resources.createResource(-1, -1), Resources.createResource(-1, -1), 0, 0); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1912) ResourceLocalizer started without any jvm memory control
[ https://issues.apache.org/jira/browse/YARN-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529728#comment-14529728 ] Hadoop QA commented on YARN-1912: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 38s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 59s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 47s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 32s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 26s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 5m 53s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 46m 12s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730662/YARN-1912.003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 90b3845 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7717/artifact/patchprocess/whitespace.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/7717/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7717/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7717/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7717/console | This message was automatically generated. ResourceLocalizer started without any jvm memory control Key: YARN-1912 URL: https://issues.apache.org/jira/browse/YARN-1912 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: stanley shi Assignee: Masatake Iwasaki Attachments: YARN-1912-0.patch, YARN-1912-1.patch, YARN-1912.003.patch In the LinuxContainerExecutor.java#startLocalizer, it does not specify any -Xmx configurations in the command, this caused the ResourceLocalizer to be started with default memory setting. In an server-level hardware, it will use 25% of the system memory as the max heap size, this will cause memory issue in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs
[ https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2172: --- Labels: BB2015-05-TBR hadoop jobs resume suspend (was: hadoop jobs resume suspend) Suspend/Resume Hadoop Jobs -- Key: YARN-2172 URL: https://issues.apache.org/jira/browse/YARN-2172 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager, webapp Affects Versions: 2.2.0 Environment: CentOS 6.5, Hadoop 2.2.0 Reporter: Richard Chen Labels: BB2015-05-TBR, hadoop, jobs, resume, suspend Attachments: Hadoop Job Suspend Resume Design.docx, hadoop_job_suspend_resume.patch Original Estimate: 336h Remaining Estimate: 336h In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN. When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs. In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs. My team has completed its implementation and our tests showed it works in a rather solid and convenient way. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3259) FairScheduler: Update to fairShare could be triggered early on node events instead of waiting for update interval
[ https://issues.apache.org/jira/browse/YARN-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3259: --- Labels: BB2015-05-TBR (was: ) FairScheduler: Update to fairShare could be triggered early on node events instead of waiting for update interval -- Key: YARN-3259 URL: https://issues.apache.org/jira/browse/YARN-3259 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Labels: BB2015-05-TBR Attachments: YARN-3259.001.patch Instead of waiting for update interval unconditionally, we can trigger early updates on important events - for eg node join and leave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2069: --- Labels: BB2015-05-TBR (was: ) CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Labels: BB2015-05-TBR Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-10.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch, YARN-2069-trunk-8.patch, YARN-2069-trunk-9.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2981) DockerContainerExecutor must support a Cluster-wide default Docker image
[ https://issues.apache.org/jira/browse/YARN-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2981: --- Labels: BB2015-05-TBR (was: ) DockerContainerExecutor must support a Cluster-wide default Docker image Key: YARN-2981 URL: https://issues.apache.org/jira/browse/YARN-2981 Project: Hadoop YARN Issue Type: Bug Reporter: Abin Shahab Assignee: Abin Shahab Labels: BB2015-05-TBR Attachments: YARN-2981.patch, YARN-2981.patch, YARN-2981.patch, YARN-2981.patch This allows the yarn administrator to add a cluster-wide default docker image that will be used when there are no per-job override of docker images. With this features, it would be convenient for newer applications like slider to launch inside a cluster-default docker container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3282) DockerContainerExecutor should support environment variables setting
[ https://issues.apache.org/jira/browse/YARN-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3282: --- Labels: BB2015-05-TBR (was: ) DockerContainerExecutor should support environment variables setting Key: YARN-3282 URL: https://issues.apache.org/jira/browse/YARN-3282 Project: Hadoop YARN Issue Type: Improvement Components: applications, nodemanager Affects Versions: 2.6.0 Reporter: Leitao Guo Labels: BB2015-05-TBR Attachments: YARN-3282.01.patch Currently, DockerContainerExecutor will mount yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs to containers automatically. However applications maybe need set more environment variables before launching containers. In our applications, just as the following command, we need to attach several directories and set some environment variables to docker containers. {code} docker run -i -t -v /data/transcode:/data/tmp -v /etc/qcs:/etc/qcs -v /mnt:/mnt -e VTC_MQTYPE=rabbitmq -e VTC_APP=ugc -e VTC_LOCATION=sh -e VTC_RUNTIME=vtc sequenceiq/hadoop-docker:2.6.0 /bin/bash {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3176) In Fair Scheduler, child queue should inherit maxApp from its parent
[ https://issues.apache.org/jira/browse/YARN-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3176: --- Labels: BB2015-05-TBR (was: ) In Fair Scheduler, child queue should inherit maxApp from its parent Key: YARN-3176 URL: https://issues.apache.org/jira/browse/YARN-3176 Project: Hadoop YARN Issue Type: Bug Reporter: Siqi Li Assignee: Siqi Li Labels: BB2015-05-TBR Attachments: YARN-3176.v1.patch if the child queue does not have a maxRunningApp limit, it will use the queueMaxAppsDefault. This behavior is not quite right, since queueMaxAppsDefault is normally a small number, whereas some parent queues do have maxRunningApp set to be more than the default -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1912) ResourceLocalizer started without any jvm memory control
[ https://issues.apache.org/jira/browse/YARN-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-1912: --- Labels: BB2015-05-TBR (was: ) ResourceLocalizer started without any jvm memory control Key: YARN-1912 URL: https://issues.apache.org/jira/browse/YARN-1912 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: stanley shi Assignee: Masatake Iwasaki Labels: BB2015-05-TBR Attachments: YARN-1912-0.patch, YARN-1912-1.patch, YARN-1912.003.patch In the LinuxContainerExecutor.java#startLocalizer, it does not specify any -Xmx configurations in the command, this caused the ResourceLocalizer to be started with default memory setting. In an server-level hardware, it will use 25% of the system memory as the max heap size, this will cause memory issue in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3554: --- Labels: BB2015-05-TBR newbie (was: newbie) Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: BB2015-05-TBR, newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2618) Avoid over-allocation of disk resources
[ https://issues.apache.org/jira/browse/YARN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2618: --- Labels: BB2015-05-TBR (was: ) Avoid over-allocation of disk resources --- Key: YARN-2618 URL: https://issues.apache.org/jira/browse/YARN-2618 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wei Yan Assignee: Wei Yan Labels: BB2015-05-TBR Attachments: YARN-2618-1.patch, YARN-2618-2.patch, YARN-2618-3.patch, YARN-2618-4.patch, YARN-2618-5.patch, YARN-2618-6.patch, YARN-2618-7.patch Subtask of YARN-2139. This should include - Add API support for introducing disk I/O as the 3rd type resource. - NM should report this information to the RM - RM should consider this to avoid over-allocation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3562) unit tests failures and issues found from findbug from earlier ATS checkins
[ https://issues.apache.org/jira/browse/YARN-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529837#comment-14529837 ] Hadoop QA commented on YARN-3562: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 48s | Pre-patch YARN-2928 compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 6 new or modified test files. | | {color:green}+1{color} | javac | 7m 38s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 43s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 58s | The applied patch generated 16 release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 34s | The applied patch generated 1 new checkstyle issues (total was 9, now 10). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 24s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 52m 40s | Tests passed in hadoop-yarn-server-resourcemanager. | | {color:green}+1{color} | yarn tests | 2m 33s | Tests passed in hadoop-yarn-server-tests. | | {color:green}+1{color} | yarn tests | 0m 21s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 94m 50s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730677/YARN-3562-YARN-2928.002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / 557a395 | | Release Audit | https://builds.apache.org/job/PreCommit-YARN-Build/7720/artifact/patchprocess/patchReleaseAuditProblems.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7720/artifact/patchprocess/diffcheckstylehadoop-yarn-server-timelineservice.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7720/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-tests test log | https://builds.apache.org/job/PreCommit-YARN-Build/7720/artifact/patchprocess/testrun_hadoop-yarn-server-tests.txt | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/7720/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7720/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7720/console | This message was automatically generated. unit tests failures and issues found from findbug from earlier ATS checkins --- Key: YARN-3562 URL: https://issues.apache.org/jira/browse/YARN-3562 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Naganarasimha G R Priority: Minor Labels: BB2015-05-TBR Attachments: YARN-3562-YARN-2928.001.patch, YARN-3562-YARN-2928.002.patch *Issues reported from MAPREDUCE-6337* : A bunch of MR unit tests are failing on our branch whenever the mini YARN cluster needs to bring up multiple node managers. For example, see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5472/testReport/org.apache.hadoop.mapred/TestClusterMapReduceTestCase/testMapReduceRestarting/ It is because the NMCollectorService is using a fixed port for the RPC (8048). *Issues reported from YARN-3044* : Test case failures and tools(FB CS) issues found : # find bugs issue : Comparison of String objects using == or != in ResourceTrackerService.updateAppCollectorsMap # find bugs issue : Boxing/unboxing to parse a primitive RMTimelineCollectorManager.postPut. Called method Long.longValue() Should call Long.parseLong(String) instead. # find bugs issue : DM_DEFAULT_ENCODING Called method new java.io.FileWriter(String, boolean) At FileSystemTimelineWriterImpl.java:\[line 86\] # hadoop.yarn.server.resourcemanager.TestAppManager, hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions, hadoop.yarn.server.resourcemanager.TestClientRMService
[jira] [Updated] (YARN-3381) A typographical error in InvalidStateTransitonException
[ https://issues.apache.org/jira/browse/YARN-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3381: --- Labels: BB2015-05-TBR (was: ) A typographical error in InvalidStateTransitonException - Key: YARN-3381 URL: https://issues.apache.org/jira/browse/YARN-3381 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Xiaoshuang LU Assignee: Brahma Reddy Battula Labels: BB2015-05-TBR Attachments: YARN-3381-002.patch, YARN-3381-003.patch, YARN-3381.patch Appears that InvalidStateTransitonException should be InvalidStateTransitionException. Transition was misspelled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.
[ https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3385: --- Labels: BB2015-05-TBR (was: ) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion. --- Key: YARN-3385 URL: https://issues.apache.org/jira/browse/YARN-3385 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Labels: BB2015-05-TBR Attachments: YARN-3385.000.patch, YARN-3385.001.patch, YARN-3385.002.patch, YARN-3385.003.patch Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete). The race condition is similar as YARN-3023. since the race condition exists for ZK node creation, it should also exist for ZK node deletion. We see this issue with the following stack trace: {code} 2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3580) [JDK 8] TestClientRMService.testGetLabelsToNodes fails
[ https://issues.apache.org/jira/browse/YARN-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3580: --- Labels: BB2015-05-TBR (was: ) [JDK 8] TestClientRMService.testGetLabelsToNodes fails -- Key: YARN-3580 URL: https://issues.apache.org/jira/browse/YARN-3580 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 2.8.0 Environment: JDK 8 Reporter: Robert Kanter Assignee: Robert Kanter Labels: BB2015-05-TBR Attachments: YARN-3580.001.patch When using JDK 8, {{TestClientRMService.testGetLabelsToNodes}} fails: {noformat} java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService.testGetLabelsToNodes(TestClientRMService.java:1499) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3582) NPE in WebAppProxyServlet
[ https://issues.apache.org/jira/browse/YARN-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3582: --- Labels: BB2015-05-TBR (was: ) NPE in WebAppProxyServlet - Key: YARN-3582 URL: https://issues.apache.org/jira/browse/YARN-3582 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Labels: BB2015-05-TBR Attachments: YARN-3582.1.patch {code} HTTP ERROR 500 Problem accessing /proxy. Reason: INTERNAL_SERVER_ERROR Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:245) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3513) Remove unused variables in ContainersMonitorImpl
[ https://issues.apache.org/jira/browse/YARN-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3513: --- Labels: BB2015-05-TBR newbie (was: newbie) Remove unused variables in ContainersMonitorImpl Key: YARN-3513 URL: https://issues.apache.org/jira/browse/YARN-3513 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Priority: Trivial Labels: BB2015-05-TBR, newbie Attachments: YARN-3513.20150421-1.patch, YARN-3513.20150503-1.patch, YARN-3513.20150506-1.patch class members : {{private final Context context;}} and some local variables in MonitoringThread.run() : {{vmemStillInUsage and pmemStillInUsage}} are not used and just updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3134: --- Labels: BB2015-05-TBR (was: ) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Li Lu Labels: BB2015-05-TBR Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf, YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch, YARN-3134-YARN-2928.001.patch, YARN-3134-YARN-2928.002.patch, YARN-3134-YARN-2928.003.patch, YARN-3134-YARN-2928.004.patch, YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529840#comment-14529840 ] Jian He commented on YARN-1680: --- On my thinking, even if we do the headroom calculation on the client side, scheduler still requires some corresponding per-app logic for the headroom calculation. And that scheduler piece of logic may end up duplicating a subset of the client side logic plus corresponding protocol changes. In that sense, I think it's simpler to do this inside scheduler. Doing the calculation in one place is still a more accurate snapshot than doing the calculations in multiple places. Also, changing MapReduce to use AMRMClient is non-trivial work. availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Craig Welch Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3523) Cleanup ResourceManagerAdministrationProtocol interface audience
[ https://issues.apache.org/jira/browse/YARN-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3523: --- Labels: BB2015-05-TBR newbie (was: newbie) Cleanup ResourceManagerAdministrationProtocol interface audience Key: YARN-3523 URL: https://issues.apache.org/jira/browse/YARN-3523 Project: Hadoop YARN Issue Type: Bug Components: client, resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Labels: BB2015-05-TBR, newbie Attachments: YARN-3523.20150422-1.patch, YARN-3523.20150504-1.patch, YARN-3523.20150505-1.patch I noticed ResourceManagerAdministrationProtocol has @Private audience for the class and @Public audience for methods. It doesn't make sense to me. We should make class audience and methods audience consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3562) unit tests failures and issues found from findbug from earlier ATS checkins
[ https://issues.apache.org/jira/browse/YARN-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3562: --- Labels: BB2015-05-TBR (was: ) unit tests failures and issues found from findbug from earlier ATS checkins --- Key: YARN-3562 URL: https://issues.apache.org/jira/browse/YARN-3562 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Naganarasimha G R Priority: Minor Labels: BB2015-05-TBR Attachments: YARN-3562-YARN-2928.001.patch, YARN-3562-YARN-2928.002.patch *Issues reported from MAPREDUCE-6337* : A bunch of MR unit tests are failing on our branch whenever the mini YARN cluster needs to bring up multiple node managers. For example, see https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5472/testReport/org.apache.hadoop.mapred/TestClusterMapReduceTestCase/testMapReduceRestarting/ It is because the NMCollectorService is using a fixed port for the RPC (8048). *Issues reported from YARN-3044* : Test case failures and tools(FB CS) issues found : # find bugs issue : Comparison of String objects using == or != in ResourceTrackerService.updateAppCollectorsMap # find bugs issue : Boxing/unboxing to parse a primitive RMTimelineCollectorManager.postPut. Called method Long.longValue() Should call Long.parseLong(String) instead. # find bugs issue : DM_DEFAULT_ENCODING Called method new java.io.FileWriter(String, boolean) At FileSystemTimelineWriterImpl.java:\[line 86\] # hadoop.yarn.server.resourcemanager.TestAppManager, hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions, hadoop.yarn.server.resourcemanager.TestClientRMService hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus, refer https://builds.apache.org/job/PreCommit-YARN-Build/7534/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI
[ https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3301: --- Labels: BB2015-05-TBR (was: ) Fix the format issue of the new RM web UI and AHS web UI Key: YARN-3301 URL: https://issues.apache.org/jira/browse/YARN-3301 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Labels: BB2015-05-TBR Attachments: Screen Shot 2015-04-21 at 5.09.25 PM.png, Screen Shot 2015-04-21 at 5.38.39 PM.png, YARN-3301.1.patch, YARN-3301.2.patch, YARN-3301.3.patch, YARN-3301.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI
[ https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3301: Attachment: YARN-3301.4.patch Fix the format issue of the new RM web UI and AHS web UI Key: YARN-3301 URL: https://issues.apache.org/jira/browse/YARN-3301 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: Screen Shot 2015-04-21 at 5.09.25 PM.png, Screen Shot 2015-04-21 at 5.38.39 PM.png, YARN-3301.1.patch, YARN-3301.2.patch, YARN-3301.3.patch, YARN-3301.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529591#comment-14529591 ] Naganarasimha G R commented on YARN-3554: - Hi [~vinodkv] [~jlowe], So would configuring yarn.client.nodemanager-connect.max-wait-ms as 1 min better ? Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529644#comment-14529644 ] Craig Welch commented on YARN-1680: --- bq. Please leave out the head-room concerns w.r.t node-labels. IIRC, we had tickets at YARN-796 tracking that. It is very likely a completely different solution, so. I'm not sure that's so - there is already a process of calculating headroom for labels associated with an application, the above is an extension of that to blacklisted nodes to handle label cases. If we leave it out, then the solution won't work for node-labels, and it can be made to do so, so that would be a loss. bq. When I said node-labels above, I meant partitions. Clearly the problem and the corresponding solution will likely be very similar for node-constraints (one type of node-labels). After all, blacklisting is a type of (anti) node-constraint. It could be modeled that way, but then it will be qualitatively different from the solution for non-label cases, which is not a good thing... bq. There is no notion of a cluster-level blacklisting in YARN. We have notions of unhealthy/lost/decommissioned nodes in a cluster. This is what I am referring when I say: bq. addition/removal at the cluster level I'm not suggesting/referring to anything other than nodes entering/leaving the cluster bq. Coming to the app-level blacklisting, clearly, the solution proposed is better than dead-locks. But blindly reducing the resources corresponding to blacklisted nodes will result in under-utilization (sometimes massively) and over-conservative scheduling requests by apps. So, that's the point of the recommended approach. The idea is to detect when it is necessary to recalculate the impact of the blacklisting on app headroom, which is when either blacklisting from the app has changed or the node composition of the cluster has changed (each of which should be relatively infrequent, certainly in relation to headroom calculation), and at that time to accurately calculate the impact by only adding the resource value nodes which actually exist from the blacklist into the value of the deduction. It isn't blindly reducing resources, it's doing it accurately, and should both prevent deadlocks and under-utilization bq. One way to resolve this is to get the apps (or optionally in the AMRMClient library) to deduct the resource unusable on blacklisted nodes It could be moved into the AM's or client library, but then they would have to do the same sort of thing, and then the logic needs to be duplicated amongst the AM's or will only be available to those which use the library (do they all?). It's worth considering if it can be made to cover them all via the library, but I'm not sure this isn't something which should be handled as part of the headroom calculation in the rm, as it is meant to provide this accurately, and is otherwise aware of the blacklist. Which suggested to me that we already have the blacklist for the application in the RM/available to the scheduler (I'm not sure why that wasn't obvious to me before...), which does appear to be the case and which therefore drops out concerns about adding it - it's already there... availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Craig Welch Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3580) [JDK 8] TestClientRMService.testGetLabelsToNodes fails
[ https://issues.apache.org/jira/browse/YARN-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-3580: Attachment: YARN-3580.001.patch When setting node labels, port 0 is considered a wildcard port, and the {{CommonNodeLabelsManager}} applies the given label to all NMs that previously had a label on that host. Due to the iteration ordering being different between JDK 7 and JDK 8, this was changing the labeling from: {noformat:title=JDK7} z host1:1 host3:1 y host2:0 host3:0 x host1:0 {noformat} to {noformat:title=JDK8} x host1:1 host1:0 y host3:0 host2:0 z host3:1 {noformat} The patch fixes the problem by using different port numbers. It also cleans the test up a little. [JDK 8] TestClientRMService.testGetLabelsToNodes fails -- Key: YARN-3580 URL: https://issues.apache.org/jira/browse/YARN-3580 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 2.8.0 Environment: JDK 8 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: YARN-3580.001.patch When using JDK 8, {{TestClientRMService.testGetLabelsToNodes}} fails: {noformat} java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService.testGetLabelsToNodes(TestClientRMService.java:1499) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3582) NPE in WebAppProxyServlet
[ https://issues.apache.org/jira/browse/YARN-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3582: -- Attachment: YARN-3582.1.patch upload a patch to fix the npe NPE in WebAppProxyServlet - Key: YARN-3582 URL: https://issues.apache.org/jira/browse/YARN-3582 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-3582.1.patch {code} HTTP ERROR 500 Problem accessing /proxy. Reason: INTERNAL_SERVER_ERROR Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:245) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3557) Support Intel Trusted Execution Technology(TXT) in YARN scheduler
[ https://issues.apache.org/jira/browse/YARN-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529827#comment-14529827 ] Naganarasimha G R commented on YARN-3557: - Thats an interesting point here . {{lables GPU, FPGA, LINUX, WINDOWS,}} these are more like constraints of a node for which a new jira is coming up and so may be once it comes in we need to support in such a way that constraints should be supported to be added from both RM and NM and partitions (existing labels) should be allowed either by RM or NM ... thoughts ? Support Intel Trusted Execution Technology(TXT) in YARN scheduler - Key: YARN-3557 URL: https://issues.apache.org/jira/browse/YARN-3557 Project: Hadoop YARN Issue Type: New Feature Reporter: Dian Fu Attachments: Support TXT in YARN high level design doc.pdf Intel TXT defines platform-level enhancements that provide the building blocks for creating trusted platforms. A TXT aware YARN scheduler can schedule security sensitive jobs on TXT enabled nodes only. YARN-2492 provides the capacity to restrict YARN applications to run only on cluster nodes that have a specified node label. This is a good mechanism that be utilized for TXT aware YARN scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3212) RMNode State Transition Update with DECOMMISSIONING state
[ https://issues.apache.org/jira/browse/YARN-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3212: --- Labels: BB2015-05-TBR (was: ) RMNode State Transition Update with DECOMMISSIONING state - Key: YARN-3212 URL: https://issues.apache.org/jira/browse/YARN-3212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Junping Du Assignee: Junping Du Labels: BB2015-05-TBR Attachments: RMNodeImpl - new.png, YARN-3212-v1.patch, YARN-3212-v2.patch, YARN-3212-v3.patch As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and can transition from “running” state triggered by a new event - “decommissioning”. This new state can be transit to state of “decommissioned” when Resource_Update if no running apps on this NM or NM reconnect after restart. Or it received DECOMMISSIONED event (after timeout from CLI). In addition, it can back to “running” if user decides to cancel previous decommission by calling recommission on the same node. The reaction to other events is similar to RUNNING state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1572) Low chance to hit NPE issue in AppSchedulingInfo#allocateNodeLocal
[ https://issues.apache.org/jira/browse/YARN-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-1572: --- Labels: BB2015-05-TBR (was: ) Low chance to hit NPE issue in AppSchedulingInfo#allocateNodeLocal -- Key: YARN-1572 URL: https://issues.apache.org/jira/browse/YARN-1572 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Labels: BB2015-05-TBR Attachments: YARN-1572-branch-2.3.0.001.patch, YARN-1572-log.tar.gz, conf.tar.gz, log.tar.gz we have lower chance to hit NPE in allocateNodeLocal when run benchmark(hit 4 in 20 times). {code} 2014-07-31 04:18:19,653 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1406794589275_0001_01_21 of capacity memory:1024, vCores:1 on host datanode10:57281, which has 6 containers, memory:6144, vCores:6 used and memory:2048, vCores:2 available after allocation 2014-07-31 04:18:19,654 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:311) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:268) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.allocate(FiCaSchedulerApp.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainer(FifoScheduler.java:683) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignNodeLocalContainers(FifoScheduler.java:602) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainersOnNode(FifoScheduler.java:560) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:488) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:729) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:774) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:599) at java.lang.Thread.run(Thread.java:662) 2014-07-31 04:18:19,655 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-160: -- Labels: BB2015-05-TBR (was: ) nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Labels: BB2015-05-TBR Attachments: apache-yarn-160.0.patch, apache-yarn-160.1.patch, apache-yarn-160.2.patch, apache-yarn-160.3.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-868) YarnClient should set the service address in tokens returned by getRMDelegationToken()
[ https://issues.apache.org/jira/browse/YARN-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-868: -- Labels: BB2015-05-TBR (was: ) YarnClient should set the service address in tokens returned by getRMDelegationToken() -- Key: YARN-868 URL: https://issues.apache.org/jira/browse/YARN-868 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Varun Saxena Labels: BB2015-05-TBR Attachments: YARN-868.patch Either the client should set this information into the token or the client layer should expose an api that returns the service address. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2380) The normalizeRequests method in SchedulerUtils always resets the vCore to 1
[ https://issues.apache.org/jira/browse/YARN-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2380: --- Labels: BB2015-05-TBR (was: ) The normalizeRequests method in SchedulerUtils always resets the vCore to 1 --- Key: YARN-2380 URL: https://issues.apache.org/jira/browse/YARN-2380 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Jian Fang Assignee: Jian Fang Priority: Critical Labels: BB2015-05-TBR Attachments: YARN-2380.patch I added some log info to the method normalizeRequest() as follows. public static void normalizeRequest( ResourceRequest ask, ResourceCalculator resourceCalculator, Resource clusterResource, Resource minimumResource, Resource maximumResource, Resource incrementResource) { LOG.info(Before request normalization, the ask capacity: + ask.getCapability()); Resource normalized = Resources.normalize( resourceCalculator, ask.getCapability(), minimumResource, maximumResource, incrementResource); LOG.info(After request normalization, the ask capacity: + normalized); ask.setCapability(normalized); } The resulted log showed that the vcore in ask was changed from 2 to 1. 2014-08-01 20:54:15,537 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils (IPC Server handler 4 on 9024): Before request normalization, the ask capacity: memory:1536, vCores:2 2014-08-01 20:54:15,537 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils (IPC Server handler 4 on 9024): After request normalization, the ask capacity: memory:1536, vCores:1 The root cause is the DefaultResourceCalculator calls Resources.createResource(normalizedMemory) to regenerate a new resource with vcore = 1. This bug is critical and it leads to the mismatch of the request resource and the container resource and many other potential issues if the user requests containers with vcore 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-867) Isolation of failures in aux services
[ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-867: -- Labels: BB2015-05-TBR (was: ) Isolation of failures in aux services -- Key: YARN-867 URL: https://issues.apache.org/jira/browse/YARN-867 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Hitesh Shah Assignee: Xuan Gong Priority: Critical Labels: BB2015-05-TBR Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, YARN-867.sampleCode.2.patch Today, a malicious application can bring down the NM by sending bad data to a service. For example, sending data to the ShuffleService such that it results any non-IOException will cause the NM's async dispatcher to exit as the service's INIT APP event is not handled properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1050) Document the Fair Scheduler REST API
[ https://issues.apache.org/jira/browse/YARN-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-1050: --- Labels: BB2015-05-TBR (was: ) Document the Fair Scheduler REST API Key: YARN-1050 URL: https://issues.apache.org/jira/browse/YARN-1050 Project: Hadoop YARN Issue Type: Improvement Components: documentation, fairscheduler Reporter: Sandy Ryza Assignee: Kenji Kikushima Labels: BB2015-05-TBR Attachments: YARN-1050-2.patch, YARN-1050-3.patch, YARN-1050.patch The documentation should be placed here along with the Capacity Scheduler documentation: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3261) rewrite resourcemanager restart doc to remove roadmap bits
[ https://issues.apache.org/jira/browse/YARN-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3261: --- Labels: BB2015-05-TBR (was: ) rewrite resourcemanager restart doc to remove roadmap bits --- Key: YARN-3261 URL: https://issues.apache.org/jira/browse/YARN-3261 Project: Hadoop YARN Issue Type: Bug Components: documentation Reporter: Allen Wittenauer Assignee: Gururaj Shetty Labels: BB2015-05-TBR Attachments: YARN-3261.01.patch Another mixture of roadmap and instruction manual that seems to be ever present in a lot of the recently written documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3126) FairScheduler: queue's usedResource is always more than the maxResource limit
[ https://issues.apache.org/jira/browse/YARN-3126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3126: --- Labels: BB2015-05-TBR assignContainer fairscheduler resources (was: assignContainer fairscheduler resources) FairScheduler: queue's usedResource is always more than the maxResource limit - Key: YARN-3126 URL: https://issues.apache.org/jira/browse/YARN-3126 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.3.0 Environment: hadoop2.3.0. fair scheduler. spark 1.1.0. Reporter: Xia Hu Labels: BB2015-05-TBR, assignContainer, fairscheduler, resources Fix For: trunk-win Attachments: resourcelimit-02.patch, resourcelimit.patch When submitting spark application(both spark-on-yarn-cluster and spark-on-yarn-cleint model), the queue's usedResources assigned by fairscheduler always can be more than the queue's maxResources limit. And by reading codes of fairscheduler, I suppose this issue happened because of ignore to check the request resources when assign Container. Here is the detail: 1. choose a queue. In this process, it will check if queue's usedResource is bigger than its max, with assignContainerPreCheck. 2. then choose a app in the certain queue. 3. then choose a container. And here is the question, there is no check whether this container would make the queue sources over its max limit. If a queue's usedResource is 13G, the maxResource limit is 16G, then a container which asking for 4G resources may be assigned successful. This problem will always happen in spark application, cause we can ask for different container resources in different applications. By the way, I have already use the patch from YARN-2083. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2345: --- Labels: BB2015-05-TBR newbie (was: newbie) yarn rmadmin -report Key: YARN-2345 URL: https://issues.apache.org/jira/browse/YARN-2345 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Allen Wittenauer Assignee: Hao Gao Labels: BB2015-05-TBR, newbie Attachments: YARN-2345.1.patch It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3360) Add JMX metrics to TimelineDataManager
[ https://issues.apache.org/jira/browse/YARN-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3360: --- Labels: BB2015-05-TBR (was: ) Add JMX metrics to TimelineDataManager -- Key: YARN-3360 URL: https://issues.apache.org/jira/browse/YARN-3360 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Labels: BB2015-05-TBR Attachments: YARN-3360.001.patch The TimelineDataManager currently has no metrics, outside of the standard JVM metrics. It would be very useful to at least log basic counts of method calls, time spent in those calls, and number of entities/events involved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1462) AHS API and other AHS changes to handle tags for completed MR jobs
[ https://issues.apache.org/jira/browse/YARN-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-1462: --- Labels: BB2015-05-TBR (was: ) AHS API and other AHS changes to handle tags for completed MR jobs -- Key: YARN-1462 URL: https://issues.apache.org/jira/browse/YARN-1462 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Karthik Kambatla Assignee: Xuan Gong Labels: BB2015-05-TBR Attachments: YARN-1462-branch-2.7-1.patch, YARN-1462.1.patch, YARN-1462.2.patch AHS related work for tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3535: --- Labels: BB2015-05-TBR (was: ) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Labels: BB2015-05-TBR Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-20) More information for yarn.resourcemanager.webapp.address in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-20: - Labels: BB2015-05-TBR (was: ) More information for yarn.resourcemanager.webapp.address in yarn-default.xml -- Key: YARN-20 URL: https://issues.apache.org/jira/browse/YARN-20 Project: Hadoop YARN Issue Type: Improvement Components: documentation, resourcemanager Affects Versions: 2.0.0-alpha Reporter: Nemon Lou Priority: Trivial Labels: BB2015-05-TBR Attachments: YARN-20.1.patch, YARN-20.patch Original Estimate: 1h Remaining Estimate: 1h The parameter yarn.resourcemanager.webapp.address in yarn-default.xml is in host:port format,which is noted in the cluster set up guide (http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html). When i read though the code,i find host format is also supported. In host format,the port will be random. So we may add more documentation in yarn-default.xml for easy understood. I will submit a patch if it's helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1515) Provide ContainerManagementProtocol#signalContainer processing a batch of signals
[ https://issues.apache.org/jira/browse/YARN-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-1515: --- Labels: BB2015-05-TBR (was: ) Provide ContainerManagementProtocol#signalContainer processing a batch of signals -- Key: YARN-1515 URL: https://issues.apache.org/jira/browse/YARN-1515 Project: Hadoop YARN Issue Type: Sub-task Components: api, nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Labels: BB2015-05-TBR Attachments: YARN-1515.v01.patch, YARN-1515.v02.patch, YARN-1515.v03.patch, YARN-1515.v04.patch, YARN-1515.v05.patch, YARN-1515.v06.patch, YARN-1515.v07.patch, YARN-1515.v08.patch This is needed to implement MAPREDUCE-5044 to enable thread diagnostics for timed-out task attempts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2325) need check whether node is null in nodeUpdate for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2325: --- Labels: BB2015-05-TBR (was: ) need check whether node is null in nodeUpdate for FairScheduler Key: YARN-2325 URL: https://issues.apache.org/jira/browse/YARN-2325 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Labels: BB2015-05-TBR Attachments: YARN-2325.000.patch need check whether node is null in nodeUpdate for FairScheduler. If nodeUpdate is called after removeNode, the getFSSchedulerNode will be null. If the node is null, we should return with error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2151) FairScheduler option for global preemption within hierarchical queues
[ https://issues.apache.org/jira/browse/YARN-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2151: --- Labels: BB2015-05-TBR (was: ) FairScheduler option for global preemption within hierarchical queues - Key: YARN-2151 URL: https://issues.apache.org/jira/browse/YARN-2151 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Andrey Stepachev Labels: BB2015-05-TBR Attachments: YARN-2151.patch FairScheduler has hierarchical queues, but fair share calculation and preemption still works withing a limited range and effectively still nonhierarchical. This patch solves this incompleteness in two aspects: 1. Currently MinShare is not propagated to upper queue, that leads to fair share calculation ignores all Min Shares in deeper queues. Lets take an example (implemented as test case TestFairScheduler#testMinShareInHierarchicalQueues) {code} ?xml version=1.0? allocations queue name=queue1 maxResources10240mb, 10vcores/maxResources queue name=big/ queue name=sub1 schedulingPolicyfair/schedulingPolicy queue name=sub11 minResources6192mb, 6vcores/minResources /queue /queue queue name=sub2 /queue /queue /allocations {code} Then bigApp started within queue1.big with 10x1GB containers. That effectively eats all maximum allowed resources for queue1. Subsequent requests for app1 (queue1.sub1.sub11) and app2 (queue1.sub2) (5x1GB each) will wait for free resources. Take a note, that sub11 has min share requirements for 6x1GB. Without given patch fair share will be calculated with no knowledge about min share requirements and app1 and app2 will get equal number of containers. With the patch resources will split according to min share ( in test it will be 5 for app1 and 1 for app2) That behaviour controlled by the same parameter as ‘globalPreemtion’, but that can be changed easily. Implementation is a bit awkward, but seems that method for min share recalculation can be exposed as public or protected api and constructor in FSQueue can call it before using minShare getter. But right now current implementation with nulls should work too. 2. Preemption doesn’t works between queues on different level for the queues hierarchy. Moreover, it is not possible to override various parameters for children queues. This patch adds parameter ‘globalPreemption’, which enables global preemption algorithm modifications. In a nutshell patch adds function shouldAttemptPreemption(queue), which can calculate usage for nested queues, and if queue with usage more that specified threshold is found, preemption can be triggered. Aggregated minShare does the rest of work and preemption will work as expected within hierarchy of queues with different MinShare/MaxShare specifications on different levels. Test case TestFairScheduler#testGlobalPreemption depicts how it works. One big app gets resources above its fair share and app1 has a declared min share. On submission code finds that starvation and preempts enough containers to give enough room for app1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1782) CLI should let users to query cluster metrics
[ https://issues.apache.org/jira/browse/YARN-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-1782: --- Labels: BB2015-05-TBR (was: ) CLI should let users to query cluster metrics - Key: YARN-1782 URL: https://issues.apache.org/jira/browse/YARN-1782 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Zhijie Shen Assignee: Kenji Kikushima Labels: BB2015-05-TBR Attachments: YARN-1782.patch Like RM webUI and RESTful services, YARN CLI should also enable users to query the cluster metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-126) yarn rmadmin help message contains reference to hadoop cli and JT
[ https://issues.apache.org/jira/browse/YARN-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-126: -- Labels: BB2015-05-TBR usability (was: usability) yarn rmadmin help message contains reference to hadoop cli and JT - Key: YARN-126 URL: https://issues.apache.org/jira/browse/YARN-126 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.0.3-alpha Reporter: Thomas Graves Assignee: Rémy SAISSY Labels: BB2015-05-TBR, usability Attachments: YARN-126.patch has option to specify a job tracker and the last line for general command line syntax had bin/hadoop command [genericOptions] [commandOptions] ran yarn rmadmin to get usage: RMAdmin Usage: java RMAdmin [-refreshQueues] [-refreshNodes] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-refreshAdminAcls] [-refreshServiceAcl] [-help [cmd]] Generic options supported are -conf configuration file specify an application configuration file -D property=valueuse value for given property -fs local|namenode:port specify a namenode -jt local|jobtracker:portspecify a job tracker -files comma separated list of filesspecify comma separated files to be copied to the map reduce cluster -libjars comma separated list of jarsspecify comma separated jar files to include in the classpath. -archives comma separated list of archivesspecify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3395) [Fair Scheduler] Handle the user name correctly when user name is used as default queue name.
[ https://issues.apache.org/jira/browse/YARN-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3395: --- Labels: BB2015-05-TBR (was: ) [Fair Scheduler] Handle the user name correctly when user name is used as default queue name. - Key: YARN-3395 URL: https://issues.apache.org/jira/browse/YARN-3395 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: zhihai xu Assignee: zhihai xu Labels: BB2015-05-TBR Attachments: YARN-3395.000.patch Handle the user name correctly when user name is used as default queue name in fair scheduler. It will be better to remove the trailing and leading whitespace of the user name when we use user name as default queue name, otherwise it will be rejected by InvalidQueueNameException from QueueManager. I think it is reasonable to make this change, because we already did special handling for '.' in user name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1287) Consolidate MockClocks
[ https://issues.apache.org/jira/browse/YARN-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-1287: --- Labels: BB2015-05-TBR newbie (was: newbie) Consolidate MockClocks -- Key: YARN-1287 URL: https://issues.apache.org/jira/browse/YARN-1287 Project: Hadoop YARN Issue Type: Improvement Reporter: Sandy Ryza Assignee: Sebastian Wong Labels: BB2015-05-TBR, newbie Attachments: YARN-1287-3.patch A bunch of different tests have near-identical implementations of MockClock. TestFairScheduler, TestFSSchedulerApp, and TestCgroupsLCEResourcesHandler for example. They should be consolidated into a single MockClock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-641) Make AMLauncher in RM Use NMClient
[ https://issues.apache.org/jira/browse/YARN-641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-641: -- Labels: BB2015-05-TBR (was: ) Make AMLauncher in RM Use NMClient -- Key: YARN-641 URL: https://issues.apache.org/jira/browse/YARN-641 Project: Hadoop YARN Issue Type: Improvement Reporter: Zhijie Shen Assignee: Zhijie Shen Labels: BB2015-05-TBR Attachments: YARN-641.1.patch, YARN-641.2.patch, YARN-641.3.patch YARN-422 adds NMClient. RM's AMLauncher is responsible for the interactions with an application's AM container. AMLauncher should also replace the raw ContainerManager proxy with NMClient. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1391) Lost node list should be identify by NodeId
[ https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-1391: --- Labels: BB2015-05-TBR (was: ) Lost node list should be identify by NodeId --- Key: YARN-1391 URL: https://issues.apache.org/jira/browse/YARN-1391 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: Siqi Li Assignee: Siqi Li Labels: BB2015-05-TBR Attachments: YARN-1391.v1.patch, YARN-1391.v2.patch in case of multiple node managers on a single machine. each of them should be identified by NodeId, which is more unique than just host name -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3557) Support Intel Trusted Execution Technology(TXT) in YARN scheduler
[ https://issues.apache.org/jira/browse/YARN-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529870#comment-14529870 ] Dian Fu commented on YARN-3557: --- Hi [~Naganarasimha], {quote}constraints should be supported to be added from both RM and NM and partitions (existing labels) should be allowed either by RM or NM{quote} Agree with you that constraints should be supported to be added from both RM and RN. TRUSTED/UNTRUSTED are also more like constraints of a node. BTW, it seems that constraints support is already created(YARN-3409). Support Intel Trusted Execution Technology(TXT) in YARN scheduler - Key: YARN-3557 URL: https://issues.apache.org/jira/browse/YARN-3557 Project: Hadoop YARN Issue Type: New Feature Reporter: Dian Fu Attachments: Support TXT in YARN high level design doc.pdf Intel TXT defines platform-level enhancements that provide the building blocks for creating trusted platforms. A TXT aware YARN scheduler can schedule security sensitive jobs on TXT enabled nodes only. YARN-2492 provides the capacity to restrict YARN applications to run only on cluster nodes that have a specified node label. This is a good mechanism that be utilized for TXT aware YARN scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1813) Better error message for yarn logs when permission denied
[ https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1813: -- Target Version/s: 2.8.0 (was: 2.6.0) Better error message for yarn logs when permission denied --- Key: YARN-1813 URL: https://issues.apache.org/jira/browse/YARN-1813 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0, 2.4.1, 2.5.1 Reporter: Andrew Wang Assignee: Tsuyoshi Ozawa Priority: Minor Attachments: YARN-1813.1.patch, YARN-1813.2.patch, YARN-1813.2.patch, YARN-1813.3.patch, YARN-1813.4.patch, YARN-1813.5.patch, YARN-1813.6.patch I ran some MR jobs as the hdfs user, and then forgot to sudo -u when grabbing the logs. yarn logs prints an error message like the following: {noformat} [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at a2402.halxg.cloudera.com/10.20.212.10:8032 Logs not available at /tmp/logs/andrew.wang/logs/application_1394482121761_0010 Log aggregation has not completed or is not enabled. {noformat} It'd be nicer if it said Permission denied or AccessControlException or something like that instead, since that's the real issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1813) Better error message for yarn logs when permission denied
[ https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529507#comment-14529507 ] Jian He commented on YARN-1813: --- [~ozawa], patch unfortunately doesn't apply any more. mind updating please ? I'll review and get this in. Better error message for yarn logs when permission denied --- Key: YARN-1813 URL: https://issues.apache.org/jira/browse/YARN-1813 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0, 2.4.1, 2.5.1 Reporter: Andrew Wang Assignee: Tsuyoshi Ozawa Priority: Minor Attachments: YARN-1813.1.patch, YARN-1813.2.patch, YARN-1813.2.patch, YARN-1813.3.patch, YARN-1813.4.patch, YARN-1813.5.patch, YARN-1813.6.patch I ran some MR jobs as the hdfs user, and then forgot to sudo -u when grabbing the logs. yarn logs prints an error message like the following: {noformat} [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at a2402.halxg.cloudera.com/10.20.212.10:8032 Logs not available at /tmp/logs/andrew.wang/logs/application_1394482121761_0010 Log aggregation has not completed or is not enabled. {noformat} It'd be nicer if it said Permission denied or AccessControlException or something like that instead, since that's the real issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3581) Deprecate -directlyAccessNodeLabelStore in RMAdminCLI
[ https://issues.apache.org/jira/browse/YARN-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3581: - Summary: Deprecate -directlyAccessNodeLabelStore in RMAdminCLI (was: Drop -directlyAccessNodeLabelStore in RMAdminCLI) Deprecate -directlyAccessNodeLabelStore in RMAdminCLI - Key: YARN-3581 URL: https://issues.apache.org/jira/browse/YARN-3581 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan In 2.6.0, we added an option called -directlyAccessNodeLabelStore to make RM can start with label-configured queue settings. After YARN-2918, we don't need this option any more, admin can configure queue setting, start RM and configure node label via RMAdminCLI without any error. In addition, this option is very restrictive, first it needs to run on the same node where RM is running if admin configured to store labels in local disk. Second, when admin run the option when RM is running, multiple process write to a same file can happen, this could make node label store becomes invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3127) Apphistory url crashes when RM switches with ATS enabled
[ https://issues.apache.org/jira/browse/YARN-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529606#comment-14529606 ] Naganarasimha G R commented on YARN-3127: - Thanks for reviewing [~gtCarrera9], Issue mentioned over here main cause is already addressed in another jira by [~xgong] and but when we test in this way we still get to see null in the webui and also more importantly this jira addressing is required as events are published for every app (start and finished) on RM failover. So if 1 apps are maintained then so many additional non required events are getting triggered. this we need to address. And for the issue pointed by [~xgong], i had asked for suggestion of approach being taken and hence waiting for it, AFAIK we need to ensure first ATS events are sent and then store the final application state to RMstate store in FINAL_SAVING transition (and also other possible cases where app is created and will be killed b4 attempt is created in which case FINAL_SAVING is not called). If this approach is fine then will update the patch and the description. Apphistory url crashes when RM switches with ATS enabled Key: YARN-3127 URL: https://issues.apache.org/jira/browse/YARN-3127 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.6.0 Environment: RM HA with ATS Reporter: Bibin A Chundatt Assignee: Naganarasimha G R Attachments: YARN-3127.20150213-1.patch, YARN-3127.20150329-1.patch 1.Start RM with HA and ATS configured and run some yarn applications 2.Once applications are finished sucessfully start timeline server 3.Now failover HA form active to standby 4.Access timeline server URL IP:PORT/applicationhistory Result: Application history URL fails with below info {quote} 2015-02-03 20:28:09,511 ERROR org.apache.hadoop.yarn.webapp.View: Failed to read the applications. java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1643) at org.apache.hadoop.yarn.server.webapp.AppsBlock.render(AppsBlock.java:80) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:67) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) ... Caused by: org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: The entity for application attempt appattempt_1422972608379_0001_01 doesn't exist in the timeline store at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getApplicationAttempt(ApplicationHistoryManagerOnTimelineStore.java:151) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.generateApplicationReport(ApplicationHistoryManagerOnTimelineStore.java:499) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerOnTimelineStore.getAllApplications(ApplicationHistoryManagerOnTimelineStore.java:108) at org.apache.hadoop.yarn.server.webapp.AppsBlock$1.run(AppsBlock.java:84) at org.apache.hadoop.yarn.server.webapp.AppsBlock$1.run(AppsBlock.java:81) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) ... 51 more 2015-02-03 20:28:09,512 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /applicationhistory org.apache.hadoop.yarn.webapp.WebAppException: Error rendering block: nestLevel=6 expected 5 at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:77) {quote} Behaviour with AHS with file based history store -Apphistory url is working -No attempt entries are shown for each application. Based on inital analysis when RM switches ,application attempts from state store are not replayed but only applications are. So when /applicaitonhistory url is accessed it tries for all attempt id and fails -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3134) [Storage implementation] Exploiting the option of using Phoenix to access HBase backend
[ https://issues.apache.org/jira/browse/YARN-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3134: Attachment: YARN-3134-YARN-2928.004.patch I've addressed all comments from [~zjshen] except for two points: I think we need to discuss more about possible cache settings for performance tuning. I also leave the work for accurately verifying the content in TestPhoenixTimelineWriterImpl as a future work. For now my main focus is on pushing the current version forward into a benchmark-ready state. [Storage implementation] Exploiting the option of using Phoenix to access HBase backend --- Key: YARN-3134 URL: https://issues.apache.org/jira/browse/YARN-3134 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Li Lu Attachments: SettingupPhoenixstorageforatimelinev2end-to-endtest.pdf, YARN-3134-040915_poc.patch, YARN-3134-041015_poc.patch, YARN-3134-041415_poc.patch, YARN-3134-042115.patch, YARN-3134-042715.patch, YARN-3134-YARN-2928.001.patch, YARN-3134-YARN-2928.002.patch, YARN-3134-YARN-2928.003.patch, YARN-3134-YARN-2928.004.patch, YARN-3134DataSchema.pdf Quote the introduction on Phoenix web page: {code} Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. {code} It may simply our implementation read/write data from/to HBase, and can easily build index and compose complex query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529603#comment-14529603 ] Vinod Kumar Vavilapalli commented on YARN-1680: --- When I said node-labels above, I meant partitions. Clearly the problem and the corresponding solution will likely be very similar for node-constraints (one type of node-labels). After all, blacklisting is a type of (anti) node-constraint. /cc [~leftnoteasy] availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Craig Welch Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI
[ https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529658#comment-14529658 ] Hadoop QA commented on YARN-3301: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 42s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 44s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 44s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 9s | The applied patch generated 2 new checkstyle issues (total was 58, now 43). | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 8s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-server-common. | | {color:red}-1{color} | yarn tests | 54m 42s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 93m 9s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestRMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730644/YARN-3301.4.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 4da8490 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7716/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7716/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7716/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7716/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7716/console | This message was automatically generated. Fix the format issue of the new RM web UI and AHS web UI Key: YARN-3301 URL: https://issues.apache.org/jira/browse/YARN-3301 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: Screen Shot 2015-04-21 at 5.09.25 PM.png, Screen Shot 2015-04-21 at 5.38.39 PM.png, YARN-3301.1.patch, YARN-3301.2.patch, YARN-3301.3.patch, YARN-3301.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3513) Remove unused variables in ContainersMonitorImpl
[ https://issues.apache.org/jira/browse/YARN-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529759#comment-14529759 ] Hadoop QA commented on YARN-3513: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 0s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 38s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 52s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 38s | The applied patch generated 1 new checkstyle issues (total was 27, now 27). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 3s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 6m 2s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 42m 46s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730656/YARN-3513.20150506-1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 90b3845 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7718/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7718/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7718/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7718/console | This message was automatically generated. Remove unused variables in ContainersMonitorImpl Key: YARN-3513 URL: https://issues.apache.org/jira/browse/YARN-3513 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Priority: Trivial Labels: newbie Attachments: YARN-3513.20150421-1.patch, YARN-3513.20150503-1.patch, YARN-3513.20150506-1.patch class members : {{private final Context context;}} and some local variables in MonitoringThread.run() : {{vmemStillInUsage and pmemStillInUsage}} are not used and just updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3169) drop the useless yarn overview document
[ https://issues.apache.org/jira/browse/YARN-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3169: --- Labels: BB2015-05-TBR (was: ) drop the useless yarn overview document --- Key: YARN-3169 URL: https://issues.apache.org/jira/browse/YARN-3169 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Labels: BB2015-05-TBR Attachments: YARN-3169-002.patch, YARN-3169.patch It's pretty superfluous given there is a site index on the left. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2306) leak of reservation metrics (fair scheduler)
[ https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2306: --- Labels: BB2015-05-TBR (was: ) leak of reservation metrics (fair scheduler) Key: YARN-2306 URL: https://issues.apache.org/jira/browse/YARN-2306 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Labels: BB2015-05-TBR Attachments: YARN-2306-2.patch, YARN-2306.patch This only applies to fair scheduler. Capacity scheduler is OK. When appAttempt or node is removed, the metrics for reservation(reservedContainers, reservedMB, reservedVCores) is not reduced back. These are important metrics for administrator. The wrong metrics confuses may confuse them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1126) Add validation of users input nodes-states options to nodes CLI
[ https://issues.apache.org/jira/browse/YARN-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-1126: --- Labels: BB2015-05-TBR (was: ) Add validation of users input nodes-states options to nodes CLI --- Key: YARN-1126 URL: https://issues.apache.org/jira/browse/YARN-1126 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Labels: BB2015-05-TBR Attachments: YARN-905-addendum.patch Follow the discussion in YARN-905. (1) case-insensitive checks for all. (2) validation of users input, exit with non-zero code and print all valid states when user gives an invalid state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users
[ https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2423: --- Labels: BB2015-05-TBR (was: ) TimelineClient should wrap all GET APIs to facilitate Java users Key: YARN-2423 URL: https://issues.apache.org/jira/browse/YARN-2423 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Robert Kanter Labels: BB2015-05-TBR Attachments: YARN-2423.004.patch, YARN-2423.005.patch, YARN-2423.006.patch, YARN-2423.007.patch, YARN-2423.patch, YARN-2423.patch, YARN-2423.patch TimelineClient provides the Java method to put timeline entities. It's also good to wrap over all GET APIs (both entity and domain), and deserialize the json response into Java POJO objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-535) TestUnmanagedAMLauncher can corrupt target/test-classes/yarn-site.xml during write phase, breaks later test runs
[ https://issues.apache.org/jira/browse/YARN-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-535: -- Labels: BB2015-05-TBR (was: ) TestUnmanagedAMLauncher can corrupt target/test-classes/yarn-site.xml during write phase, breaks later test runs Key: YARN-535 URL: https://issues.apache.org/jira/browse/YARN-535 Project: Hadoop YARN Issue Type: Bug Components: applications Affects Versions: 2.6.0 Environment: OS/X laptop, HFS+ filesystem Reporter: Steve Loughran Assignee: Steve Loughran Priority: Minor Labels: BB2015-05-TBR Attachments: YARN-535-02.patch, YARN-535.patch the setup phase of {{TestUnmanagedAMLauncher}} overwrites {{yarn-site.xml}}. As {{Configuration.writeXml()}} does a reread of all resources, this will break if the (open-for-writing) resource is already visible as an empty file. This leaves a corrupted {{target/test-classes/yarn-site.xml}}, which breaks later test runs -because it is not overwritten by later incremental builds, due to timestamps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2003: --- Labels: BB2015-05-TBR (was: ) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side] -- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Labels: BB2015-05-TBR Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, 0008-YARN-2003.patch, 0009-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3405: --- Labels: BB2015-05-TBR (was: ) FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Labels: BB2015-05-TBR Attachments: YARN-3405.01.patch, YARN-3405.02.patch Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2444) Primary filters added after first submission not indexed, cause exceptions in logs.
[ https://issues.apache.org/jira/browse/YARN-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2444: --- Labels: BB2015-05-TBR (was: ) Primary filters added after first submission not indexed, cause exceptions in logs. --- Key: YARN-2444 URL: https://issues.apache.org/jira/browse/YARN-2444 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.5.0 Reporter: Marcelo Vanzin Assignee: Steve Loughran Labels: BB2015-05-TBR Attachments: YARN-2444-001.patch, ats.java, org.apache.hadoop.yarn.server.timeline.TestTimelineClientPut-output.txt See attached code for an example. The code creates an entity with a primary filter, submits it to the ATS. After that, a new primary filter value is added and the entity is resubmitted. At that point two things can be seen: - Searching for the new primary filter value does not return the entity - The following exception shows up in the logs: {noformat} 14/08/22 11:33:42 ERROR webapp.TimelineWebServices: Error when verifying access for user dr.who (auth:SIMPLE) on the events of the timeline entity { id: testid-48625678-9cbb-4e71-87de-93c50be51d1a, type: test } org.apache.hadoop.yarn.exceptions.YarnException: Owner information of the timeline entity { id: testid-48625678-9cbb-4e71-87de-93c50be51d1a, type: test } is corrupted. at org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:67) at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getEntities(TimelineWebServices.java:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running
[ https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2268: --- Labels: BB2015-05-TBR (was: ) Disallow formatting the RMStateStore when there is an RM running Key: YARN-2268 URL: https://issues.apache.org/jira/browse/YARN-2268 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Rohith Labels: BB2015-05-TBR Attachments: 0001-YARN-2268.patch YARN-2131 adds a way to format the RMStateStore. However, it can be a problem if we format the store while an RM is actively using it. It would be nice to fail the format if there is an RM running and using this store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3541) Add version info on timeline service / generic history web UI and RES API
[ https://issues.apache.org/jira/browse/YARN-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3541: --- Labels: BB2015-05-TBR (was: ) Add version info on timeline service / generic history web UI and RES API - Key: YARN-3541 URL: https://issues.apache.org/jira/browse/YARN-3541 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Labels: BB2015-05-TBR Attachments: YARN-3541.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2556) Tool to measure the performance of the timeline server
[ https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2556: --- Labels: BB2015-05-TBR (was: ) Tool to measure the performance of the timeline server -- Key: YARN-2556 URL: https://issues.apache.org/jira/browse/YARN-2556 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: Chang Li Labels: BB2015-05-TBR Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, YARN-2556.1.patch, YARN-2556.2.patch, YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch We need to be able to understand the capacity model for the timeline server to give users the tools they need to deploy a timeline server with the correct capacity. I propose we create a mapreduce job that can measure timeline server write and read performance. Transactions per second, I/O for both read and write would be a good start. This could be done as an example or test job that could be tied into gridmix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3170: --- Labels: BB2015-05-TBR (was: ) YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Labels: BB2015-05-TBR Attachments: YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3458) CPU resource monitoring in Windows
[ https://issues.apache.org/jira/browse/YARN-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3458: --- Labels: BB2015-05-TBR containers metrics windows (was: containers metrics windows) CPU resource monitoring in Windows -- Key: YARN-3458 URL: https://issues.apache.org/jira/browse/YARN-3458 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Affects Versions: 2.7.0 Environment: Windows Reporter: Inigo Goiri Assignee: Inigo Goiri Priority: Minor Labels: BB2015-05-TBR, containers, metrics, windows Attachments: YARN-3458-1.patch, YARN-3458-2.patch, YARN-3458-3.patch, YARN-3458-4.patch, YARN-3458-5.patch, YARN-3458-6.patch Original Estimate: 168h Remaining Estimate: 168h The current implementation of getCpuUsagePercent() for WindowsBasedProcessTree is left as unavailable. Attached a proposal of how to do it. I reused the CpuTimeTracker using 1 jiffy=1ms. This was left open by YARN-3122. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-644) Basic null check is not performed on passed in arguments before using them in ContainerManagerImpl.startContainer
[ https://issues.apache.org/jira/browse/YARN-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-644: -- Labels: BB2015-05-TBR (was: ) Basic null check is not performed on passed in arguments before using them in ContainerManagerImpl.startContainer - Key: YARN-644 URL: https://issues.apache.org/jira/browse/YARN-644 Project: Hadoop YARN Issue Type: Sub-task Reporter: Omkar Vinit Joshi Assignee: Varun Saxena Priority: Minor Labels: BB2015-05-TBR Attachments: YARN-644.001.patch, YARN-644.002.patch I see that validation/ null check is not performed on passed in parameters. Ex. tokenId.getContainerID().getApplicationAttemptId() inside ContainerManagerImpl.authorizeRequest() I guess we should add these checks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-41: - Labels: BB2015-05-TBR (was: ) The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Ravi Teja Ch N V Assignee: Devaraj K Labels: BB2015-05-TBR Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, YARN-41-4.patch, YARN-41.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2571) RM to support YARN registry
[ https://issues.apache.org/jira/browse/YARN-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2571: --- Labels: BB2015-05-TBR (was: ) RM to support YARN registry Key: YARN-2571 URL: https://issues.apache.org/jira/browse/YARN-2571 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Steve Loughran Assignee: Steve Loughran Labels: BB2015-05-TBR Attachments: YARN-2571-001.patch, YARN-2571-002.patch, YARN-2571-003.patch, YARN-2571-005.patch, YARN-2571-007.patch, YARN-2571-008.patch, YARN-2571-009.patch, YARN-2571-010.patch The RM needs to (optionally) integrate with the YARN registry: # startup: create the /services and /users paths with system ACLs (yarn, hdfs principals) # app-launch: create the user directory /users/$username with the relevant permissions (CRD) for them to create subnodes. # attempt, container, app completion: remove service records with the matching persistence and ID -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.
[ https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529842#comment-14529842 ] Hadoop QA commented on YARN-3385: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 56s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 37s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 41s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 46s | The applied patch generated 1 new checkstyle issues (total was 42, now 43). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 15s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:red}-1{color} | yarn tests | 62m 58s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 99m 52s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730672/YARN-3385.003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 90b3845 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7721/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7721/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7721/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7721/console | This message was automatically generated. Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion. --- Key: YARN-3385 URL: https://issues.apache.org/jira/browse/YARN-3385 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Labels: BB2015-05-TBR Attachments: YARN-3385.000.patch, YARN-3385.001.patch, YARN-3385.002.patch, YARN-3385.003.patch Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete). The race condition is similar as YARN-3023. since the race condition exists for ZK node creation, it should also exist for ZK node deletion. We see this issue with the following stack trace: {code} 2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647) at
[jira] [Updated] (YARN-1426) YARN Components need to unregister their beans upon shutdown
[ https://issues.apache.org/jira/browse/YARN-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-1426: --- Labels: BB2015-05-TBR (was: ) YARN Components need to unregister their beans upon shutdown Key: YARN-1426 URL: https://issues.apache.org/jira/browse/YARN-1426 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.0.0, 2.3.0 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Labels: BB2015-05-TBR Attachments: YARN-1426.patch, YARN-1426.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2421) CapacityScheduler still allocates containers to an app in the FINISHING state
[ https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529488#comment-14529488 ] Craig Welch commented on YARN-2421: --- Hi [~lichangleo], thanks for working on this fix. Can you resolve the javac warning and run the TestRMContainerImpl test locally with the patch to verify the patch is not the cause? It seems to be persistently failing. CapacityScheduler still allocates containers to an app in the FINISHING state - Key: YARN-2421 URL: https://issues.apache.org/jira/browse/YARN-2421 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.1 Reporter: Thomas Graves Assignee: Chang Li Attachments: yarn2421.patch, yarn2421.patch, yarn2421.patch I saw an instance of a bad application master where it unregistered with the RM but then continued to call into allocate. The RMAppAttempt went to the FINISHING state, but the capacity scheduler kept allocating it containers. We should probably have the capacity scheduler check that the application isn't in one of the terminal states before giving it containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3581) Drop -directlyAccessNodeLabelStore in RMAdminCLI
[ https://issues.apache.org/jira/browse/YARN-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529552#comment-14529552 ] Allen Wittenauer commented on YARN-3581: Removing a command line option is an incompatible change. Drop -directlyAccessNodeLabelStore in RMAdminCLI Key: YARN-3581 URL: https://issues.apache.org/jira/browse/YARN-3581 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan In 2.6.0, we added an option called -directlyAccessNodeLabelStore to make RM can start with label-configured queue settings. After YARN-2918, we don't need this option any more, admin can configure queue setting, start RM and configure node label via RMAdminCLI without any error. In addition, this option is very restrictive, first it needs to run on the same node where RM is running if admin configured to store labels in local disk. Second, when admin run the option when RM is running, multiple process write to a same file can happen, this could make node label store becomes invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3381) A typographical error in InvalidStateTransitonException
[ https://issues.apache.org/jira/browse/YARN-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529621#comment-14529621 ] Sidharta Seethana commented on YARN-3381: - [~brahmareddy] , the latest version of the patch appears to generated differently than the earlier versions (--no-prefix missing?). Could you please fix? This would make it easier to compare patch versions. A typographical error in InvalidStateTransitonException - Key: YARN-3381 URL: https://issues.apache.org/jira/browse/YARN-3381 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Xiaoshuang LU Assignee: Brahma Reddy Battula Attachments: YARN-3381-002.patch, YARN-3381-003.patch, YARN-3381.patch Appears that InvalidStateTransitonException should be InvalidStateTransitionException. Transition was misspelled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.
[ https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529622#comment-14529622 ] Hadoop QA commented on YARN-3385: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 38s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 35s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 39s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 51s | The applied patch generated 1 new checkstyle issues (total was 42, now 43). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 16s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:red}-1{color} | yarn tests | 52m 28s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 89m 1s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestRMRestart | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12730639/YARN-3385.003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 9809a16 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7715/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7715/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7715/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7715/console | This message was automatically generated. Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion. --- Key: YARN-3385 URL: https://issues.apache.org/jira/browse/YARN-3385 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3385.000.patch, YARN-3385.001.patch, YARN-3385.002.patch, YARN-3385.003.patch Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete). The race condition is similar as YARN-3023. since the race condition exists for ZK node creation, it should also exist for ZK node deletion. We see this issue with the following stack trace: {code} 2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691) at
[jira] [Created] (YARN-3582) NPE in WebAppProxyServlet
Jian He created YARN-3582: - Summary: NPE in WebAppProxyServlet Key: YARN-3582 URL: https://issues.apache.org/jira/browse/YARN-3582 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He {code} HTTP ERROR 500 Problem accessing /proxy. Reason: INTERNAL_SERVER_ERROR Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:245) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3529) Add miniHBase cluster and Phoenix support to ATS v2 unit tests
[ https://issues.apache.org/jira/browse/YARN-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529689#comment-14529689 ] Li Lu commented on YARN-3529: - Thanks for the help from [~swagle] and [~vrushalic]! Now I have a patch that works for Phoenix locally, but could not apply on the YARN-2928 branch because it's based on my YARN-3134 patch. I'm currently blocked on that patch but once that in this JIRA will be patch available. Add miniHBase cluster and Phoenix support to ATS v2 unit tests -- Key: YARN-3529 URL: https://issues.apache.org/jira/browse/YARN-3529 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: AbstractMiniHBaseClusterTest.java, output_minicluster2.txt After we have our HBase and Phoenix writer implementations, we may want to find a way to set up HBase and Phoenix in our unit tests. We need to do this integration before the branch got merged back to trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)