[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142404#comment-14142404 ] Wangda Tan commented on YARN-2056: -- Hi [~eepayne], Thanks for updating, I took a look at your latest patch, some comments, bq. Even if a leaf queue could inherit it's parent's disable preemption value, there will likely be cases where part of the parent queue's over-capacity resources are untouchable and part of them are preemptable. I'm not quite understand about this, is there any issue if we let disable-preemption of parent queue is only a default value, and its children can overwrite it? bq. I collected untouchableExtra instead of preemptableExtra at the TempQueue level. in computeFixpointAllocation, Make sense to me With bq. In TempQueue#offer, one of the calculations is current + pending - idealAssigned ... and TempQueue#offer would end up assigning more to that queue than it should. bq. I looped through each queue, and if one has any untouchableExtra, then the queue's idealAssigned = guaranteed + untouchableExtra A problem I can see is, a negtive example: {code} root has A,B,C, total capacity = 90 A.guaranteed = 30, A.pending = 20, A.current = 40 B.guaranteed = 30, B.pending = 0, B.current = 50 C.guaranteed = 30, C.pending = 0, C.current = 0 initial: A.idealAssigned = 40 B.idealAssigned = 0 unassigned = 50 (90 - 40) 1st round (A/B/C's normalized_guarantee = 1/3) A.idealAssigned = 40 (unchanged because changes of TempQueue#offer implementation, A will be removed) B.idealAssigned = 17 (50/3) C.idealAssigned = 0 (satisfied already, C will be removed) unassigned = 33 2nd round (A/B's normalized_guarantee = 0.5) B.idealAssigned = 17 + 33 = 50 unassigned = 0 {code} However, with A/B has same guarantee, they should end up with both ideal_assigned = 45. And A will be able to preempt 5 from B. So in a short word, the problem is, a disable_preemption queue cannot preempt more resource than its current when some queues in the cluster don't consume resource. Do you have any idea to solve this problem? Thanks, Wangda Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt, YARN-2056.201409210049.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2554) Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy
[ https://issues.apache.org/jira/browse/YARN-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142488#comment-14142488 ] Jonathan Maron commented on YARN-2554: -- Let me try to address each of these concerns: - the key-store needs to present on all machine to distribute certificates: AMs may come up anywhere. The client trust store appears to an already defined keystore in the secure deployments I've seen so far. - the key-store used by Hadoop daemons CANNOT be shared with AMs: AMs run user-code as the user - the key-store cannot be shared across AMs of different users: Assuming I am running three different Slider apps as three different users, you don't want to have a single key-store instance accessible by all Slider AMs. Your concern about sharing elements between daemon processes and users is correct in the context of kerberos - principals are used for authentication and authorization and it wouldn't be prudent to have those shared. When code is running within an AM, the identity that is established and governs its authorized interactions is done via kerberos. But SSL is a security offering targeted for the transport layer. Typically with SSL, for client C and server S, SSL only establishes that: C believes S is who it intended to contact C believes it has a secure connection to S S believes it has a secure connection to C In addition, typically SSL ensures authentication of the server alone (the CN is generally the host name), so the trust between C and S is more along the lines of trusting the C is actually talking to host S. The session key for the RM (C) and AM (S), given the host certificate of the AM, will be unique - it is created by the RM, encrypted using the AM cert (using the AM public key), and sent to the AM to be decrypted by the AM's private key. I guess I don't see the issue with hadoop applications, whether daemon or AM, using the available hadoop keystores. The tranport layer mechanism doesn't play a role in the application level authorization since it has no role to play in establishing the user's subject etc. Having said all that, I may not be aware of some uses of the keystores in a hadoop cluster (or perhaps my assumption about the availability of a client trust store is wrong). In addition, it'd be nice to get the perspective of some of the security folks about the role of SSL in the security layers provided in a secure cluster. Slider AM Web UI is inaccessible if HTTPS/SSL is specified as the HTTP policy - Key: YARN-2554 URL: https://issues.apache.org/jira/browse/YARN-2554 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.6.0 Reporter: Jonathan Maron Attachments: YARN-2554.1.patch, YARN-2554.2.patch, YARN-2554.3.patch, YARN-2554.3.patch If the HTTP policy to enable HTTPS is specified, the RM and AM are initialized with SSL listeners. The RM has a web app proxy servlet that acts as a proxy for incoming AM requests. In order to forward the requests to the AM the proxy servlet makes use of HttpClient. However, the HttpClient utilized is not initialized correctly with the necessary certs to allow for successful one way SSL invocations to the other nodes in the cluster (it is not configured to access/load the client truststore specified in ssl-client.xml). I imagine SSLFactory.createSSLSocketFactory() could be utilized to create an instance that can be assigned to the HttpClient. The symptoms of this issue are: AM: Displays unknown_certificate exception RM: Displays an exception such as javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2498) [YARN-796] Respect labels in preemption policy of capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142513#comment-14142513 ] Sunil G commented on YARN-2498: --- Hi [~wangda] Great work Wangda. Very good design to solve the issues. I have few doubts though :) 1. bq. 2G resource in node1 with label x, and queueA has label y, so queueA cannot access node1 A scenario where *node1* has more than 50% (say 60) of cluster resources, and queue A is given 50% in CS. IN that case, is there any chance of under utilization? 2. When two or more nodes are merged to a single resource tree, and if these nodes share some common labels, i could see a common total sum is been finally stored per label in the consolidated tree. Here I feel, we may need to split up the resource of label in each node level. For eg; label *x* and *y* are needed for *queueA*. label *y* and *z* are needed for *queueB*. And if the label *y* is shared between 2 or more nodes, I am not sure whether it will cause some pblm. 3. bq.(container.resource can be used by pendingResourceLeaf) For preemption, we just calculate to match the totalResourceToPreempt from the over utilized queues. But whether this container is from which node, and also under which label, and whether this label is coming under which queue. Do we need to do this check for each container? [YARN-796] Respect labels in preemption policy of capacity scheduler Key: YARN-2498 URL: https://issues.apache.org/jira/browse/YARN-2498 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2498.patch, YARN-2498.patch, YARN-2498.patch, yarn-2498-implementation-notes.pdf There're 3 stages in ProportionalCapacityPreemptionPolicy, # Recursively calculate {{ideal_assigned}} for queue. This is depends on available resource, resource used/pending in each queue and guaranteed capacity of each queue. # Mark to-be preempted containers: For each over-satisfied queue, it will mark some containers will be preempted. # Notify scheduler about to-be preempted container. We need respect labels in the cluster for both #1 and #2: For #1, when there're some resource available in the cluster, we shouldn't assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot access such labels For #2, when we make decision about whether we need preempt a container, we need make sure, resource this container is *possibly* usable by a queue which is under-satisfied and has pending resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2452) TestRMApplicationHistoryWriter fails with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2452: --- Summary: TestRMApplicationHistoryWriter fails with FairScheduler (was: TestRMApplicationHistoryWriter is failed for FairScheduler) TestRMApplicationHistoryWriter fails with FairScheduler --- Key: YARN-2452 URL: https://issues.apache.org/jira/browse/YARN-2452 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2452.000.patch, YARN-2452.001.patch, YARN-2452.002.patch TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is the following: T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 66.261 sec FAILURE! java.lang.AssertionError: expected:1 but was:200 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142667#comment-14142667 ] Karthik Kambatla commented on YARN-2452: Patch looks reasonable to me, can't think of an easier way to get the test to work with FS. +1. Committing this. TestRMApplicationHistoryWriter is failed for FairScheduler -- Key: YARN-2452 URL: https://issues.apache.org/jira/browse/YARN-2452 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2452.000.patch, YARN-2452.001.patch, YARN-2452.002.patch TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is the following: T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 66.261 sec FAILURE! java.lang.AssertionError: expected:1 but was:200 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2252) Intermittent failure for testcase TestFairScheduler.testContinuousScheduling
[ https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2252: --- Attachment: yarn-2252-2.patch Here is a patch that modifies Ratandeep's patch. Intermittent failure for testcase TestFairScheduler.testContinuousScheduling Key: YARN-2252 URL: https://issues.apache.org/jira/browse/YARN-2252 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: trunk-win Reporter: Ratandeep Ratti Labels: hadoop2, scheduler, yarn Attachments: YARN-2252-1.patch, yarn-2252-2.patch This test-case is failing sporadically on my machine. I think I have a plausible explanation for this. It seems that when the Scheduler is being asked for resources, the resource requests that are being constructed have no preference for the hosts (nodes). The two mock hosts constructed, both have a memory of 8192 mb. The containers(resources) being requested each require a memory of 1024mb, hence a single node can execute both the resource requests for the application. In the end of the test-case it is being asserted that the containers (resource requests) be executed on different nodes, but since we haven't specified any preferences for nodes when requesting the resources, the scheduler (at times) executes both the containers (requests) on the same node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142684#comment-14142684 ] Karthik Kambatla commented on YARN-2453: As of now, CapacityScheduler is the only scheduler that instantiates PreemptableResourceScheduler. How about, we set the scheduler explicitly in the setup method. The test should just reuse the configuration instead of creating another instance. TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2252) Intermittent failure for testcase TestFairScheduler.testContinuousScheduling
[ https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142718#comment-14142718 ] Hadoop QA commented on YARN-2252: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670318/yarn-2252-2.patch against trunk revision c50fc92. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5066//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5066//console This message is automatically generated. Intermittent failure for testcase TestFairScheduler.testContinuousScheduling Key: YARN-2252 URL: https://issues.apache.org/jira/browse/YARN-2252 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: trunk-win Reporter: Ratandeep Ratti Labels: hadoop2, scheduler, yarn Attachments: YARN-2252-1.patch, yarn-2252-2.patch This test-case is failing sporadically on my machine. I think I have a plausible explanation for this. It seems that when the Scheduler is being asked for resources, the resource requests that are being constructed have no preference for the hosts (nodes). The two mock hosts constructed, both have a memory of 8192 mb. The containers(resources) being requested each require a memory of 1024mb, hence a single node can execute both the resource requests for the application. In the end of the test-case it is being asserted that the containers (resource requests) be executed on different nodes, but since we haven't specified any preferences for nodes when requesting the resources, the scheduler (at times) executes both the containers (requests) on the same node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2252) Intermittent failure for testcase TestFairScheduler.testContinuousScheduling
[ https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142720#comment-14142720 ] Karthik Kambatla commented on YARN-2252: The patch touches only TestFairScheduler, and couldn't have caused all these test failures. Intermittent failure for testcase TestFairScheduler.testContinuousScheduling Key: YARN-2252 URL: https://issues.apache.org/jira/browse/YARN-2252 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: trunk-win Reporter: Ratandeep Ratti Labels: hadoop2, scheduler, yarn Attachments: YARN-2252-1.patch, yarn-2252-2.patch This test-case is failing sporadically on my machine. I think I have a plausible explanation for this. It seems that when the Scheduler is being asked for resources, the resource requests that are being constructed have no preference for the hosts (nodes). The two mock hosts constructed, both have a memory of 8192 mb. The containers(resources) being requested each require a memory of 1024mb, hence a single node can execute both the resource requests for the application. In the end of the test-case it is being asserted that the containers (resource requests) be executed on different nodes, but since we haven't specified any preferences for nodes when requesting the resources, the scheduler (at times) executes both the containers (requests) on the same node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2252) Intermittent failure for testcase TestFairScheduler.testContinuousScheduling
[ https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142745#comment-14142745 ] Wei Yan commented on YARN-2252: --- Patch looks good to me. Intermittent failure for testcase TestFairScheduler.testContinuousScheduling Key: YARN-2252 URL: https://issues.apache.org/jira/browse/YARN-2252 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: trunk-win Reporter: Ratandeep Ratti Labels: hadoop2, scheduler, yarn Attachments: YARN-2252-1.patch, yarn-2252-2.patch This test-case is failing sporadically on my machine. I think I have a plausible explanation for this. It seems that when the Scheduler is being asked for resources, the resource requests that are being constructed have no preference for the hosts (nodes). The two mock hosts constructed, both have a memory of 8192 mb. The containers(resources) being requested each require a memory of 1024mb, hence a single node can execute both the resource requests for the application. In the end of the test-case it is being asserted that the containers (resource requests) be executed on different nodes, but since we haven't specified any preferences for nodes when requesting the resources, the scheduler (at times) executes both the containers (requests) on the same node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2453: Attachment: YARN-2453.001.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch, YARN-2453.001.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2453: Attachment: (was: YARN-2453.001.patch) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2453: Attachment: YARN-2453.001.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch, YARN-2453.001.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142758#comment-14142758 ] zhihai xu commented on YARN-2453: - Hi [~kasha], Your suggestion is good. I made the change to set CapacityScheduler as the scheduler for this test. I attached a new patch YARN-2453.001.patch, please review it. thanks TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch, YARN-2453.001.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142768#comment-14142768 ] Karthik Kambatla commented on YARN-2453: Thanks for the quick update, Zhihai. Should we set the scheduler to CapacityScheduler in the setup method so the entire class can use it? TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch, YARN-2453.001.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2190) Provide a Windows container executor that can limit memory and CPU
[ https://issues.apache.org/jira/browse/YARN-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142770#comment-14142770 ] Chuan Liu commented on YARN-2190: - [~vvasudev], thanks for the review! Please see my answers below. bq. 1. What is the behaviour of a process that tries to exceed the allocated memory? Will it start swapping or will it be killed? OS will limit the process's memory to the specified value. After the limit is reached, further API calls to memory allocation, e.g. {{malloc()}}, will get an error. The process behavior depends. In JVM, the application will get an OOM error and terminate. In C#, I was told CLR can detect memory pressure. I think swapping is an orthogonal OS feature and not related to this. bq. 2. Your code assumes a 1-1 mapping of physical cores to vcores. This assumption is/will be problematic, especially in heterogeneous clusters. You're better off using the ratio of (container-vcores/node-vcores) to determine cpu limits. The implementation follows the existing {{org.apache.hadoop.yarn.api.records.Resource}}. In the existing doc, it says A node's capacity should be configured with virtual cores equal to its number of physical cores. (Source: http://hadoop.apache.org/docs/current/api/index.html) I checked existing {{CgroupsLCEResourcesHandler}} implementation, which also does not use node-vcores, i.e. {{yarn.nodemanager.resource.cpu-vcores}}, to determine CPU shares. bq. 3. Can you explain why you are modifying DefaultContainerExecutor? You've added a method for the old signature in ContainerExecutor. I need addtional {{Resource}} which does not exist in existing {{getRunCommand()}} method parameters. I added a method of the old signature to maintain backward compatibility as this is an abstract class and there are other child implementations out there. bq. 4. Can you modify the comments/usage to specify the units of memory(bytes, MB, GB)? This is already documented in existing patch, which I have cited as follows. Where exactly do you want to see more comments? {noformat} + OPTIONS: -c [cores] set virtual core limits on the job object.\n\ + -m [memory] set the memory limit on the job object.\n\ + The core limit is an integral value of number of cores. The memory\n\ + limit is an integral number of memory in MB. The definition\n\ + follows the org.apache.hadoop.yarn.api.records.Resource model.\n\ + The limit will not be set if 0 or negative value is passed in as\n\ + parameter(s).\n\ {noformat} Provide a Windows container executor that can limit memory and CPU -- Key: YARN-2190 URL: https://issues.apache.org/jira/browse/YARN-2190 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Reporter: Chuan Liu Assignee: Chuan Liu Attachments: YARN-2190-prototype.patch, YARN-2190.1.patch, YARN-2190.2.patch, YARN-2190.3.patch, YARN-2190.4.patch, YARN-2190.5.patch Yarn default container executor on Windows does not set the resource limit on the containers currently. The memory limit is enforced by a separate monitoring thread. The container implementation on Windows uses Job Object right now. The latest Windows (8 or later) API allows CPU and memory limits on the job objects. We want to create a Windows container executor that sets the limits on job objects thus provides resource enforcement at OS level. http://msdn.microsoft.com/en-us/library/windows/desktop/ms686216(v=vs.85).aspx -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142774#comment-14142774 ] Hadoop QA commented on YARN-2453: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670327/YARN-2453.001.patch against trunk revision c50fc92. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5067//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5067//console This message is automatically generated. TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch, YARN-2453.001.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142775#comment-14142775 ] Hadoop QA commented on YARN-2453: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670328/YARN-2453.001.patch against trunk revision c50fc92. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5068//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5068//console This message is automatically generated. TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch, YARN-2453.001.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2453: Attachment: YARN-2453.002.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch, YARN-2453.001.patch, YARN-2453.002.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142826#comment-14142826 ] zhihai xu commented on YARN-2453: - Hi [~Karthik Kambatla], I moved all the configuration to setup in YARN-2453.002.patch. thanks for the quick response. TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch, YARN-2453.001.patch, YARN-2453.002.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142836#comment-14142836 ] Hadoop QA commented on YARN-2453: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670339/YARN-2453.002.patch against trunk revision c50fc92. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5069//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5069//console This message is automatically generated. TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch, YARN-2453.001.patch, YARN-2453.002.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142875#comment-14142875 ] Vinod Kumar Vavilapalli commented on YARN-913: -- I started looking at this. I have done my fair share of reviews in my life, but this is a monster patch. Can we please split this to have a realistic timeline for commit to 2.6? I propose the following sub-tasks - The addition of registry as a stand alone module. Which in itself is massive and can be further split - server side abstraction, ZK based server side implementation, client side, web-services etc. - Integration of registry into YARN - Changing YARN apps like distributed-shell - Documentation I just applied the massive patch against latest trunk and digging through it. But if you agree, we should file sub-tasks and divide up the patch. Thanks. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2460) Remove obsolete entries from yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142898#comment-14142898 ] Ray Chiang commented on YARN-2460: -- Thanks for the help committing. Remove obsolete entries from yarn-default.xml - Key: YARN-2460 URL: https://issues.apache.org/jira/browse/YARN-2460 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.0 Reporter: Ray Chiang Assignee: Ray Chiang Priority: Minor Labels: newbie Fix For: 2.6.0 Attachments: YARN-2460-01.patch, YARN-2460-02.patch The following properties are defined in yarn-default.xml, but do not exist in YarnConfiguration. mapreduce.job.hdfs-servers mapreduce.job.jar yarn.ipc.exception.factory.class yarn.ipc.serializer.type yarn.nodemanager.aux-services.mapreduce_shuffle.class yarn.nodemanager.hostname yarn.nodemanager.resourcemanager.connect.retry_interval.secs yarn.nodemanager.resourcemanager.connect.wait.secs yarn.resourcemanager.amliveliness-monitor.interval-ms yarn.resourcemanager.application-tokens.master-key-rolling-interval-secs yarn.resourcemanager.container.liveness-monitor.interval-ms yarn.resourcemanager.nm.liveness-monitor.interval-ms yarn.timeline-service.hostname yarn.timeline-service.http-authentication.simple.anonymous.allowed yarn.timeline-service.http-authentication.type Presumably, the mapreduce.* properties are okay. Similarly, the yarn.timeline-service.* properties are for the future TimelineService. However, the rest are likely fully deprecated. Submitting bug for comment/feedback about which other properties should be kept in yarn-default.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2578) NM does not failover timely if RM node network connection fails
Wilfred Spiegelenburg created YARN-2578: --- Summary: NM does not failover timely if RM node network connection fails Key: YARN-2578 URL: https://issues.apache.org/jira/browse/YARN-2578 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Wilfred Spiegelenburg The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a service network stop or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. The NM should than re-register with the new active RM. This re-register takes a long time (15 minutes or more). Until then the cluster has no nodes for processing and applications are stuck. Reproduction test case which can be used in any environment: - create a cluster with 3 nodes node 1: ZK, NN, JN, ZKFC, DN, RM, NM node 2: ZK, NN, JN, ZKFC, DN, RM, NM node 3: ZK, JN, DN, NM - start all services make sure they are in good health - kill the network connection of the RM that is active using one of the network kills from above - observe the NN and RM failover - the DN's fail over to the new active NN - the NM does not recover for a long time - the logs show a long delay and traces show no change at all The stack traces of the NM all show the same set of threads. The main thread which should be used in the re-register is the Node Status Updater This thread is stuck in: {code} Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in Object.wait() [0x7f5a51fc1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.ipc.Client.call(Client.java:1395) - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.Client.call(Client.java:1362) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) {code} The client connection which goes through the proxy can be traced back to the ResourceTrackerPBClientImpl. The generated proxy does not time out and we should be using a version which takes the RPC timeout (from the configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YARN-2578: Attachment: YARN-2578.patch Attached patch for initial review In the patch the proxy uses the RPC timeout that is set in the configuration to timeout the connection. NM does not failover timely if RM node network connection fails --- Key: YARN-2578 URL: https://issues.apache.org/jira/browse/YARN-2578 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Wilfred Spiegelenburg Attachments: YARN-2578.patch The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a service network stop or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. The NM should than re-register with the new active RM. This re-register takes a long time (15 minutes or more). Until then the cluster has no nodes for processing and applications are stuck. Reproduction test case which can be used in any environment: - create a cluster with 3 nodes node 1: ZK, NN, JN, ZKFC, DN, RM, NM node 2: ZK, NN, JN, ZKFC, DN, RM, NM node 3: ZK, JN, DN, NM - start all services make sure they are in good health - kill the network connection of the RM that is active using one of the network kills from above - observe the NN and RM failover - the DN's fail over to the new active NN - the NM does not recover for a long time - the logs show a long delay and traces show no change at all The stack traces of the NM all show the same set of threads. The main thread which should be used in the re-register is the Node Status Updater This thread is stuck in: {code} Node Status Updater prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in Object.wait() [0x7f5a51fc1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.ipc.Client.call(Client.java:1395) - locked 0xed62f488 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.Client.call(Client.java:1362) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) {code} The client connection which goes through the proxy can be traced back to the ResourceTrackerPBClientImpl. The generated proxy does not time out and we should be using a version which takes the RPC timeout (from the configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)