[jira] [Commented] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times
[ https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363387#comment-17363387 ] Hadoop QA commented on YARN-10767: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 44s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 30s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 29s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 45s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 41s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 19m 43s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 38s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 8s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | |
[jira] [Updated] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times
[ https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] D M Murali Krishna Reddy updated YARN-10767: Attachment: YARN-10767.004.patch > Yarn Logs Command retrying on Standby RM for 30 times > - > > Key: YARN-10767 > URL: https://issues.apache.org/jira/browse/YARN-10767 > Project: Hadoop YARN > Issue Type: Bug >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Major > Attachments: YARN-10767.001.patch, YARN-10767.002.patch, > YARN-10767.003.patch, YARN-10767.004.patch > > > When ResourceManager HA is enabled and the first RM is unavailable, on > executing "bin/yarn logs -applicationId -am 1", we get > ConnectionException for connecting to the first RM, the ConnectionException > Occurs for 30 times before it tries to connect to the second RM. > > This can be optimized by trying to fetch the logs from the Active RM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10789: Attachment: YARN-10789.branch-3.3.001.patch YARN-10789.branch-3.2.001.patch > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch, > YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
[ https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363215#comment-17363215 ] Eric Payne commented on YARN-10802: --- [~bteke], Thanks for raising this issue and for working on it. I have a question and an observation. {quote}Capacity Scheduler's minimum-user-limit-percent only accepts integers, which means at most 100 users can use a single queue fairly {quote} This isn't exactly accurate. Minimum user limit percent is only enforced when a queue's max capacity is reached _AND_ (100 / {{min-user-limit-pct}}) users are both using resources and asking for more resources. As long as the queue's max capacity is not reached _AND_ there are more resources available in the system, the 101st, 102nd, 103rd, etc., will be assigned resources. So, my question is, do you have a use case where 1. 100 users are using up the max capacity in the queue 2. All 100 users are active (that is, requesting more resources) 3. The 101st user comes in and is starved because, as containers are released, they are assigned to one of the first 100 (again, because they are all asking for resources)? We have several very-heavily-used multi-tenant queues that often have 100 or more users running, but only a subset of them are actively requesting resources. My observation is that when we have set the min-user-limit-pct to be 1 in a very highly used multi-tenant queue, the user limit grows way too slowly. The min-user-limit-pct is used in calculating the user limit (seen as "Max Resources" in the queue's pull-down menu in the RM GUI). When the queue grows above its capacity but is still below its max capacity, the calculations for user limit in {{UsersManager#computeUserLimit}} uses the min-user-limit-pct to limit how fast the user limit can grow. The smaller the min-user-limit-pct is, the slower it grows. What ends up happening is that a few users want to grow larger, but several smaller users come in, request resources, and leave without ever reaching the current user limit. This process repeats because there are several new active users all the time, so the longer-running, larger users can't grow beyond a certain limit even though there are still available queue and cluster resources. > Change Capacity Scheduler minimum-user-limit-percent to accept decimal values > - > > Key: YARN-10802 > URL: https://issues.apache.org/jira/browse/YARN-10802 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10802.001.patch, YARN-10802.002.patch, > YARN-10802.003.patch, YARN-10802.004.patch > > > Capacity Scheduler's minimum-user-limit-percent only accepts integers, which > means at most 100 users can use a single queue fairly. Using decimal values > could solve this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
[ https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-10802: - Description: Capacity Scheduler's minimum-user-limit-percent only accepts integers, which means at most 100 users can use a single queue fairly. Using decimal values could solve this problem. (was: Capacity Scheduler's minimum-user-limit-percent only accepts integers, which means at most 100 users can use a single fairly. Using decimal values could solve this problem.) > Change Capacity Scheduler minimum-user-limit-percent to accept decimal values > - > > Key: YARN-10802 > URL: https://issues.apache.org/jira/browse/YARN-10802 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10802.001.patch, YARN-10802.002.patch, > YARN-10802.003.patch, YARN-10802.004.patch > > > Capacity Scheduler's minimum-user-limit-percent only accepts integers, which > means at most 100 users can use a single queue fairly. Using decimal values > could solve this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times
[ https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363205#comment-17363205 ] Jim Brennan commented on YARN-10767: Thanks for the update [~dmmkr]! I can see that you changed {noformat} public static String findActiveRMHAId(YarnConfiguration conf) { YarnConfiguration yarnConf = new YarnConfiguration(conf); {noformat} to {noformat} public static String findActiveRMHAId(YarnConfiguration yarnConf) { {noformat} Effectively moving the construction of the temporary YarnConfiguration to the caller. I see in the other place where this method is called, it was already doing that. So in that sense this make sense. I am wondering about the change in behavior for findActiveRMHAId() though. Previously, it did not change the conf that was passed in - it made changes in a local copy. Now, it will modify the passed in conf whether it succeeds or fails, by setting RM_HA_ID. That is why I suggested changing it to this: {noformat} public static String findActiveRMHAId(Configuration conf) { YarnConfiguration yarnConf = new YarnConfiguration(conf); {noformat} Then you can just use the conf you were passed in. This does not make any functional difference with the current callers, but it could matter to future callers, if they assume findActiveRMHAId won't modify the passed in conf. > Yarn Logs Command retrying on Standby RM for 30 times > - > > Key: YARN-10767 > URL: https://issues.apache.org/jira/browse/YARN-10767 > Project: Hadoop YARN > Issue Type: Bug >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Major > Attachments: YARN-10767.001.patch, YARN-10767.002.patch, > YARN-10767.003.patch > > > When ResourceManager HA is enabled and the first RM is unavailable, on > executing "bin/yarn logs -applicationId -am 1", we get > ConnectionException for connecting to the first RM, the ConnectionException > Occurs for 30 times before it tries to connect to the second RM. > > This can be optimized by trying to fetch the logs from the Active RM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
[ https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-10802: -- Fix Version/s: 3.4.0 > Change Capacity Scheduler minimum-user-limit-percent to accept decimal values > - > > Key: YARN-10802 > URL: https://issues.apache.org/jira/browse/YARN-10802 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10802.001.patch, YARN-10802.002.patch, > YARN-10802.003.patch, YARN-10802.004.patch > > > Capacity Scheduler's minimum-user-limit-percent only accepts integers, which > means at most 100 users can use a single fairly. Using decimal values could > solve this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
[ https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363194#comment-17363194 ] Szilard Nemeth commented on YARN-10802: --- Ok, had some quick offline discussion with [~bteke], it turns out I confused decimal with whole numbers (english is not my native) but still it's a bit embarassing. Anyways, the primitive data types documentation for Java also mentions double / float as decimal types: https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html I don't think it's worth to upload a new patch to fix the nit, so I fixed it just before committing. Thanks [~bteke] again for the patch, committed to trunk and resolving jira now. > Change Capacity Scheduler minimum-user-limit-percent to accept decimal values > - > > Key: YARN-10802 > URL: https://issues.apache.org/jira/browse/YARN-10802 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10802.001.patch, YARN-10802.002.patch, > YARN-10802.003.patch, YARN-10802.004.patch > > > Capacity Scheduler's minimum-user-limit-percent only accepts integers, which > means at most 100 users can use a single fairly. Using decimal values could > solve this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
[ https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363192#comment-17363192 ] Szilard Nemeth commented on YARN-10802: --- Hi [~bteke], Thanks for working on this. Some comments for the latest patch: 1. Checking the description: Capacity Scheduler's minimum-user-limit-percent only accepts integers, which means at most 100 users can use a single fairly. *Using decimal values could solve this problem.* Didn't you want to add "using fractional values could solve this problem"? Also, the name of the testcase org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue#testDecimalUserLimits is saying decimal user limits, but you are setting 50.1% which is a fractional. 2. Nit: In org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue#testDecimalUserLimits You may replace 0*GB with 0 in assertions like: {code} assertEquals(0*GB, app1.getCurrentConsumption().getMemorySize()); {code} Other than these, the patch looks okay. > Change Capacity Scheduler minimum-user-limit-percent to accept decimal values > - > > Key: YARN-10802 > URL: https://issues.apache.org/jira/browse/YARN-10802 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10802.001.patch, YARN-10802.002.patch, > YARN-10802.003.patch, YARN-10802.004.patch > > > Capacity Scheduler's minimum-user-limit-percent only accepts integers, which > means at most 100 users can use a single fairly. Using decimal values could > solve this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10813) Root queue capacity is not set when using node labels
[ https://issues.apache.org/jira/browse/YARN-10813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363185#comment-17363185 ] Szilard Nemeth commented on YARN-10813: --- Thanks [~gandras] for reporting this, This is quite a trivial fix, but it's good that you have spotted it. Some questions / observations: 1. How come our tests didn't catch this? Is it easy to add a unit test to cover the fixed scenario if it isn't any? 2. I would have been baffled to realize if we don't have a common constant for the queue name "root" anywhere. The thing is, we have many constants, just search for "root" from package: org/apache/hadoop/yarn/server/resourcemanager/scheduler I know it's not strongly related to this, but could you please report a follow-up to clean up those? I just don't want to increase the number of occurrences of "root" in production code anymore. Thanks. > Root queue capacity is not set when using node labels > - > > Key: YARN-10813 > URL: https://issues.apache.org/jira/browse/YARN-10813 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10813.001.patch > > > CapacitySchedulerConfiguration#getNonLabeledQueueCapacity handles root in the > following way: > {code:java} > if (absoluteResourceConfigured || configuredWeightAsCapacity( > configuredCapacity)) { > // Return capacity in percentage as 0 for non-root queues and 100 for > // root.From AbstractCSQueue, absolute resource will be parsed and > // updated. Once nodes are added/removed in cluster, capacity in > // percentage will also be re-calculated. > return queue.equals("root") ? 100.0f : 0f; > } > {code} > CapacitySchedulerConfiguration#internalGetLabeledQueueCapacity on the other > hand does not take root queue into consideration: > {code:java} > if (absoluteResourceConfigured || configuredWeightAsCapacity( > configuredCapacity)) { > // Return capacity in percentage as 0 for non-root queues and 100 for > // root.From AbstractCSQueue, absolute resource, and weight will be > parsed > // and updated separately. Once nodes are added/removed in cluster, > // capacity is percentage will also be re-calculated. > return defaultValue; > } > float capacity = getFloat(capacityPropertyName, defaultValue); > {code} > Due to this, labeled root capacity is 0, which is not set in in > AbstractCSQueue#derivedCapacityFromAbsoluteConfigurations, because root is > never in Absolute mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10801) Fix Auto Queue template to properly set all configuration properties
[ https://issues.apache.org/jira/browse/YARN-10801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363181#comment-17363181 ] Szilard Nemeth commented on YARN-10801: --- Hi [~gandras], Thanks for working on this. One thing I quite don't understand: ParentQueue#createNewQueue used to call ParentQueue#getConfForAutoCreatedQueue for child queues: {code} childQueue = new LeafQueue(csContext, getConfForAutoCreatedQueue(childQueuePath, isLeaf), queueShortName, this, null); {code} The method definition was removed with your patch, the original method: {code} ParentQueue#getConfForAutoCreatedQueue private CapacitySchedulerConfiguration getConfForAutoCreatedQueue( String childQueuePath, boolean isLeaf) { // Copy existing config CapacitySchedulerConfiguration dupCSConfig = new CapacitySchedulerConfiguration( csContext.getConfiguration(), false); autoCreatedQueueTemplate.setTemplateEntriesForChild(dupCSConfig, childQueuePath); if (isLeaf) { // set to -1, to disable it dupCSConfig.setUserLimitFactor(childQueuePath, -1); // Set Max AM percentage to a higher value dupCSConfig.setMaximumApplicationMasterResourcePerQueuePercent( childQueuePath, 0.5f); } return dupCSConfig; } {code} However, you replaced the calls with: {code} if (isLeaf) { childQueue = new LeafQueue(csContext, csContext.getConfiguration(), queueShortName, this, null, true); } {code} Method definition of LeafQueue#setDynamicQueueProperties: {code} @Override protected void setDynamicQueueProperties( CapacitySchedulerConfiguration configuration) { super.setDynamicQueueProperties(configuration); // set to -1, to disable it configuration.setUserLimitFactor(getQueuePath(), -1); // Set Max AM percentage to a higher value configuration.setMaximumApplicationMasterResourcePerQueuePercent( getQueuePath(), 1f); } {code} I can see that the old setMaximumApplicationMasterResourcePerQueuePercent was called with 0.5f and the new is called with 1f. Could you please explain to me what's the intention of this change? Could you add some more unit test assertions: For the AM resource percentage takes the correct value and the userLimitFactor is -1? Thanks. > Fix Auto Queue template to properly set all configuration properties > > > Key: YARN-10801 > URL: https://issues.apache.org/jira/browse/YARN-10801 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10801.001.patch, YARN-10801.002.patch, > YARN-10801.003.patch, YARN-10801.004.patch, YARN-10801.005.patch > > > Currently Auto Queue templates set configuration properties only on > Configuration object passed in the constructor. Due to the fact, that a lot > of configuration values are ready from the Configuration object in csContext, > template properties are not set in every cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption
[ https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363171#comment-17363171 ] Eric Payne commented on YARN-10821: --- {quote} - In UsersManager#computeUserLimit the userLimit is calculated as is (currentCapacity * userLimit) {code} Resource userLimitResource = Resources.max(resourceCalculator, partitionResource, Resources.divideAndCeil(resourceCalculator, resourceUsed, usersSummedByWeight), Resources.divideAndCeil(resourceCalculator, Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()), 100)); {code} {quote} One more thing to note: another difference between the preemption and allocation calculations is that in the preemption path, {{resourceUsed}} in the above algorithm is resources used by all users whereas in the allocation path, it is only resources used by active users (that is, users currently asking for resources). > User limit is not calculated as per definition for preemption > - > > Key: YARN-10821 > URL: https://issues.apache.org/jira/browse/YARN-10821 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10821.001.patch > > > Minimum user limit percent (MULP) is a soft limit by definition. Preemption > uses pending resources to determine the resources needed by a queue, which is > calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This > method involves headroom calculated by UsersManager#computeUserLimit. > However, the pending resources for preemption are limited in an unexpected > fashion. > * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is > calculated first: > {code:java} > float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f, > 1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1)); > {code} > * In UsersManager#computeUserLimit the userLimit is calculated as is > (currentCapacity * userLimit) > {code:java} > Resource userLimitResource = Resources.max(resourceCalculator, > partitionResource, > Resources.divideAndCeil(resourceCalculator, resourceUsed, > usersSummedByWeight), > Resources.divideAndCeil(resourceCalculator, > Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()), > 100)); > {code} > The fewer users occupying the queue, the more prevalent and outstanding this > effect will be in preemption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption
[ https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363169#comment-17363169 ] Eric Payne commented on YARN-10821: --- {quote} In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is calculated first: In UsersManager#computeUserLimit the userLimit is calculated as is (currentCapacity * userLimit) {quote} [~gandras], thanks for raising this issue. {{LeafQueue#getUserAMResourceLimitPerPartition}} and {{UsersManager#computeUserLimit}} are used to calculate different things. {{getUserAMResourceLimitPerPartition}} is used to calculate the maximum resources that can be used for AMs by all apps from a single user in the {{LeafQueue}} {{computeUserLimit}} is used to calculate the maximum total resources that can be used by all apps from a single user in the {{LeafQueue}} {{computeUserLimit}} is used not only during calculations by the preemption monitor, but it is also used to calculate headroom during container allocation and assignment to a queue. In this way, the preemption monitor and the Capacity Scheduler allocations are using the same computations for each users' user limit. The calculations in {{getUserAMResourceLimitPerPartition}} are more lenient than those in {{computeUserLimit}}. But they are calculating different limits. This difference is not between preemption vs. allocation, but between AM resources limit vs. total resources limit per user. > User limit is not calculated as per definition for preemption > - > > Key: YARN-10821 > URL: https://issues.apache.org/jira/browse/YARN-10821 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10821.001.patch > > > Minimum user limit percent (MULP) is a soft limit by definition. Preemption > uses pending resources to determine the resources needed by a queue, which is > calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This > method involves headroom calculated by UsersManager#computeUserLimit. > However, the pending resources for preemption are limited in an unexpected > fashion. > * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is > calculated first: > {code:java} > float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f, > 1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1)); > {code} > * In UsersManager#computeUserLimit the userLimit is calculated as is > (currentCapacity * userLimit) > {code:java} > Resource userLimitResource = Resources.max(resourceCalculator, > partitionResource, > Resources.divideAndCeil(resourceCalculator, resourceUsed, > usersSummedByWeight), > Resources.divideAndCeil(resourceCalculator, > Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()), > 100)); > {code} > The fewer users occupying the queue, the more prevalent and outstanding this > effect will be in preemption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption
[ https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363164#comment-17363164 ] Hadoop QA commented on YARN-10821: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 23m 3s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 6s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 4s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 47s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 57s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 2s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 20m 20s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 56s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 50s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 39s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1062/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 10 unchanged - 0 fixed = 11 total (was 10) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 49s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 3s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} |
[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363120#comment-17363120 ] Szilard Nemeth commented on YARN-10789: --- Hi [~tarunparimi], Can you please upload the branch-3.3 patch so Jenkins will trigger and does the build? Thanks. > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
[ https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363103#comment-17363103 ] Hadoop QA commented on YARN-10802: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 21m 6s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 38s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 57s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 56s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 5s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 42s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 20m 20s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 54s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 52s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 56s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 47s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 53s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1060/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 2 new + 672 unchanged - 15 fixed = 674 total (was 687) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 50s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | |
[jira] [Commented] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times
[ https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363087#comment-17363087 ] D M Murali Krishna Reddy commented on YARN-10767: - [~Jim_Brennan], I have fixed the spotbugs issue in the v3 patch. Can you have a look? > Yarn Logs Command retrying on Standby RM for 30 times > - > > Key: YARN-10767 > URL: https://issues.apache.org/jira/browse/YARN-10767 > Project: Hadoop YARN > Issue Type: Bug >Reporter: D M Murali Krishna Reddy >Assignee: D M Murali Krishna Reddy >Priority: Major > Attachments: YARN-10767.001.patch, YARN-10767.002.patch, > YARN-10767.003.patch > > > When ResourceManager HA is enabled and the first RM is unavailable, on > executing "bin/yarn logs -applicationId -am 1", we get > ConnectionException for connecting to the first RM, the ConnectionException > Occurs for 30 times before it tries to connect to the second RM. > > This can be optimized by trying to fetch the logs from the Active RM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10822) Containers going from New to Scheduled transition even though container is killed before NM restart when NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-10822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minni Mittal updated YARN-10822: Description: INFO [91] ContainerImpl: Container container_e1171_1623422468672_2229_01_000738 transitioned from NEW to LOCALIZING INFO [91] ContainerImpl: Container container_e1171_1623422468672_2229_01_000738 transitioned from LOCALIZING to SCHEDULED INFO [91] ContainerScheduler: Opportunistic container container_e1171_1623422468672_2229_01_000738 will be queued at the NM. INFO [127] ContainerManagerImpl: Stopping container with container Id: container_e1171_1623422468672_2229_01_000738 INFO [91] ContainerImpl: Container container_e1171_1623422468672_2229_01_000738 transitioned from SCHEDULED to KILLING INFO [91] ContainerImpl: Container container_e1171_1623422468672_2229_01_000738 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL INFO [91] NMAuditLogger: USER=defaultcafor1stparty OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1623422468672_2229 CONTAINERID=container_e1171_1623422468672_2229_01_000738 INFO [91] ApplicationImpl: Removing container_e1171_1623422468672_2229_01_000738 from application application_1623422468672_2229 INFO [91] ContainersMonitorImpl: Stopping resource-monitoring for container_e1171_1623422468672_2229_01_000738 INFO [163] NodeStatusUpdaterImpl: Removed completed containers from NM context:[container_e1171_1623422468672_2229_01_000738] NM restart happened and recovery is attempted INFO [1] ContainerManagerImpl: Recovering container_e1171_1623422468672_2229_01_000738 in state QUEUED with exit code -1000 INFO [1] ApplicationImpl: Adding container_e1171_1623422468672_2229_01_000738 to application application_1623422468672_2229 INFO [89] ContainerImpl: Container container_e1171_1623422468672_2229_01_000738 transitioned from NEW to SCHEDULED INFO [89] ContainerImpl: Container container_e1171_1623422468672_2229_01_000738 transitioned from SCHEDULED to KILLING INFO [89] ContainerImpl: Container container_e1171_1623422468672_2229_01_000738 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL Ideally, when container got killed before restart, it should finish the container immediately. > Containers going from New to Scheduled transition even though container is > killed before NM restart when NM recovery is enabled > --- > > Key: YARN-10822 > URL: https://issues.apache.org/jira/browse/YARN-10822 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Minni Mittal >Assignee: Minni Mittal >Priority: Major > > INFO [91] ContainerImpl: Container > container_e1171_1623422468672_2229_01_000738 transitioned from NEW to > LOCALIZING > INFO [91] ContainerImpl: Container > container_e1171_1623422468672_2229_01_000738 transitioned from LOCALIZING to > SCHEDULED > INFO [91] ContainerScheduler: Opportunistic container > container_e1171_1623422468672_2229_01_000738 will be queued at the NM. > INFO [127] ContainerManagerImpl: Stopping container with container Id: > container_e1171_1623422468672_2229_01_000738 > INFO [91] ContainerImpl: Container > container_e1171_1623422468672_2229_01_000738 transitioned from SCHEDULED to > KILLING > INFO [91] ContainerImpl: Container > container_e1171_1623422468672_2229_01_000738 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > INFO [91] NMAuditLogger: USER=defaultcafor1stparty OPERATION=Container > Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS > APPID=application_1623422468672_2229 > CONTAINERID=container_e1171_1623422468672_2229_01_000738 > INFO [91] ApplicationImpl: Removing > container_e1171_1623422468672_2229_01_000738 from application > application_1623422468672_2229 > INFO [91] ContainersMonitorImpl: Stopping resource-monitoring for > container_e1171_1623422468672_2229_01_000738 > INFO [163] NodeStatusUpdaterImpl: Removed completed containers from NM > context:[container_e1171_1623422468672_2229_01_000738] > NM restart happened and recovery is attempted > > INFO [1] ContainerManagerImpl: Recovering > container_e1171_1623422468672_2229_01_000738 in state QUEUED with exit code > -1000 > INFO [1] ApplicationImpl: Adding > container_e1171_1623422468672_2229_01_000738 to application > application_1623422468672_2229 > INFO [89] ContainerImpl: Container > container_e1171_1623422468672_2229_01_000738 transitioned from NEW to > SCHEDULED > INFO [89] ContainerImpl: Container > container_e1171_1623422468672_2229_01_000738 transitioned from SCHEDULED to > KILLING > INFO [89] ContainerImpl: Container > container_e1171_1623422468672_2229_01_000738 transitioned from KILLING to >
[jira] [Created] (YARN-10822) Containers going to New to Scheduled transition even though container is killed before NM restart when NM recovery is enabled
Minni Mittal created YARN-10822: --- Summary: Containers going to New to Scheduled transition even though container is killed before NM restart when NM recovery is enabled Key: YARN-10822 URL: https://issues.apache.org/jira/browse/YARN-10822 Project: Hadoop YARN Issue Type: Bug Reporter: Minni Mittal Assignee: Minni Mittal -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10822) Containers going from New to Scheduled transition even though container is killed before NM restart when NM recovery is enabled
[ https://issues.apache.org/jira/browse/YARN-10822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minni Mittal updated YARN-10822: Summary: Containers going from New to Scheduled transition even though container is killed before NM restart when NM recovery is enabled (was: Containers going to New to Scheduled transition even though container is killed before NM restart when NM recovery is enabled) > Containers going from New to Scheduled transition even though container is > killed before NM restart when NM recovery is enabled > --- > > Key: YARN-10822 > URL: https://issues.apache.org/jira/browse/YARN-10822 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Minni Mittal >Assignee: Minni Mittal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times
[ https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363036#comment-17363036 ] Hadoop QA commented on YARN-10767: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 31s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 31m 17s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 6s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 42s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 20m 16s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 46s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 42s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 9s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | |
[jira] [Comment Edited] (YARN-10821) User limit is not calculated as per definition for preemption
[ https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362940#comment-17362940 ] Andras Gyori edited comment on YARN-10821 at 6/14/21, 3:32 PM: --- I am not entirely convinced that this is the best solution to this problem, and as user limit is heavily used throughout the entire codebase, I am also unsure that it does not break something. Perhaps experts could help here cc [~epayne]. was (Author: gandras): I am not entirely convinced that this is the best solution to this problem, and as user limit is heavily used throughout the entire codebase, I am also not sure that it will not break anything. Perhaps experts could help here cc [~epayne]. > User limit is not calculated as per definition for preemption > - > > Key: YARN-10821 > URL: https://issues.apache.org/jira/browse/YARN-10821 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > > Minimum user limit percent (MULP) is a soft limit by definition. Preemption > uses pending resources to determine the resources needed by a queue, which is > calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This > method involves headroom calculated by UsersManager#computeUserLimit. > However, the pending resources for preemption are limited in an unexpected > fashion. > * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is > calculated first: > {code:java} > float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f, > 1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1)); > {code} > * In UsersManager#computeUserLimit the userLimit is calculated as is > (currentCapacity * userLimit) > {code:java} > Resource userLimitResource = Resources.max(resourceCalculator, > partitionResource, > Resources.divideAndCeil(resourceCalculator, resourceUsed, > usersSummedByWeight), > Resources.divideAndCeil(resourceCalculator, > Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()), > 100)); > {code} > The fewer users occupying the queue, the more prevalent and outstanding this > effect will be in preemption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption
[ https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363004#comment-17363004 ] Eric Payne commented on YARN-10821: --- Thanks [~gandras] for bringing this up. I will take a look. > User limit is not calculated as per definition for preemption > - > > Key: YARN-10821 > URL: https://issues.apache.org/jira/browse/YARN-10821 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > > Minimum user limit percent (MULP) is a soft limit by definition. Preemption > uses pending resources to determine the resources needed by a queue, which is > calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This > method involves headroom calculated by UsersManager#computeUserLimit. > However, the pending resources for preemption are limited in an unexpected > fashion. > * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is > calculated first: > {code:java} > float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f, > 1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1)); > {code} > * In UsersManager#computeUserLimit the userLimit is calculated as is > (currentCapacity * userLimit) > {code:java} > Resource userLimitResource = Resources.max(resourceCalculator, > partitionResource, > Resources.divideAndCeil(resourceCalculator, resourceUsed, > usersSummedByWeight), > Resources.divideAndCeil(resourceCalculator, > Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()), > 100)); > {code} > The fewer users occupying the queue, the more prevalent and outstanding this > effect will be in preemption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption
[ https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362940#comment-17362940 ] Andras Gyori commented on YARN-10821: - I am not entirely convinced that this is the best solution to this problem, and as user limit is heavily used throughout the entire codebase, I am also not sure that it will not break anything. Perhaps experts could help here cc [~epayne]. > User limit is not calculated as per definition for preemption > - > > Key: YARN-10821 > URL: https://issues.apache.org/jira/browse/YARN-10821 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > > Minimum user limit percent (MULP) is a soft limit by definition. Preemption > uses pending resources to determine the resources needed by a queue, which is > calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This > method involves headroom calculated by UsersManager#computeUserLimit. > However, the pending resources for preemption are limited in an unexpected > fashion. > * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is > calculated first: > {code:java} > float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f, > 1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1)); > {code} > * In UsersManager#computeUserLimit the userLimit is calculated as is > (currentCapacity * userLimit) > {code:java} > Resource userLimitResource = Resources.max(resourceCalculator, > partitionResource, > Resources.divideAndCeil(resourceCalculator, resourceUsed, > usersSummedByWeight), > Resources.divideAndCeil(resourceCalculator, > Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()), > 100)); > {code} > The fewer users occupying the queue, the more prevalent and outstanding this > effect will be in preemption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10821) User limit is not calculated as per definition for preemption
Andras Gyori created YARN-10821: --- Summary: User limit is not calculated as per definition for preemption Key: YARN-10821 URL: https://issues.apache.org/jira/browse/YARN-10821 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Andras Gyori Assignee: Andras Gyori Minimum user limit percent (MULP) is a soft limit by definition. Preemption uses pending resources to determine the resources needed by a queue, which is calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This method involves headroom calculated by UsersManager#computeUserLimit. However, the pending resources for preemption are limited in an unexpected fashion. * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is calculated first: {code:java} float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f, 1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1)); {code} * In UsersManager#computeUserLimit the userLimit is calculated as is (currentCapacity * userLimit) {code:java} Resource userLimitResource = Resources.max(resourceCalculator, partitionResource, Resources.divideAndCeil(resourceCalculator, resourceUsed, usersSummedByWeight), Resources.divideAndCeil(resourceCalculator, Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()), 100)); {code} The fewer users occupying the queue, the more prevalent and outstanding this effect will be in preemption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe
[ https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362837#comment-17362837 ] Surendra Singh Lilhore commented on YARN-10820: --- [~Swathi Chandrashekar], Added you as contributor and assigned to you > Make GetClusterNodesRequestPBImpl thread safe > - > > Key: YARN-10820 > URL: https://issues.apache.org/jira/browse/YARN-10820 > Project: Hadoop YARN > Issue Type: Task > Components: client >Affects Versions: 3.1.0, 3.3.0 >Reporter: Prabhu Joseph >Assignee: SwathiChandrashekar >Priority: Major > > yarn node list intermittently fails with below > {code:java} > 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on > [resourcemanager-1], so propagating back to caller. > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 > at java.util.ArrayList.add(ArrayList.java:465) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.UnsupportedOperationException on > [resourcemanager-0], so propagating back to caller. > Exception in thread "main" java.lang.UnsupportedOperationException > at > java.util.Collections$UnmodifiableCollection.add(Collections.java:1057) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Assigned] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe
[ https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore reassigned YARN-10820: - Assignee: SwathiChandrashekar (was: Prabhu Joseph) > Make GetClusterNodesRequestPBImpl thread safe > - > > Key: YARN-10820 > URL: https://issues.apache.org/jira/browse/YARN-10820 > Project: Hadoop YARN > Issue Type: Task > Components: client >Affects Versions: 3.1.0, 3.3.0 >Reporter: Prabhu Joseph >Assignee: SwathiChandrashekar >Priority: Major > > yarn node list intermittently fails with below > {code:java} > 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on > [resourcemanager-1], so propagating back to caller. > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 > at java.util.ArrayList.add(ArrayList.java:465) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.UnsupportedOperationException on > [resourcemanager-0], so propagating back to caller. > Exception in thread "main" java.lang.UnsupportedOperationException > at > java.util.Collections$UnmodifiableCollection.add(Collections.java:1057) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Commented] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
[ https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362831#comment-17362831 ] Benjamin Teke commented on YARN-10802: -- Hi [~snemeth], Thanks for checking this. 1. Fixed most of the checkstyle issues, two remains but those two would require larger effort than the patch itself and they're unrelated, so if that's okay I would skip them. 2. The UT failure seems unrelated, happened [here|https://issues.apache.org/jira/browse/YARN-10726?jql=project%20%3D%20YARN%20AND%20text%20~%20%22TestCapacitySchedulerAsyncScheduling%22%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC] as well. > Change Capacity Scheduler minimum-user-limit-percent to accept decimal values > - > > Key: YARN-10802 > URL: https://issues.apache.org/jira/browse/YARN-10802 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10802.001.patch, YARN-10802.002.patch, > YARN-10802.003.patch, YARN-10802.004.patch > > > Capacity Scheduler's minimum-user-limit-percent only accepts integers, which > means at most 100 users can use a single fairly. Using decimal values could > solve this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
[ https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-10802: - Attachment: YARN-10802.004.patch > Change Capacity Scheduler minimum-user-limit-percent to accept decimal values > - > > Key: YARN-10802 > URL: https://issues.apache.org/jira/browse/YARN-10802 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10802.001.patch, YARN-10802.002.patch, > YARN-10802.003.patch, YARN-10802.004.patch > > > Capacity Scheduler's minimum-user-limit-percent only accepts integers, which > means at most 100 users can use a single fairly. Using decimal values could > solve this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10813) Root queue capacity is not set when using node labels
[ https://issues.apache.org/jira/browse/YARN-10813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362825#comment-17362825 ] Benjamin Teke commented on YARN-10813: -- Thanks [~gandras] for the patch, this indeed seems to be a bug. LGTM (non-binding). > Root queue capacity is not set when using node labels > - > > Key: YARN-10813 > URL: https://issues.apache.org/jira/browse/YARN-10813 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10813.001.patch > > > CapacitySchedulerConfiguration#getNonLabeledQueueCapacity handles root in the > following way: > {code:java} > if (absoluteResourceConfigured || configuredWeightAsCapacity( > configuredCapacity)) { > // Return capacity in percentage as 0 for non-root queues and 100 for > // root.From AbstractCSQueue, absolute resource will be parsed and > // updated. Once nodes are added/removed in cluster, capacity in > // percentage will also be re-calculated. > return queue.equals("root") ? 100.0f : 0f; > } > {code} > CapacitySchedulerConfiguration#internalGetLabeledQueueCapacity on the other > hand does not take root queue into consideration: > {code:java} > if (absoluteResourceConfigured || configuredWeightAsCapacity( > configuredCapacity)) { > // Return capacity in percentage as 0 for non-root queues and 100 for > // root.From AbstractCSQueue, absolute resource, and weight will be > parsed > // and updated separately. Once nodes are added/removed in cluster, > // capacity is percentage will also be re-calculated. > return defaultValue; > } > float capacity = getFloat(capacityPropertyName, defaultValue); > {code} > Due to this, labeled root capacity is 0, which is not set in in > AbstractCSQueue#derivedCapacityFromAbsoluteConfigurations, because root is > never in Absolute mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362820#comment-17362820 ] Tarun Parimi commented on YARN-10789: - Thanks [~snemeth] for the review and commit. Thanks [~bteke],[~zhuqi] for your reviews. We can backport it to 3.3/3.2 branches. The trunk patch applies cleanly on 3.3. Will add a patch for 3.2. > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10816) Avoid doing delegation token ops when yarn.timeline-service.http-authentication.type=simple
[ https://issues.apache.org/jira/browse/YARN-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362752#comment-17362752 ] Tarun Parimi commented on YARN-10816: - Thanks [~snemeth] for the review and commit. > Avoid doing delegation token ops when > yarn.timeline-service.http-authentication.type=simple > --- > > Key: YARN-10816 > URL: https://issues.apache.org/jira/browse/YARN-10816 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.4.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10816.001.patch, YARN-10816.002.patch > > > YARN-10339 introduced changes to ensure that PseudoAuthenticationHandler is > used in TimelineClient when > yarn.timeline-service.http-authentication.type=simple > PseudoAuthenticationHandler doesn't support delegation token ops like get, > renew and cancel since those ops strictly require SPNEGO auth to work. We > don't use timeline delegation tokens when simple auth is used. > Prior to YARN-10339, Timeline delegation tokens were unnecessarily used when > yarn.timeline-service.http-authentication.type=simple, but hadoop security > was enabled. After YARN-10339, the tokens are not used when > yarn.timeline-service.http-authentication.type=simple. > In a rolling upgrade scenario, we can have a client which doesn't have > YARN-10339 changes submitting an application and requests a Timeline > delegation token even when > yarn.timeline-service.http-authentication.type=simple. RM on the other hand > can have YARN-10339 changes and so will result in error while trying to renew > the token with PseudoAuthenticationHandler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe
[ https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362724#comment-17362724 ] Prabhu Joseph commented on YARN-10820: -- Hi [~bibinchundatt], Could you please assign [~Swathi Chandrashekar] the contributor. Thanks. > Make GetClusterNodesRequestPBImpl thread safe > - > > Key: YARN-10820 > URL: https://issues.apache.org/jira/browse/YARN-10820 > Project: Hadoop YARN > Issue Type: Task > Components: client >Affects Versions: 3.1.0, 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > yarn node list intermittently fails with below > {code:java} > 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on > [resourcemanager-1], so propagating back to caller. > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 > at java.util.ArrayList.add(ArrayList.java:465) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.UnsupportedOperationException on > [resourcemanager-0], so propagating back to caller. > Exception in thread "main" java.lang.UnsupportedOperationException > at > java.util.Collections$UnmodifiableCollection.add(Collections.java:1057) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Commented] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe
[ https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362722#comment-17362722 ] Swathi Chandrashekar commented on YARN-10820: - Hi Prabhu, Can you please assign it to me ? > Make GetClusterNodesRequestPBImpl thread safe > - > > Key: YARN-10820 > URL: https://issues.apache.org/jira/browse/YARN-10820 > Project: Hadoop YARN > Issue Type: Task > Components: client >Affects Versions: 3.1.0, 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > yarn node list intermittently fails with below > {code:java} > 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on > [resourcemanager-1], so propagating back to caller. > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 > at java.util.ArrayList.add(ArrayList.java:465) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception: java.lang.UnsupportedOperationException on > [resourcemanager-0], so propagating back to caller. > Exception in thread "main" java.lang.UnsupportedOperationException > at > java.util.Collections$UnmodifiableCollection.add(Collections.java:1057) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >