[ https://issues.apache.org/jira/browse/YARN-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822457#comment-15822457 ]
Wangda Tan commented on YARN-5889: ---------------------------------- Thanks for update the patch, my comments for ver.3 patch. 1) LQ: - getUserAndAddIfAbsent could be move to UM - addUser/removeUser could be private method of UM - getUsers -> getUsersInfo, and move implemenation to UM - getComputedActiveUserLimit -> getComputedResourceLimitForActiveUsers (or shorter name) - getComputedUserLimit -> getComputedResourceLimitForAllUsers (mostly for more consistency naming to above, and more acurate and less ambiguous) - Add more description to above two methods. - Is it better to move recalculateQueueUsageRatio/calculateUserUsageRatio into UM, and recalculateQueueUsageRatio can be renamed to "clusterResourceUpdated" - inc/decUsedResource, keep User reference to avoid hit hashMap twice. - Not caused by your patch: could you add a desc to UsageRatios? 2) UM: - UM#userCountToRecalculateUserlimit could be AtomicLong to avoid writelock and less frequently wrapped to MIN_VALUE. Rename it to latestVersion. - rename UM#getUserCountToRecalculateUserlimit -> getLatestVersion and readlock is not required. - A potential race condition: {code} .... // In allocation thread, cross check whether user became active or not. // Such a change needs recomputation for all user-limits. if (activeUsersManager.isUserAddedOrRemoved()) { // 1 // Clear computed user-limits and ask for recomputation. userLimitNeedsRecompute(); // 2 // reset the flag so that we can further track for active users. activeUsersManager.setUserAddedOrRemoved(false); // 3 } .... {code} If thread#1 at // 2, and thread #2 calls setUserAddedOrRemoved, the {{isUserAddedOrRemoved}} could be overwritten and cannot properly trigger another update. To simplify interface and avoid the race condition, can we change {{isUserAddedOrRemoved}} to use an AtomicBoolean, invokes getAndSet(false) to do the "flip" operation. - Renaming {{userCountToRecalculateUserlimit}} to {{localVersion}}? - Similiarily, {code} if (isRecomputeNeeded(user, userLimitPerSchedulingMode)) { // recompute userLimitPerSchedulingMode = reComputeUserLimits(rc, userName, nodePartition, clusterResource, schedulingMode, false); // update user count to cache so that we can avoid recompute if no major // changes. user.setUserCountToRecalculateUserlimit( getUserCountToRecalculateUserlimit()); } {code} To avoid race condition, is it better to save value of "latestVersion" and set it to "local" version, pesudo code like: {code} int latestVersion = UM#getLatestVersion if (getLocalVersion != latestVersion) { // recompute ... setLocalVersion(latestVersion); } {code} - {{getComputedUserLimit}} / {{getComputedActiveUserLimit}}: I'm wondering if the UserToPartitionRecord is required, since existing computation doesn't need the "user" field. - Does existing logic handle following scenario? Assume userLimitPerSchedulingMode is null and no new changes to user-limit, will NPE happen if we get a RESPECT_EXCLUSIVITY first and then get a IGNORE_EXCLUSIVITY? - Inside {{computeUserLimit}}, it calls {{user.setUserResourceLimit(userLimitResource);}}, I think we need to file a separate JIRA to make sure REST API / web UI can get expected UL. - TODO, after you update the patch, need to double check locks of UM. > Improve user-limit calculation in capacity scheduler > ---------------------------------------------------- > > Key: YARN-5889 > URL: https://issues.apache.org/jira/browse/YARN-5889 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler > Reporter: Sunil G > Assignee: Sunil G > Attachments: YARN-5889.0001.patch, > YARN-5889.0001.suggested.patchnotes, YARN-5889.0002.patch, > YARN-5889.0003.patch, YARN-5889.v0.patch, YARN-5889.v1.patch, > YARN-5889.v2.patch > > > Currently user-limit is computed during every heartbeat allocation cycle with > a write lock. To improve performance, this tickets is focussing on moving > user-limit calculation out of heartbeat allocation flow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org