[
https://issues.apache.org/jira/browse/YARN-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822457#comment-15822457
]
Wangda Tan commented on YARN-5889:
----------------------------------
Thanks for update the patch, my comments for ver.3 patch.
1) LQ:
- getUserAndAddIfAbsent could be move to UM
- addUser/removeUser could be private method of UM
- getUsers -> getUsersInfo, and move implemenation to UM
- getComputedActiveUserLimit -> getComputedResourceLimitForActiveUsers (or
shorter name)
- getComputedUserLimit -> getComputedResourceLimitForAllUsers (mostly for more
consistency naming to above, and more acurate and less ambiguous)
- Add more description to above two methods.
- Is it better to move recalculateQueueUsageRatio/calculateUserUsageRatio into
UM, and recalculateQueueUsageRatio can be renamed to "clusterResourceUpdated"
- inc/decUsedResource, keep User reference to avoid hit hashMap twice.
- Not caused by your patch: could you add a desc to UsageRatios?
2) UM:
- UM#userCountToRecalculateUserlimit could be AtomicLong to avoid writelock and
less frequently wrapped to MIN_VALUE. Rename it to latestVersion.
- rename UM#getUserCountToRecalculateUserlimit -> getLatestVersion and readlock
is not required.
- A potential race condition:
{code}
....
// In allocation thread, cross check whether user became active or not.
// Such a change needs recomputation for all user-limits.
if (activeUsersManager.isUserAddedOrRemoved()) { // 1
// Clear computed user-limits and ask for recomputation.
userLimitNeedsRecompute(); // 2
// reset the flag so that we can further track for active users.
activeUsersManager.setUserAddedOrRemoved(false); // 3
}
....
{code}
If thread#1 at // 2, and thread #2 calls setUserAddedOrRemoved, the
{{isUserAddedOrRemoved}} could be overwritten and cannot properly trigger
another update.
To simplify interface and avoid the race condition, can we change
{{isUserAddedOrRemoved}} to use an AtomicBoolean, invokes getAndSet(false) to
do the "flip" operation.
- Renaming {{userCountToRecalculateUserlimit}} to {{localVersion}}?
- Similiarily,
{code}
if (isRecomputeNeeded(user, userLimitPerSchedulingMode)) {
// recompute
userLimitPerSchedulingMode = reComputeUserLimits(rc, userName,
nodePartition, clusterResource, schedulingMode, false);
// update user count to cache so that we can avoid recompute if no major
// changes.
user.setUserCountToRecalculateUserlimit(
getUserCountToRecalculateUserlimit());
}
{code}
To avoid race condition, is it better to save value of "latestVersion" and set
it to "local" version, pesudo code like:
{code}
int latestVersion = UM#getLatestVersion
if (getLocalVersion != latestVersion) {
// recompute ...
setLocalVersion(latestVersion);
}
{code}
- {{getComputedUserLimit}} / {{getComputedActiveUserLimit}}: I'm wondering if
the UserToPartitionRecord is required, since existing computation doesn't need
the "user" field.
- Does existing logic handle following scenario? Assume
userLimitPerSchedulingMode is null and no new changes to user-limit, will NPE
happen if we get a RESPECT_EXCLUSIVITY first and then get a IGNORE_EXCLUSIVITY?
- Inside {{computeUserLimit}}, it calls
{{user.setUserResourceLimit(userLimitResource);}}, I think we need to file a
separate JIRA to make sure REST API / web UI can get expected UL.
- TODO, after you update the patch, need to double check locks of UM.
> Improve user-limit calculation in capacity scheduler
> ----------------------------------------------------
>
> Key: YARN-5889
> URL: https://issues.apache.org/jira/browse/YARN-5889
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler
> Reporter: Sunil G
> Assignee: Sunil G
> Attachments: YARN-5889.0001.patch,
> YARN-5889.0001.suggested.patchnotes, YARN-5889.0002.patch,
> YARN-5889.0003.patch, YARN-5889.v0.patch, YARN-5889.v1.patch,
> YARN-5889.v2.patch
>
>
> Currently user-limit is computed during every heartbeat allocation cycle with
> a write lock. To improve performance, this tickets is focussing on moving
> user-limit calculation out of heartbeat allocation flow.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]