[
https://issues.apache.org/jira/browse/YARN-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054197#comment-16054197
]
Jason Lowe commented on YARN-6680:
----------------------------------
There definitely is a bug in the code with respect to locking in ResourceUsage,
both before and after this proposed change. Besides the issues Daryn pointed
out earlier, there's this problem:
- Thread 1 calls getUsed on some label. Whether we lock or not, we can return
the Resource object that is being used for bookkeeping. Once we return from
the get, the caller has access to the bookeeping object with no locks held.
- Thread 2 calls decUsed on the same label. It proceeds to mutate the _same
Resource object_ with the write lock held. The lock doesn't help for this
scenario, since Thread 1 already has the object being mutated and is not
calling any ResourceUsage code at the time.
- Thread 1 can now see an inconsistent view of the Resource, where the memory
field has been decremented but the vcore field has yet to be decremented. In
other words, a Resource usage that never actually occurred in practice.
This locking bug has been there for quite some time. Daryn is simply
optimizing what it already does today. I'm guessing the inconsistency isn't
much of an issue in practice due to the granular scheduler and queue locks
already being used during scheduling, which leaves the UI to show occasional
inconsistent values since I believe it can grab these values without holding
those same granular locks.
I'm +1 for the patch. It significantly speeds up what is a very common case
for us, and I suspect no node label is fairly common among other users as well.
Eventually we should try to make this completely lockless as much as possible,
using ConcurrentHashMap where the map stores atomic snapshot objects of state
where we need to update many at once. But that's a more significant effort for
another JIRA. This is a small change that offers a nice speedup for a common
scenario in the interim.
> Avoid locking overhead for NO_LABEL lookups
> -------------------------------------------
>
> Key: YARN-6680
> URL: https://issues.apache.org/jira/browse/YARN-6680
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.8.0
> Reporter: Daryn Sharp
> Assignee: Daryn Sharp
> Attachments: YARN-6680.patch
>
>
> Labels are managed via a hash that is protected with a read lock. The lock
> acquire and release are each just as expensive as the hash lookup itself -
> resulting in a 3X slowdown.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]