[
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349575#comment-14349575
]
Jason Lowe commented on YARN-3136:
----------------------------------
Thanks for updating the patch, Sunil. Comments:
I'd like to see a comment on the method stating why it's synchronizing and that
it should be overridden for performance if the underlying map supports
concurrent access. Also we're inconsistent about using the new method. There
are plenty of places, both in AbstractYarnScheduler and otherwise, that still
call applications.get rather than the new method to abstract it.
The FairScheduler and FifoScheduler also use concurrent maps, shouldn't they
override the method as well?
Speaking of accessing the applications map without holding a proper lock,
doesn't AbstractYarnScheduler.createReleaseCache() do exactly that? Seems like
the underlying map needs to be concurrent for that code to iterate the map
without holding a lock, and either it's safe to assume it is concurrent (and
thus we don't need all this extra get method overhead and related changes), or
createReleaseCache() is broken for third-party schedulers who didn't bother to
use a concurrent map.
> getTransferredContainers can be a bottleneck during AM registration
> -------------------------------------------------------------------
>
> Key: YARN-3136
> URL: https://issues.apache.org/jira/browse/YARN-3136
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: scheduler
> Affects Versions: 2.6.0
> Reporter: Jason Lowe
> Assignee: Sunil G
> Attachments: 0001-YARN-3136.patch, 0002-YARN-3136.patch,
> 0003-YARN-3136.patch, 0004-YARN-3136.patch
>
>
> While examining RM stack traces on a busy cluster I noticed a pattern of AMs
> stuck waiting for the scheduler lock trying to call getTransferredContainers.
> The scheduler lock is highly contended, especially on a large cluster with
> many nodes heartbeating, and it would be nice if we could find a way to
> eliminate the need to grab this lock during this call. We've already done
> similar work during AM allocate calls to make sure they don't needlessly grab
> the scheduler lock, and it would be good to do so here as well, if possible.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)