Jason Lowe commented on YARN-3136:

Thanks for updating the patch, Sunil.  Comments:

I'd like to see a comment on the method stating why it's synchronizing and that 
it should be overridden for performance if the underlying map supports 
concurrent access.  Also we're inconsistent about using the new method.  There 
are plenty of places, both in AbstractYarnScheduler and otherwise, that still 
call applications.get rather than the new method to abstract it.

The FairScheduler and FifoScheduler also use concurrent maps, shouldn't they 
override the method as well?

Speaking of accessing the applications map without holding a proper lock, 
doesn't AbstractYarnScheduler.createReleaseCache() do exactly that?  Seems like 
the underlying map needs to be concurrent for that code to iterate the map 
without holding a lock, and either it's safe to assume it is concurrent (and 
thus we don't need all this extra get method overhead and related changes), or 
createReleaseCache() is broken for third-party schedulers who didn't bother to 
use a concurrent map.

> getTransferredContainers can be a bottleneck during AM registration
> -------------------------------------------------------------------
>                 Key: YARN-3136
>                 URL: https://issues.apache.org/jira/browse/YARN-3136
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: scheduler
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Sunil G
>         Attachments: 0001-YARN-3136.patch, 0002-YARN-3136.patch, 
> 0003-YARN-3136.patch, 0004-YARN-3136.patch
> While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
> stuck waiting for the scheduler lock trying to call getTransferredContainers. 
>  The scheduler lock is highly contended, especially on a large cluster with 
> many nodes heartbeating, and it would be nice if we could find a way to 
> eliminate the need to grab this lock during this call.  We've already done 
> similar work during AM allocate calls to make sure they don't needlessly grab 
> the scheduler lock, and it would be good to do so here as well, if possible.

This message was sent by Atlassian JIRA

Reply via email to