[ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350487#comment-14350487
 ] 

Jason Lowe commented on YARN-3136:
----------------------------------

bq. createReleaseCache is only called In serviceInit, so I think should be fine.

But createReleaseCache schedules a timer task that, sometime much later, tries 
to walk the applications map without a lock.  It may setup the timer during 
serviceInit, but is it guaranteed that there's no contention when this timer 
task finally runs?  Maybe I'm missing something.

bq. I have a general question that, is AbstractYarnScheduler supposed to be 
public for external use ?

I wondered the same.  By far the simplest thing to do here is to just document 
(or require, by changing the type from Map to ConcurrentMap as I originally 
suggested) that the underlying map must support concurrent access.  If we only 
expect AbstractYarnScheduler to be used by the Fifo, Fair, and Capacity 
schedulers then we don't need to bother with the overhead of an accessor method 
that can be overridden, etc.  Technically AbstractYarnScheduler was not marked 
Public, so we should be able to update it without worrying about third-party 
use.  Agree that we should mark it Private/Unstable going forward regardless of 
how we eventually fix this.

> getTransferredContainers can be a bottleneck during AM registration
> -------------------------------------------------------------------
>
>                 Key: YARN-3136
>                 URL: https://issues.apache.org/jira/browse/YARN-3136
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: scheduler
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Sunil G
>         Attachments: 0001-YARN-3136.patch, 0002-YARN-3136.patch, 
> 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch
>
>
> While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
> stuck waiting for the scheduler lock trying to call getTransferredContainers. 
>  The scheduler lock is highly contended, especially on a large cluster with 
> many nodes heartbeating, and it would be nice if we could find a way to 
> eliminate the need to grab this lock during this call.  We've already done 
> similar work during AM allocate calls to make sure they don't needlessly grab 
> the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to