[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

ASF GitHub Bot (Jira) Wed, 10 Aug 2022 00:28:07 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577834#comment-17577834
 ]


ASF GitHub Bot commented on YARN-11191:
---------------------------------------

yb12138 commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1210271008

   @luoyuan3471 
   
![未命名](https://user-images.githubusercontent.com/29743168/183837904-187ebe71-d5a6-474c-948d-1160f0d3407e.png)
   you can see this image.
   this problem will occupy when refresh thread is calling 
PreemptionManager.refreshQueue and schedule thread is calling 
AbstractCSQueue.getTotalKillableResource.At this time, refresh thread will 
require csqueue.readLock，but csqueue.readLock will blocked by schedule thread 
and "other thread"( https://bugs.openjdk.org/browse/JDK-6893626 ).And schedule 
thread will require PremmptionManager.readLock,but this readLock will blocked 
by refresh thread held writeLock. so i use tryLock to make refresh thread get 
csqueue.readLock. Wait for the refresh thread complete 
PreemptionManager.refreshQueue,the schedule thread will get 
premmptionManager.readLock, then can allocate new container.
   




> Global Scheduler refreshQueue cause deadLock 
> ---------------------------------------------
>
>                 Key: YARN-11191
>                 URL: https://issues.apache.org/jira/browse/YARN-11191
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>    Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>            Reporter: ben yang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: 1.jstack, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
>                                 require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
>                             require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

Reply via email to