[jira] [Updated] (YARN-11191) Global Scheduler refreshQueue cause deadLock

ben yang (Jira) Mon, 20 Jun 2022 22:52:03 -0700


     [ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ben yang updated YARN-11191:
----------------------------
    Description: 
This is a potential bug may impact all open premmption  cluster.In our current 
version with preemption enabled, the capacityScheduler will call the 
refreshQueue method of the PreemptionManager when it refreshQueue. This process 
hold the preemptionManager write lock and  require csqueue read 
lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock and 
require PreemptionManager ReadLock.

There is a possibility of deadlock at this time.Because readlock has one rule 
on unfair policy, when a lock is already occupied by a read lock and the first 
request in the lock competition queue is a write lock request,other read lock 
requests cann‘t acquire the lock.

So the potential deadlock is:
{code:java}
CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
                                require: csqueue.readLock

CapacityScheduler.schedule: hold: csqueue.readLock
                            require: PremmptionManager.readLock

other thread(completeContainer,release Resource,etc.): require: 
csqueue.writeLock 

{code}
The jstack logs at the time were as follows

  was:
This is a potential bug may impact all open premmption  cluster.In our current 
version with preemption enabled, the capacityScheduler will call the 
refreshQueue method of the PreemptionManager when it refreshQueue. This process 
hold the preemptionManager write lock and  require csqueue read 
lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock and 
require PreemptionManager ReadLock.

There is a possibility of deadlock at this time.Because readlock has one rule 
on unfair policy, when a lock is already occupied by a read lock and the first 
request in the lock competition queue is a write lock request,other read lock 
requests cann‘t acquire the lock.

So the potential deadlock is:
{code:java}
CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
                                require: csqueue.readLock

CapacityScheduler.schedule: hold: csqueue.readLock
                            require: PremmptionManager.readLock

other thread(completeContainer,release Resource,etc.): require: 
csqueue.writeLock 

{code}


> Global Scheduler refreshQueue cause deadLock 
> ---------------------------------------------
>
>                 Key: YARN-11191
>                 URL: https://issues.apache.org/jira/browse/YARN-11191
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>    Affects Versions: 2.9.0, 3.3.0
>            Reporter: ben yang
>            Priority: Major
>         Attachments: 1.jstack, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
>                                 require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
>                             require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YARN-11191) Global Scheduler refreshQueue cause deadLock

Reply via email to