Wangda Tan commented on YARN-2925:

We cannot simply add a synchronized modifier to internal fields used to get 
user-limit and headroom, it will lead to deadlock:
- Thread 1 is CS's message handler, it process a node's heartbeat and trying to 
allocate some containers. It will acquires LeafQueue's synchronized lock first, 
then acquires corresponding FiCaScheduler's synchronized lock
- Thread 2 is ApplicationMasterService.allocate, it will all CS.allocate, first 
will acquires FiCaScheduler's synchronized lock, then it will acquires 
LeafQueue's synchronized
Thread 1/2 will be deadlock after then.

Basically, we have two choices to solve this problem and avoid deadlock 
mentioned above,
- Adding synchronized modifier to CapacityScheduler.allocate, that writing 
operations to LeafQueue will be protected by CapacityScheduler lock. But 
according to read world use case, CapacityScheduler.allocate will be called by 
all application between a short period, lock whole CS seems too inefficiency 
- Adding a fine-grained lock in LeafQueue, only protect resource/capacity 
related fields. With this, fields could be protected and CS lock will be 
avoided altogether, so I prefer to do the 2nd way. 

> Internal fields in LeafQueue access should be protected when accessed from 
> FiCaSchedulerApp to calculate Headroom
> -----------------------------------------------------------------------------------------------------------------
>                 Key: YARN-2925
>                 URL: https://issues.apache.org/jira/browse/YARN-2925
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>            Priority: Critical
> Upon YARN-2644, FiCaScheduler will calculation up-to-date headroom before 
> sending back Allocation response to AM.
> Headroom calculation is happened in LeafQueue side, uses fields like used 
> resource, etc. But it is not protected by any lock of LeafQueue, so it might 
> be corrupted is someone else is editing it.

This message was sent by Atlassian JIRA

Reply via email to