Ashwin Shankar created YARN-2959:
------------------------------------

             Summary: Fair Scheduler "fifo" option can violate FIFO behavior 
and cause deadlock among jobs
                 Key: YARN-2959
                 URL: https://issues.apache.org/jira/browse/YARN-2959
             Project: Hadoop YARN
          Issue Type: Bug
          Components: fairscheduler
            Reporter: Ashwin Shankar


We have a cluster which run jobs in fifo order(due to the nature of those jobs) 
using Fair scheduler's "fifo" option.
Recently we found jobs deadlocked in the cluster, here is what happened :
There were two jobs,say A and B. A was submitted before B.
Both were in PENDING state since the cluster was busy.
When containers freed up, the two pending jobs got their AM containers at about 
the same time. 
However Job B's AM or appattempt1 registered with RM a little earlier than Job 
A and grabbed available containers at that time, and satisfied a fraction of 
its requirement. Note, JobB can't make progress until it gets all its 
requirement satisfied.
Next, JobA's appattempt1 registered with RM and since JobA was submitted 
earlier, RM stops allocating containers to JobB and starts allocating to JobA, 
satisfying a fraction of its requirement as well.
Now together jobA,jobB hold the entire cluster, but neither can progress and 
are deadlocked since their resource requests are partially satisfied.

Note:Above is an example with 2 jobs, however the deadlock can happen with n 
jobs : J1..Jn if the sequence of AM registration is Jn, J(n-1),..J1.
 
Solution : one proposed solution is to order the fifo queue by appattempt 
start/register time instead of app submit time.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to