[ 
https://issues.apache.org/jira/browse/YARN-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246753#comment-14246753
 ] 

Karthik Kambatla commented on YARN-2959:
----------------------------------------

Thanks for reporting this, Ashwin. A couple of follow-up questions:
# Is preemption enabled? I would have expected Job B to have containers 
preempted and handed over to Job A.
# If Job A was submitted before Job B, we should probably investigate why Job 
B's AM came up first? Are these MR jobs (or managed AMs)? If they are managed 
AMs and the order of requests was not honored, considering the AppAttempt 
start/register time might not help us much. 

> Fair Scheduler "fifo" option can violate FIFO behavior and cause deadlock 
> among jobs
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2959
>                 URL: https://issues.apache.org/jira/browse/YARN-2959
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>            Reporter: Ashwin Shankar
>
> We have a cluster which run jobs in fifo order(due to the nature of those 
> jobs) using Fair scheduler's "fifo" option.
> Recently we found jobs deadlocked in the cluster, here is what happened :
> There were two jobs,say A and B. A was submitted before B.
> Both were in PENDING state since the cluster was busy.
> When containers freed up, the two pending jobs got their AM containers at 
> about the same time. 
> However Job B's AM or appattempt1 registered with RM a little earlier than 
> Job A and grabbed available containers at that time, and satisfied a fraction 
> of its requirement. Note, JobB can't make progress until it gets all its 
> requirement satisfied.
> Next, JobA's appattempt1 registered with RM and since JobA was submitted 
> earlier, RM stops allocating containers to JobB and starts allocating to 
> JobA, satisfying a fraction of its requirement as well.
> Now together jobA,jobB hold the entire cluster, but neither can progress and 
> are deadlocked since their resource requests are partially satisfied.
> Note:Above is an example with 2 jobs, however the deadlock can happen with n 
> jobs : J1..Jn if the sequence of AM registration is Jn, J(n-1),..J1.
>  
> Solution : one proposed solution is to order the fifo queue by appattempt 
> start/register time instead of app submit time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to