[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138915#comment-16138915
 ] 

Jason Lowe commented on YARN-7086:
----------------------------------

We've noticed container release is particularly painful as well, although we 
haven't seen it deadlock.

Whether we do this asynchronously or not, one issue is that releasing a bunch 
of containers requires grabbing a highly-contended lock for every container 
released.  Do this in a loop and it ends up taking a long time since getting 
the lock is not cheap.  Async scheduling helps since we can wait in some other 
thread rather than in the AM handler threads or scheduler dispatcher thread, 
but it will still take a long time looping through all those events.  I think 
it would be a lot better if there was a bulk-release interface so we could grab 
the critical lock once.  We can put a limit on how many we do per batch if 
we're worried it will hold that lock for too long, but I don't think it's so 
much the actual work per container as it is the time spent waiting for the lock 
that makes this so painful.


> Release all containers aynchronously
> ------------------------------------
>
>                 Key: YARN-7086
>                 URL: https://issues.apache.org/jira/browse/YARN-7086
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to