[
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138915#comment-16138915
]
Jason Lowe commented on YARN-7086:
----------------------------------
We've noticed container release is particularly painful as well, although we
haven't seen it deadlock.
Whether we do this asynchronously or not, one issue is that releasing a bunch
of containers requires grabbing a highly-contended lock for every container
released. Do this in a loop and it ends up taking a long time since getting
the lock is not cheap. Async scheduling helps since we can wait in some other
thread rather than in the AM handler threads or scheduler dispatcher thread,
but it will still take a long time looping through all those events. I think
it would be a lot better if there was a bulk-release interface so we could grab
the critical lock once. We can put a limit on how many we do per batch if
we're worried it will hold that lock for too long, but I don't think it's so
much the actual work per container as it is the time spent waiting for the lock
that makes this so painful.
> Release all containers aynchronously
> ------------------------------------
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Reporter: Arun Suresh
> Assignee: Arun Suresh
>
> We have noticed in production two situations that can cause deadlocks and
> cause scheduling of new containers to come to a halt, especially with regard
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the
> AbstractYarnScheduler and a corresponding scheduler event, which is currently
> used specifically for the container-update code paths (where the scheduler
> realeases temp containers which it creates for the update)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]