[
https://issues.apache.org/jira/browse/YARN-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408148#comment-16408148
]
Jason Lowe commented on YARN-8012:
----------------------------------
Thanks for the document! I know very little about Windows, winutils, etc., so
I'm going to have to defer to someone else to comment on most of the patch
details. However I can comment on the approach as a whole.
Having a periodic monitor per container makes sense for handling the case where
the NM suddenly disappears. We already use a lingering process per container
for NM restart, as we need to record the container exit code even when the NM
is temporarily missing. It would be awesome if we could leverage that existing
process rather than create yet another monitoring process to reduce the
per-container overhead, but I understand the reluctance to do this in C for the
native container executors.
It was unclear in the document that the "ping" to the NM was not an RPC call
but a REST query. Would be good to elaborate the details of how the checker
monitors the NM.
I would rather not see all the configurations be windows specific. The design
implies this isn't something only Windows can implement, and I'd hate there to
be separate Windows, Linux, BSD, Solaris, etc. versions of all of these
settings. If the setting doesn't work on a particular platform we can document
the limitations in the property description.
How does the container monitor authenticate with the NM in a secure cluster
setup?
Will the overhead of the new UnmanagedContainerChecker process will be counted
against the overall container resource usage?
I didn't follow the logic in the design document for why it doesn't make sense
to retry launching the unmanaged monitor if it exits unexpectedly. It simply
says, "Add the unmanaged container judgement logic (retrypolicy) in winutils is
not suitable, it should be in UnmanagedContainerChecker." However this section
is discussing how to handle an unexpected exit of UnmanagedContainerChecker, so
why would it make sense to put the retry logic in the very thing we are
retrying?
Does it really make sense to catch Throwable in the monitor loop? Seems like
it would make more sense to have this localized to where we are communicating
with the NM, otherwise it could easily suppress OOM errors or other
non-exceptions that would be better handled by letting this process die and
relaunching a replacement.
> Support Unmanaged Container Cleanup
> -----------------------------------
>
> Key: YARN-8012
> URL: https://issues.apache.org/jira/browse/YARN-8012
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: nodemanager
> Affects Versions: 2.7.1
> Reporter: Yuqi Wang
> Assignee: Yuqi Wang
> Priority: Major
> Fix For: 2.7.1
>
> Attachments: YARN-8012 - Unmanaged Container Cleanup.pdf,
> YARN-8012-branch-2.7.1.001.patch
>
>
> An *unmanaged container / leaked container* is a container which is no longer
> managed by NM. Thus, it is cannot be managed / leaked by YARN, too.
> *There are many cases a YARN managed container can become unmanaged, such as:*
> * NM service is disabled or removed on the node.
> * NM is unable to start up again on the node, such as depended
> configuration, or resources cannot be ready.
> * NM local leveldb store is corrupted or lost, such as bad disk sectors.
> * NM has bugs, such as wrongly mark live container as complete.
> Note, they are caused or things become worse if work-preserving NM restart
> enabled, see YARN-1336
> *Bad impacts of unmanaged container, such as:*
> # Resource cannot be managed for YARN on the node:
> ** Cause YARN on the node resource leak
> ** Cannot kill the container to release YARN resource on the node to free up
> resource for other urgent computations on the node.
> # Container and App killing is not eventually consistent for App user:
> ** App which has bugs can still produce bad impacts to outside even if the
> App is killed for a long time
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]