[ 
https://issues.apache.org/jira/browse/YARN-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408148#comment-16408148
 ] 

Jason Lowe commented on YARN-8012:
----------------------------------

Thanks for the document!  I know very little about Windows, winutils, etc., so 
I'm going to have to defer to someone else to comment on most of the patch 
details.  However I can comment on the approach as a whole.

Having a periodic monitor per container makes sense for handling the case where 
the NM suddenly disappears.  We already use a lingering process per container 
for NM restart, as we need to record the container exit code even when the NM 
is temporarily missing.  It would be awesome if we could leverage that existing 
process rather than create yet another monitoring process to reduce the 
per-container overhead, but I understand the reluctance to do this in C for the 
native container executors.

It was unclear in the document that the "ping" to the NM was not an RPC call 
but a REST query.  Would be good to elaborate the details of how the checker 
monitors the NM.

I would rather not see all the configurations be windows specific.  The design 
implies this isn't something only Windows can implement, and I'd hate there to 
be separate Windows, Linux, BSD, Solaris, etc. versions of all of these 
settings.  If the setting doesn't work on a particular platform we can document 
the limitations in the property description.

How does the container monitor authenticate with the NM in a secure cluster 
setup?

Will the overhead of the new UnmanagedContainerChecker process will be counted 
against the overall container resource usage?

I didn't follow the logic in the design document for why it doesn't make sense 
to retry launching the unmanaged monitor if it exits unexpectedly.  It simply 
says, "Add the unmanaged container judgement logic (retrypolicy) in winutils is 
not suitable, it should be in UnmanagedContainerChecker."  However this section 
is discussing how to handle an unexpected exit of UnmanagedContainerChecker, so 
why would it make sense to put the retry logic in the very thing we are 
retrying?

Does it really make sense to catch Throwable in the monitor loop?  Seems like 
it would make more sense to have this localized to where we are communicating 
with the NM, otherwise it could easily suppress OOM errors or other 
non-exceptions that would be better handled by letting this process die and 
relaunching a replacement.


> Support Unmanaged Container Cleanup
> -----------------------------------
>
>                 Key: YARN-8012
>                 URL: https://issues.apache.org/jira/browse/YARN-8012
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>    Affects Versions: 2.7.1
>            Reporter: Yuqi Wang
>            Assignee: Yuqi Wang
>            Priority: Major
>             Fix For: 2.7.1
>
>         Attachments: YARN-8012 - Unmanaged Container Cleanup.pdf, 
> YARN-8012-branch-2.7.1.001.patch
>
>
> An *unmanaged container / leaked container* is a container which is no longer 
> managed by NM. Thus, it is cannot be managed / leaked by YARN, too.
> *There are many cases a YARN managed container can become unmanaged, such as:*
>  * NM service is disabled or removed on the node.
>  * NM is unable to start up again on the node, such as depended 
> configuration, or resources cannot be ready.
>  * NM local leveldb store is corrupted or lost, such as bad disk sectors.
>  * NM has bugs, such as wrongly mark live container as complete.
> Note, they are caused or things become worse if work-preserving NM restart 
> enabled, see YARN-1336
> *Bad impacts of unmanaged container, such as:*
>  # Resource cannot be managed for YARN on the node:
>  ** Cause YARN on the node resource leak
>  ** Cannot kill the container to release YARN resource on the node to free up 
> resource for other urgent computations on the node.
>  # Container and App killing is not eventually consistent for App user:
>  ** App which has bugs can still produce bad impacts to outside even if the 
> App is killed for a long time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to