[ 
https://issues.apache.org/jira/browse/YARN-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409152#comment-16409152
 ] 

Yuqi Wang edited comment on YARN-8012 at 3/22/18 7:15 AM:
----------------------------------------------------------

Thanks [~jlowe]:).
{quote}Having a periodic monitor per container makes sense for handling the 
case where the NM suddenly disappears. We already use a lingering process per 
container for NM restart, as we need to record the container exit code even 
when the NM is temporarily missing. It would be awesome if we could leverage 
that existing process rather than create yet another monitoring process to 
reduce the per-container overhead, but I understand the reluctance to do this 
in C for the native container executors.
{quote}
As mentioned in the doc, implement as another process in Java can also help to 
make all platform's container executor (wrote in C, such as windows winutils, 
linux container-executor) to leverage the feature. And since it is platform 
independent, put the feature inside every platform's container executor is not 
make sense.

 
{quote}It was unclear in the document that the "ping" to the NM was not an RPC 
call but a REST query.
{quote}
 
 * Currently, there seems not benefit for RPC (tcp proto request) over Rest 
(http request). Do you see any benefits ?
 * This is just a first stage of this feature, and if it make sense and works 
well in production. We can refine it to use RPC

 
{quote}Would be good to elaborate the details of how the checker monitors the 
NM.
{quote}
This is mainly elaborated in Section: UNMANAGED CONTAINER DETECTION 

 
{quote}I would rather not see all the configurations be windows specific. The 
design implies this isn't something only Windows can implement, and I'd hate 
there to be separate Windows, Linux, BSD, Solaris, etc. versions of all of 
these settings. If the setting doesn't work on a particular platform we can 
document the limitations in the property description.
{quote}
Agree. The configuration is windows specific now is because for this patch, I 
only implement the feature for windows. 
 We can expand it after the first stage. However, we should also consider for 
win, it depends on DefaultContainerExecutor. For linux, it depends on 
LinuxContainerExecutor.

 
{quote}How does the container monitor authenticate with the NM in a secure 
cluster setup?
{quote}
Do you mean Secure Container Executor?

I have not investigated the Secure Container Executor. But may support it after 
the 1st stage.

 
{quote}Will the overhead of the new UnmanagedContainerChecker process will be 
counted against the overall container resource usage?
{quote}
Yes, but it is hard to avoid it, since we want the UnmanagedContainerChecker 
process can also be cleaned up after the container job object killed. So it 
should be inside container job object. And for winutils, it is also inside the 
job object and will be count into container resource usage. I think the usage 
is very little and can be configure by env: YARN_UCC_HEAPSIZE. So we can ignore 
it at least for the 1st stage.

 
{quote}I didn't follow the logic in the design document for why it doesn't make 
sense to retry launching the unmanaged monitor if it exits unexpectedly. It 
simply says, "Add the unmanaged container judgement logic (retrypolicy) in 
winutils is not suitable, it should be in UnmanagedContainerChecker." However 
this section is discussing how to handle an unexpected exit of 
UnmanagedContainerChecker, so why would it make sense to put the retry logic in 
the very thing we are retrying?
{quote}
Since YARN NM does not even retry the container executor process unexpected 
exit, and it happens rarely, we can ignore to retry the ucc process in the 
first stage. And if really required, we can add retry policy on the batch 
start-yarn-ucc.cmd instead of winutils.

 
{quote}Does it really make sense to catch Throwable in the monitor loop? Seems 
like it would make more sense to have this localized to where we are 
communicating with the NM, otherwise it could easily suppress OOM errors or 
other non-exceptions that would be better handled by letting this process die 
and relaunching a replacement.
{quote}
Agree, but it needs outside to retry the process.

Any thoughts for the whole feature? :)

 


was (Author: yqwang):
Thanks [~jlowe]:).
{quote}Having a periodic monitor per container makes sense for handling the 
case where the NM suddenly disappears. We already use a lingering process per 
container for NM restart, as we need to record the container exit code even 
when the NM is temporarily missing. It would be awesome if we could leverage 
that existing process rather than create yet another monitoring process to 
reduce the per-container overhead, but I understand the reluctance to do this 
in C for the native container executors.
{quote}
As mentioned in the doc, implement as another process in Java can also help to 
make all platform's container executor (wrote in C, such as windows winutils, 
linux container-executor) to leverage the feature. And since it is platform 
independent, put the feature inside every platform's container executor is not 
make sense.
{quote}It was unclear in the document that the "ping" to the NM was not an RPC 
call but a REST query.
{quote}
 
 * Currently, there seems not benefit for RPC (tcp proto request) over Rest 
(http request). Do you see any benefits ?
 * This is just a first stage of this feature, and if it make sense and works 
well in production. We can refine it to use RPC.

{quote}Would be good to elaborate the details of how the checker monitors the 
NM.
{quote}
This is mainly elaborated in Section: UNMANAGED CONTAINER DETECTION 

 
{quote}I would rather not see all the configurations be windows specific. The 
design implies this isn't something only Windows can implement, and I'd hate 
there to be separate Windows, Linux, BSD, Solaris, etc. versions of all of 
these settings. If the setting doesn't work on a particular platform we can 
document the limitations in the property description.
{quote}
Agree. The configuration is windows specific now is because for this patch, I 
only implement the feature for windows. 
We can expand it after the first stage. However, we should also consider for 
win, it depends on DefaultContainerExecutor. For linux, it depends on 
LinuxContainerExecutor.
{quote}How does the container monitor authenticate with the NM in a secure 
cluster setup?
{quote}
Do you mean Secure Container Executor?

I have not investigated the Secure Container Executor. But may support it after 
the 1st stage.
{quote}Will the overhead of the new UnmanagedContainerChecker process will be 
counted against the overall container resource usage?
{quote}
Yes, but it is hard to avoid it, since we want the UnmanagedContainerChecker 
process can also be cleaned up after the container job object killed. So it 
should be inside container job object. And for winutils, it is also inside the 
job object and will be count into container resource usage. I think the usage 
is very little and can be configure by env: YARN_UCC_HEAPSIZE. So we can ignore 
it at least for the 1st stage.

 
{quote}I didn't follow the logic in the design document for why it doesn't make 
sense to retry launching the unmanaged monitor if it exits unexpectedly. It 
simply says, "Add the unmanaged container judgement logic (retrypolicy) in 
winutils is not suitable, it should be in UnmanagedContainerChecker." However 
this section is discussing how to handle an unexpected exit of 
UnmanagedContainerChecker, so why would it make sense to put the retry logic in 
the very thing we are retrying?
{quote}
Since YARN NM does not even retry the container executor process unexpected 
exit, and it happens rarely. We can add retry policy on the batch 
start-yarn-ucc.cmd instead of winutils.
{quote}Does it really make sense to catch Throwable in the monitor loop? Seems 
like it would make more sense to have this localized to where we are 
communicating with the NM, otherwise it could easily suppress OOM errors or 
other non-exceptions that would be better handled by letting this process die 
and relaunching a replacement.
{quote}
Agree, but it needs outside to retry the process.

Any thoughts for the whole feature? :)

 

> Support Unmanaged Container Cleanup
> -----------------------------------
>
>                 Key: YARN-8012
>                 URL: https://issues.apache.org/jira/browse/YARN-8012
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>    Affects Versions: 2.7.1
>            Reporter: Yuqi Wang
>            Assignee: Yuqi Wang
>            Priority: Major
>             Fix For: 2.7.1
>
>         Attachments: YARN-8012 - Unmanaged Container Cleanup.pdf, 
> YARN-8012-branch-2.7.1.001.patch
>
>
> An *unmanaged container / leaked container* is a container which is no longer 
> managed by NM. Thus, it is cannot be managed / leaked by YARN, too.
> *There are many cases a YARN managed container can become unmanaged, such as:*
>  * NM service is disabled or removed on the node.
>  * NM is unable to start up again on the node, such as depended 
> configuration, or resources cannot be ready.
>  * NM local leveldb store is corrupted or lost, such as bad disk sectors.
>  * NM has bugs, such as wrongly mark live container as complete.
> Note, they are caused or things become worse if work-preserving NM restart 
> enabled, see YARN-1336
> *Bad impacts of unmanaged container, such as:*
>  # Resource cannot be managed for YARN on the node:
>  ** Cause YARN on the node resource leak
>  ** Cannot kill the container to release YARN resource on the node to free up 
> resource for other urgent computations on the node.
>  # Container and App killing is not eventually consistent for App user:
>  ** App which has bugs can still produce bad impacts to outside even if the 
> App is killed for a long time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to