[ 
https://issues.apache.org/jira/browse/YARN-8012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409737#comment-16409737
 ] 

Jason Lowe commented on YARN-8012:
----------------------------------

{quote}Agree. The configuration is windows specific now is because for this 
patch, I only implement the feature for windows. We can expand it after the 
first stage. However, we should also consider for win, it depends on 
DefaultContainerExecutor. For linux, it depends on LinuxContainerExecutor.
{quote}
If we know we're going to get rid of the system-specific configs then we should 
not advertise them even in the initial commit. Otherwise we then have to deal 
with migrating users when we remove those configs. Better to simply use the 
final config names up front and document the systems that are or are not 
supported, IMHO.
{quote}Do you mean Secure Container Executor?
{quote}
No, I mean when the unmanaged container monitor is trying to connect with a 
nodemanager running in a secure cluster. In a secure cluster setup, RPC and 
REST endpoints are authenticated to prevent literally anyone from just seeing 
the information available at those APIs. How will the unmanaged container 
monitor authenticate with the REST endpoint? Is it running as the NM user and 
leveraging the NM's Kerberos keytab, using tokens, or ..? I was under the 
impression it runs as the user running the container.
{quote}Since YARN NM does not even retry the container executor process 
unexpected exit, and it happens rarely, we can ignore to retry the ucc process 
in the first stage. And if really required, we can add retry policy on the 
batch start-yarn-ucc.cmd instead of winutils.
{quote}
I'm still not following here. We're admitting this is a problem, and it has a 
fairly straightforward fix which is to have winutils relaunch the command if it 
fails. It's already launching it today, right? If so, what's the concern again? 
I did not see an explanation in the design or in this JIRA why that's not going 
to work.
{quote}Agree, but it needs outside to retry the process.
{quote}
Again, I don't understand the concern with retrying the process.
{quote}Any thoughts for the whole feature?
{quote}
As I said above, I'm OK with the overall approach of a per-container monitor, 
especially since we sort of already have one today (monitoring for container 
exit code instead of NM existence, but a per-container monitor nonetheless). 
However I'm not comfortable reviewing most of the patch since it's Windows code 
that I'm not going to be able to review properly. I'm just raising specific 
concerns about the design, how it will work on other, non-Windows systems and 
secure clusters, but I don't have major concerns about the high-level approach.

> Support Unmanaged Container Cleanup
> -----------------------------------
>
>                 Key: YARN-8012
>                 URL: https://issues.apache.org/jira/browse/YARN-8012
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>    Affects Versions: 2.7.1
>            Reporter: Yuqi Wang
>            Assignee: Yuqi Wang
>            Priority: Major
>             Fix For: 2.7.1
>
>         Attachments: YARN-8012 - Unmanaged Container Cleanup.pdf, 
> YARN-8012-branch-2.7.1.001.patch
>
>
> An *unmanaged container / leaked container* is a container which is no longer 
> managed by NM. Thus, it is cannot be managed / leaked by YARN, too.
> *There are many cases a YARN managed container can become unmanaged, such as:*
>  * NM service is disabled or removed on the node.
>  * NM is unable to start up again on the node, such as depended 
> configuration, or resources cannot be ready.
>  * NM local leveldb store is corrupted or lost, such as bad disk sectors.
>  * NM has bugs, such as wrongly mark live container as complete.
> Note, they are caused or things become worse if work-preserving NM restart 
> enabled, see YARN-1336
> *Bad impacts of unmanaged container, such as:*
>  # Resource cannot be managed for YARN on the node:
>  ** Cause YARN on the node resource leak
>  ** Cannot kill the container to release YARN resource on the node to free up 
> resource for other urgent computations on the node.
>  # Container and App killing is not eventually consistent for App user:
>  ** App which has bugs can still produce bad impacts to outside even if the 
> App is killed for a long time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to