[
https://issues.apache.org/jira/browse/YARN-8259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496492#comment-16496492
]
Shane Kumpf commented on YARN-8259:
-----------------------------------
I've been doing additional testing here and could use input from the community
as all of the solutions have cons. Here is what I've tested and been
considering.
----
1) */proc/pid check as yarn*
Pros:
* No c-e changes
* Works for with Docker live restore
Cons:
* Breaks down when using hide pid
* Portability
----
2) */proc/pid or kill -0 as privileged user*
Pros:
* Works for with Docker live restore
Cons:
* Circumvents hidepid, allows the yarn user to check the existence of any pid
due to use of elevated privileges.
* Portability (/proc method)
----
3) *docker inspect*
Pros:
* No c-e changes
* Uses the Docker API
Cons:
* Requires retry handling to support Docker live restore.
** In the case of a Docker daemon upgrade, this means the upgrade must
complete before the retries are exhausted, which could mean hundreds of retries.
----
4) *Hybrid* (Keep existing kill -0 for non-privileged, docker inspect for
privileged)
Pros:
* No c-e changes
* Limits impacts to live restore
Cons:
* Requires retry handling to support Docker live restore.
* Different handling based on container type.
----
I believe #2 is a non-starter as it silently bypasses the hidepid option. I'm
leaning towards striking #3 from the list as well, as we really need the
recovery logic to be solid, so I don't want to unnecessary impact
non-privileged containers which appear to be working well.
At this point, I'm leaning towards #4 or #1 (with docs indicating that the NM
user must be whitelisted if hidepid is enabled).
> Revisit liveliness checks for Docker containers
> -----------------------------------------------
>
> Key: YARN-8259
> URL: https://issues.apache.org/jira/browse/YARN-8259
> Project: Hadoop YARN
> Issue Type: Sub-task
> Affects Versions: 3.0.2, 3.2.0, 3.1.1
> Reporter: Shane Kumpf
> Assignee: Shane Kumpf
> Priority: Blocker
> Labels: Docker
> Attachments: YARN-8259.001.patch
>
>
> As privileged containers may execute as a user that does not match the YARN
> run as user, sending the null signal for liveliness checks could fail. We
> need to reconsider how liveliness checks are handled in the Docker case.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]