[
https://issues.apache.org/jira/browse/YARN-72?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486349#comment-13486349
]
Sandy Ryza commented on YARN-72:
--------------------------------
You're right, we should have some sort of timeout, and just move on and exit
after that.
My worries about the process group approach would be:
* A process to be killed via process group is sent a SIGHUP signal, which it
can choose to catch and ignore. The current NodeManager mechanism that my
patch makes use of ultimately sends a SIGKILL, which cannot be ignored.
* Processes are allowed to change their own process group.
* The proposed solution to YARN-3 also relies on a possibly conflicting use
process groups (I believe a single one for each container?).
* From cursory Googling, there doesn't seem to be any nice way in Java to deal
with them.
That said, I'd also defer to someone with a better understanding of NM.
> NM should handle cleaning up containers when it shuts down ( and kill
> containers from an earlier instance when it comes back up after an unclean
> shutdown )
> -----------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-72
> URL: https://issues.apache.org/jira/browse/YARN-72
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Hitesh Shah
> Assignee: Sandy Ryza
> Attachments: YARN-72.patch
>
>
> Ideally, the NM should wait for a limited amount of time when it gets a
> shutdown signal for existing containers to complete and kill the containers (
> if we pick an aggressive approach ) after this time interval.
> For NMs which come up after an unclean shutdown, the NM should look through
> its directories for existing container.pids and try and kill an existing
> containers matching the pids found.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira