[jira] [Commented] (YARN-1897) CLI and core support for signal container functionality

Junping Du (JIRA) Thu, 17 Sep 2015 08:03:42 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803039#comment-14803039
 ]


Junping Du commented on YARN-1897:
----------------------------------

Thanks [~mingma] for replying the comments.
bq. Yes, the approach taken in YARN-4131 is simpler by leveraging the existing 
protocol (to accomplish the kill container scenario. But changing the NM-RM 
protocol will allow us to support other useful scenarios besides kill container 
and thread dump.
Agree. I don't mean the previous approach (YARN-4131) can replace the approach 
here. Just want to understand if the approach here can cover all cases that 
YARN-4131 try to address. Sounds like we still need YARN-4131's approach even 
when patch here goes in. Please see comments below for details.

bq. Kill container via preemption. This means RM will know about it first 
before NM, different from the signal container order which kills container 
without RM's knowledge first. It seems killing container without RM knowledge 
matches container crash test case better. But killing container via preemption 
can simulate preemption. But does it matter here as long as container is killed?
Yes. It does matter. Number of preempted containers won't be count as container 
failure in AM prospective and won't affect the success in application's running 
result. In some tests, we need to emulate both cases instead of one.

bq. Container Expiration. Is that only for a container that has been 
allocated/acquired before it is in running state? It seems it is used by RM to 
time out on container allocation/acquisition. It will trigger 
RMContainerEventType.EXPIRE and won't have impact on running container.
Sorry. I mean container LOST situation. If we want to emulate the case NM get 
shutdown (kill -9) suddenly and never come back and its impact to RMContainers. 
We may not achieve this by NM-RM protocol but better to generate some timeout 
event from RM directly?

My overall thinking is there could be two kinds of source that affect 
containers' state (in RM stand point): first is state update event trigger from 
container/NM, include mainstream cases for container's lifecycle which is well 
addressed with approach here; the other is some events generated in RM itself, 
like: resource/container preemption, lose contact with NM with running 
containers, etc. I would prefer YARN-4131 to address 2nd sources event as an 
addendum to our approach here. What do you think?

BTW, Sounds like test failure in 
TestContainerManager.testForcefulShutdownSignal is related?

> CLI and core support for signal container functionality
> -------------------------------------------------------
>
>                 Key: YARN-1897
>                 URL: https://issues.apache.org/jira/browse/YARN-1897
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: api
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: YARN-1897-2.patch, YARN-1897-3.patch, YARN-1897-4.patch, 
> YARN-1897-5.patch, YARN-1897-6.patch, YARN-1897.1.patch
>
>
> We need to define SignalContainerRequest and SignalContainerResponse first as 
> they are needed by other sub tasks. SignalContainerRequest should use 
> OS-independent commands and provide a way to application to specify "reason" 
> for diagnosis. SignalContainerResponse might be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1897) CLI and core support for signal container functionality

Reply via email to