Junping Du commented on YARN-1897:
Thanks [~mingma] for replying the comments.
bq. Yes, the approach taken in YARN-4131 is simpler by leveraging the existing
protocol (to accomplish the kill container scenario. But changing the NM-RM
protocol will allow us to support other useful scenarios besides kill container
and thread dump.
Agree. I don't mean the previous approach (YARN-4131) can replace the approach
here. Just want to understand if the approach here can cover all cases that
YARN-4131 try to address. Sounds like we still need YARN-4131's approach even
when patch here goes in. Please see comments below for details.
bq. Kill container via preemption. This means RM will know about it first
before NM, different from the signal container order which kills container
without RM's knowledge first. It seems killing container without RM knowledge
matches container crash test case better. But killing container via preemption
can simulate preemption. But does it matter here as long as container is killed?
Yes. It does matter. Number of preempted containers won't be count as container
failure in AM prospective and won't affect the success in application's running
result. In some tests, we need to emulate both cases instead of one.
bq. Container Expiration. Is that only for a container that has been
allocated/acquired before it is in running state? It seems it is used by RM to
time out on container allocation/acquisition. It will trigger
RMContainerEventType.EXPIRE and won't have impact on running container.
Sorry. I mean container LOST situation. If we want to emulate the case NM get
shutdown (kill -9) suddenly and never come back and its impact to RMContainers.
We may not achieve this by NM-RM protocol but better to generate some timeout
event from RM directly?
My overall thinking is there could be two kinds of source that affect
containers' state (in RM stand point): first is state update event trigger from
container/NM, include mainstream cases for container's lifecycle which is well
addressed with approach here; the other is some events generated in RM itself,
like: resource/container preemption, lose contact with NM with running
containers, etc. I would prefer YARN-4131 to address 2nd sources event as an
addendum to our approach here. What do you think?
BTW, Sounds like test failure in
TestContainerManager.testForcefulShutdownSignal is related?
> CLI and core support for signal container functionality
> Key: YARN-1897
> URL: https://issues.apache.org/jira/browse/YARN-1897
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: api
> Reporter: Ming Ma
> Assignee: Ming Ma
> Attachments: YARN-1897-2.patch, YARN-1897-3.patch, YARN-1897-4.patch,
> YARN-1897-5.patch, YARN-1897-6.patch, YARN-1897.1.patch
> We need to define SignalContainerRequest and SignalContainerResponse first as
> they are needed by other sub tasks. SignalContainerRequest should use
> OS-independent commands and provide a way to application to specify "reason"
> for diagnosis. SignalContainerResponse might be empty.
This message was sent by Atlassian JIRA