[
https://issues.apache.org/jira/browse/YARN-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734800#comment-14734800
]
Junping Du commented on YARN-3337:
----------------------------------
I think there is one difficulty here: it looks like we didn't keep finished
container info in RM scheduler info but only keep live containers info (in
SchedulerApplicationAttempt). If no dead container info get preserved in RM,
the new added API can only send kill container event but no way to know if
container get killed actually (no way to differentiate a wrong container ID or
an ID for finished container). CLI could be better as it can query running
container list first, then kill it and wait container is not active.
If we want exactly the same semantic as kill apps API, then we have to make RM
to track info for dead containers which sounds too overkill to me as it force
RM to track all containers for all applications (complexity become the same as
MRv1).
May be a better trade-off here is: the semantic for forceKillContainer() only
means to send kill containers events but not means container get killed or not.
A boolean value response for forceKillContainer() indicate if we found a live
container to kill or not. So we could lose Idempotent property for this API?
> Provide YARN chaos monkey
> -------------------------
>
> Key: YARN-3337
> URL: https://issues.apache.org/jira/browse/YARN-3337
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: test
> Affects Versions: 2.7.0
> Reporter: Steve Loughran
>
> To test failure resilience today you either need custom scripts or implement
> Chaos Monkey-like logic in your application (SLIDER-202).
> Killing AMs and containers on a schedule & probability is the core activity
> here, one that could be handled by a CLI App/client lib that does this.
> # entry point to have a startup delay before acting
> # frequency of chaos wakeup/polling
> # probability to AM failure generation (0-100)
> # probability of non-AM container kill
> # future: other operations
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)