[
https://issues.apache.org/jira/browse/YARN-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959937#comment-15959937
]
Carlo Curino commented on YARN-6451:
------------------------------------
The patch provides and initial implementation of this idea. It does the
following simple thing:
Every time the InvariantsChecker is invoked it:
# poll the QueueMetrics (we could/should extend it and make it configurable)
# it checks a list of invariants (loaded from config file)
# logs any error as a warning
The idea is to use this in few ways:
# For SLS-based unit/integration tests that ensure correctness of the overall
RM subsystem. E.g., running for a while, and checking that important
invariants are never violated (e.g., resource being non-negative, or locality
going from usually good to very bad after a check-in).
# Performance-based analysis via SLS (and fixed environments), e.g.,
allocation-latency starting to get worse after a certain change.
# In production environments to "anticipate" customer griping.
An extension of this is to make the "action" triggered when an invariant is
violated configurable, e.g., in some cases a log is all is needed, while other
times one may want an alert, or even a system.exit() if things are really bad
(and/or the deployment allows it).
[~wangda], [~jlowe], [~kasha], [~subru], [~kkaranasos], [~asuresh],
[~chris.douglas]: Thoughts?
> Create a monitor to check whether we maintain RM (scheduling) invariants
> ------------------------------------------------------------------------
>
> Key: YARN-6451
> URL: https://issues.apache.org/jira/browse/YARN-6451
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Carlo Curino
> Assignee: Carlo Curino
> Attachments: YARN-6451.v0.patch
>
>
> For SLS runs, as well as for live test clusters (and maybe prod), it would be
> useful to have a mechanism to continuously check whether core invariants of
> the RM/Scheduler are respected (e.g., no priority inversions, fairness mostly
> respected, certain latencies within expected range, etc..)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]