[
https://issues.apache.org/jira/browse/YARN-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971709#comment-15971709
]
Chris Douglas commented on YARN-6451:
-------------------------------------
bq. when invariants are violated the log line is harder to read if combined,
but perf is much better. In the current example of invariants.txt I will leave
this with one invariant per line, so slower but easier to understand---works?
This could evaluate the combined expression, and only if it detects some
violation, iterate over the set of expressions to print specific error
messages. Though shaving fractions of a millisecond off the validation check is
probably not significant.
+1 overall. For future versions:
* The invariant checker might want to use bindings across contexts; this would
be hard to express as subtypes of {{InvariantsChecker}}. For example, if one
wanted to check some invariant using values from the scheduler and the metrics,
there isn't a good way to compose the two with inheritance. That said, in the
current RM it's hard to correlate values collected from multiple components
without reasoning about their mutual consistency in a brittle, ad hoc way. How
invariants are loaded and how errors are handled could also be abstracted, but
(IMHO) that'd be premature. This is approachable as-is.
* The unit test is kind of light
* This could print a warning when it starts up, since it's mostly for testing.
If it's accidentally deployed in a production setting, it should show up in the
log. The RM refuses to start if {{invariants.txt}} is missing?
> Create a monitor to check whether we maintain RM (scheduling) invariants
> ------------------------------------------------------------------------
>
> Key: YARN-6451
> URL: https://issues.apache.org/jira/browse/YARN-6451
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Carlo Curino
> Assignee: Carlo Curino
> Attachments: YARN-6451.v0.patch, YARN-6451.v1.patch,
> YARN-6451.v2.patch, YARN-6451.v3.patch
>
>
> For SLS runs, as well as for live test clusters (and maybe prod), it would be
> useful to have a mechanism to continuously check whether core invariants of
> the RM/Scheduler are respected (e.g., no priority inversions, fairness mostly
> respected, certain latencies within expected range, etc..)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]