[ 
https://issues.apache.org/jira/browse/YARN-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959937#comment-15959937
 ] 

Carlo Curino commented on YARN-6451:
------------------------------------

The patch provides and  initial implementation of this idea. It does the 
following simple thing:

Every time the InvariantsChecker is invoked it: 
 # poll the QueueMetrics (we could/should extend it and make it configurable)
 # it checks a list of invariants (loaded from config file)
 # logs any error as a warning

The idea is to use this in few ways: 
 # For SLS-based unit/integration tests that ensure correctness of the overall 
RM subsystem.  E.g., running for a while, and checking that important 
invariants are never violated (e.g., resource being non-negative, or locality 
going from usually good to very bad after a check-in). 
 # Performance-based analysis via SLS (and fixed environments), e.g., 
allocation-latency starting to get worse after a certain change.
 # In production environments to "anticipate" customer griping.

An extension of this is to make the "action" triggered when an invariant is 
violated configurable, e.g., in some cases a log is all is needed, while other 
times one may want an alert, or even a system.exit() if things are really bad 
(and/or the deployment allows it).

[~wangda], [~jlowe], [~kasha], [~subru], [~kkaranasos], [~asuresh], 
[~chris.douglas]: Thoughts?




> Create a monitor to check whether we maintain RM (scheduling) invariants
> ------------------------------------------------------------------------
>
>                 Key: YARN-6451
>                 URL: https://issues.apache.org/jira/browse/YARN-6451
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Carlo Curino
>            Assignee: Carlo Curino
>         Attachments: YARN-6451.v0.patch
>
>
> For SLS runs, as well as for live test clusters (and maybe prod), it would be 
> useful to have a mechanism to continuously check whether core invariants of 
> the RM/Scheduler are respected (e.g., no priority inversions, fairness mostly 
> respected, certain latencies within expected range, etc..)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to