[
https://issues.apache.org/jira/browse/YARN-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971530#comment-15971530
]
Carlo Curino commented on YARN-6451:
------------------------------------
Thanks [~chris.douglas] for the feedback.
# I did the precompilation as you suggested (I didn't know the Javascrip
engine is a {{Compilable}} subclass of the general {{ScriptEngine}} one), it
helps somewhat. Poking at performance, I also found that the longer I ran it
the slower it got... it was due to the collector accumulating records. I know
clear it at each iteration. Combined this brought us down to about 1ms per
iteration if we keep all invariant separate (one per line of our script file),
and *0.07ms per invocation* if we combine them in a single large invariant
(with all individual invariants in && ).
Pros and cons, when invariants are violated the log line is harder to read if
combined, but perf is much better. In the current example of {{invariants.txt}}
I will leave this with one invariant per line, so slower but easier to
understand---works?
# I added this to the logging/exception message. In particular, I am pruning
the bindings, so that the message should contain only the bindings used in the
failing invariant (bar performance tricks above, this makes for a very readable
output).
# As we discussed offline, while it is true we could push the checking deep
into the collector and get a little closer to detect the issues to when they
happen, since we run say every second with this, it is unlikely we will improve
detection much (we shave sub-millis time, but we might still be 0.5sec off in
average from when the violation occurred). Short of checking at every metrics
update (very costly), we probably can only detect issues a little after they
have happened. This seems anyway much better than days later when a customer
complains :-)
> Create a monitor to check whether we maintain RM (scheduling) invariants
> ------------------------------------------------------------------------
>
> Key: YARN-6451
> URL: https://issues.apache.org/jira/browse/YARN-6451
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Carlo Curino
> Assignee: Carlo Curino
> Attachments: YARN-6451.v0.patch, YARN-6451.v1.patch,
> YARN-6451.v2.patch
>
>
> For SLS runs, as well as for live test clusters (and maybe prod), it would be
> useful to have a mechanism to continuously check whether core invariants of
> the RM/Scheduler are respected (e.g., no priority inversions, fairness mostly
> respected, certain latencies within expected range, etc..)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]