[ 
https://issues.apache.org/jira/browse/YARN-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971530#comment-15971530
 ] 

Carlo Curino commented on YARN-6451:
------------------------------------

Thanks [~chris.douglas] for the feedback.

 # I did the precompilation as you suggested (I didn't know the Javascrip 
engine is a {{Compilable}} subclass of the general {{ScriptEngine}} one), it 
helps somewhat. Poking at performance, I also found that the longer I ran it 
the slower it got... it was due to the collector accumulating records. I know 
clear it at each iteration. Combined this brought us down to about 1ms per 
iteration if we keep all invariant separate (one per line of our script file), 
and *0.07ms per invocation* if we combine them in a single large invariant 
(with all individual invariants in && ). 
Pros and cons, when invariants are violated the log line is harder to read if 
combined, but perf is much better. In the current example of {{invariants.txt}} 
I will leave this with one invariant per line, so slower but easier to 
understand---works?

# I added this to the logging/exception message. In particular, I am pruning 
the bindings, so that the message should contain only the bindings used in the 
failing invariant (bar performance tricks above, this makes for a very readable 
output).

# As we discussed offline, while it is true we could push the checking deep 
into the collector and get a little closer to detect the issues to when they 
happen, since we run say every second with this, it is unlikely we will improve 
detection much (we shave sub-millis time, but we might still be 0.5sec off in 
average from when the violation occurred). Short of checking at every metrics 
update (very costly), we probably can only detect issues a little after they 
have happened. This seems anyway much better than days later when a customer 
complains :-)



> Create a monitor to check whether we maintain RM (scheduling) invariants
> ------------------------------------------------------------------------
>
>                 Key: YARN-6451
>                 URL: https://issues.apache.org/jira/browse/YARN-6451
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Carlo Curino
>            Assignee: Carlo Curino
>         Attachments: YARN-6451.v0.patch, YARN-6451.v1.patch, 
> YARN-6451.v2.patch
>
>
> For SLS runs, as well as for live test clusters (and maybe prod), it would be 
> useful to have a mechanism to continuously check whether core invariants of 
> the RM/Scheduler are respected (e.g., no priority inversions, fairness mostly 
> respected, certain latencies within expected range, etc..)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to