[ https://issues.apache.org/jira/browse/YARN-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971530#comment-15971530 ]
Carlo Curino commented on YARN-6451: ------------------------------------ Thanks [~chris.douglas] for the feedback. # I did the precompilation as you suggested (I didn't know the Javascrip engine is a {{Compilable}} subclass of the general {{ScriptEngine}} one), it helps somewhat. Poking at performance, I also found that the longer I ran it the slower it got... it was due to the collector accumulating records. I know clear it at each iteration. Combined this brought us down to about 1ms per iteration if we keep all invariant separate (one per line of our script file), and *0.07ms per invocation* if we combine them in a single large invariant (with all individual invariants in && ). Pros and cons, when invariants are violated the log line is harder to read if combined, but perf is much better. In the current example of {{invariants.txt}} I will leave this with one invariant per line, so slower but easier to understand---works? # I added this to the logging/exception message. In particular, I am pruning the bindings, so that the message should contain only the bindings used in the failing invariant (bar performance tricks above, this makes for a very readable output). # As we discussed offline, while it is true we could push the checking deep into the collector and get a little closer to detect the issues to when they happen, since we run say every second with this, it is unlikely we will improve detection much (we shave sub-millis time, but we might still be 0.5sec off in average from when the violation occurred). Short of checking at every metrics update (very costly), we probably can only detect issues a little after they have happened. This seems anyway much better than days later when a customer complains :-) > Create a monitor to check whether we maintain RM (scheduling) invariants > ------------------------------------------------------------------------ > > Key: YARN-6451 > URL: https://issues.apache.org/jira/browse/YARN-6451 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Carlo Curino > Assignee: Carlo Curino > Attachments: YARN-6451.v0.patch, YARN-6451.v1.patch, > YARN-6451.v2.patch > > > For SLS runs, as well as for live test clusters (and maybe prod), it would be > useful to have a mechanism to continuously check whether core invariants of > the RM/Scheduler are respected (e.g., no priority inversions, fairness mostly > respected, certain latencies within expected range, etc..) -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org