Wei Chen created YARN-7964:
------------------------------
Summary: Yarn Scheduler Load Simulator (SLS): MetricsLogRunnable
stops working when there are too many jobs needed to load from sls
Key: YARN-7964
URL: https://issues.apache.org/jira/browse/YARN-7964
Project: Hadoop YARN
Issue Type: Bug
Components: scheduler-load-simulator
Affects Versions: 3.0.0, 2.7.5
Environment: I am running sls on a linux server (ubuntu-16.04). The
hadoop version is 3.0.0
Reporter: Wei Chen
hi, I am using sls to simulate a large scale cluster, which consists more than
100 nodes and runs more than 4k jobs. I found that MetricsLogRunnable
(periodically flush real-time metrics to a file) stops working if the sls takes
too long to load sls file.
More specifically, the exception is thrown at here in function String
generateRealTimeTrackingMetrics() in SLSWebApp.java :
{code:java}
for (String queue : wrapper.getQueueSet()) {
..........
}
{code}
The excepthion is reported as:
2018-02-22 17:13:59,450 INFO sls.SLSRunner: newly creaed job:
6263127055conainer size: 10queue: default
java.lang.NullPointerException
at
org.apache.hadoop.yarn.sls.web.SLSWebApp.generateRealTimeTrackingMetrics(SLSWebApp.java:438)
at
org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler$MetricsLogRunnable.run(SLSCapacityScheduler.java:724)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
So the wrapper.getQueueSet() returns a NullPointer which causes the exception.
After we further analyzing the source code, we noticed that in SLSRunner.java:
{code:java}
public void start() throws Exception {
// start resource manager
startRM();
// start node managers
startNM();
// start application masters
startAM();
// set queue & tracked apps information
((SchedulerWrapper) rm.getResourceScheduler())
.setQueueSet(this.queueAppNumMap.keySet());
((SchedulerWrapper) rm.getResourceScheduler())
.setTrackedAppSet(this.trackedApps);
// print out simulation info
printSimulationInfo();
// blocked until all nodes RUNNING
waitForNodesRunning();
// starting the runner once everything is ready to go,
runner.start();
}
{code}
As you can see the queue set for tracking is set by
((SchedulerWrapper)rm.getResourceScheduler())
.setQueueSet(this.queueAppNumMap.keySet()); which
is done after rm, nm and app initilization. Before the queue set is set, the
MetricsLogRunnable has already been lauched. That's the reason why the queue
set is empty and cause NullPointerException.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]