Wei Chen created YARN-7964:
------------------------------

             Summary: Yarn Scheduler Load Simulator (SLS): MetricsLogRunnable 
stops working when there are too many jobs needed to load from sls
                 Key: YARN-7964
                 URL: https://issues.apache.org/jira/browse/YARN-7964
             Project: Hadoop YARN
          Issue Type: Bug
          Components: scheduler-load-simulator
    Affects Versions: 3.0.0, 2.7.5
         Environment: I am running sls on a linux server (ubuntu-16.04). The 
hadoop version is 3.0.0
            Reporter: Wei Chen


hi, I am using sls to simulate a large scale cluster, which consists more than 
100 nodes and runs more than 4k jobs. I found that MetricsLogRunnable 
(periodically flush real-time metrics to a file) stops working if the sls takes 
too long to load sls file.

More specifically, the exception is thrown at here in function String 
generateRealTimeTrackingMetrics() in SLSWebApp.java :
{code:java}
for (String queue : wrapper.getQueueSet()) {
..........
}
{code}
 
The excepthion is reported as:
2018-02-22 17:13:59,450 INFO sls.SLSRunner: newly creaed job: 
6263127055conainer size: 10queue: default
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.sls.web.SLSWebApp.generateRealTimeTrackingMetrics(SLSWebApp.java:438)
at 
org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler$MetricsLogRunnable.run(SLSCapacityScheduler.java:724)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

 

So the wrapper.getQueueSet() returns a NullPointer which causes the exception.

After we further analyzing the source code, we noticed that in SLSRunner.java:
{code:java}
public void start() throws Exception {
    // start resource manager
    startRM();
    // start node managers
    startNM();
    // start application masters
    startAM();
    // set queue & tracked apps information
    ((SchedulerWrapper) rm.getResourceScheduler())
                            .setQueueSet(this.queueAppNumMap.keySet());
    ((SchedulerWrapper) rm.getResourceScheduler())
                            .setTrackedAppSet(this.trackedApps);
    // print out simulation info
    printSimulationInfo();
    // blocked until all nodes RUNNING
    waitForNodesRunning();
    // starting the runner once everything is ready to go,
    runner.start();
  }
{code}

As you can see the queue set for tracking is set by  
((SchedulerWrapper)rm.getResourceScheduler())
                            .setQueueSet(this.queueAppNumMap.keySet()); which 
is done after rm, nm and app initilization. Before the queue set is set, the  
MetricsLogRunnable has already been lauched. That's the reason why the queue 
set is empty and cause NullPointerException.
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to