[
https://issues.apache.org/jira/browse/YARN-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591456#comment-16591456
]
Xianghao Lu commented on YARN-7964:
-----------------------------------
[~cxcw] Seems duplicate with YARN-8632.
> Yarn Scheduler Load Simulator (SLS): MetricsLogRunnable stops working when
> there are too many jobs needed to load from sls
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-7964
> URL: https://issues.apache.org/jira/browse/YARN-7964
> Project: Hadoop YARN
> Issue Type: Bug
> Components: scheduler-load-simulator
> Affects Versions: 2.7.5, 3.0.0
> Environment: I am running sls on a linux server (ubuntu-16.04). The
> hadoop version is 3.0.0
> Reporter: Wei Chen
> Priority: Minor
>
> hi, I am using sls to simulate a large scale cluster, which consists more
> than 100 nodes and runs more than 4k jobs. I found that MetricsLogRunnable
> (periodically flush real-time metrics to a file) stops working if the sls
> takes too long to load sls file.
> More specifically, the exception is thrown at here in function String
> generateRealTimeTrackingMetrics() in SLSWebApp.java :
> {code:java}
> for (String queue : wrapper.getQueueSet()) {
> ..........
> }
> {code}
>
> The excepthion is reported as:
> 2018-02-22 17:13:59,450 INFO sls.SLSRunner: newly creaed job:
> 6263127055conainer size: 10queue: default
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.sls.web.SLSWebApp.generateRealTimeTrackingMetrics(SLSWebApp.java:438)
> at
> org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler$MetricsLogRunnable.run(SLSCapacityScheduler.java:724)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> So the wrapper.getQueueSet() returns a NullPointer which causes the exception.
> After we further analyzing the source code, we noticed that in SLSRunner.java:
> {code:java}
> public void start() throws Exception {
> // start resource manager
> startRM();
> // start node managers
> startNM();
> // start application masters
> startAM();
> // set queue & tracked apps information
> ((SchedulerWrapper) rm.getResourceScheduler())
> .setQueueSet(this.queueAppNumMap.keySet());
> ((SchedulerWrapper) rm.getResourceScheduler())
> .setTrackedAppSet(this.trackedApps);
> // print out simulation info
> printSimulationInfo();
> // blocked until all nodes RUNNING
> waitForNodesRunning();
> // starting the runner once everything is ready to go,
> runner.start();
> }
> {code}
> As you can see the queue set for tracking is set by
> ((SchedulerWrapper)rm.getResourceScheduler())
> .setQueueSet(this.queueAppNumMap.keySet()); which
> is done after rm, nm and app initilization. Before the queue set is set, the
> MetricsLogRunnable has already been lauched. That's the reason why the queue
> set is empty and cause NullPointerException.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]