[ https://issues.apache.org/jira/browse/YARN-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591456#comment-16591456 ]
Xianghao Lu commented on YARN-7964: ----------------------------------- [~cxcw] Seems duplicate with YARN-8632. > Yarn Scheduler Load Simulator (SLS): MetricsLogRunnable stops working when > there are too many jobs needed to load from sls > -------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-7964 > URL: https://issues.apache.org/jira/browse/YARN-7964 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator > Affects Versions: 2.7.5, 3.0.0 > Environment: I am running sls on a linux server (ubuntu-16.04). The > hadoop version is 3.0.0 > Reporter: Wei Chen > Priority: Minor > > hi, I am using sls to simulate a large scale cluster, which consists more > than 100 nodes and runs more than 4k jobs. I found that MetricsLogRunnable > (periodically flush real-time metrics to a file) stops working if the sls > takes too long to load sls file. > More specifically, the exception is thrown at here in function String > generateRealTimeTrackingMetrics() in SLSWebApp.java : > {code:java} > for (String queue : wrapper.getQueueSet()) { > .......... > } > {code} > > The excepthion is reported as: > 2018-02-22 17:13:59,450 INFO sls.SLSRunner: newly creaed job: > 6263127055conainer size: 10queue: default > java.lang.NullPointerException > at > org.apache.hadoop.yarn.sls.web.SLSWebApp.generateRealTimeTrackingMetrics(SLSWebApp.java:438) > at > org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler$MetricsLogRunnable.run(SLSCapacityScheduler.java:724) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > So the wrapper.getQueueSet() returns a NullPointer which causes the exception. > After we further analyzing the source code, we noticed that in SLSRunner.java: > {code:java} > public void start() throws Exception { > // start resource manager > startRM(); > // start node managers > startNM(); > // start application masters > startAM(); > // set queue & tracked apps information > ((SchedulerWrapper) rm.getResourceScheduler()) > .setQueueSet(this.queueAppNumMap.keySet()); > ((SchedulerWrapper) rm.getResourceScheduler()) > .setTrackedAppSet(this.trackedApps); > // print out simulation info > printSimulationInfo(); > // blocked until all nodes RUNNING > waitForNodesRunning(); > // starting the runner once everything is ready to go, > runner.start(); > } > {code} > As you can see the queue set for tracking is set by > ((SchedulerWrapper)rm.getResourceScheduler()) > .setQueueSet(this.queueAppNumMap.keySet()); which > is done after rm, nm and app initilization. Before the queue set is set, the > MetricsLogRunnable has already been lauched. That's the reason why the queue > set is empty and cause NullPointerException. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org