[jira] [Commented] (YARN-7964) Yarn Scheduler Load Simulator (SLS): MetricsLogRunnable stops working when there are too many jobs needed to load from sls

Xianghao Lu (JIRA) Fri, 24 Aug 2018 03:37:14 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591456#comment-16591456
 ]


Xianghao Lu commented on YARN-7964:
-----------------------------------

[~cxcw] Seems duplicate with YARN-8632.

> Yarn Scheduler Load Simulator (SLS): MetricsLogRunnable stops working when 
> there are too many jobs needed to load from sls
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-7964
>                 URL: https://issues.apache.org/jira/browse/YARN-7964
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler-load-simulator
>    Affects Versions: 2.7.5, 3.0.0
>         Environment: I am running sls on a linux server (ubuntu-16.04). The 
> hadoop version is 3.0.0
>            Reporter: Wei Chen
>            Priority: Minor
>
> hi, I am using sls to simulate a large scale cluster, which consists more 
> than 100 nodes and runs more than 4k jobs. I found that MetricsLogRunnable 
> (periodically flush real-time metrics to a file) stops working if the sls 
> takes too long to load sls file.
> More specifically, the exception is thrown at here in function String 
> generateRealTimeTrackingMetrics() in SLSWebApp.java :
> {code:java}
> for (String queue : wrapper.getQueueSet()) {
> ..........
> }
> {code}
>  
> The excepthion is reported as:
> 2018-02-22 17:13:59,450 INFO sls.SLSRunner: newly creaed job: 
> 6263127055conainer size: 10queue: default
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.sls.web.SLSWebApp.generateRealTimeTrackingMetrics(SLSWebApp.java:438)
> at 
> org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler$MetricsLogRunnable.run(SLSCapacityScheduler.java:724)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>  
> So the wrapper.getQueueSet() returns a NullPointer which causes the exception.
> After we further analyzing the source code, we noticed that in SLSRunner.java:
> {code:java}
> public void start() throws Exception {
>     // start resource manager
>     startRM();
>     // start node managers
>     startNM();
>     // start application masters
>     startAM();
>     // set queue & tracked apps information
>     ((SchedulerWrapper) rm.getResourceScheduler())
>                             .setQueueSet(this.queueAppNumMap.keySet());
>     ((SchedulerWrapper) rm.getResourceScheduler())
>                             .setTrackedAppSet(this.trackedApps);
>     // print out simulation info
>     printSimulationInfo();
>     // blocked until all nodes RUNNING
>     waitForNodesRunning();
>     // starting the runner once everything is ready to go,
>     runner.start();
>   }
> {code}
> As you can see the queue set for tracking is set by  
> ((SchedulerWrapper)rm.getResourceScheduler())
>                             .setQueueSet(this.queueAppNumMap.keySet()); which 
> is done after rm, nm and app initilization. Before the queue set is set, the  
> MetricsLogRunnable has already been lauched. That's the reason why the queue 
> set is empty and cause NullPointerException.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7964) Yarn Scheduler Load Simulator (SLS): MetricsLogRunnable stops working when there are too many jobs needed to load from sls

Reply via email to