The metadata clean up helped partially. @ShaoFeng, as suspected the root cause was not related to kylin. We tried to tune HBASE, which eventually helped in resolving the issue.
Regards Prashant On Mon, Jan 25, 2016 at 11:02 AM, ShaoFeng Shi <[email protected]> wrote: > The "Other" jobs here means those inactive jobs, including completed and > discarded. As time goes on, the number increases. If you found the "Jobs" > tab on web UI loading is very slow, you'd better do a cleanup; The job > engine is doing the same thing, retrieving the jobs from HBase. I'm not > sure whether that is the root cause in your case, because 20 ~ 30 minutes > is too long a time for loading 500 records, but it worth a try. > > From v1.2, run "./bin/metadata clean --delete true" will also drop those > old inactive jobs (old than 30 days I remeber). Before doing that, take a > metadata backup is a must, please check > https://kylin.apache.org/docs/howto/howto_backup_metadata.html > > For v1.1, I'm afraid there is no simple way to cleanup the old jobs. > > 2016-01-25 1:10 GMT+08:00 Prashant Prakash <[email protected]>: > >> Hi, >> >> We are running kylin v1.1.1 on our cluster. Over the time we have >> observed degradation in cube building process. By debugging we have >> confirmed that all the Executable, are running fine, the issue lies in >> DefaultScheduler.FetcherRunner. By default FetcherRunner is scheduled to >> run at every 60 sec, but in our case the lag is ~ 20 - 30 min. >> >> [kylin@ec2-184-169-138-213 logs]$ tail -1000f kylin_job.log | grep "Job >> Fetcher" >> >> [pool-6-thread-1]:[2016-01-24 >> *06:12:53*,465][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)] >> - Job Fetcher: 0 running, 1 actual running, 4 ready, 533 others >> >> [pool-6-thread-1]:[2016-01-24 >> *06:34:15*,518][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)] >> - Job Fetcher: 0 running, 1 actual running, 4 ready, 533 others >> >> [pool-6-thread-1]:[2016-01-24 >> *06:55:46*,133][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)] >> - Job Fetcher: 0 running, 1 actual running, 4 ready, 533 >> others[pool-6-thread-1]:[2016-01-24 >> 07:18:52,040][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)] >> - Job Fetcher: 0 running, 1 actual running, 4 ready, 533 others >> >> *1. *One key number in the log is *533 others, *which I suspect mostly >> contains jobs with state "SUCCEED" (Have not verified). The reason >> behind the guess is that we are not deleting the jobs after they get >> completed. There is a method ExecutableManager.deleteJob but as of now >> it is being called only from Test module. >> >> *2. *Over time we have observed the number in *Others* column grow. From >> thread dump it looks like FetcherRunner most of the time is fetching >> jobOutput. >> >> Because of above the two reason I suspect build of jobs with state " >> SUCCEED"/ DISCARDED" might be the reason behind performance degradation. >> >> Please provide feedback on wether my hypothesis have some merit. >> >> >> *Thread dump:* >> >> "pool-6-thread-1" #819 prio=5 os_prio=0 tid=0x00007fe264591000 nid=0x25c9 >> in Object.wait() [0x00007fe237a9d000] >> >> java.lang.Thread.State: TIMED_WAITING (on object monitor) >> >> at java.lang.Object.wait(Native Method) >> >> at java.lang.Object.wait(Object.java:460) >> >> at java.util.concurrent.TimeUnit.timedWait(TimeUnit.java:348) >> >> at >> org.apache.hadoop.hbase.client.ResultBoundedCompletionService.poll(ResultBoundedCompletionService.java:155) >> >> - locked <0x00000003d8026808> (a >> [Lorg.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture;) >> >> at >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:168) >> >> at >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:57) >> >> at >> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) >> >> at >> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:293) >> >> at >> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:393) >> >> at >> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:337) >> >> at >> org.apache.hadoop.hbase.client.AbstractClientScanner$1.hasNext(AbstractClientScanner.java:94) >> >> at >> org.apache.kylin.common.persistence.HBaseResourceStore.getByScan(HBaseResourceStore.java:283) >> >> at >> org.apache.kylin.common.persistence.HBaseResourceStore.getResourceImpl(HBaseResourceStore.java:201) >> >> at >> org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:130) >> >> at >> org.apache.kylin.job.dao.ExecutableDao.readJobOutputResource(ExecutableDao.java:90) >> >> at >> org.apache.kylin.job.dao.ExecutableDao.getJobOutput(ExecutableDao.java:179) >> >> at >> org.apache.kylin.job.manager.ExecutableManager.getOutput(ExecutableManager.java:119) >> >> at org.apache.kylin.job.impl.threadpool.DefaultScheduler$ >> FetcherRunner.run(DefaultScheduler.java:93) >> >> - locked <0x0000000482aa9c98> (a >> org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner) >> >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> >> at >> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) >> >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) >> >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) >> >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> >> at java.lang.Thread.run(Thread.java:745) >> >> >> Regards >> >> Prashant >> > > > > -- > Best regards, > > Shaofeng Shi > >
