Hi,

We are running kylin v1.1.1 on our cluster. Over the time we have
observed degradation in cube building process. By debugging we have
confirmed that all the Executable, are running fine, the issue lies in
DefaultScheduler.FetcherRunner. By default FetcherRunner is scheduled to
run at every 60 sec, but in our case the lag is ~ 20 - 30 min.

[kylin@ec2-184-169-138-213 logs]$ tail -1000f kylin_job.log | grep "Job
Fetcher"

[pool-6-thread-1]:[2016-01-24
*06:12:53*,465][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)]
- Job Fetcher: 0 running, 1 actual running, 4 ready, 533 others

[pool-6-thread-1]:[2016-01-24
*06:34:15*,518][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)]
- Job Fetcher: 0 running, 1 actual running, 4 ready, 533 others

[pool-6-thread-1]:[2016-01-24
*06:55:46*,133][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)]
- Job Fetcher: 0 running, 1 actual running, 4 ready, 533
others[pool-6-thread-1]:[2016-01-24
07:18:52,040][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)]
- Job Fetcher: 0 running, 1 actual running, 4 ready, 533 others

*1. *One key number in the log is *533 others, *which I suspect mostly
contains jobs with state "SUCCEED" (Have not verified). The reason behind
the guess is that we are not deleting the jobs after they get completed.
There is a method  ExecutableManager.deleteJob but as of now it is being
called only from Test module.

*2. *Over time we have observed the number in *Others* column grow. From
thread dump it looks like FetcherRunner most of the time is fetching
jobOutput.

Because of above the two reason I suspect build of jobs with state "
SUCCEED"/ DISCARDED" might be the reason behind performance degradation.

Please provide feedback on wether my hypothesis have some merit.


*Thread dump:*

"pool-6-thread-1" #819 prio=5 os_prio=0 tid=0x00007fe264591000 nid=0x25c9
in Object.wait() [0x00007fe237a9d000]

   java.lang.Thread.State: TIMED_WAITING (on object monitor)

        at java.lang.Object.wait(Native Method)

        at java.lang.Object.wait(Object.java:460)

        at java.util.concurrent.TimeUnit.timedWait(TimeUnit.java:348)

        at
org.apache.hadoop.hbase.client.ResultBoundedCompletionService.poll(ResultBoundedCompletionService.java:155)

        - locked <0x00000003d8026808> (a
[Lorg.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture;)

        at
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:168)

        at
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:57)

        at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)

        at
org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:293)

        at
org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:393)

        at
org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:337)

        at
org.apache.hadoop.hbase.client.AbstractClientScanner$1.hasNext(AbstractClientScanner.java:94)

        at
org.apache.kylin.common.persistence.HBaseResourceStore.getByScan(HBaseResourceStore.java:283)

        at
org.apache.kylin.common.persistence.HBaseResourceStore.getResourceImpl(HBaseResourceStore.java:201)

        at
org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:130)

        at
org.apache.kylin.job.dao.ExecutableDao.readJobOutputResource(ExecutableDao.java:90)

        at
org.apache.kylin.job.dao.ExecutableDao.getJobOutput(ExecutableDao.java:179)

        at
org.apache.kylin.job.manager.ExecutableManager.getOutput(ExecutableManager.java:119)

        at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
FetcherRunner.run(DefaultScheduler.java:93)

        - locked <0x0000000482aa9c98> (a
org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner)

        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)

        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)

        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)

        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)


Regards

Prashant

Reply via email to