Hi Prashant,
Could you give more detail about the root cause and how you fix it?
I think it will benefit others when they run to same issue later.
Thanks.
Luke
Best Regards!
---------------------
Luke Han
On Wed, Jan 27, 2016 at 7:57 PM, Prashant Prakash <[email protected]>
wrote:
> The metadata clean up helped partially.
>
> @ShaoFeng, as suspected the root cause was not related to kylin. We tried
> to tune HBASE, which eventually helped in resolving the issue.
>
> Regards
> Prashant
>
> On Mon, Jan 25, 2016 at 11:02 AM, ShaoFeng Shi <[email protected]>
> wrote:
>
>> The "Other" jobs here means those inactive jobs, including completed and
>> discarded. As time goes on, the number increases. If you found the "Jobs"
>> tab on web UI loading is very slow, you'd better do a cleanup; The job
>> engine is doing the same thing, retrieving the jobs from HBase. I'm not
>> sure whether that is the root cause in your case, because 20 ~ 30 minutes
>> is too long a time for loading 500 records, but it worth a try.
>>
>> From v1.2, run "./bin/metadata clean --delete true" will also drop those
>> old inactive jobs (old than 30 days I remeber). Before doing that, take a
>> metadata backup is a must, please check
>> https://kylin.apache.org/docs/howto/howto_backup_metadata.html
>>
>> For v1.1, I'm afraid there is no simple way to cleanup the old jobs.
>>
>> 2016-01-25 1:10 GMT+08:00 Prashant Prakash <[email protected]>:
>>
>>> Hi,
>>>
>>> We are running kylin v1.1.1 on our cluster. Over the time we have
>>> observed degradation in cube building process. By debugging we have
>>> confirmed that all the Executable, are running fine, the issue lies in
>>> DefaultScheduler.FetcherRunner. By default FetcherRunner is scheduled
>>> to run at every 60 sec, but in our case the lag is ~ 20 - 30 min.
>>>
>>> [kylin@ec2-184-169-138-213 logs]$ tail -1000f kylin_job.log | grep "Job
>>> Fetcher"
>>>
>>> [pool-6-thread-1]:[2016-01-24
>>> *06:12:53*,465][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)]
>>> - Job Fetcher: 0 running, 1 actual running, 4 ready, 533 others
>>>
>>> [pool-6-thread-1]:[2016-01-24
>>> *06:34:15*,518][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)]
>>> - Job Fetcher: 0 running, 1 actual running, 4 ready, 533 others
>>>
>>> [pool-6-thread-1]:[2016-01-24
>>> *06:55:46*,133][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)]
>>> - Job Fetcher: 0 running, 1 actual running, 4 ready, 533
>>> others[pool-6-thread-1]:[2016-01-24
>>> 07:18:52,040][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)]
>>> - Job Fetcher: 0 running, 1 actual running, 4 ready, 533 others
>>>
>>> *1. *One key number in the log is *533 others, *which I suspect mostly
>>> contains jobs with state "SUCCEED" (Have not verified). The reason
>>> behind the guess is that we are not deleting the jobs after they get
>>> completed. There is a method ExecutableManager.deleteJob but as of now
>>> it is being called only from Test module.
>>>
>>> *2. *Over time we have observed the number in *Others* column grow.
>>> From thread dump it looks like FetcherRunner most of the time is
>>> fetching jobOutput.
>>>
>>> Because of above the two reason I suspect build of jobs with state "
>>> SUCCEED"/ DISCARDED" might be the reason behind performance
>>> degradation.
>>>
>>> Please provide feedback on wether my hypothesis have some merit.
>>>
>>>
>>> *Thread dump:*
>>>
>>> "pool-6-thread-1" #819 prio=5 os_prio=0 tid=0x00007fe264591000
>>> nid=0x25c9 in Object.wait() [0x00007fe237a9d000]
>>>
>>> java.lang.Thread.State: TIMED_WAITING (on object monitor)
>>>
>>> at java.lang.Object.wait(Native Method)
>>>
>>> at java.lang.Object.wait(Object.java:460)
>>>
>>> at java.util.concurrent.TimeUnit.timedWait(TimeUnit.java:348)
>>>
>>> at
>>> org.apache.hadoop.hbase.client.ResultBoundedCompletionService.poll(ResultBoundedCompletionService.java:155)
>>>
>>> - locked <0x00000003d8026808> (a
>>> [Lorg.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture;)
>>>
>>> at
>>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:168)
>>>
>>> at
>>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:57)
>>>
>>> at
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
>>>
>>> at
>>> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:293)
>>>
>>> at
>>> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:393)
>>>
>>> at
>>> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:337)
>>>
>>> at
>>> org.apache.hadoop.hbase.client.AbstractClientScanner$1.hasNext(AbstractClientScanner.java:94)
>>>
>>> at
>>> org.apache.kylin.common.persistence.HBaseResourceStore.getByScan(HBaseResourceStore.java:283)
>>>
>>> at
>>> org.apache.kylin.common.persistence.HBaseResourceStore.getResourceImpl(HBaseResourceStore.java:201)
>>>
>>> at
>>> org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:130)
>>>
>>> at
>>> org.apache.kylin.job.dao.ExecutableDao.readJobOutputResource(ExecutableDao.java:90)
>>>
>>> at
>>> org.apache.kylin.job.dao.ExecutableDao.getJobOutput(ExecutableDao.java:179)
>>>
>>> at
>>> org.apache.kylin.job.manager.ExecutableManager.getOutput(ExecutableManager.java:119)
>>>
>>> at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
>>> FetcherRunner.run(DefaultScheduler.java:93)
>>>
>>> - locked <0x0000000482aa9c98> (a
>>> org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner)
>>>
>>> at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>
>>> at
>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>>>
>>> at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>>>
>>> at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> Regards
>>>
>>> Prashant
>>>
>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi
>>
>>
>