The metadata clean up helped partially.

@ShaoFeng, as suspected the root cause was not related to kylin. We tried
to tune HBASE, which eventually helped in resolving the issue.

Regards
Prashant

On Mon, Jan 25, 2016 at 11:02 AM, ShaoFeng Shi <[email protected]>
wrote:

> The "Other" jobs here means those inactive jobs, including completed and
> discarded. As time goes on, the number increases. If you found the "Jobs"
> tab on web UI loading is very slow, you'd better do a cleanup; The job
> engine is doing the same thing, retrieving the jobs from HBase. I'm not
> sure whether that is the root cause in your case, because 20 ~ 30 minutes
> is too long a time for loading 500 records, but it worth a try.
>
> From v1.2, run "./bin/metadata clean --delete true" will also drop those
> old inactive jobs (old than 30 days I remeber). Before doing that, take a
> metadata backup is a must, please check
> https://kylin.apache.org/docs/howto/howto_backup_metadata.html
>
> For v1.1, I'm afraid there is no simple way to cleanup the old jobs.
>
> 2016-01-25 1:10 GMT+08:00 Prashant Prakash <[email protected]>:
>
>> Hi,
>>
>> We are running kylin v1.1.1 on our cluster. Over the time we have
>> observed degradation in cube building process. By debugging we have
>> confirmed that all the Executable, are running fine, the issue lies in
>> DefaultScheduler.FetcherRunner. By default FetcherRunner is scheduled to
>> run at every 60 sec, but in our case the lag is ~ 20 - 30 min.
>>
>> [kylin@ec2-184-169-138-213 logs]$ tail -1000f kylin_job.log | grep "Job
>> Fetcher"
>>                                                               
>> [pool-6-thread-1]:[2016-01-24
>> *06:12:53*,465][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)]
>> - Job Fetcher: 0 running, 1 actual running, 4 ready, 533 others
>>
>> [pool-6-thread-1]:[2016-01-24 
>> *06:34:15*,518][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)]
>> - Job Fetcher: 0 running, 1 actual running, 4 ready, 533 others
>>
>> [pool-6-thread-1]:[2016-01-24 
>> *06:55:46*,133][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)]
>> - Job Fetcher: 0 running, 1 actual running, 4 ready, 533
>> others[pool-6-thread-1]:[2016-01-24
>> 07:18:52,040][INFO][org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner.run(DefaultScheduler.java:112)]
>> - Job Fetcher: 0 running, 1 actual running, 4 ready, 533 others
>>
>> *1. *One key number in the log is *533 others, *which I suspect mostly
>> contains jobs with state "SUCCEED" (Have not verified). The reason
>> behind the guess is that we are not deleting the jobs after they get
>> completed. There is a method  ExecutableManager.deleteJob but as of now
>> it is being called only from Test module.
>>
>> *2. *Over time we have observed the number in *Others* column grow. From
>> thread dump it looks like FetcherRunner most of the time is fetching
>> jobOutput.
>>
>> Because of above the two reason I suspect build of jobs with state "
>> SUCCEED"/ DISCARDED" might be the reason behind performance degradation.
>>
>> Please provide feedback on wether my hypothesis have some merit.
>>
>>
>> *Thread dump:*
>>
>> "pool-6-thread-1" #819 prio=5 os_prio=0 tid=0x00007fe264591000 nid=0x25c9
>> in Object.wait() [0x00007fe237a9d000]
>>
>>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>>
>>         at java.lang.Object.wait(Native Method)
>>
>>         at java.lang.Object.wait(Object.java:460)
>>
>>         at java.util.concurrent.TimeUnit.timedWait(TimeUnit.java:348)
>>
>>         at
>> org.apache.hadoop.hbase.client.ResultBoundedCompletionService.poll(ResultBoundedCompletionService.java:155)
>>
>>         - locked <0x00000003d8026808> (a
>> [Lorg.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture;)
>>
>>         at
>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:168)
>>
>>         at
>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:57)
>>
>>         at
>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
>>
>>         at
>> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:293)
>>
>>         at
>> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:393)
>>
>>         at
>> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:337)
>>
>>         at
>> org.apache.hadoop.hbase.client.AbstractClientScanner$1.hasNext(AbstractClientScanner.java:94)
>>
>>         at
>> org.apache.kylin.common.persistence.HBaseResourceStore.getByScan(HBaseResourceStore.java:283)
>>
>>         at
>> org.apache.kylin.common.persistence.HBaseResourceStore.getResourceImpl(HBaseResourceStore.java:201)
>>
>>         at
>> org.apache.kylin.common.persistence.ResourceStore.getResource(ResourceStore.java:130)
>>
>>         at
>> org.apache.kylin.job.dao.ExecutableDao.readJobOutputResource(ExecutableDao.java:90)
>>
>>         at
>> org.apache.kylin.job.dao.ExecutableDao.getJobOutput(ExecutableDao.java:179)
>>
>>         at
>> org.apache.kylin.job.manager.ExecutableManager.getOutput(ExecutableManager.java:119)
>>
>>         at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
>> FetcherRunner.run(DefaultScheduler.java:93)
>>
>>         - locked <0x0000000482aa9c98> (a
>> org.apache.kylin.job.impl.threadpool.DefaultScheduler$FetcherRunner)
>>
>>         at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>
>>         at
>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>>
>>         at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>>
>>         at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>>
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>
>>         at java.lang.Thread.run(Thread.java:745)
>>
>>
>> Regards
>>
>> Prashant
>>
>
>
>
> --
> Best regards,
>
> Shaofeng Shi
>
>

Reply via email to