Re: Re: Re: Strange HBase rpc operation timeout error

ShaoFeng Shi Sun, 17 Dec 2017 03:09:07 -0800

Kylin not only uses HBase for cube storage but also use it for metadata
persistent (cube definition, job status, etc).


When read/write metadata, Kylin expects it to be finished quickly. HBase's
default timeout is 1 minute, it is too long for this case, so in the
HBaseResourceStore, it overwrites this to 5 seconds. The error you got just
happens when Kylin tries to read job output from HBase, and timeout
happened. The possible reason includes network latency, HBase runtime issue
etc; Later that problem might be disappeared, then Kylin recovered.

For what's the root cause, you need check HBase master and region server's
log for more information.

The "hbase.rpc.timeout=3600000" is allowing HBase to wait for the cube
(hundreds of GB) be upload to S3, that is another scenario, so they do not
conflict.



2017-12-17 18:32 GMT+08:00 jxs <[email protected]>:

> Well, I found no other timeout or hbase rpc related errors other than
> these "JobFetcher/DefaultScheduler" timeout errors.
> And I am using HDFS for HBase storage, not S3, so I guess it's not related
> with the setting "hbase.rpc.timeout": "3600000" specified in the doc.
> Also, when the buidling job failed on a step, if I click "resume" on Kylin
> WebUI, it shows the step has been done and goes to next step.
>
> If the full log is needed, please let me know, I will post it.
>
> 在2017年12月17 17时21分, "Billy Liu"<[email protected]>写道:
>
>
> Actually, in your questions, here are two HBase timeout. One is about the
> Cube build, the other one is metadata access.
> For the first issue, please check this article: http://kylin.apache.
> org/docs21/install/kylin_aws_emr.html  It introduces how to increase the
> HBase rpc timeout.
> For the second issue, as previous discussion. We should keep it.
>
> 2017-12-17 10:37 GMT+08:00 jxs <[email protected]>:
>
>> Hi Billy,
>> Thank you for pointing the previous discussion. But for now we are
>> running a very small hbase cluster for lower cost, which has only one slave
>> node.
>> So the unsteady response time (in a range not two bad, eg: within 1
>> minute) is somehow acceptable.
>> The previous timeout error just interrupted the cube building procedure,
>> we don't wan't that.
>> What is your suggestion for this use case?
>>
>>
>>
>> 在2017年12月16 11时48分, "Billy Liu"<[email protected]>写道:
>>
>>
>> Check this: http://apache-kylin.74782.x6.nabble.com/hbase-configed-with-
>> fixed-value-td9241.html
>>
>> 2017-12-15 18:03 GMT+08:00 jxs <[email protected]>:
>>
>>> Hi,
>>>
>>> Finally, I found this in org.apache.kylin.storage.hbase
>>> .HBaseResourceStore:
>>>
>>> ```
>>>     private StorageURL buildMetadataUrl(KylinConfig kylinConfig) throws
>>> IOException {
>>>         StorageURL url = kylinConfig.getMetadataUrl();
>>>         if (!url.getScheme().equals("hbase"))
>>>             throw new IOException("Cannot create HBaseResourceStore. Url
>>> not match. Url: " + url);
>>>
>>>         // control timeout for prompt error report
>>>         Map<String, String> newParams = new LinkedHashMap<>();
>>>         newParams.put("hbase.client.scanner.timeout.period", "10000");
>>>         newParams.put("hbase.rpc.timeout", "5000");
>>>         newParams.put("hbase.client.retries.number", "1");
>>>         newParams.putAll(url.getAllParameters());
>>>
>>>         return url.copy(newParams);
>>>     }
>>> ```
>>> Is this related to the timeout error? Why these params are hard coded
>>> instead of reading from configuration, is there any workaround for this
>>> timeout error?
>>>
>>>
>>> 在2017年12月15 16时03分, "jxs"<[email protected]>写道:
>>>
>>>
>>> Hi, kylin users,
>>>
>>> I encountered an strange timeout error today when buiding a cube.
>>>
>>> By "strange", I mean the "hbase.rpc.timeout" configuration is set to
>>> 60000 in hbase, but I get "org.apache.hadoop.hbase.ipc.CallTimeoutException:
>>> Call id=8099904, waitTime=5001, operationTimeout=5000 expired" errors.
>>>
>>> Kylin version 2.2.0, runs on EMR, and it runs wihtout error for about
>>> half of a month, suddenly it not work, the current cube is not the biggest
>>> one.
>>> I am wondering where should I look, any help is appreciated.
>>>
>>> The traceback from log:
>>>
>>> ```
>>> 2017-12-15 06:46:57,892 ERROR [Scheduler 2090031901 <020%209003%201901>
>>> Job c9067736-eac7-48ad-88f3-dbd6f4e870ae-167]
>>> execution.ExecutableManager:149 : fail to get job
>>> output:c9067736-eac7-48ad-88f3-dbd6f4e870ae-14
>>> org.apache.kylin.job.exception.PersistentException:
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
>>> attempts=1, exceptions:
>>> Fri Dec 15 14:46:57 GMT+08:00 2017, 
>>> RpcRetryingCaller{globalStartTime=1513320412890,
>>> pause=100, retries=1}, java.io.IOException: Call to
>>> ip-172-31-5-71.cn-north-1.compute.internal/172.31.5.71:16020 failed on
>>> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
>>> id=8099904, waitTime=5001, operationTimeout=5000 expired.
>>>
>>>         at org.apache.kylin.job.dao.ExecutableDao.getJobOutput(Executab
>>> leDao.java:202)
>>>         at org.apache.kylin.job.execution.ExecutableManager.getOutput(E
>>> xecutableManager.java:145)
>>>         at org.apache.kylin.job.execution.AbstractExecutable.getOutput(
>>> AbstractExecutable.java:312)
>>>         at org.apache.kylin.job.execution.AbstractExecutable.isDiscarde
>>> d(AbstractExecutable.java:392)
>>>         at org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork
>>> (MapReduceExecutable.java:149)
>>>         at org.apache.kylin.job.execution.AbstractExecutable.execute(Ab
>>> stractExecutable.java:125)
>>>         at org.apache.kylin.job.execution.DefaultChainedExecutable.doWo
>>> rk(DefaultChainedExecutable.java:64)
>>>         at org.apache.kylin.job.execution.AbstractExecutable.execute(Ab
>>> stractExecutable.java:125)
>>>         at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRun
>>> ner.run(DefaultScheduler.java:144)
>>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1149)
>>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:624)
>>>         at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException:
>>> Failed after attempts=1, exceptions:
>>> Fri Dec 15 14:46:57 GMT+08:00 2017, 
>>> RpcRetryingCaller{globalStartTime=1513320412890,
>>> pause=100, retries=1}, java.io.IOException: Call to
>>> ip-172-31-5-71.cn-north-1.compute.internal/172.31.5.71:16020 failed on
>>> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
>>> id=8099904, waitTime=5001, operationTimeout=5000 expired.
>>> ```
>>>
>>>
>>>
>>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Re: Re: Strange HBase rpc operation timeout error

Reply via email to