Kylin not only uses HBase for cube storage but also use it for metadata persistent (cube definition, job status, etc).
When read/write metadata, Kylin expects it to be finished quickly. HBase's default timeout is 1 minute, it is too long for this case, so in the HBaseResourceStore, it overwrites this to 5 seconds. The error you got just happens when Kylin tries to read job output from HBase, and timeout happened. The possible reason includes network latency, HBase runtime issue etc; Later that problem might be disappeared, then Kylin recovered. For what's the root cause, you need check HBase master and region server's log for more information. The "hbase.rpc.timeout=3600000" is allowing HBase to wait for the cube (hundreds of GB) be upload to S3, that is another scenario, so they do not conflict. 2017-12-17 18:32 GMT+08:00 jxs <[email protected]>: > Well, I found no other timeout or hbase rpc related errors other than > these "JobFetcher/DefaultScheduler" timeout errors. > And I am using HDFS for HBase storage, not S3, so I guess it's not related > with the setting "hbase.rpc.timeout": "3600000" specified in the doc. > Also, when the buidling job failed on a step, if I click "resume" on Kylin > WebUI, it shows the step has been done and goes to next step. > > If the full log is needed, please let me know, I will post it. > > 在2017年12月17 17时21分, "Billy Liu"<[email protected]>写道: > > > Actually, in your questions, here are two HBase timeout. One is about the > Cube build, the other one is metadata access. > For the first issue, please check this article: http://kylin.apache. > org/docs21/install/kylin_aws_emr.html It introduces how to increase the > HBase rpc timeout. > For the second issue, as previous discussion. We should keep it. > > 2017-12-17 10:37 GMT+08:00 jxs <[email protected]>: > >> Hi Billy, >> Thank you for pointing the previous discussion. But for now we are >> running a very small hbase cluster for lower cost, which has only one slave >> node. >> So the unsteady response time (in a range not two bad, eg: within 1 >> minute) is somehow acceptable. >> The previous timeout error just interrupted the cube building procedure, >> we don't wan't that. >> What is your suggestion for this use case? >> >> >> >> 在2017年12月16 11时48分, "Billy Liu"<[email protected]>写道: >> >> >> Check this: http://apache-kylin.74782.x6.nabble.com/hbase-configed-with- >> fixed-value-td9241.html >> >> 2017-12-15 18:03 GMT+08:00 jxs <[email protected]>: >> >>> Hi, >>> >>> Finally, I found this in org.apache.kylin.storage.hbase >>> .HBaseResourceStore: >>> >>> ``` >>> private StorageURL buildMetadataUrl(KylinConfig kylinConfig) throws >>> IOException { >>> StorageURL url = kylinConfig.getMetadataUrl(); >>> if (!url.getScheme().equals("hbase")) >>> throw new IOException("Cannot create HBaseResourceStore. Url >>> not match. Url: " + url); >>> >>> // control timeout for prompt error report >>> Map<String, String> newParams = new LinkedHashMap<>(); >>> newParams.put("hbase.client.scanner.timeout.period", "10000"); >>> newParams.put("hbase.rpc.timeout", "5000"); >>> newParams.put("hbase.client.retries.number", "1"); >>> newParams.putAll(url.getAllParameters()); >>> >>> return url.copy(newParams); >>> } >>> ``` >>> Is this related to the timeout error? Why these params are hard coded >>> instead of reading from configuration, is there any workaround for this >>> timeout error? >>> >>> >>> 在2017年12月15 16时03分, "jxs"<[email protected]>写道: >>> >>> >>> Hi, kylin users, >>> >>> I encountered an strange timeout error today when buiding a cube. >>> >>> By "strange", I mean the "hbase.rpc.timeout" configuration is set to >>> 60000 in hbase, but I get "org.apache.hadoop.hbase.ipc.CallTimeoutException: >>> Call id=8099904, waitTime=5001, operationTimeout=5000 expired" errors. >>> >>> Kylin version 2.2.0, runs on EMR, and it runs wihtout error for about >>> half of a month, suddenly it not work, the current cube is not the biggest >>> one. >>> I am wondering where should I look, any help is appreciated. >>> >>> The traceback from log: >>> >>> ``` >>> 2017-12-15 06:46:57,892 ERROR [Scheduler 2090031901 <020%209003%201901> >>> Job c9067736-eac7-48ad-88f3-dbd6f4e870ae-167] >>> execution.ExecutableManager:149 : fail to get job >>> output:c9067736-eac7-48ad-88f3-dbd6f4e870ae-14 >>> org.apache.kylin.job.exception.PersistentException: >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after >>> attempts=1, exceptions: >>> Fri Dec 15 14:46:57 GMT+08:00 2017, >>> RpcRetryingCaller{globalStartTime=1513320412890, >>> pause=100, retries=1}, java.io.IOException: Call to >>> ip-172-31-5-71.cn-north-1.compute.internal/172.31.5.71:16020 failed on >>> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call >>> id=8099904, waitTime=5001, operationTimeout=5000 expired. >>> >>> at org.apache.kylin.job.dao.ExecutableDao.getJobOutput(Executab >>> leDao.java:202) >>> at org.apache.kylin.job.execution.ExecutableManager.getOutput(E >>> xecutableManager.java:145) >>> at org.apache.kylin.job.execution.AbstractExecutable.getOutput( >>> AbstractExecutable.java:312) >>> at org.apache.kylin.job.execution.AbstractExecutable.isDiscarde >>> d(AbstractExecutable.java:392) >>> at org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork >>> (MapReduceExecutable.java:149) >>> at org.apache.kylin.job.execution.AbstractExecutable.execute(Ab >>> stractExecutable.java:125) >>> at org.apache.kylin.job.execution.DefaultChainedExecutable.doWo >>> rk(DefaultChainedExecutable.java:64) >>> at org.apache.kylin.job.execution.AbstractExecutable.execute(Ab >>> stractExecutable.java:125) >>> at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRun >>> ner.run(DefaultScheduler.java:144) >>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>> Executor.java:1149) >>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>> lExecutor.java:624) >>> at java.lang.Thread.run(Thread.java:748) >>> Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: >>> Failed after attempts=1, exceptions: >>> Fri Dec 15 14:46:57 GMT+08:00 2017, >>> RpcRetryingCaller{globalStartTime=1513320412890, >>> pause=100, retries=1}, java.io.IOException: Call to >>> ip-172-31-5-71.cn-north-1.compute.internal/172.31.5.71:16020 failed on >>> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call >>> id=8099904, waitTime=5001, operationTimeout=5000 expired. >>> ``` >>> >>> >>> >> > -- Best regards, Shaofeng Shi 史少锋
