Re: 紧急求救，kylin Query机查询运行20多分钟后死机

沈鲁威 Tue, 24 Apr 2018 00:04:51 -0700

There is nothing OOM or overload error in region server log.

Our Hbase version is 1.2.0-cdh


> 在 2018年4月24日，下午1:59，Ma Gang <[email protected]> 写道：
> 
> You may check the region server log, is the related region server OOM or 
> overload?
> 
> 在 2018-04-24 13:47:08，"沈鲁威" <[email protected]> 写道：
> >
> >异常补充
> >ylin.log:Caused by: org.apache.hadoop.hbase.DoNotRetryIOException: 
> >org.apache.hadoop.hbase.DoNotRetryIOException: Coprocessor passed deadline! 
> >Maybe server is overloaded
> >kylin.log-      at 
> >org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService.checkDeadline(CubeVisitService.java:225)
> >kylin.log-      at 
> >org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService.visitCube(CubeVisitService.java:259)
> >kylin.log-      at 
> >org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.generated.CubeVisitProtos$CubeVisitService.callMethod(CubeVisitProtos.java:5555)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7931)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1969)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1951)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33652)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183)
> >--
> >kylin.log-      at 
> >org.apache.hadoop.hbase.ipc.RegionCoprocessorRpcChannel.callExecService(RegionCoprocessorRpcChannel.java:107)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.ipc.CoprocessorRpcChannel.callMethod(CoprocessorRpcChannel.java:56)
> >kylin.log-      at 
> >org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.generated.CubeVisitProtos$CubeVisitService$Stub.visitCube(CubeVisitProtos.java:5616)
> >kylin.log-      at 
> >org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC$2.call(CubeHBaseEndpointRPC.java:237)
> >kylin.log-      at 
> >org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC$2.call(CubeHBaseEndpointRPC.java:206)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.client.HTable$15.call(HTable.java:1800)
> >kylin.log-      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >kylin.log-      at 
> >java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >kylin.log-      at 
> >java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >kylin.log-      ... 1 more
> >kylin.log:Caused by: 
> >org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException):
> > org.apache.hadoop.hbase.DoNotRetryIOException: Coprocessor passed deadline! 
> >Maybe server is overloaded
> >kylin.log-      at 
> >org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService.checkDeadline(CubeVisitService.java:225)
> >kylin.log-      at 
> >org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService.visitCube(CubeVisitService.java:259)
> >kylin.log-      at 
> >org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.generated.CubeVisitProtos$CubeVisitService.callMethod(CubeVisitProtos.java:5555)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7931)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1969)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1951)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33652)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
> >kylin.log-      at 
> >org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183)
> >> 在 2018年4月23日，下午10:51，沈鲁威 <[email protected]> 写道：
> >> 
> >> 各位大神：
> >> 我们这边搭建了 cdh5.13.1+kylin.2.3.0
> >> 一台任务机，三台查询机slb 负载均衡(4核8G)
> >> 
> >> 
> >> 
> >> 问题：工作的过程中经常隔断时间，某一台查询机器查询报超时，紧接着所有查询均不可用
> >> 只能kylin.sh stop 停掉这台查询机，其他机器才能正常工作
> >> 
> >> 查看机器负载 并不高
> >> 查看日志 出现过的错误日志
> >> 1、ncategorized SQLException for SQL []; SQL state [null]; error code [0]; 
> >> exception while executing query: java.io.IOException: POST failed, error 
> >> code 500 and response: {"code":"999","data":null,"msg":"Timeout visiting 
> >> cube! Check why coprocessor exception is not sent back? In coprocessor 
> >> Self-termination is checked every 100 scanned rows, the configured 
> >> timeout(54000) cannot support this many scans?\nwhile executing SQL: 
> >> \"select COALESCE(SUM(a.total_sale_money_kpi),0) as total_sale_money_kpi , 
> >> COALESCE(SUM(a.total_sale_count_kpi),0) as 
> >> 2、by total_sale_money_kpi desc ### Cause: java.sql.SQLException: exception 
> >> while executing query: java.io.IOException: POST failed, error code 500 
> >> and response: 
> >> {"code":"999","data":null,"msg":"org.apache.hadoop.hbase.DoNotRetryIOException:
> >>  org.apache.hadoop.hbase.DoNotRetryIOException: Coprocessor passed 
> >> deadline! Maybe server is overloaded at 
> >> org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService.checkDeadline(CubeVisitService.java:225)
> >>  at 
> >> org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService.visitCube(CubeVisitService.java:259)
> >>  at 
> >> org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.generated.CubeVisitProtos$CubeVisitService.callMethod(CubeVisitProtos.java:5555)
> >>  at 
> >> org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7931)
> >>  at 
> >> org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1969)
> >>  at 
> >> org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1951)
> >>  at 
> >> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33652)
> >>  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191) at 
> >> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) at 
> >> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183) 
> >> at 
> >> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163)\nwhile
> >>  executing SQL: 
> >> 
> >> 
> >> <CE1ED564E277BCD093CB59000F043C9F.png>
> >> 
> >> 
> >> 
> >> jstack 查看日志
> >> 
> >> 情况1：
> >> 有很多线程在等待同一个锁 多的话有100多个 
> >> 怀疑可能有个锁被锁住了，而且可能是全局锁，因为一台机器有问题其他机器也没法查了
> >> 
> >> 
> >> "kylin-coproc--pool2-t82051" #93742 daemon prio=5 os_prio=0 
> >> tid=0x00007f314d435800 nid=0x1fb waiting on condition [0x00007f315abad000]
> >>   java.lang.Thread.State: TIMED_WAITING (parking)
> >>    at sun.misc.Unsafe.park(Native Method)
> >>    - parking to wait for  <0x00000007008eeff8> (a 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >>    at 
> >> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> >>    at 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> >>    at 
> >> java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
> >>    at 
> >> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
> >>    at 
> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> >>    at 
> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >>    at java.lang.Thread.run(Thread.java:748)
> >> 
> >>   Locked ownable synchronizers:
> >>    - None
> >> 
> >> "kylin-coproc--pool2-t82050" #93741 daemon prio=5 os_prio=0 
> >> tid=0x00007f314dc24800 nid=0x1fa waiting on condition [0x00007f315c1bb000]
> >>   java.lang.Thread.State: TIMED_WAITING (parking)
> >>    at sun.misc.Unsafe.park(Native Method)
> >>    - parking to wait for  <0x00000007008eeff8> (a 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >>    at 
> >> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> >>    at 
> >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> >>    at 
> >> java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
> >>    at 
> >> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
> >>    at 
> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> >>    at 
> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >>    at java.lang.Thread.run(Thread.java:748)
> >> 
> >>   Locked ownable synchronizers:
> >>    - None
> >> 
> >> 
> >> 
> >> 情况2：
> >> 线程池的问题：但是目前没找到哪类设置的线程池数量
> >> 
> >> 2018-04-22 10:56:13,407 ERROR [pool-10-thread-806] 
> >> v2.CubeHBaseEndpointRPC:340 : <sub-thread for Query 
> >> 492811-3d81d0ee-b6c9-443b-b652-3f94f5072cd1-1524365662180 GTScanRequest 
> >> 1578e6c>Error when visiting cubes by endpoint
> >> java.util.concurrent.RejectedExecutionException: Task 
> >> java.util.concurrent.FutureTask@6006a8c3 rejected from 
> >> java.util.concurrent.ThreadPoolExecutor@276cb5e4[Shutting down, pool size 
> >> = 19, active threads = 19, queued tasks = 0, completed tasks = 90389]
> >> at 
> >> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> >> at 
> >> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> >> at 
> >> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> >> at 
> >> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
> >> at 
> >> org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1795)
> >> at 
> >> org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC.runEPRange(CubeHBaseEndpointRPC.java:205)
> >> at 
> >> org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC.access$000(CubeHBaseEndpointRPC.java:69)
> >> at 
> >> org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC$1.run(CubeHBaseEndpointRPC.java:186)
> >> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at 
> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >> at 
> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >> at java.lang.Thread.run(Thread.java:748)
> >> 2018-04-22 10:56:13,407 DEBUG [Query 
> >> 492811-e0a95289-a23e-4eb2-a1d2-e0000fd66ac4-1524365662196-116] 
> >> gtrecord.GTCubeStorageQueryBase:311 : Need storage aggregation
> >> 2018-04-22 10:56:13,408 INFO  [Query 
> >> 123629-8888aa31-e163-41c7-84d2-4b06a6b8da18-1524365659125-143] 
> >> service.QueryService:1134 : Processed rows for each storageContext: 7 
> >> 2018-04-22 10:56:13,408 ERROR [pool-10-thread-800] 
> >> v2.CubeHBaseEndpointRPC:340 : <sub-thread for Query 
> >> 492811-3d81d0ee-b6c9-443b-b652-3f94f5072cd1-1524365662180 GTScanRequest 
> >> 5677c55d>Error when visiting cubes by endpoint
> >> java.util.concurrent.RejectedExecutionException: Task 
> >> java.util.concurrent.FutureTask@6006a8c3 rejected from 
> >> java.util.concurrent.ThreadPoolExecutor@276cb5e4[Shutting down, pool size 
> >> = 19, active threads = 19, queued tasks = 0, completed tasks = 90389]
> >> at 
> >> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> >> at 
> >> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> >> at 
> >> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> >> at 
> >> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
> >> at 
> >> org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1795)
> >> at 
> >> org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC.runEPRange(CubeHBaseEndpointRPC.java:205)
> >> at 
> >> org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC.access$000(CubeHBaseEndpointRPC.java:69)
> >> at 
> >> org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC$1.run(CubeHBaseEndpointRPC.java:186)
> >> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at 
> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >> at 
> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >> at java.lang.Thread.run(Thread.java:748)
> >> 
> >> <091BC280DBCABF12925C7456BF791602.jpg>
> >> 
> >> 
> >> 情况3：出现过如下错误
> >> hangzhou.dianjia.io trying to unlock 
> >> /kylin/kylin_metadata/job_engine/global_job_engine_lock
> >> kylin.out:      at 
> >> org.apache.kylin.storage.hbase.util.ZookeeperDistributedLock.unlock(ZookeeperDistributedLock.java:236)
> >> kylin.out:      at 
> >> org.apache.kylin.storage.hbase.util.ZookeeperDistributedLock.unlockJobEngine(ZookeeperDistributedLock.java:311)
> >> kylin.out:      at 
> >> org.apache.kylin.storage.hbase.util.ZookeeperJobLock.unlockJobEngine(ZookeeperJobLock.java:86)
> >> kylin.out-      at 
> >> org.apache.kylin.job.impl.threadpool.DefaultScheduler.shutdown(DefaultScheduler.java:234)
> >> kylin.out-      at 
> >> org.apache.kylin.rest.service.JobService$2.run(JobService.java:140)
> >> kylin.out-      at java.lang.Thread.run(Thread.java:748)
> >> kylin.out-Caused by: java.lang.IllegalStateException: Client is not started
> >> kylin.out-      at 
> >> com.google.common.base.Preconditions.checkState(Preconditions.java:149)
> >> kylin.out:      at 
> >> org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:113)
> >> kylin.out-      at 
> >> org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477)
> >> kylin.out-      at 
> >> org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:238)
> >> kylin.out-      at 
> >> org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:233)
> >> kylin.out-      at 
> >> org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
> >> kylin.out-      at 
> >> org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:230)
> >> kylin.out-      at 
> >> org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:214)
> >> kylin.out-      at 
> >> org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:41)
> >> kylin.out:      at 
> >> org.apache.kylin.storage.hbase.util.ZookeeperDistributedLock.unlock(ZookeeperDistributedLock.java:231)
> >> 
> >> 
> >> 怀疑过如下代码：
> >> 但是我们验证过去掉同步锁 但是情况依旧。
> >> 多种情况下是下图66666到77777这个之间执行很慢。
> >> <B5ED37ABAEE71EB70911E69D10DD3252.png>
> >> 
> >> 
> >> 
> >> 
> >> 
> >> <kylin配置.txt>
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> 
> 
>

Re: 紧急求救，kylin Query机 查询运行20多分钟后死机

Reply via email to

Re: 紧急求救，kylin Query机查询运行20多分钟后死机