I think upgrading to HBase1.4.10 can fix your problem here. Please see: https://issues.apache.org/jira/browse/HBASE-21464
Thanks. On Wed, Aug 14, 2019 at 10:22 AM OpenInx <[email protected]> wrote: > Hi Alexander > Thanks for the log report & share. I guess we did not handle > the NotServingRegionException for hbase:meta correctly, says when a client > encounter a NotServingRegionException from hbase:meta, it won't try to > relocate the hbase:meta location. (in 1.4.8) > Let me have a check. > > On Wed, Aug 14, 2019 at 1:11 AM Alexander Batyrshin <[email protected]> > wrote: > >> Hello, >> We’ve got problem with hbase:meta that cause several servers to die. >> Our version is HBase-1.4.8 >> >> prod006 - flush hbase:meta and move it: >> >> Aug 13 15:46:23 prod006 hbase[63435]: 2019-08-13 15:46:23,019 INFO >> [prod006,60020,1564942121270_ChoreService_1] regionserver.HRegionServer: >> prod006,60020,1564942121270-MemstoreFlusherChore requesting flush of >> hbase:meta,,1.1588230740 because info has an old edit so flush to free WALs >> after random delay 226,666 ms >> Aug 13 15:50:09 prod006 hbase[63435]: 2019-08-13 15:50:09,709 INFO >> [MemStoreFlusher.1] regionserver.HRegion: Finished memstore flush of ~2.99 >> KB/3064, currentsize=0 B/0 for region hbase:meta,,1.1588230740 in 24ms, >> sequenceid=323883, compaction requested=false >> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,943 INFO >> [StoreCloserThread-hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.-1] >> regionserver.HStore: Closed info >> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,951 INFO >> [StoreCloserThread-hbase:meta,,1.1588230740-1] regionserver.HStore: Closed >> info >> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,976 INFO >> [RS_CLOSE_META-prod006:60020-0] regionserver.HRegion: Closed >> hbase:meta,,1.1588230740 >> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,979 INFO >> [RS_CLOSE_REGION-prod006:60020-0] regionserver.HRegion: Closed >> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. >> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,979 INFO >> [RS_CLOSE_REGION-prod006:60020-0] regionserver.HRegionServer: Adding moved >> region record: d3d9ad37f8506c02a84458121965c0b0 to >> prod023,60020,1565701089060 as of 147 >> >> At this time at master we have this: >> >> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO >> [AM.ZK.Worker-pool5-t1878] master.RegionStates: Transition {1588230740 >> state=PENDING_CLOSE, ts=1565701090850, server=prod006,60020,1564942121270} >> to {158823074 >> 0 state=CLOSED, ts=1565701090982, server=prod006,60020,1564942121270} >> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO >> [AM.-pool3-t178] master.AssignmentManager: Setting node as OFFLINED in >> ZooKeeper for region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', >> STARTKEY => '', ENDKEY => ''} >> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO >> [AM.-pool3-t178] master.RegionStates: Transition {1588230740 state=CLOSED, >> ts=1565701090982, server=prod006,60020,1564942121270} to {1588230740 >> state=OFFLINE, ts=1565701090982, server=prod006,60020,1564942121270} >> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO >> [AM.ZK.Worker-pool5-t1879] master.RegionStates: Transition >> {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_CLOSE, ts=1565701090871, >> server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0 >> state=CLOSED, ts=1565701090984, server=prod006,60020,1564942121270} >> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO >> [AM.-pool3-t178] master.AssignmentManager: Assigning >> hbase:meta,,1.1588230740 to prod023,60020,1565701089060 >> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO >> [AM.-pool3-t178] master.RegionStates: Transition {1588230740 state=OFFLINE, >> ts=1565701090982, server=prod006,60020,1564942121270} to {1588230740 >> state=PENDING_OPEN, ts=1565701090984, server=prod023,60020,1565701089060} >> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,985 INFO >> [AM.-pool3-t179] master.AssignmentManager: Setting node as OFFLINED in >> ZooKeeper for region {ENCODED => d3d9ad37f8506c02a84458121965c0b0, NAME => >> 'hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.', >> STARTKEY => '', ENDKEY => ''} >> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,985 INFO >> [AM.-pool3-t179] master.RegionStates: Transition >> {d3d9ad37f8506c02a84458121965c0b0 state=CLOSED, ts=1565701090984, >> server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0 >> state=OFFLINE, ts=1565701090985, server=prod006,60020,1564942121270} >> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,986 INFO >> [AM.-pool3-t179] master.AssignmentManager: Assigning >> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to >> prod023,60020,1565701089060 >> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,986 INFO >> [AM.-pool3-t179] master.RegionStates: Transition >> {d3d9ad37f8506c02a84458121965c0b0 state=OFFLINE, ts=1565701090985, >> server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0 >> state=PENDING_OPEN, ts=1565701090986, server=prod023,60020,1565701089060} >> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,039 WARN >> [AM.-pool3-t179] master.AssignmentManager: Failed assignment of >> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to >> prod023,60020,1565701089060, waiting a little before trying on the same >> region server try=1 of 10 >> Aug 13 15:58:11 prod001 hbase[14953]: >> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: >> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server >> prod023,60020,1565701089060 is not running yet >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1320) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1707) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22737) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> java.lang.reflect.Constructor.newInstance(Constructor.java:423) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:95) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:85) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:368) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:330) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:787) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2248) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1694) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> java.util.concurrent.FutureTask.run(FutureTask.java:266) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> java.lang.Thread.run(Thread.java:748) >> Aug 13 15:58:11 prod001 hbase[14953]: Caused by: >> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException): >> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server >> prod023,60020,1565701089060 is not running yet >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1320) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1707) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22737) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:386) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:405) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600) >> Aug 13 15:58:11 prod001 hbase[14953]: at >> org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334) >> Aug 13 15:58:11 prod001 hbase[14953]: ... 1 more >> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,217 INFO >> [AM.ZK.Worker-pool5-t1882] master.RegionStates: Transition {1588230740 >> state=PENDING_OPEN, ts=1565701090984, server=prod023,60020,1565701089060} >> to {1588230740 state=OPENING, ts=1565701091217, >> server=prod023,60020,1565701089060} >> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,244 INFO >> [AM.-pool3-t179] master.AssignmentManager: Assigning >> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to >> prod023,60020,1565701089060 >> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,391 INFO >> [AM.ZK.Worker-pool5-t1883] master.RegionStates: Transition >> {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_OPEN, ts=1565701091244, >> server=prod023,60020,1565701089060} to {d3d9ad37f8506c02a84458121965c0b0 >> state=OPENING, ts=1565701091391, server=prod023,60020,1565701089060} >> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,666 INFO >> [AM.ZK.Worker-pool5-t1884] master.RegionStates: Transition {1588230740 >> state=OPENING, ts=1565701091217, server=prod023,60020,1565701089060} to >> {1588230740 state=OPEN, ts=1565701091666, >> server=prod023,60020,1565701089060} >> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,666 INFO >> [AM.ZK.Worker-pool5-t1884] coordination.ZkOpenRegionCoordination: Handling >> OPENED of 1588230740 from prod001,60000,1565179968855; deleting unassigned >> node >> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,668 INFO >> [AM.ZK.Worker-pool5-t1886] master.RegionStates: Offlined 1588230740 from >> prod006,60020,1564942121270 >> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,805 INFO >> [AM.ZK.Worker-pool5-t1887] master.RegionStates: Transition >> {d3d9ad37f8506c02a84458121965c0b0 state=OPENING, ts=1565701091391, >> server=prod023,60020,1565701089060} to {d3d9ad37f8506c02a84458121965c0b0 >> state=OPEN, ts=1565701091805, server=prod023,60020,1565701089060} >> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,807 INFO >> [AM.ZK.Worker-pool5-t1889] master.RegionStates: Offlined >> d3d9ad37f8506c02a84458121965c0b0 from prod006,60020,1564942121270 >> >> >> From this moment hbase:meta and hbase:namespace is moved to prod023 from >> prod006 >> >> But some other servers still try to get hbase:meta from prod006: >> >> Aug 13 16:02:24 prod021 hbase[103657]: 2019-08-13 16:02:24,832 INFO >> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292] >> client.RpcRetryingCaller: Call exception, tries=10, retries=350, >> started=38462 ms ago, cancelled=false, >> msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 >> is not online on prod006,60020,1564942121270 >> Aug 13 16:02:24 prod021 hbase[103657]: #011at >> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3086) >> Aug 13 16:02:24 prod021 hbase[103657]: #011at >> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271) >> Aug 13 16:02:24 prod021 hbase[103657]: #011at >> org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2197) >> Aug 13 16:02:24 prod021 hbase[103657]: #011at >> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617) >> Aug 13 16:02:24 prod021 hbase[103657]: #011at >> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2369) >> Aug 13 16:02:24 prod021 hbase[103657]: #011at >> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) >> Aug 13 16:02:24 prod021 hbase[103657]: #011at >> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297) >> Aug 13 16:02:24 prod021 hbase[103657]: #011at >> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277) >> Aug 13 16:02:24 prod021 hbase[103657]: row >> ‘IDX,\x1Abb6f7662-275e-459d-b0f3-7fd6aac9b6a4\x0000000046218537kV0WbJ_\x00\x13\x80\x00\x01k\x90\xFEzH\x00\x00\x00\x00,1562868892011.a215bcc7b396e5d08563f0a4183ba5cc.44' >> on table 'hbase:meta' at region=hbase:meta,,1.1588230740, >> hostname=prod023,60020,1565701089060, seqNum=0 >> >> …. >> >> Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,136 INFO >> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292] >> client.RpcRetryingCaller: Call exception, tries=68, retries=350, >> started=1194772 ms ago, cancell >> ed=false, msg=org.apache.hadoop.hbase.NotServingRegionException: Region >> hbase:meta,,1 is not online on prod006,60020,1564942121270 >> Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,146 INFO >> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292] >> regionserver.SplitRequest: Running rollback/cleanup of failed split of >> IDX,\x1Abb6f7662-275e-459d-b0f3-7fd6aac9b6a4\x0000000046218537kV0WbJ_\x00\x13\x80\x00\x01k\x90\xFEzH\x00\x00\x00\x00,1562868892011.a215bcc7b396e5d08563f0a4183ba5cc.; >> callTimeout=1200000, callDuration=1214938: >> org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is >> not online on prod006,60020,1564942121270 >> Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,193 FATAL >> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292] >> regionserver.HRegionServer: ABORTING region server >> prod021,60020,1564943354635: Abort; we got an error after point-of-no-return >> >> The same picture on other crashed servers >> >> >> Any ideas why this happened and how to prevent this in future? > >
