What release of HBase do you use ? I looked at the two log files but didn't find such information. In the log for node 118, I saw something such as the following: Failed to connect to /10.0.229.16:50010 for block, add to deadNodes and continue
Was hdfs healthy around the time region server got stuck ? Cheers Friday, July 24, 2015, 12:21 AM +0800 from Konstantin Chudinov <[email protected]>: >Hi all, >Our team faced cascading server's stuck. RS logs are similar to that in >HBASE-10499 ( https://issues.apache.org/jira/browse/HBASE-10499 ) except there >is no RegionTooBusyException before flush loop: >2015-07-19 07:32:41,961 INFO org.apache.hadoop.hbase.regionserver.HStore: >Completed major compaction of 2 file(s) in s of table4,\xC7 >,1390920313296.9f554d5828cfa9689de27c1a42d844e3. into >65dae45c82264b4d80fc7ed0818a4094(size=1.2 M), total size for store is 1.2 M. >This selection was in queue for 0sec, and took 0sec to execute. >2015-07-19 07:32:41,961 INFO >org.apache.hadoop.hbase.regionserver.CompactSplitThread: Completed compaction: >Request = regionName=table4,\xC7 >,1390920313296.9f554d5828cfa9689de27c1a42d844e3., storeName=s, fileCount=2, >fileSize=1.2 M, priority=998, time=24425664829680753; duration=0sec >2015-07-19 07:32:41,962 INFO >org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy: >Default compaction algorithm has selected 1 files from 1 candidates >2015-07-19 07:32:44,764 INFO >org.apache.hadoop.hbase.regionserver.HRegionServer: >regionserver60020.periodicFlusher requesting flush for region >webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after a >delay of 18943 >2015-07-19 07:32:54,765 INFO >org.apache.hadoop.hbase.regionserver.HRegionServer: >regionserver60020.periodicFlusher requesting flush for region >webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after a >delay of 4851 >2015-07-19 07:33:04,764 INFO >org.apache.hadoop.hbase.regionserver.HRegionServer: >regionserver60020.periodicFlusher requesting flush for region >webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after a >delay of 7466 >2015-07-19 07:33:14,764 INFO >org.apache.hadoop.hbase.regionserver.HRegionServer: >regionserver60020.periodicFlusher requesting flush for region >webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after a >delay of 4940 >2015-07-19 07:33:24,765 INFO >org.apache.hadoop.hbase.regionserver.HRegionServer: >regionserver60020.periodicFlusher requesting flush for region >webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after a >delay of 12909 >2015-07-19 07:33:34,764 INFO >org.apache.hadoop.hbase.regionserver.HRegionServer: >regionserver60020.periodicFlusher requesting flush for region >webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after a >delay of 5897 >2015-07-19 07:33:44,764 INFO >org.apache.hadoop.hbase.regionserver.HRegionServer: >regionserver60020.periodicFlusher requesting flush for region >webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after a >delay of 9110 >2015-07-19 07:33:54,764 INFO >org.apache.hadoop.hbase.regionserver.HRegionServer: >regionserver60020.periodicFlusher requesting flush for region >webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after a >delay of 7109 >.... >until we've rebooted RS at 10:08. >8 servers got in stuck at the same time. >I haven't found anything in hmaster's logs. Thread dumps shows, that many >theads (and flush thread) are waiting for read lock during access to HDFS: >"RpcServer.handler=19,port=60020" - Thread t@90 > java.lang.Thread.State: WAITING >at java.lang.Object.wait(Native Method) >- waiting on <77770184> (a org.apache.hadoop.hbase.util.IdLock$Entry) >at java.lang.Object.wait(Object.java:503) >at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) >at >org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:319) >at >org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) >at >org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494) >at >org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:542) >at >org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) >at >org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) >at >org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:377) >at >org.apache.hadoop.hbase.regionserver.KeyValueHeap.pollRealKV(KeyValueHeap.java:347) >at >org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:304) >at >org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) >at >org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) >at >org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) >at >org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.requestSeek(NonLazyKeyValueScanner.java:39) >at >org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:311) >at >org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) >at >org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3987) >at >org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3814) >at >org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3805) >at >org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3136) >- locked <1623a240> (a >org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) >at >org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497) >at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012) >at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98) >at >org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) >at >org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) >at >org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) >at java.lang.Thread.run(Thread.java:745) >"RpcServer.handler=29,port=60020" - Thread t@100 > java.lang.Thread.State: BLOCKED >at org.apache.hadoop.hdfs.DFSInputStream.getFileLength(DFSInputStream.java:354) >- waiting to lock <399a6ff3> (a org.apache.hadoop.hdfs.DFSInputStream) owned >by "RpcServer.handler=21,port=60020" t@92 >at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1270) >at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:90) >at >org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1224) >at >org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1432) >at >org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314) >at >org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355) >at >org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) >at >org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494) >at >org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:542) >at >org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) >at >org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) >at >org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:377) >at >org.apache.hadoop.hbase.regionserver.KeyValueHeap.pollRealKV(KeyValueHeap.java:347) >at >org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:304) >at >org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) >at >org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) >at >org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) >at >org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) >at >org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) >at >org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3866) >at >org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateFromJoinedHeap(HRegion.java:3840) >at >org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3995) >at >org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3814) >at >org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3805) >at >org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3136) >- locked <3af54140> (a >org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) >at >org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497) >at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012) >at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98) >at >org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) >at >org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) >at >org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) >at java.lang.Thread.run(Thread.java:745) > Locked ownable synchronizers: >- locked <5320bfc4> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) >I have zipped all logs and dumps and attached to this mail. >This problem occurs once a month on out cluster. >Does anybody know what the reason of this cascading servers failure? >Thank you in advance! > >Konstantin Chudinov
