probably block was being replicated because of DN failure and HBase was trying to access that replica and got stuck? I can see that DN answers that some blocks are missing. or maybe you run HDFS-balancer?
The other thing is that you should always get read access to HDFS by design, you are not allowed to modify file concurrently, first writer gets lease on block and NN doesn't allow to get concurrent leases as I remember it correctly... See what happens with block 1099777976128 RS: 015-07-19 07:25:08,533 INFO org.apache.hadoop.hbase.regionserver.HStore: Starting compaction of 2 file(s) in i of table7,\x8C\xA0,1435936455217.12a2d1e37fd8f0f9870fc1b5afd6046d. into tmpdir=hdfs://server1/hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/.tmp, totalSize=416.0 M 2015-07-19 07:25:08,556 WARN org.apache.hadoop.hdfs.BlockReaderFactory: BlockReaderFactory(fileName=/hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/i/983cf03fddfa480f92346f25a61b0b9e, block=BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128): unknown response code ERROR while attempting to set up short-circuit access. Block BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 is not valid 2015-07-19 07:25:08,556 WARN org.apache.hadoop.hdfs.client.ShortCircuitCache: ShortCircuitCache(0x6b1f04e2): failed to load 1195579097_BP-1892992341-10.10.122.111-1352825964285 2015-07-19 07:25:08,557 WARN org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader. java.io.IOException: Got error for OP_READ_BLOCK, self=/10.0.241.39:53420, remote=/10.0.241.39:50010, for file /hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/i/983cf03fddfa480f92346f25a61b0b9e, for pool BP-1892992341-10.10.122.111-1352825964285 block 1195579097_1099777976128 at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:432) at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:397) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:786) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:665) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:566) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:789) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:836) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1210) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1483) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:1052) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:244) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317) at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:240) at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:202) at org.apache.hadoop.hbase.regionserver.compactions.Compactor.createScanner(Compactor.java:257) at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65) at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:109) at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1080) at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1482) at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:475) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-07-19 07:25:08,558 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.0.241.39:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/10.0.241.39:53420, remote=/10.0.241.39:50010, for file /hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/i/983cf03fddfa480f92346f25a61b0b9e, for pool BP-1892992341-10.10.122.111-1352825964285 block 1195579097_1099777976128 java.io.IOException: Got error for OP_READ_BLOCK, self=/10.0.241.39:53420, remote=/10.0.241.39:50010, for file /hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/i/983cf03fddfa480f92346f25a61b0b9e, for pool BP-1892992341-10.10.122.111-1352825964285 block 1195579097_1099777976128 at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:432) at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:397) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:786) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:665) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:566) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:789) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:836) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1210) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1483) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:1052) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:244) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317) at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:240) at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:202) at org.apache.hadoop.hbase.regionserver.compactions.Compactor.createScanner(Compactor.java:257) at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65) at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:109) at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1080) at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1482) at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:475) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-07-19 07:25:08,559 INFO org.apache.hadoop.hdfs.DFSClient: Successfully connected to /10.0.240.163:50010 for BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 2015-07-19 07:25:12,382 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region webpage_table,c7000000,1432632712751.736fd216603c2368a7001f34c944a7a0. after a delay of 17793 DN: 2015-07-19 07:25:08,556 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opReadBlock BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 2015-07-19 07:25:08,557 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.0.241.39, datanodeUuid=b5972c62-421f-4eab-9ae7-641b8019d406, infoPort=50075, ipcPort=50020, storageInfo=lv=-55;cid=cluster10;nsid=1415935480;c=1418135836666):Got exception while serving BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 to / 10.0.241.39:53420 org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 at org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:419) at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:228) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:466) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229) at java.lang.Thread.run(Thread.java:745) 2015-07-19 07:25:08,557 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hbase101.nnn.pvt:50010:DataXceiver error processing READ_BLOCK operation src: /10.0.241.39:53420 dest: /10.0.241.39:50010 org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 at org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:419) at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:228) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:466) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229) at java.lang.Thread.run(Thread.java:745) 2015-07-19 07:25:09,022 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1892992341-10.10.122.111-1352825964285:blk_1201101758_1099783498791 src: /10.0.241.39:53422 dest: /10.0.241.39:50010 Are all your RS collocated with DN? Or you have RS which doesn't have local DN? can we get logs during "stuck time" from DN located here: /10.0.240.163 2015-07-24 13:49 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > > Is it possible for you to upgrade to 0.98.10+ ? > > I will take a look at your logs later. > > Thanks > > Friday, July 24, 2015, 7:15 PM +0800 from Konstantin Chudinov < > kchudi...@griddynamics.com>: > >Hello Ted, > >Thank you for your answer! > >Hadoop and HBase versions are: > >2.3.0-cdh5.1.0 - версия хадупа (и hdfs) > >hbase-0.98.1 > >About hdfs.. i don’t see anything special in the logs. I’ve attached them > to this message. Btw, it’s another server, which is also crashed (I’ve lost > hdfs logs of previous server), so hbase logs are in archive as well. > > > >Best regards, > > > >Konstantin Chudinov > > > >On 23 Jul 2015, at 20:44, Ted Yu < yuzhih...@gmail.com > wrote: > >> > >>What release of HBase do you use ? > >> > >>I looked at the two log files but didn't find such information. > >>In the log for node 118, I saw something such as the following: > >>Failed to connect to /10.0.229.16:50010 for block, add to deadNodes and > continue > >> > >>Was hdfs healthy around the time region server got stuck ? > >> > >>Cheers > >> > >> > >>Friday, July 24, 2015, 12:21 AM +0800 from Konstantin Chudinov < > kchudi...@griddynamics.com >: > >>>Hi all, > >>>Our team faced cascading server's stuck. RS logs are similar to that in > HBASE-10499 ( https://issues.apache.org/jira/browse/HBASE-10499 ) except > there is no RegionTooBusyException before flush loop: > >>>2015-07-19 07:32:41,961 INFO > org.apache.hadoop.hbase.regionserver.HStore: Completed major compaction of > 2 file(s) in s of table4,\xC7 > ,1390920313296.9f554d5828cfa9689de27c1a42d844e3. into > 65dae45c82264b4d80fc7ed0818a4094(size=1.2 M), total size for store is 1.2 > M. This selection was in queue for 0sec, and took 0sec to execute. > >>>2015-07-19 07:32:41,961 INFO > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Completed > compaction: Request = regionName=table4,\xC7 > ,1390920313296.9f554d5828cfa9689de27c1a42d844e3., storeName=s, fileCount=2, > fileSize=1.2 M, priority=998, time=24425664829680753; duration=0sec > >>>2015-07-19 07:32:41,962 INFO > org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy: > Default compaction algorithm has selected 1 files from 1 candidates > >>>2015-07-19 07:32:44,764 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver60020.periodicFlusher requesting flush for region > webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > after a delay of 18943 > >>>2015-07-19 07:32:54,765 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver60020.periodicFlusher requesting flush for region > webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > after a delay of 4851 > >>>2015-07-19 07:33:04,764 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver60020.periodicFlusher requesting flush for region > webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > after a delay of 7466 > >>>2015-07-19 07:33:14,764 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver60020.periodicFlusher requesting flush for region > webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > after a delay of 4940 > >>>2015-07-19 07:33:24,765 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver60020.periodicFlusher requesting flush for region > webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > after a delay of 12909 > >>>2015-07-19 07:33:34,764 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver60020.periodicFlusher requesting flush for region > webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > after a delay of 5897 > >>>2015-07-19 07:33:44,764 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver60020.periodicFlusher requesting flush for region > webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > after a delay of 9110 > >>>2015-07-19 07:33:54,764 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver60020.periodicFlusher requesting flush for region > webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > after a delay of 7109 > >>>.... > >>>until we've rebooted RS at 10:08. > >>>8 servers got in stuck at the same time. > >>>I haven't found anything in hmaster's logs. Thread dumps shows, that > many theads (and flush thread) are waiting for read lock during access to > HDFS: > >>>"RpcServer.handler=19,port=60020" - Thread t@90 > >>> java.lang.Thread.State: WAITING > >>>at java.lang.Object.wait(Native Method) > >>>- waiting on <77770184> (a org.apache.hadoop.hbase.util.IdLock$Entry) > >>>at java.lang.Object.wait(Object.java:503) > >>>at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) > >>>at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:319) > >>>at > org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) > >>>at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494) > >>>at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:542) > >>>at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) > >>>at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) > >>>at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:377) > >>>at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.pollRealKV(KeyValueHeap.java:347) > >>>at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:304) > >>>at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) > >>>at > org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) > >>>at > org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) > >>>at > org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.requestSeek(NonLazyKeyValueScanner.java:39) > >>>at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:311) > >>>at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) > >>>at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3987) > >>>at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3814) > >>>at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3805) > >>>at > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3136) > >>>- locked <1623a240> (a > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) > >>>at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497) > >>>at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012) > >>>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98) > >>>at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) > >>>at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) > >>>at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) > >>>at java.lang.Thread.run(Thread.java:745) > >>>"RpcServer.handler=29,port=60020" - Thread t@100 > >>> java.lang.Thread.State: BLOCKED > >>>at > org.apache.hadoop.hdfs.DFSInputStream.getFileLength(DFSInputStream.java:354) > >>>- waiting to lock <399a6ff3> (a org.apache.hadoop.hdfs.DFSInputStream) > owned by "RpcServer.handler=21,port=60020" t@92 > >>>at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1270) > >>>at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:90) > >>>at > org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1224) > >>>at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1432) > >>>at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314) > >>>at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355) > >>>at > org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) > >>>at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494) > >>>at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:542) > >>>at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) > >>>at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) > >>>at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:377) > >>>at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.pollRealKV(KeyValueHeap.java:347) > >>>at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:304) > >>>at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) > >>>at > org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) > >>>at > org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) > >>>at > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) > >>>at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) > >>>at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3866) > >>>at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateFromJoinedHeap(HRegion.java:3840) > >>>at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3995) > >>>at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3814) > >>>at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3805) > >>>at > org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3136) > >>>- locked <3af54140> (a > org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) > >>>at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497) > >>>at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012) > >>>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98) > >>>at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) > >>>at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) > >>>at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) > >>>at java.lang.Thread.run(Thread.java:745) > >>> Locked ownable synchronizers: > >>>- locked <5320bfc4> (a > java.util.concurrent.locks.ReentrantLock$NonfairSync) > >>>I have zipped all logs and dumps and attached to this mail. > >>>This problem occurs once a month on out cluster. > >>>Does anybody know what the reason of this cascading servers failure? > >>>Thank you in advance! > >>> > >>>Konstantin Chudinov >