>>or maybe you run HDFS-balancer? Yes, we trigger hdfs balancer almost all of the time. And in other logs we often see similar exception. So, I don’t think this is reason of failure. >>Are all your RS collocated with DN? Yes, all of them. >>can we get logs during "stuck time" from DN located here: /10.0.240.163 Unfortunately, our log rolling strategy doesn’t allow us to retrieve logs for the day of failure. Question from Ted: >> > Is it possible for you to upgrade to 0.98.10+ It’s very expensive upgrade for us. To initiate upgrade like this, we should provide proves, that it 100% fixes current issue. That’s why I’m asking here the reason.
Best regards, Konstantin Chudinov Software Engineer Grid Dynamics On 24 Jul 2015, at 15:23, Serega Sheypak <[email protected]> wrote: > probably block was being replicated because of DN failure and HBase was > trying to access that replica and got stuck? > I can see that DN answers that some blocks are missing. > or maybe you run HDFS-balancer? > > The other thing is that you should always get read access to HDFS by design, > you are not allowed to modify file concurrently, first writer gets lease on > block and NN doesn't allow to get concurrent leases as I remember it > correctly... > > See what happens with block 1099777976128 > > RS: > 015-07-19 07:25:08,533 INFO org.apache.hadoop.hbase.regionserver.HStore: > Starting compaction of 2 file(s) in i of > table7,\x8C\xA0,1435936455217.12a2d1e37fd8f0f9870fc1b5afd6046d. into > tmpdir=hdfs://server1/hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/.tmp, > totalSize=416.0 M > 2015-07-19 07:25:08,556 WARN org.apache.hadoop.hdfs.BlockReaderFactory: > BlockReaderFactory(fileName=/hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/i/983cf03fddfa480f92346f25a61b0b9e, > > block=BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128): > unknown response code ERROR while attempting to set up short-circuit access. > Block BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 > is not valid > 2015-07-19 07:25:08,556 WARN org.apache.hadoop.hdfs.client.ShortCircuitCache: > ShortCircuitCache(0x6b1f04e2): failed to load > 1195579097_BP-1892992341-10.10.122.111-1352825964285 > 2015-07-19 07:25:08,557 WARN org.apache.hadoop.hdfs.BlockReaderFactory: I/O > error constructing remote block reader. > java.io.IOException: Got error for OP_READ_BLOCK, self=/10.0.241.39:53420, > remote=/10.0.241.39:50010, for file > /hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/i/983cf03fddfa480f92346f25a61b0b9e, > for pool BP-1892992341-10.10.122.111-1352825964285 block > 1195579097_1099777976128 > at > org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:432) > at > org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:397) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:786) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:665) > at > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325) > at > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:566) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:789) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:836) > at java.io.DataInputStream.read(DataInputStream.java:149) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1210) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1483) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:1052) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:244) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:240) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:202) > at > org.apache.hadoop.hbase.regionserver.compactions.Compactor.createScanner(Compactor.java:257) > at > org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65) > at > org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:109) > at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1080) > at > org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1482) > at > org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:475) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2015-07-19 07:25:08,558 WARN org.apache.hadoop.hdfs.DFSClient: Failed to > connect to /10.0.241.39:50010 for block, add to deadNodes and continue. > java.io.IOException: Got error for OP_READ_BLOCK, self=/10.0.241.39:53420, > remote=/10.0.241.39:50010, for file > /hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/i/983cf03fddfa480f92346f25a61b0b9e, > for pool BP-1892992341-10.10.122.111-1352825964285 block > 1195579097_1099777976128 > java.io.IOException: Got error for OP_READ_BLOCK, self=/10.0.241.39:53420, > remote=/10.0.241.39:50010, for file > /hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/i/983cf03fddfa480f92346f25a61b0b9e, > for pool BP-1892992341-10.10.122.111-1352825964285 block > 1195579097_1099777976128 > at > org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:432) > at > org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:397) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:786) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:665) > at > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325) > at > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:566) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:789) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:836) > at java.io.DataInputStream.read(DataInputStream.java:149) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1210) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1483) > at > org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:1052) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:244) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:240) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:202) > at > org.apache.hadoop.hbase.regionserver.compactions.Compactor.createScanner(Compactor.java:257) > at > org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65) > at > org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:109) > at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1080) > at > org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1482) > at > org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:475) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2015-07-19 07:25:08,559 INFO org.apache.hadoop.hdfs.DFSClient: Successfully > connected to /10.0.240.163:50010 for > BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 > 2015-07-19 07:25:12,382 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver60020.periodicFlusher requesting flush for region > webpage_table,c7000000,1432632712751.736fd216603c2368a7001f34c944a7a0. after > a delay of 17793 > > > DN: > 2015-07-19 07:25:08,556 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > opReadBlock > BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 > received exception > org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not > found for > BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 > 2015-07-19 07:25:08,557 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(10.0.241.39, > datanodeUuid=b5972c62-421f-4eab-9ae7-641b8019d406, infoPort=50075, > ipcPort=50020, > storageInfo=lv=-55;cid=cluster10;nsid=1415935480;c=1418135836666):Got > exception while serving > BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 to > /10.0.241.39:53420 > org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not > found for > BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:419) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:228) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:466) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229) > at java.lang.Thread.run(Thread.java:745) > 2015-07-19 07:25:08,557 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > hbase101.nnn.pvt:50010:DataXceiver error processing READ_BLOCK operation > src: /10.0.241.39:53420 dest: /10.0.241.39:50010 > org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not > found for > BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:419) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:228) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:466) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229) > at java.lang.Thread.run(Thread.java:745) > 2015-07-19 07:25:09,022 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving > BP-1892992341-10.10.122.111-1352825964285:blk_1201101758_1099783498791 src: > /10.0.241.39:53422 dest: /10.0.241.39:50010 > > Are all your RS collocated with DN? Or you have RS which doesn't have local > DN? > can we get logs during "stuck time" from DN located here: /10.0.240.163 > > 2015-07-24 13:49 GMT+02:00 Ted Yu <[email protected]>: > > Is it possible for you to upgrade to 0.98.10+ ? > > I will take a look at your logs later. > > Thanks > > Friday, July 24, 2015, 7:15 PM +0800 from Konstantin Chudinov > <[email protected]>: > >Hello Ted, > >Thank you for your answer! > >Hadoop and HBase versions are: > >2.3.0-cdh5.1.0 - версия хадупа (и hdfs) > >hbase-0.98.1 > >About hdfs.. i don’t see anything special in the logs. I’ve attached them to > >this message. Btw, it’s another server, which is also crashed (I’ve lost > >hdfs logs of previous server), so hbase logs are in archive as well. > > > >Best regards, > > > >Konstantin Chudinov > > > >On 23 Jul 2015, at 20:44, Ted Yu < [email protected] > wrote: > >> > >>What release of HBase do you use ? > >> > >>I looked at the two log files but didn't find such information. > >>In the log for node 118, I saw something such as the following: > >>Failed to connect to /10.0.229.16:50010 for block, add to deadNodes and > >>continue > >> > >>Was hdfs healthy around the time region server got stuck ? > >> > >>Cheers > >> > >> > >>Friday, July 24, 2015, 12:21 AM +0800 from Konstantin Chudinov < > >>[email protected] >: > >>>Hi all, > >>>Our team faced cascading server's stuck. RS logs are similar to that in > >>>HBASE-10499 ( https://issues.apache.org/jira/browse/HBASE-10499 ) except > >>>there is no RegionTooBusyException before flush loop: > >>>2015-07-19 07:32:41,961 INFO org.apache.hadoop.hbase.regionserver.HStore: > >>>Completed major compaction of 2 file(s) in s of table4,\xC7 > >>>,1390920313296.9f554d5828cfa9689de27c1a42d844e3. into > >>>65dae45c82264b4d80fc7ed0818a4094(size=1.2 M), total size for store is 1.2 > >>>M. This selection was in queue for 0sec, and took 0sec to execute. > >>>2015-07-19 07:32:41,961 INFO > >>>org.apache.hadoop.hbase.regionserver.CompactSplitThread: Completed > >>>compaction: Request = regionName=table4,\xC7 > >>>,1390920313296.9f554d5828cfa9689de27c1a42d844e3., storeName=s, > >>>fileCount=2, fileSize=1.2 M, priority=998, time=24425664829680753; > >>>duration=0sec > >>>2015-07-19 07:32:41,962 INFO > >>>org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy: > >>> Default compaction algorithm has selected 1 files from 1 candidates > >>>2015-07-19 07:32:44,764 INFO > >>>org.apache.hadoop.hbase.regionserver.HRegionServer: > >>>regionserver60020.periodicFlusher requesting flush for region > >>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > >>>after a delay of 18943 > >>>2015-07-19 07:32:54,765 INFO > >>>org.apache.hadoop.hbase.regionserver.HRegionServer: > >>>regionserver60020.periodicFlusher requesting flush for region > >>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > >>>after a delay of 4851 > >>>2015-07-19 07:33:04,764 INFO > >>>org.apache.hadoop.hbase.regionserver.HRegionServer: > >>>regionserver60020.periodicFlusher requesting flush for region > >>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > >>>after a delay of 7466 > >>>2015-07-19 07:33:14,764 INFO > >>>org.apache.hadoop.hbase.regionserver.HRegionServer: > >>>regionserver60020.periodicFlusher requesting flush for region > >>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > >>>after a delay of 4940 > >>>2015-07-19 07:33:24,765 INFO > >>>org.apache.hadoop.hbase.regionserver.HRegionServer: > >>>regionserver60020.periodicFlusher requesting flush for region > >>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > >>>after a delay of 12909 > >>>2015-07-19 07:33:34,764 INFO > >>>org.apache.hadoop.hbase.regionserver.HRegionServer: > >>>regionserver60020.periodicFlusher requesting flush for region > >>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > >>>after a delay of 5897 > >>>2015-07-19 07:33:44,764 INFO > >>>org.apache.hadoop.hbase.regionserver.HRegionServer: > >>>regionserver60020.periodicFlusher requesting flush for region > >>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > >>>after a delay of 9110 > >>>2015-07-19 07:33:54,764 INFO > >>>org.apache.hadoop.hbase.regionserver.HRegionServer: > >>>regionserver60020.periodicFlusher requesting flush for region > >>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. > >>>after a delay of 7109 > >>>.... > >>>until we've rebooted RS at 10:08. > >>>8 servers got in stuck at the same time. > >>>I haven't found anything in hmaster's logs. Thread dumps shows, that many > >>>theads (and flush thread) are waiting for read lock during access to HDFS: > >>>"RpcServer.handler=19,port=60020" - Thread t@90 > >>> java.lang.Thread.State: WAITING > >>>at java.lang.Object.wait(Native Method) > >>>- waiting on <77770184> (a org.apache.hadoop.hbase.util.IdLock$Entry) > >>>at java.lang.Object.wait(Object.java:503) > >>>at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) > >>>at > >>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:319) > >>>at > >>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) > >>>at > >>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494) > >>>at > >>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:542) > >>>at > >>>org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) > >>>at > >>>org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) > >>>at > >>>org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:377) > >>>at > >>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.pollRealKV(KeyValueHeap.java:347) > >>>at > >>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:304) > >>>at > >>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) > >>>at > >>>org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) > >>>at > >>>org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) > >>>at > >>>org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.requestSeek(NonLazyKeyValueScanner.java:39) > >>>at > >>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:311) > >>>at > >>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) > >>>at > >>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3987) > >>>at > >>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3814) > >>>at > >>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3805) > >>>at > >>>org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3136) > >>>- locked <1623a240> (a > >>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) > >>>at > >>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497) > >>>at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012) > >>>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98) > >>>at > >>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) > >>>at > >>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) > >>>at > >>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) > >>>at java.lang.Thread.run(Thread.java:745) > >>>"RpcServer.handler=29,port=60020" - Thread t@100 > >>> java.lang.Thread.State: BLOCKED > >>>at > >>>org.apache.hadoop.hdfs.DFSInputStream.getFileLength(DFSInputStream.java:354) > >>>- waiting to lock <399a6ff3> (a org.apache.hadoop.hdfs.DFSInputStream) > >>>owned by "RpcServer.handler=21,port=60020" t@92 > >>>at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1270) > >>>at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:90) > >>>at > >>>org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1224) > >>>at > >>>org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1432) > >>>at > >>>org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314) > >>>at > >>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355) > >>>at > >>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) > >>>at > >>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494) > >>>at > >>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:542) > >>>at > >>>org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) > >>>at > >>>org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) > >>>at > >>>org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:377) > >>>at > >>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.pollRealKV(KeyValueHeap.java:347) > >>>at > >>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:304) > >>>at > >>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) > >>>at > >>>org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) > >>>at > >>>org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) > >>>at > >>>org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) > >>>at > >>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) > >>>at > >>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3866) > >>>at > >>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateFromJoinedHeap(HRegion.java:3840) > >>>at > >>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3995) > >>>at > >>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3814) > >>>at > >>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3805) > >>>at > >>>org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3136) > >>>- locked <3af54140> (a > >>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) > >>>at > >>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497) > >>>at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012) > >>>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98) > >>>at > >>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160) > >>>at > >>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38) > >>>at > >>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110) > >>>at java.lang.Thread.run(Thread.java:745) > >>> Locked ownable synchronizers: > >>>- locked <5320bfc4> (a > >>>java.util.concurrent.locks.ReentrantLock$NonfairSync) > >>>I have zipped all logs and dumps and attached to this mail. > >>>This problem occurs once a month on out cluster. > >>>Does anybody know what the reason of this cascading servers failure? > >>>Thank you in advance! > >>> > >>>Konstantin Chudinov >
