We ran a hbase cluser with version 0.98.1+cdh5.1.0 and with auto
compaction. I have noticed a few times that compaction stuck under the
following circumstances.
1. Some server in the cluster is hard dead and physical down.
2. At the same time, if any region servers are running major compaction and
requesting data blocks from the dead server. The following exception is
seen in region server log.
2015-03-16 03:51:19,621 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /10.0.xx.xx:50010 for block, add to deadNodes and continue.
java.net.NoRouteToHostException: No route to host
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at
org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2765)
at
org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:746)
at
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:661)
at
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325)
at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:566)
at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:789)
at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:836)
at java.io.DataInputStream.read(DataInputStream.java:149)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:563)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1215)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1432)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355)
at
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:515)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:237)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:176)
at
org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1761)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:3734)
at
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1950)
at
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1936)
at
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1913)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3068)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
3. After a few tries, the compaction makes no progress and run for hour
before it is killed manually.
4. During the time span, that region is unreachable from client. Client
always see TimeoutException.
Any thoughts on this issue, or work around I can do with this? Any feedback
is greatly appreciated.
Chen