Hi,
We are testing HBase for some read-heavy batch operations, and encountering
frequent, silent RegionServer crashes. The application does many thousands of
very selective row scans on a dataset containing several hundred million rows
(less than 200GB overall), via thrift.
We have encountered this behavior both on a physical cluster with CDH4
installed, and in an Amazon EMR HBase cluster. HBase, Thrift, HDFS etc are all
running with very minimal customization to the default configurations (in EMR's
case, effectively no customization is done at all).
The logs contain some clues, but nothing straightforwardly acknowledging the
root cause of failure. As such it is somewhat scary and we'd really appreciate
any advice or suggestions on cause, prevention, etc.
The post-mortem pattern goes like this:
1) RegionServer logs a WARN to the effect of:
org.apache.hadoop.ipc.HBaseServer
(responseTooLarge):
{"processingtimems":2997,"call":"next(-6754667650407290823, 1000), rpc
version=1, client version=29,
methodsFingerPrint=1891768260","client":"somehost:55253","starttimems":1359399938999,"queuetimems":0,"class":"HRegionServer","responsesize":231379184,"method":"next"}
2) RegionServer logs one or a few of these at DEBUG, in the second or two
before it goes silent:
org.apache.hadoop.hdfs.DFSClient
Error making BlockReader. Closing stale
Socket[addr=/thishost,port=50010,localport=51345]
java.io.EOFException: Premature EOF: no length prefix available
at
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
at
org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:407)
at
org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:105)
at
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:937)
at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:455)
at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:645)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:689)
at java.io.DataInputStream.read(DataInputStream.java:132)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:454)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1062)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1318)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:266)
at
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:209)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.seekTo(HFileReaderV2.java:519)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.seekTo(HFileReaderV2.java:534)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:178)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:111)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:83)
at
org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1779)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:2915)
at
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1453)
at
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1445)
at
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1421)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2079)
at sun.reflect.GeneratedMethodAccessor210.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1345)
3) HDFS logs this at INFO:
org.apache.hadoop.hdfs.server.datanode.DataNode
Exception in receiveBlock for
BP-1555962259-thisIP-1358284735183:blk_-5783891561030721738_125644
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
at java.lang.Thread.run(Thread.java:662)
4) HDFS logs this as ERROR:
org.apache.hadoop.hdfs.server.datanode.DataNode
thishost:50010:DataXceiver error processing WRITE_BLOCK operation src:
/thishost:50620 dest: /thishost:50010
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
at java.lang.Thread.run(Thread.java:662)
5) by now RegionServer is stopped, usually taking the Thrift server down with
it (connection refused).
One final note; we found a bug in our code which was occasionally performing
long and unbounded table scans. After fixing that, the crashes became much
less frequent, but they still happen.
Thanks,
Jim