Hi,

We are testing HBase for some read-heavy batch operations, and encountering 
frequent, silent RegionServer crashes.  The application does many thousands of 
very selective row scans on a dataset containing several hundred million rows 
(less than 200GB overall), via thrift.

We have encountered this behavior both on a physical cluster with CDH4 
installed, and in an Amazon EMR HBase cluster.  HBase, Thrift, HDFS etc are all 
running with very minimal customization to the default configurations (in EMR's 
case, effectively no customization is done at all).

The logs contain some clues, but nothing straightforwardly acknowledging the 
root cause of failure.  As such it is somewhat scary and we'd really appreciate 
any advice or suggestions on cause, prevention, etc.


The post-mortem pattern goes like this:


1) RegionServer logs a WARN to the effect of:

 org.apache.hadoop.ipc.HBaseServer 
 (responseTooLarge): 
{"processingtimems":2997,"call":"next(-6754667650407290823, 1000), rpc 
version=1, client version=29, 
methodsFingerPrint=1891768260","client":"somehost:55253","starttimems":1359399938999,"queuetimems":0,"class":"HRegionServer","responsesize":231379184,"method":"next"}

2) RegionServer logs one or a few of these at DEBUG, in the second or two 
before it goes silent:

 org.apache.hadoop.hdfs.DFSClient
 
 Error making BlockReader. Closing stale 
Socket[addr=/thishost,port=50010,localport=51345]
 java.io.EOFException: Premature EOF: no length prefix available
        at 
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
        at 
org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:407)
        at 
org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:105)
        at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:937)
        at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:455)
        at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:645)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:689)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:454)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1062)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1318)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:266)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:209)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.seekTo(HFileReaderV2.java:519)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.seekTo(HFileReaderV2.java:534)
        at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:178)
        at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:111)
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:83)
        at 
org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1779)
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:2915)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1453)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1445)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1421)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2079)
        at sun.reflect.GeneratedMethodAccessor210.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
        at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1345)


3) HDFS logs this at INFO:

 org.apache.hadoop.hdfs.server.datanode.DataNode 
 Exception in receiveBlock for 
BP-1555962259-thisIP-1358284735183:blk_-5783891561030721738_125644 
 java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
        at java.lang.Thread.run(Thread.java:662)

4) HDFS logs this as ERROR:

 org.apache.hadoop.hdfs.server.datanode.DataNode        

 thishost:50010:DataXceiver error processing WRITE_BLOCK operation  src: 
/thishost:50620 dest: /thishost:50010
 java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
        at java.lang.Thread.run(Thread.java:662)

5) by now RegionServer is stopped, usually taking the Thrift server down with 
it (connection refused).



One final note; we found a bug in our code which was occasionally performing 
long and unbounded table scans.  After fixing that, the crashes became much 
less frequent, but they still happen.


Thanks,
Jim

Reply via email to