On Mon, Jan 28, 2013 at 12:14 PM, Jim Abramson <[email protected]> wrote:
> Hi,
>
> We are testing HBase for some read-heavy batch operations, and
> encountering frequent, silent RegionServer crashes.
'Silent' is interesting. Which files did you check? .log and the .out?
Nothing in the latter?
> The post-mortem pattern goes like this:
>
>
> 1) RegionServer logs a WARN to the effect of:
>
> org.apache.hadoop.ipc.HBaseServer
> (responseTooLarge):
> {"processingtimems":2997,"call":"next(-6754667650407290823, 1000), rpc
> version=1, client version=29,
> methodsFingerPrint=1891768260","client":"somehost:55253","starttimems":1359399938999,"queuetimems":0,"class":"HRegionServer","responsesize":231379184,"method":"next"}
>
>
Seems like we are trying to return > 100MB cell. See around line #386 in
WritableRpcEngine.
Can you figure which region it is when we get this big cell? We could look
at the hfiles under it using hfile too. It reports big cells:
http://hbase.apache.org/book.html#hfile_tool2
St.Ack
> 2) RegionServer logs one or a few of these at DEBUG, in the second or two
> before it goes silent:
>
> org.apache.hadoop.hdfs.DFSClient
>
> Error making BlockReader. Closing stale
> Socket[addr=/thishost,port=50010,localport=51345]
> java.io.EOFException: Premature EOF: no length prefix available
> at
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>
> at
> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:407)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:105)
> at
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:937)
> at
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:455)
> at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:645)
> at
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:689)
> at java.io.DataInputStream.read(DataInputStream.java:132)
> at
> org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:454)
> at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1062)
> at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1318)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:266)
> at
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:209)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.seekTo(HFileReaderV2.java:519)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.seekTo(HFileReaderV2.java:534)
> at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:178)
> at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:111)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:83)
> at
> org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1779)
> at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:2915)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1453)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1445)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1421)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2079)
> at sun.reflect.GeneratedMethodAccessor210.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
> at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1345)
>
>
> 3) HDFS logs this at INFO:
>
> org.apache.hadoop.hdfs.server.datanode.DataNode
> Exception in receiveBlock for
> BP-1555962259-thisIP-1358284735183:blk_-5783891561030721738_125644
> java.io.IOException: Premature EOF from inputStream
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
> at java.lang.Thread.run(Thread.java:662)
>
> 4) HDFS logs this as ERROR:
>
> org.apache.hadoop.hdfs.server.datanode.DataNode
>
> thishost:50010:DataXceiver error processing WRITE_BLOCK operation src:
> /thishost:50620 dest: /thishost:50010
> java.io.IOException: Premature EOF from inputStream
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
> at java.lang.Thread.run(Thread.java:662)
>
> 5) by now RegionServer is stopped, usually taking the Thrift server down
> with it (connection refused).
>
>
>
> One final note; we found a bug in our code which was occasionally
> performing long and unbounded table scans. After fixing that, the crashes
> became much less frequent, but they still happen.
>
>
> Thanks,
> Jim
>
>