Please pastebin log from region server around the time it became dead. What hbase / Hadoop version are you using ?
Anything interesting in master log ? Thanks On Nov 7, 2014, at 4:57 AM, Jean-Marc Spaggiari <[email protected]> wrote: > Hi, > > Have you checked that your Hadoop is running fine? Have you checked that > network between your servers is fine to? > > JM > > 2014-11-07 5:22 GMT-05:00 [email protected] <[email protected]>: > >> I've deploied a "2+4" cluster which has been normally running for a >> long time. >> The cluster has got more than 40T data.When I initiatively shut the hbase >> service >> and try to restart it,the regionserver will be dead. >> >> The log of regionserver shows that all the regions are opened. But in >> the logs of the datanode can see WARN and ERROR logs. >> Bellow is the log for details: >> >> 2014-11-07 14:47:21,584 INFO >> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >> 10.230.63.12:50010, dest: /10.230.63.9:39405, bytes: 4696, op: HDFS_READ, >> cliID: DFSClient_hb_rs_salve1,60020,1415342303886_- >> 2037622978_29, offset: 31996928, srvID: >> bb0032a3-1170-4a34-b85b-e2cfa0d56cb2, blockid: BP-1731746090-10.230.63.3- >> 1406195669990:blk_1078709392_4968828, duration: 7978822 >> 2014-11-07 14:47:21,596 INFO >> org.apache.hadoop.hdfs.server.datanode.DataNode: exception: >> java.net.SocketTimeoutException: 480000 millis timeout while waiting >> for channel to be ready for write. ch : >> java.nio.channels.SocketChannel[connected local=/10.230.63.12:50010 >> remote=/10.230.63.11:41511] >> at >> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) >> at >> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172) >> at >> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220) >> at >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547) >> at >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:712) >> at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:479) >> at >> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110) >> at >> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68) >> at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229) >> at java.lang.Thread.run(Thread.java:744) >> 2014-11-07 14:47:21,599 INFO >> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >> 10.230.63.12:50010, dest: /10.230.63.11:41511, bytes: 726528, op: >> HDFS_READ, cliID: DFSClient_hb_rs_salve3,60020,1415342303807_1094119849_29, >> offset: 0, srvID: bb0032a3-1170-4a34-b85b-e2cfa0d56cb2, blockid: >> BP-1731746090-10.230.63.3-1406195669990:blk_1078034913_4294168, duration: >> 480190668115 >> 2014-11-07 14:47:21,599 WARN >> org.apache.hadoop.hdfs.server.datanode.DataNode: >> DatanodeRegistration(10.230.63.12, >> datanodeUuid=bb0032a3-1170-4a34-b85b-e2cfa0d56cb2, infoPort=50075, >> ipcPort=50020, storageInfo=lv=-55;cid=cluster12;nsid=395652542;c=0):Got >> exception while serving >> BP-1731746090-10.230.63.3-1406195669990:blk_1078034913_4294168 to / >> 10.230.63.11:41511 >> java.net.SocketTimeoutException: 480000 millis timeout while waiting for >> channel to be ready for write. ch : >> java.nio.channels.SocketChannel[connected local=/10.230.63.12:50010 >> remote=/10.230.63.11:41511] >> at >> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) >> at >> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172) >> at >> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220) >> at >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547) >> at >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:712) >> at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:479) >> at >> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110) >> at >> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68) >> at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229) >> at java.lang.Thread.run(Thread.java:744) >> 2014-11-07 14:47:21,600 ERROR >> org.apache.hadoop.hdfs.server.datanode.DataNode: salve4:50010:DataXceiver >> error processing READ_BLOCK operation src: /10.230.63.11:41511 dest: / >> 10.230.63.12:50010 >> >> >> I personally think it was caused on the load on open stage,where the >> disk IO of the cluster can >> be very high and the pressure can be huge. >> >> I wonder what results in reading error while reading hfile,and what >> leads to timeout. >> Are there any solutions that can control the speed of loading on open and >> reduce >> pressure of the cluster? >> >> I need help ! >> >> Thanks! >> >> >> >> >> [email protected] >>
