u should upgrade to 0.94 as you also had issues with row locks as newer version had improved miniBatchPut code base.
On Nov 7, 2012, at 9:57 AM, Varun Sharma <[email protected]> wrote: > Thanks for the response. One more point is that I am running hadoop 1.0.4 > with hbase 0.92 - not sure if that is known to have these issues. > > I had one quick question though - these logs are picked from 10.31.138.145 > and from my understanding of the logs below, its still going to another bad > datanode for retrieving the block even though it should already have the > data block - see last line... > > 12/11/07 02:17:45 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor > exception for block blk_2813460962462751946_78454java.io.IOException: Bad > response 1 for block blk_2813460962462751946_78454 from datanode > 10.31.190.107:9200 > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3084) > > 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block > blk_2813460962462751946_78454 bad datanode[1] 10.31.190.107:9200 > 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block > blk_2813460962462751946_78454 in pipeline *10.31.138.245:9200, > 10.31.190.107:9200, 10.159.19.90:9200: bad datanode 10.31.190.107:9200* > > Looking at the DataNode logs - it seems that the local datanode is trying > to connect to the remote bad datanode. Is this for replicating the WALEdit ? > > 2012-11-07 02:17:45,142 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (PacketResponder 2 for Block blk_2813460962462751946_78454): > PacketResponder blk_2813460962462751946_78454 2 Exception > java.net.SocketTimeoutException: > 66000 millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[*connected local=/**10.31.138.245:33965 > remote=/10.31.190.107:9200]* > * > * > Also, this is preceded by a whole bunch of slow operations with > processingtimems close to 20 seconds like these - are these other slow > walEdit appends (slowed down due to HDFS) ? > > 12/11/07 02:16:01 WARN ipc.HBaseServer: (responseTooSlow): > {"processingtimems":21957,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@7198c05d), > rpc version=1, client version=29, methodsFingerPrint=54742778","client":" > 10.31.128.131:55327 > ","starttimems":1352254539935,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"} > > Thanks > Varun > > On Wed, Nov 7, 2012 at 7:25 AM, Jeremy Carroll <[email protected]> wrote: > >> Sorry. It's early in the morning here. Did not see the 'read timeout'. +1 >> to Nicolas's response. >> >> On Wed, Nov 7, 2012 at 7:22 AM, Jeremy Carroll <[email protected]> >> wrote: >> >>> One trick I have used for a while is to >>> set dfs.datanode.socket.write.timeout in hdfs-site.xml to 0 (disabled). >>> It's not going to solve your underlying IOPS capacity issue with your >>> servers, but it can help for short bursty periods. Basically it's hiding >>> the real issue, but it can help in the short term. >>> >>> >>> On Wed, Nov 7, 2012 at 1:43 AM, Varun Sharma <[email protected]> >> wrote: >>> >>>> Hi, >>>> >>>> I am seeing extremely long HDFS timeouts - and this seems to be >> associated >>>> with the loss of a DataNode. Here is the RS log: >>>> >>>> 12/11/07 02:17:45 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor >>>> exception for block blk_2813460962462751946_78454java.io.IOException: >> Bad >>>> response 1 for block blk_2813460962462751946_78454 from datanode >>>> 10.31.190.107:9200 >>>> at >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3084) >>>> >>>> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block >>>> blk_2813460962462751946_78454 bad datanode[1] 10.31.190.107:9200 >>>> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block >>>> blk_2813460962462751946_78454 in pipeline 10.31.138.245:9200, >>>> 10.31.190.107:9200, 10.159.19.90:9200: bad datanode 10.31.190.107:9200 >>>> 12/11/07 02:17:45 WARN wal.HLog: IPC Server handler 35 on 60020 took >> 65955 >>>> ms appending an edit to hlog; editcount=476686, len~=76.0 >>>> 12/11/07 02:17:45 WARN wal.HLog: HDFS pipeline error detected. Found 2 >>>> replicas but expecting no less than 3 replicas. Requesting close of >> hlog. >>>> >>>> The corresponding DN log goes like this >>>> >>>> 2012-11-07 02:17:45,142 INFO >>>> org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder 2 for >>>> Block blk_2813460962462751946_78454): PacketResponder >>>> blk_2813460962462751946_78454 2 Exception >> java.net.SocketTimeoutException: >>>> 66000 millis timeout while waiting for channel to be ready for read. ch >> : >>>> java.nio.channels.SocketChannel[connected local=/10.31.138.245:33965 >>>> remote=/ >>>> 10.31.190.107:9200] >>>> at >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) >>>> at >>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) >>>> at >>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) >>>> at java.io.DataInputStream.readFully(DataInputStream.java:178) >>>> at java.io.DataInputStream.readLong(DataInputStream.java:399) >>>> at >> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124) >>>> at >> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:806) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> It seems like the DataNode local to the region server is trying to grab >>>> the >>>> block from another DN and that is timing out because of this other data >>>> node being bad. All in all this causes response times to be terribly >> poor. >>>> Is there a way around this or am I missing something ? >>>> >>>> Varun >>
