Re: Terribly long HDFS timeouts while appending to HLog

David Charle Wed, 07 Nov 2012 10:21:22 -0800

u should upgrade to 0.94 as you also had issues with row locks as newer version 
had improved miniBatchPut code base.



On Nov 7, 2012, at 9:57 AM, Varun Sharma <[email protected]> wrote:

> Thanks for the response. One more point is that I am running hadoop 1.0.4
> with hbase 0.92 - not sure if that is known to have these issues.
> 
> I had one quick question though - these logs are picked from 10.31.138.145
> and from my understanding of the logs below, its still going to another bad
> datanode for retrieving the block even though it should already have the
> data block - see last line...
> 
> 12/11/07 02:17:45 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor
> exception  for block blk_2813460962462751946_78454java.io.IOException: Bad
> response 1 for block blk_2813460962462751946_78454 from datanode
> 10.31.190.107:9200
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3084)
> 
> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block
> blk_2813460962462751946_78454 bad datanode[1] 10.31.190.107:9200
> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block
> blk_2813460962462751946_78454 in pipeline *10.31.138.245:9200,
> 10.31.190.107:9200, 10.159.19.90:9200: bad datanode 10.31.190.107:9200*
> 
> Looking at the DataNode logs - it seems that the local datanode is trying
> to connect to the remote bad datanode. Is this for replicating the WALEdit ?
> 
> 2012-11-07 02:17:45,142 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
> (PacketResponder 2 for Block blk_2813460962462751946_78454):
> PacketResponder blk_2813460962462751946_78454 2 Exception
> java.net.SocketTimeoutException:
> 66000 millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[*connected local=/**10.31.138.245:33965
> remote=/10.31.190.107:9200]*
> *
> *
> Also, this is preceded by a whole bunch of slow operations with
> processingtimems close to 20 seconds like these - are these other slow
> walEdit appends (slowed down due to HDFS) ?
> 
> 12/11/07 02:16:01 WARN ipc.HBaseServer: (responseTooSlow):
> {"processingtimems":21957,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@7198c05d),
> rpc version=1, client version=29, methodsFingerPrint=54742778","client":"
> 10.31.128.131:55327
> ","starttimems":1352254539935,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
> 
> Thanks
> Varun
> 
> On Wed, Nov 7, 2012 at 7:25 AM, Jeremy Carroll <[email protected]> wrote:
> 
>> Sorry. It's early in the morning here. Did not see the 'read timeout'. +1
>> to Nicolas's response.
>> 
>> On Wed, Nov 7, 2012 at 7:22 AM, Jeremy Carroll <[email protected]>
>> wrote:
>> 
>>> One trick I have used for a while is to
>>> set dfs.datanode.socket.write.timeout in hdfs-site.xml to 0 (disabled).
>>> It's not going to solve your underlying IOPS capacity issue with your
>>> servers, but it can help for short bursty periods. Basically it's hiding
>>> the real issue, but it can help in the short term.
>>> 
>>> 
>>> On Wed, Nov 7, 2012 at 1:43 AM, Varun Sharma <[email protected]>
>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I am seeing extremely long HDFS timeouts - and this seems to be
>> associated
>>>> with the loss of a DataNode. Here is the RS log:
>>>> 
>>>> 12/11/07 02:17:45 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor
>>>> exception  for block blk_2813460962462751946_78454java.io.IOException:
>> Bad
>>>> response 1 for block blk_2813460962462751946_78454 from datanode
>>>> 10.31.190.107:9200
>>>>        at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3084)
>>>> 
>>>> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block
>>>> blk_2813460962462751946_78454 bad datanode[1] 10.31.190.107:9200
>>>> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block
>>>> blk_2813460962462751946_78454 in pipeline 10.31.138.245:9200,
>>>> 10.31.190.107:9200, 10.159.19.90:9200: bad datanode 10.31.190.107:9200
>>>> 12/11/07 02:17:45 WARN wal.HLog: IPC Server handler 35 on 60020 took
>> 65955
>>>> ms appending an edit to hlog; editcount=476686, len~=76.0
>>>> 12/11/07 02:17:45 WARN wal.HLog: HDFS pipeline error detected. Found 2
>>>> replicas but expecting no less than 3 replicas.  Requesting close of
>> hlog.
>>>> 
>>>> The corresponding DN log goes like this
>>>> 
>>>> 2012-11-07 02:17:45,142 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder 2 for
>>>> Block blk_2813460962462751946_78454): PacketResponder
>>>> blk_2813460962462751946_78454 2 Exception
>> java.net.SocketTimeoutException:
>>>> 66000 millis timeout while waiting for channel to be ready for read. ch
>> :
>>>> java.nio.channels.SocketChannel[connected local=/10.31.138.245:33965
>>>> remote=/
>>>> 10.31.190.107:9200]
>>>>        at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>>>>        at
>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>>>        at
>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:178)
>>>>        at java.io.DataInputStream.readLong(DataInputStream.java:399)
>>>>        at
>> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
>>>>        at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:806)
>>>>        at java.lang.Thread.run(Thread.java:662)
>>>> 
>>>> It seems like the DataNode local to the region server is trying to grab
>>>> the
>>>> block from another DN and that is timing out because of this other data
>>>> node being bad. All in all this causes response times to be terribly
>> poor.
>>>> Is there a way around this or am I missing something ?
>>>> 
>>>> Varun
>>

Re: Terribly long HDFS timeouts while appending to HLog

Reply via email to