One trick I have used for a while is to
set dfs.datanode.socket.write.timeout in hdfs-site.xml to 0 (disabled).
It's not going to solve your underlying IOPS capacity issue with your
servers, but it can help for short bursty periods. Basically it's hiding
the real issue, but it can help in the short term.

On Wed, Nov 7, 2012 at 1:43 AM, Varun Sharma <[email protected]> wrote:

> Hi,
>
> I am seeing extremely long HDFS timeouts - and this seems to be associated
> with the loss of a DataNode. Here is the RS log:
>
> 12/11/07 02:17:45 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor
> exception  for block blk_2813460962462751946_78454java.io.IOException: Bad
> response 1 for block blk_2813460962462751946_78454 from datanode
> 10.31.190.107:9200
>         at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3084)
>
> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block
> blk_2813460962462751946_78454 bad datanode[1] 10.31.190.107:9200
> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block
> blk_2813460962462751946_78454 in pipeline 10.31.138.245:9200,
> 10.31.190.107:9200, 10.159.19.90:9200: bad datanode 10.31.190.107:9200
> 12/11/07 02:17:45 WARN wal.HLog: IPC Server handler 35 on 60020 took 65955
> ms appending an edit to hlog; editcount=476686, len~=76.0
> 12/11/07 02:17:45 WARN wal.HLog: HDFS pipeline error detected. Found 2
> replicas but expecting no less than 3 replicas.  Requesting close of hlog.
>
> The corresponding DN log goes like this
>
> 2012-11-07 02:17:45,142 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder 2 for
> Block blk_2813460962462751946_78454): PacketResponder
> blk_2813460962462751946_78454 2 Exception java.net.SocketTimeoutException:
> 66000 millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.31.138.245:33965
> remote=/
> 10.31.190.107:9200]
>         at
>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readLong(DataInputStream.java:399)
>         at
>
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
>         at
>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:806)
>         at java.lang.Thread.run(Thread.java:662)
>
> It seems like the DataNode local to the region server is trying to grab the
> block from another DN and that is timing out because of this other data
> node being bad. All in all this causes response times to be terribly poor.
> Is there a way around this or am I missing something ?
>
> Varun
>

Reply via email to