Sorry. It's early in the morning here. Did not see the 'read timeout'. +1 to Nicolas's response.
On Wed, Nov 7, 2012 at 7:22 AM, Jeremy Carroll <[email protected]> wrote: > One trick I have used for a while is to > set dfs.datanode.socket.write.timeout in hdfs-site.xml to 0 (disabled). > It's not going to solve your underlying IOPS capacity issue with your > servers, but it can help for short bursty periods. Basically it's hiding > the real issue, but it can help in the short term. > > > On Wed, Nov 7, 2012 at 1:43 AM, Varun Sharma <[email protected]> wrote: > >> Hi, >> >> I am seeing extremely long HDFS timeouts - and this seems to be associated >> with the loss of a DataNode. Here is the RS log: >> >> 12/11/07 02:17:45 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor >> exception for block blk_2813460962462751946_78454java.io.IOException: Bad >> response 1 for block blk_2813460962462751946_78454 from datanode >> 10.31.190.107:9200 >> at >> >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3084) >> >> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block >> blk_2813460962462751946_78454 bad datanode[1] 10.31.190.107:9200 >> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block >> blk_2813460962462751946_78454 in pipeline 10.31.138.245:9200, >> 10.31.190.107:9200, 10.159.19.90:9200: bad datanode 10.31.190.107:9200 >> 12/11/07 02:17:45 WARN wal.HLog: IPC Server handler 35 on 60020 took 65955 >> ms appending an edit to hlog; editcount=476686, len~=76.0 >> 12/11/07 02:17:45 WARN wal.HLog: HDFS pipeline error detected. Found 2 >> replicas but expecting no less than 3 replicas. Requesting close of hlog. >> >> The corresponding DN log goes like this >> >> 2012-11-07 02:17:45,142 INFO >> org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder 2 for >> Block blk_2813460962462751946_78454): PacketResponder >> blk_2813460962462751946_78454 2 Exception java.net.SocketTimeoutException: >> 66000 millis timeout while waiting for channel to be ready for read. ch : >> java.nio.channels.SocketChannel[connected local=/10.31.138.245:33965 >> remote=/ >> 10.31.190.107:9200] >> at >> >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) >> at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) >> at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) >> at java.io.DataInputStream.readFully(DataInputStream.java:178) >> at java.io.DataInputStream.readLong(DataInputStream.java:399) >> at >> >> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124) >> at >> >> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:806) >> at java.lang.Thread.run(Thread.java:662) >> >> It seems like the DataNode local to the region server is trying to grab >> the >> block from another DN and that is timing out because of this other data >> node being bad. All in all this causes response times to be terribly poor. >> Is there a way around this or am I missing something ? >> >> Varun >> > >
