Re: hdfs /DN errors

Jack Levin Mon, 28 Mar 2011 18:27:18 -0700

I have 5000 and and no 0 there, so its standard config whatever it
might be.  I had it zero before, but it was causing RS to hang on
threads and would basically break RS restarts.


-Jack

On Mon, Mar 28, 2011 at 5:50 PM, Ashley Taylor
<[email protected]> wrote:
> Have you incressed xcievers count in your hdfs-site.xml
> Default is 256 this needs to much higher if you want to run hbase something 
> line
>  <property>
>        <name>dfs.datanode.max.xcievers</name>
>        <value>4096</value>
>  </property>
>
> Also check that you have added or set this to a high enough number in your 
> hdfs-site.xml
>  <property>
>       <name>dfs.datanode.socket.write.timeout</name>
>       <value>0</value>
>  </property>
>
> -----Original Message-----
> From: Jack Levin [mailto:[email protected]]
> Sent: Tuesday, 29 March 2011 12:44 p.m.
> To: [email protected]
> Subject: Re: hdfs /DN errors
>
> sorry for continous emails... I was just able to get a jstack on high
> IOwait erroring DN:
>
> http://pastebin.com/jQHpeYHX
>
> -Jack
>
> On Mon, Mar 28, 2011 at 4:38 PM, Jack Levin <[email protected]> wrote:
>> more data:
>>
>> before datanode restart -
>>
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda               0.00    17.00   71.00   15.00 11648.00   448.00
>> 140.65     7.08  133.13  11.62  99.90
>> sdb               0.00     4.00   79.00    4.00 13224.00    64.00
>> 160.10     2.90   40.51   9.13  75.80
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>          17.44    0.00    3.69   54.05    0.00   24.82
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda               2.00     8.00   70.00    5.00 10584.00   104.00
>> 142.51     9.37  153.17  13.33 100.00
>> sdb               0.00     0.00   47.00    0.00  7104.00     0.00
>> 151.15     0.73   14.96   9.53  44.80
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>          12.22    0.00    5.62   59.66    0.00   22.49
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda               3.00   239.00   78.00    3.00  9352.00  1936.00
>> 139.36     9.01   89.38  12.31  99.70
>> sdb               0.00     0.00   70.00    0.00 11744.00     0.00
>> 167.77     2.39   34.56  10.77  75.40
>>
>> 16:36:16 10.101.6.4 root@rdaf4:/usr/java/latest/bin $ ps uax | grep datano
>> root     24358  0.0  0.0 103152   812 pts/0    S+   16:36   0:00 grep datano
>> hadoop   31249 11.6  3.6 4503764 596992 ?      Sl   11:49  33:25
>> /usr/java/latest/bin/java -Xmx2048m -server
>>
>>
>>
>> After restart:
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda               0.00     0.00    2.00    0.00   272.00     0.00
>> 136.00     0.03   15.50  15.50   3.10
>> sdb               0.00     0.00   12.00    0.00  1176.00     0.00
>> 98.00     0.08    6.83   6.83   8.20
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>          10.64    0.00    1.73    1.98    0.00   85.64
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda               0.00    18.00    8.00   49.00  1848.00   536.00
>> 41.82     0.46    8.04   1.07   6.10
>> sdb               0.00     0.00    8.00    0.00   720.00     0.00
>> 90.00     0.06    7.75   6.25   5.00
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>           4.23    0.00    0.75    0.50    0.00   94.53
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda               0.00     0.00    2.00    0.00   272.00     0.00
>> 136.00     0.03   13.00  13.00   2.60
>> sdb               0.00     0.00    0.00    0.00     0.00     0.00
>> 0.00     0.00    0.00   0.00   0.00
>>
>>
>>
>>
>>
>>
>> On Mon, Mar 28, 2011 at 4:28 PM, Jack Levin <[email protected]> wrote:
>>> Also, I can't even jstack the datanode, its CPU is low, and its not eating 
>>> RAM:
>>>
>>> 16:21:29 10.103.7.3 root@mtag3:/usr/java/latest/bin $ ./jstack 31771
>>> 31771: Unable to open socket file: target process not responding or
>>> HotSpot VM not loaded
>>> The -F option can be used when the target process is not responding
>>> You have new mail in /var/spool/mail/root
>>> 16:21:54 10.103.7.3 root@mtag3:/usr/java/latest/bin $
>>>
>>>
>>> When I restart the process iowait goes back to normal.  Right now
>>> iowait in insanely higher compared to a server that had high IOwait
>>> but which I restarted, please see attached graph.
>>>
>>> Graph with IOwait drop is the datanode I restarted, the other, I can't
>>> jvm jstack from.
>>>
>>>
>>> -Jack
>>>
>>> On Mon, Mar 28, 2011 at 4:19 PM, Jack Levin <[email protected]> wrote:
>>>> Hello guys, we are getting those errors:
>>>>
>>>>
>>>> 2011-03-28 15:08:33,485 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.5:50010, dest: /10.101.6.5:51365, bytes: 66564, op:
>>>> HDFS_READ, cliI
>>>> D: 
>>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053,
>>>> offset: 4191232, srvID: DS-1528941561-10.101.6.5-50010-1299713950021,
>>>> blockid: blk_-30874978
>>>> 22408705276_723501, duration: 14409579
>>>> 2011-03-28 15:08:33,492 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.5:50010, dest: /10.101.6.5:51366, bytes: 14964, op:
>>>> HDFS_READ, cliI
>>>> D: 
>>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053,
>>>> offset: 67094016, srvID: DS-1528941561-10.101.6.5-50010-1299713950021,
>>>> blockid: blk_-3224146
>>>> 686136187733_731011, duration: 8855000
>>>> 2011-03-28 15:08:33,495 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.5:50010, dest: /10.101.6.5:51368, bytes: 51600, op:
>>>> HDFS_READ, cliI
>>>> D: 
>>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053,
>>>> offset: 0, srvID: DS-1528941561-10.101.6.5-50010-1299713950021,
>>>> blockid: blk_-63843345833451
>>>> 99846_731014, duration: 2053969
>>>> 2011-03-28 15:08:33,503 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.5:50010, dest: /10.101.6.5:42553, bytes: 462336, op:
>>>> HDFS_READ, cli
>>>> ID: 
>>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053,
>>>> offset: 327680, srvID: DS-1528941561-10.101.6.5-50010-1299713950021,
>>>> blockid: blk_-47512832
>>>> 94726600221_724785, duration: 480254862706
>>>> 2011-03-28 15:08:33,504 WARN
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>>> DatanodeRegistration(10.101.6.5:50010,
>>>> storageID=DS-1528941561-10.101.6.5-50010-1299713950021,
>>>>  infoPort=50075, ipcPort=50020):Got exception while serving
>>>> blk_-4751283294726600221_724785 to /10.101.6.5:
>>>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
>>>> for channel to be ready for write. ch :
>>>> java.nio.channels.SocketChannel[connected local=/10.101.6.5:500
>>>> 10 remote=/10.101.6.5:42553]
>>>>        at 
>>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
>>>>        at 
>>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>>>        at 
>>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>>>        at 
>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>>>>        at 
>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>>>>        at 
>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>>>>        at 
>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:110)
>>>>
>>>> 2011-03-28 15:08:33,504 ERROR
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>>> DatanodeRegistration(10.101.6.5:50010,
>>>> storageID=DS-1528941561-10.101.6.5-50010-1299713950021
>>>> , infoPort=50075, ipcPort=50020):DataXceiver
>>>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
>>>> for channel to be ready for write. ch :
>>>> java.nio.channels.SocketChannel[connected local=/10.101.6.5:500
>>>> 10 remote=/10.101.6.5:42553]
>>>>        at 
>>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
>>>>        at 
>>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>>>        at 
>>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>>>        at 
>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>>>>        at 
>>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>>>>        at 
>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>>>>        at 
>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:110)
>>>> 2011-03-28 15:08:33,504 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.5:50010, dest: /10.101.6.5:51369, bytes: 66564, op:
>>>> HDFS_READ, cliI
>>>> D: 
>>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053,
>>>> offset: 4781568, srvID: DS-1528941561-10.101.6.5-50010-1299713950021,
>>>> blockid: blk_-30874978
>>>> 22408705276_723501, duration: 11478016
>>>> 2011-03-28 15:08:33,506 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.5:50010, dest: /10.101.6.5:51370, bytes: 66564, op:
>>>> HDFS_READ, cliI
>>>> D: 
>>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053,
>>>> offset: 66962944, srvID: DS-1528941561-10.101.6.5-50010-1299713950021,
>>>> blockid: blk_-3224146
>>>> 686136187733_731011, duration: 7643688
>>>>
>>>>
>>>> RS talking to DN, and we are getting timeouts, there are no issues
>>>> like ulimit afaik, as we start them with 32k.  Any ideas what the deal
>>>> is?
>>>>
>>>> -Jack
>>>>
>>>
>>
>

Re: hdfs /DN errors

Reply via email to