Have you incressed xcievers count in your hdfs-site.xml
Default is 256 this needs to much higher if you want to run hbase something line
<property>
<name>dfs.datanode.max.xcievers</name>
<value>4096</value>
</property>
Also check that you have added or set this to a high enough number in your
hdfs-site.xml
<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>0</value>
</property>
-----Original Message-----
From: Jack Levin [mailto:[email protected]]
Sent: Tuesday, 29 March 2011 12:44 p.m.
To: [email protected]
Subject: Re: hdfs /DN errors
sorry for continous emails... I was just able to get a jstack on high
IOwait erroring DN:
http://pastebin.com/jQHpeYHX
-Jack
On Mon, Mar 28, 2011 at 4:38 PM, Jack Levin <[email protected]> wrote:
> more data:
>
> before datanode restart -
>
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 0.00 17.00 71.00 15.00 11648.00 448.00
> 140.65 7.08 133.13 11.62 99.90
> sdb 0.00 4.00 79.00 4.00 13224.00 64.00
> 160.10 2.90 40.51 9.13 75.80
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 17.44 0.00 3.69 54.05 0.00 24.82
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 2.00 8.00 70.00 5.00 10584.00 104.00
> 142.51 9.37 153.17 13.33 100.00
> sdb 0.00 0.00 47.00 0.00 7104.00 0.00
> 151.15 0.73 14.96 9.53 44.80
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 12.22 0.00 5.62 59.66 0.00 22.49
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 3.00 239.00 78.00 3.00 9352.00 1936.00
> 139.36 9.01 89.38 12.31 99.70
> sdb 0.00 0.00 70.00 0.00 11744.00 0.00
> 167.77 2.39 34.56 10.77 75.40
>
> 16:36:16 10.101.6.4 root@rdaf4:/usr/java/latest/bin $ ps uax | grep datano
> root 24358 0.0 0.0 103152 812 pts/0 S+ 16:36 0:00 grep datano
> hadoop 31249 11.6 3.6 4503764 596992 ? Sl 11:49 33:25
> /usr/java/latest/bin/java -Xmx2048m -server
>
>
>
> After restart:
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 0.00 0.00 2.00 0.00 272.00 0.00
> 136.00 0.03 15.50 15.50 3.10
> sdb 0.00 0.00 12.00 0.00 1176.00 0.00
> 98.00 0.08 6.83 6.83 8.20
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 10.64 0.00 1.73 1.98 0.00 85.64
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 0.00 18.00 8.00 49.00 1848.00 536.00
> 41.82 0.46 8.04 1.07 6.10
> sdb 0.00 0.00 8.00 0.00 720.00 0.00
> 90.00 0.06 7.75 6.25 5.00
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 4.23 0.00 0.75 0.50 0.00 94.53
>
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
> avgrq-sz avgqu-sz await svctm %util
> sda 0.00 0.00 2.00 0.00 272.00 0.00
> 136.00 0.03 13.00 13.00 2.60
> sdb 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00
>
>
>
>
>
>
> On Mon, Mar 28, 2011 at 4:28 PM, Jack Levin <[email protected]> wrote:
>> Also, I can't even jstack the datanode, its CPU is low, and its not eating
>> RAM:
>>
>> 16:21:29 10.103.7.3 root@mtag3:/usr/java/latest/bin $ ./jstack 31771
>> 31771: Unable to open socket file: target process not responding or
>> HotSpot VM not loaded
>> The -F option can be used when the target process is not responding
>> You have new mail in /var/spool/mail/root
>> 16:21:54 10.103.7.3 root@mtag3:/usr/java/latest/bin $
>>
>>
>> When I restart the process iowait goes back to normal. Right now
>> iowait in insanely higher compared to a server that had high IOwait
>> but which I restarted, please see attached graph.
>>
>> Graph with IOwait drop is the datanode I restarted, the other, I can't
>> jvm jstack from.
>>
>>
>> -Jack
>>
>> On Mon, Mar 28, 2011 at 4:19 PM, Jack Levin <[email protected]> wrote:
>>> Hello guys, we are getting those errors:
>>>
>>>
>>> 2011-03-28 15:08:33,485 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.5:50010, dest: /10.101.6.5:51365, bytes: 66564, op:
>>> HDFS_READ, cliI
>>> D:
>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053,
>>> offset: 4191232, srvID: DS-1528941561-10.101.6.5-50010-1299713950021,
>>> blockid: blk_-30874978
>>> 22408705276_723501, duration: 14409579
>>> 2011-03-28 15:08:33,492 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.5:50010, dest: /10.101.6.5:51366, bytes: 14964, op:
>>> HDFS_READ, cliI
>>> D:
>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053,
>>> offset: 67094016, srvID: DS-1528941561-10.101.6.5-50010-1299713950021,
>>> blockid: blk_-3224146
>>> 686136187733_731011, duration: 8855000
>>> 2011-03-28 15:08:33,495 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.5:50010, dest: /10.101.6.5:51368, bytes: 51600, op:
>>> HDFS_READ, cliI
>>> D:
>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053,
>>> offset: 0, srvID: DS-1528941561-10.101.6.5-50010-1299713950021,
>>> blockid: blk_-63843345833451
>>> 99846_731014, duration: 2053969
>>> 2011-03-28 15:08:33,503 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.5:50010, dest: /10.101.6.5:42553, bytes: 462336, op:
>>> HDFS_READ, cli
>>> ID:
>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053,
>>> offset: 327680, srvID: DS-1528941561-10.101.6.5-50010-1299713950021,
>>> blockid: blk_-47512832
>>> 94726600221_724785, duration: 480254862706
>>> 2011-03-28 15:08:33,504 WARN
>>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>> DatanodeRegistration(10.101.6.5:50010,
>>> storageID=DS-1528941561-10.101.6.5-50010-1299713950021,
>>> infoPort=50075, ipcPort=50020):Got exception while serving
>>> blk_-4751283294726600221_724785 to /10.101.6.5:
>>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
>>> for channel to be ready for write. ch :
>>> java.nio.channels.SocketChannel[connected local=/10.101.6.5:500
>>> 10 remote=/10.101.6.5:42553]
>>> at
>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
>>> at
>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>> at
>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>> at
>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>>> at
>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>>> at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>>> at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:110)
>>>
>>> 2011-03-28 15:08:33,504 ERROR
>>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>> DatanodeRegistration(10.101.6.5:50010,
>>> storageID=DS-1528941561-10.101.6.5-50010-1299713950021
>>> , infoPort=50075, ipcPort=50020):DataXceiver
>>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
>>> for channel to be ready for write. ch :
>>> java.nio.channels.SocketChannel[connected local=/10.101.6.5:500
>>> 10 remote=/10.101.6.5:42553]
>>> at
>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
>>> at
>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>> at
>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>> at
>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>>> at
>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>>> at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>>> at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:110)
>>> 2011-03-28 15:08:33,504 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.5:50010, dest: /10.101.6.5:51369, bytes: 66564, op:
>>> HDFS_READ, cliI
>>> D:
>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053,
>>> offset: 4781568, srvID: DS-1528941561-10.101.6.5-50010-1299713950021,
>>> blockid: blk_-30874978
>>> 22408705276_723501, duration: 11478016
>>> 2011-03-28 15:08:33,506 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.5:50010, dest: /10.101.6.5:51370, bytes: 66564, op:
>>> HDFS_READ, cliI
>>> D:
>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053,
>>> offset: 66962944, srvID: DS-1528941561-10.101.6.5-50010-1299713950021,
>>> blockid: blk_-3224146
>>> 686136187733_731011, duration: 7643688
>>>
>>>
>>> RS talking to DN, and we are getting timeouts, there are no issues
>>> like ulimit afaik, as we start them with 32k. Any ideas what the deal
>>> is?
>>>
>>> -Jack
>>>
>>
>