I have 5000 and and no 0 there, so its standard config whatever it might be. I had it zero before, but it was causing RS to hang on threads and would basically break RS restarts.
-Jack On Mon, Mar 28, 2011 at 5:50 PM, Ashley Taylor <[email protected]> wrote: > Have you incressed xcievers count in your hdfs-site.xml > Default is 256 this needs to much higher if you want to run hbase something > line > <property> > <name>dfs.datanode.max.xcievers</name> > <value>4096</value> > </property> > > Also check that you have added or set this to a high enough number in your > hdfs-site.xml > <property> > <name>dfs.datanode.socket.write.timeout</name> > <value>0</value> > </property> > > -----Original Message----- > From: Jack Levin [mailto:[email protected]] > Sent: Tuesday, 29 March 2011 12:44 p.m. > To: [email protected] > Subject: Re: hdfs /DN errors > > sorry for continous emails... I was just able to get a jstack on high > IOwait erroring DN: > > http://pastebin.com/jQHpeYHX > > -Jack > > On Mon, Mar 28, 2011 at 4:38 PM, Jack Levin <[email protected]> wrote: >> more data: >> >> before datanode restart - >> >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 0.00 17.00 71.00 15.00 11648.00 448.00 >> 140.65 7.08 133.13 11.62 99.90 >> sdb 0.00 4.00 79.00 4.00 13224.00 64.00 >> 160.10 2.90 40.51 9.13 75.80 >> >> avg-cpu: %user %nice %system %iowait %steal %idle >> 17.44 0.00 3.69 54.05 0.00 24.82 >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 2.00 8.00 70.00 5.00 10584.00 104.00 >> 142.51 9.37 153.17 13.33 100.00 >> sdb 0.00 0.00 47.00 0.00 7104.00 0.00 >> 151.15 0.73 14.96 9.53 44.80 >> >> avg-cpu: %user %nice %system %iowait %steal %idle >> 12.22 0.00 5.62 59.66 0.00 22.49 >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 3.00 239.00 78.00 3.00 9352.00 1936.00 >> 139.36 9.01 89.38 12.31 99.70 >> sdb 0.00 0.00 70.00 0.00 11744.00 0.00 >> 167.77 2.39 34.56 10.77 75.40 >> >> 16:36:16 10.101.6.4 root@rdaf4:/usr/java/latest/bin $ ps uax | grep datano >> root 24358 0.0 0.0 103152 812 pts/0 S+ 16:36 0:00 grep datano >> hadoop 31249 11.6 3.6 4503764 596992 ? Sl 11:49 33:25 >> /usr/java/latest/bin/java -Xmx2048m -server >> >> >> >> After restart: >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 0.00 0.00 2.00 0.00 272.00 0.00 >> 136.00 0.03 15.50 15.50 3.10 >> sdb 0.00 0.00 12.00 0.00 1176.00 0.00 >> 98.00 0.08 6.83 6.83 8.20 >> >> avg-cpu: %user %nice %system %iowait %steal %idle >> 10.64 0.00 1.73 1.98 0.00 85.64 >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 0.00 18.00 8.00 49.00 1848.00 536.00 >> 41.82 0.46 8.04 1.07 6.10 >> sdb 0.00 0.00 8.00 0.00 720.00 0.00 >> 90.00 0.06 7.75 6.25 5.00 >> >> avg-cpu: %user %nice %system %iowait %steal %idle >> 4.23 0.00 0.75 0.50 0.00 94.53 >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 0.00 0.00 2.00 0.00 272.00 0.00 >> 136.00 0.03 13.00 13.00 2.60 >> sdb 0.00 0.00 0.00 0.00 0.00 0.00 >> 0.00 0.00 0.00 0.00 0.00 >> >> >> >> >> >> >> On Mon, Mar 28, 2011 at 4:28 PM, Jack Levin <[email protected]> wrote: >>> Also, I can't even jstack the datanode, its CPU is low, and its not eating >>> RAM: >>> >>> 16:21:29 10.103.7.3 root@mtag3:/usr/java/latest/bin $ ./jstack 31771 >>> 31771: Unable to open socket file: target process not responding or >>> HotSpot VM not loaded >>> The -F option can be used when the target process is not responding >>> You have new mail in /var/spool/mail/root >>> 16:21:54 10.103.7.3 root@mtag3:/usr/java/latest/bin $ >>> >>> >>> When I restart the process iowait goes back to normal. Right now >>> iowait in insanely higher compared to a server that had high IOwait >>> but which I restarted, please see attached graph. >>> >>> Graph with IOwait drop is the datanode I restarted, the other, I can't >>> jvm jstack from. >>> >>> >>> -Jack >>> >>> On Mon, Mar 28, 2011 at 4:19 PM, Jack Levin <[email protected]> wrote: >>>> Hello guys, we are getting those errors: >>>> >>>> >>>> 2011-03-28 15:08:33,485 INFO >>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>> /10.101.6.5:50010, dest: /10.101.6.5:51365, bytes: 66564, op: >>>> HDFS_READ, cliI >>>> D: >>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053, >>>> offset: 4191232, srvID: DS-1528941561-10.101.6.5-50010-1299713950021, >>>> blockid: blk_-30874978 >>>> 22408705276_723501, duration: 14409579 >>>> 2011-03-28 15:08:33,492 INFO >>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>> /10.101.6.5:50010, dest: /10.101.6.5:51366, bytes: 14964, op: >>>> HDFS_READ, cliI >>>> D: >>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053, >>>> offset: 67094016, srvID: DS-1528941561-10.101.6.5-50010-1299713950021, >>>> blockid: blk_-3224146 >>>> 686136187733_731011, duration: 8855000 >>>> 2011-03-28 15:08:33,495 INFO >>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>> /10.101.6.5:50010, dest: /10.101.6.5:51368, bytes: 51600, op: >>>> HDFS_READ, cliI >>>> D: >>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053, >>>> offset: 0, srvID: DS-1528941561-10.101.6.5-50010-1299713950021, >>>> blockid: blk_-63843345833451 >>>> 99846_731014, duration: 2053969 >>>> 2011-03-28 15:08:33,503 INFO >>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>> /10.101.6.5:50010, dest: /10.101.6.5:42553, bytes: 462336, op: >>>> HDFS_READ, cli >>>> ID: >>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053, >>>> offset: 327680, srvID: DS-1528941561-10.101.6.5-50010-1299713950021, >>>> blockid: blk_-47512832 >>>> 94726600221_724785, duration: 480254862706 >>>> 2011-03-28 15:08:33,504 WARN >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: >>>> DatanodeRegistration(10.101.6.5:50010, >>>> storageID=DS-1528941561-10.101.6.5-50010-1299713950021, >>>> infoPort=50075, ipcPort=50020):Got exception while serving >>>> blk_-4751283294726600221_724785 to /10.101.6.5: >>>> java.net.SocketTimeoutException: 480000 millis timeout while waiting >>>> for channel to be ready for write. ch : >>>> java.nio.channels.SocketChannel[connected local=/10.101.6.5:500 >>>> 10 remote=/10.101.6.5:42553] >>>> at >>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) >>>> at >>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) >>>> at >>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) >>>> at >>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350) >>>> at >>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436) >>>> at >>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197) >>>> at >>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:110) >>>> >>>> 2011-03-28 15:08:33,504 ERROR >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: >>>> DatanodeRegistration(10.101.6.5:50010, >>>> storageID=DS-1528941561-10.101.6.5-50010-1299713950021 >>>> , infoPort=50075, ipcPort=50020):DataXceiver >>>> java.net.SocketTimeoutException: 480000 millis timeout while waiting >>>> for channel to be ready for write. ch : >>>> java.nio.channels.SocketChannel[connected local=/10.101.6.5:500 >>>> 10 remote=/10.101.6.5:42553] >>>> at >>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) >>>> at >>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) >>>> at >>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) >>>> at >>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350) >>>> at >>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436) >>>> at >>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197) >>>> at >>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:110) >>>> 2011-03-28 15:08:33,504 INFO >>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>> /10.101.6.5:50010, dest: /10.101.6.5:51369, bytes: 66564, op: >>>> HDFS_READ, cliI >>>> D: >>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053, >>>> offset: 4781568, srvID: DS-1528941561-10.101.6.5-50010-1299713950021, >>>> blockid: blk_-30874978 >>>> 22408705276_723501, duration: 11478016 >>>> 2011-03-28 15:08:33,506 INFO >>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>>> /10.101.6.5:50010, dest: /10.101.6.5:51370, bytes: 66564, op: >>>> HDFS_READ, cliI >>>> D: >>>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053, >>>> offset: 66962944, srvID: DS-1528941561-10.101.6.5-50010-1299713950021, >>>> blockid: blk_-3224146 >>>> 686136187733_731011, duration: 7643688 >>>> >>>> >>>> RS talking to DN, and we are getting timeouts, there are no issues >>>> like ulimit afaik, as we start them with 32k. Any ideas what the deal >>>> is? >>>> >>>> -Jack >>>> >>> >> >
