sorry for continous emails... I was just able to get a jstack on high IOwait erroring DN:
http://pastebin.com/jQHpeYHX -Jack On Mon, Mar 28, 2011 at 4:38 PM, Jack Levin <[email protected]> wrote: > more data: > > before datanode restart - > > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 17.00 71.00 15.00 11648.00 448.00 > 140.65 7.08 133.13 11.62 99.90 > sdb 0.00 4.00 79.00 4.00 13224.00 64.00 > 160.10 2.90 40.51 9.13 75.80 > > avg-cpu: %user %nice %system %iowait %steal %idle > 17.44 0.00 3.69 54.05 0.00 24.82 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 2.00 8.00 70.00 5.00 10584.00 104.00 > 142.51 9.37 153.17 13.33 100.00 > sdb 0.00 0.00 47.00 0.00 7104.00 0.00 > 151.15 0.73 14.96 9.53 44.80 > > avg-cpu: %user %nice %system %iowait %steal %idle > 12.22 0.00 5.62 59.66 0.00 22.49 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 3.00 239.00 78.00 3.00 9352.00 1936.00 > 139.36 9.01 89.38 12.31 99.70 > sdb 0.00 0.00 70.00 0.00 11744.00 0.00 > 167.77 2.39 34.56 10.77 75.40 > > 16:36:16 10.101.6.4 root@rdaf4:/usr/java/latest/bin $ ps uax | grep datano > root 24358 0.0 0.0 103152 812 pts/0 S+ 16:36 0:00 grep datano > hadoop 31249 11.6 3.6 4503764 596992 ? Sl 11:49 33:25 > /usr/java/latest/bin/java -Xmx2048m -server > > > > After restart: > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 2.00 0.00 272.00 0.00 > 136.00 0.03 15.50 15.50 3.10 > sdb 0.00 0.00 12.00 0.00 1176.00 0.00 > 98.00 0.08 6.83 6.83 8.20 > > avg-cpu: %user %nice %system %iowait %steal %idle > 10.64 0.00 1.73 1.98 0.00 85.64 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 18.00 8.00 49.00 1848.00 536.00 > 41.82 0.46 8.04 1.07 6.10 > sdb 0.00 0.00 8.00 0.00 720.00 0.00 > 90.00 0.06 7.75 6.25 5.00 > > avg-cpu: %user %nice %system %iowait %steal %idle > 4.23 0.00 0.75 0.50 0.00 94.53 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 2.00 0.00 272.00 0.00 > 136.00 0.03 13.00 13.00 2.60 > sdb 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > > > > > > > On Mon, Mar 28, 2011 at 4:28 PM, Jack Levin <[email protected]> wrote: >> Also, I can't even jstack the datanode, its CPU is low, and its not eating >> RAM: >> >> 16:21:29 10.103.7.3 root@mtag3:/usr/java/latest/bin $ ./jstack 31771 >> 31771: Unable to open socket file: target process not responding or >> HotSpot VM not loaded >> The -F option can be used when the target process is not responding >> You have new mail in /var/spool/mail/root >> 16:21:54 10.103.7.3 root@mtag3:/usr/java/latest/bin $ >> >> >> When I restart the process iowait goes back to normal. Right now >> iowait in insanely higher compared to a server that had high IOwait >> but which I restarted, please see attached graph. >> >> Graph with IOwait drop is the datanode I restarted, the other, I can't >> jvm jstack from. >> >> >> -Jack >> >> On Mon, Mar 28, 2011 at 4:19 PM, Jack Levin <[email protected]> wrote: >>> Hello guys, we are getting those errors: >>> >>> >>> 2011-03-28 15:08:33,485 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>> /10.101.6.5:50010, dest: /10.101.6.5:51365, bytes: 66564, op: >>> HDFS_READ, cliI >>> D: >>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053, >>> offset: 4191232, srvID: DS-1528941561-10.101.6.5-50010-1299713950021, >>> blockid: blk_-30874978 >>> 22408705276_723501, duration: 14409579 >>> 2011-03-28 15:08:33,492 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>> /10.101.6.5:50010, dest: /10.101.6.5:51366, bytes: 14964, op: >>> HDFS_READ, cliI >>> D: >>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053, >>> offset: 67094016, srvID: DS-1528941561-10.101.6.5-50010-1299713950021, >>> blockid: blk_-3224146 >>> 686136187733_731011, duration: 8855000 >>> 2011-03-28 15:08:33,495 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>> /10.101.6.5:50010, dest: /10.101.6.5:51368, bytes: 51600, op: >>> HDFS_READ, cliI >>> D: >>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053, >>> offset: 0, srvID: DS-1528941561-10.101.6.5-50010-1299713950021, >>> blockid: blk_-63843345833451 >>> 99846_731014, duration: 2053969 >>> 2011-03-28 15:08:33,503 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>> /10.101.6.5:50010, dest: /10.101.6.5:42553, bytes: 462336, op: >>> HDFS_READ, cli >>> ID: >>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053, >>> offset: 327680, srvID: DS-1528941561-10.101.6.5-50010-1299713950021, >>> blockid: blk_-47512832 >>> 94726600221_724785, duration: 480254862706 >>> 2011-03-28 15:08:33,504 WARN >>> org.apache.hadoop.hdfs.server.datanode.DataNode: >>> DatanodeRegistration(10.101.6.5:50010, >>> storageID=DS-1528941561-10.101.6.5-50010-1299713950021, >>> infoPort=50075, ipcPort=50020):Got exception while serving >>> blk_-4751283294726600221_724785 to /10.101.6.5: >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting >>> for channel to be ready for write. ch : >>> java.nio.channels.SocketChannel[connected local=/10.101.6.5:500 >>> 10 remote=/10.101.6.5:42553] >>> at >>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) >>> at >>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) >>> at >>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) >>> at >>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350) >>> at >>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436) >>> at >>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197) >>> at >>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:110) >>> >>> 2011-03-28 15:08:33,504 ERROR >>> org.apache.hadoop.hdfs.server.datanode.DataNode: >>> DatanodeRegistration(10.101.6.5:50010, >>> storageID=DS-1528941561-10.101.6.5-50010-1299713950021 >>> , infoPort=50075, ipcPort=50020):DataXceiver >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting >>> for channel to be ready for write. ch : >>> java.nio.channels.SocketChannel[connected local=/10.101.6.5:500 >>> 10 remote=/10.101.6.5:42553] >>> at >>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) >>> at >>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) >>> at >>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) >>> at >>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350) >>> at >>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436) >>> at >>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197) >>> at >>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:110) >>> 2011-03-28 15:08:33,504 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>> /10.101.6.5:50010, dest: /10.101.6.5:51369, bytes: 66564, op: >>> HDFS_READ, cliI >>> D: >>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053, >>> offset: 4781568, srvID: DS-1528941561-10.101.6.5-50010-1299713950021, >>> blockid: blk_-30874978 >>> 22408705276_723501, duration: 11478016 >>> 2011-03-28 15:08:33,506 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: >>> /10.101.6.5:50010, dest: /10.101.6.5:51370, bytes: 66564, op: >>> HDFS_READ, cliI >>> D: >>> DFSClient_hb_rs_rdaf5.prod.imageshack.com,60020,1301323415015_1301323415053, >>> offset: 66962944, srvID: DS-1528941561-10.101.6.5-50010-1299713950021, >>> blockid: blk_-3224146 >>> 686136187733_731011, duration: 7643688 >>> >>> >>> RS talking to DN, and we are getting timeouts, there are no issues >>> like ulimit afaik, as we start them with 32k. Any ideas what the deal >>> is? >>> >>> -Jack >>> >> >
