Have any more info on the setup to offer? What do performance metrics look like on the nodes? Network/Disk/etc?
What are you using to store the data on the namenode? Hadoop hdfs backend and hardware? Also, what are the ulimits (-a) on your nodes? And how much memory per task? sg Sent from my iPhone On Aug 6, 2010, at 1:58 PM, Emmanuel de Castro Santana <[email protected]> wrote: > Hi all, > > We are running Nutch in a 4 nodes cluster (3 tasktracker & datanode, 1 > jobtracker & namenode). > These machines are pretty strong hardware and fetch jobs run easily. > > however, sometimes as the update job is running, we see the following > exception: > > 2010-08-05 21:07:19,213 ERROR datanode.DataNode - DatanodeRegistration( > 172.16.202.172:50010, > storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075, > ipcPort=50020):DataXceiver > java.io.EOFException > at java.io.DataInputStream.readShort(DataInputStream.java:298) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79) > at java.lang.Thread.run(Thread.java:619) > 2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF > 2010-08-05 21:12:19,155 ERROR datanode.DataNode - DatanodeRegistration( > 172.16.202.172:50010, > storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075, > ipcPort=50020):DataXceiver > java.io.EOFException > at java.io.DataInputStream.readShort(DataInputStream.java:298) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79) > at java.lang.Thread.run(Thread.java:619) > 2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF > 2010-08-05 21:17:19,239 ERROR datanode.DataNode - DatanodeRegistration( > 172.16.202.172:50010, > storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075, > ipcPort=50020):DataXceiver > java.io.EOFException > at java.io.DataInputStream.readShort(DataInputStream.java:298) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79) > at java.lang.Thread.run(Thread.java:619) > > > This exception plots out at a 5 or 4 minutes rate. > This is the amount of data being read and written by this job as those > exceptions appear: > > FILE_BYTES_READ 1,224,570,415 0 1,224,570,415 > HDFS_BYTES_READ 1,405,131,713 0 1,405,131,713 > FILE_BYTES_WRITTEN 2,501,562,342 1,224,570,187 3,726,132,529 > > checking fileSystem with "bin/hadoop fsck" shows me most of the time only > HEALTHY blocks, > although there are times when job history files seem to become CORRUPT, as I > can see with "bin/hadoop fsck -openforwrite" > > dfs.block.size is 128Mb > system ulimit is set to 16384 > > The cluster is composed of strong hardware and the network between them is > pretty fast too. > There is plenty of disk space and memory on all nodes. > Given that, I guess it > should be something about my current configuration that is not fully > appropriate. > > A short tip would be helpful at this moment. > > > Thanks in advance > > Emmanuel de Castro Santana

