On 2010-08-06 22:58, Emmanuel de Castro Santana wrote:
Hi all,We are running Nutch in a 4 nodes cluster (3 tasktracker& datanode, 1 jobtracker& namenode). These machines are pretty strong hardware and fetch jobs run easily. however, sometimes as the update job is running, we see the following exception: 2010-08-05 21:07:19,213 ERROR datanode.DataNode - DatanodeRegistration( 172.16.202.172:50010, storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79) at java.lang.Thread.run(Thread.java:619) 2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF 2010-08-05 21:12:19,155 ERROR datanode.DataNode - DatanodeRegistration( 172.16.202.172:50010, storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79) at java.lang.Thread.run(Thread.java:619) 2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF 2010-08-05 21:17:19,239 ERROR datanode.DataNode - DatanodeRegistration( 172.16.202.172:50010, storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79) at java.lang.Thread.run(Thread.java:619) This exception plots out at a 5 or 4 minutes rate. This is the amount of data being read and written by this job as those exceptions appear: FILE_BYTES_READ 1,224,570,415 0 1,224,570,415 HDFS_BYTES_READ 1,405,131,713 0 1,405,131,713 FILE_BYTES_WRITTEN 2,501,562,342 1,224,570,187 3,726,132,529 checking fileSystem with "bin/hadoop fsck" shows me most of the time only HEALTHY blocks, although there are times when job history files seem to become CORRUPT, as I can see with "bin/hadoop fsck -openforwrite" dfs.block.size is 128Mb system ulimit is set to 16384 The cluster is composed of strong hardware and the network between them is pretty fast too. There is plenty of disk space and memory on all nodes. Given that, I guess it should be something about my current configuration that is not fully appropriate. A short tip would be helpful at this moment.
Hadoop network usage patterns are sometimes taxing for the network equipment - I've seen strange errors pop up in situations with cabling of poor quality, and even one case when everything was perfect except for the gigE switch - the switch was equipped with several gigE ports, and the vendor claimed it can support all ports simultaneously... but it's poor CPU was too underpowered to actually handle so many packets/sec from all ports, so during peaks it would choke and drop packets.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

