Hi all,

We are running Nutch in a 4 nodes cluster (3 tasktracker & datanode, 1
jobtracker & namenode).
These machines are pretty strong hardware and fetch jobs run easily.

however, sometimes as the update job is running, we see the following
exception:

2010-08-05 21:07:19,213 ERROR datanode.DataNode - DatanodeRegistration(
172.16.202.172:50010,
storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException
    at java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
    at java.lang.Thread.run(Thread.java:619)
2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF
2010-08-05 21:12:19,155 ERROR datanode.DataNode - DatanodeRegistration(
172.16.202.172:50010,
storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException
    at java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
    at java.lang.Thread.run(Thread.java:619)
2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF
2010-08-05 21:17:19,239 ERROR datanode.DataNode - DatanodeRegistration(
172.16.202.172:50010,
storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException
    at java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
    at java.lang.Thread.run(Thread.java:619)


This exception plots out at a 5 or 4 minutes rate.
This is the amount of data being read and written by this job as those
exceptions appear:

FILE_BYTES_READ      1,224,570,415      0      1,224,570,415
HDFS_BYTES_READ     1,405,131,713     0     1,405,131,713
FILE_BYTES_WRITTEN     2,501,562,342     1,224,570,187     3,726,132,529

checking fileSystem with "bin/hadoop fsck" shows me most of the time only
HEALTHY blocks,
although there are times when job history files seem to become CORRUPT, as I
can see with "bin/hadoop fsck -openforwrite"

dfs.block.size is 128Mb
system ulimit is set to 16384

The cluster is composed of strong hardware and the network between them is
pretty fast too.
There is plenty of disk space and memory on all nodes.
Given that, I guess it
should be something about my current configuration that is not fully
appropriate.

A short tip would be helpful at this moment.


Thanks in advance

Emmanuel de Castro Santana

Reply via email to