Hi all,
We are running Nutch in a 4 nodes cluster (3 tasktracker & datanode, 1
jobtracker & namenode).
These machines are pretty strong hardware and fetch jobs run easily.
however, sometimes as the update job is running, we see the following
exception:
2010-08-05 21:07:19,213 ERROR datanode.DataNode - DatanodeRegistration(
172.16.202.172:50010,
storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
at java.lang.Thread.run(Thread.java:619)
2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF
2010-08-05 21:12:19,155 ERROR datanode.DataNode - DatanodeRegistration(
172.16.202.172:50010,
storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
at java.lang.Thread.run(Thread.java:619)
2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF
2010-08-05 21:17:19,239 ERROR datanode.DataNode - DatanodeRegistration(
172.16.202.172:50010,
storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
at java.lang.Thread.run(Thread.java:619)
This exception plots out at a 5 or 4 minutes rate.
This is the amount of data being read and written by this job as those
exceptions appear:
FILE_BYTES_READ 1,224,570,415 0 1,224,570,415
HDFS_BYTES_READ 1,405,131,713 0 1,405,131,713
FILE_BYTES_WRITTEN 2,501,562,342 1,224,570,187 3,726,132,529
checking fileSystem with "bin/hadoop fsck" shows me most of the time only
HEALTHY blocks,
although there are times when job history files seem to become CORRUPT, as I
can see with "bin/hadoop fsck -openforwrite"
dfs.block.size is 128Mb
system ulimit is set to 16384
The cluster is composed of strong hardware and the network between them is
pretty fast too.
There is plenty of disk space and memory on all nodes.
Given that, I guess it
should be something about my current configuration that is not fully
appropriate.
A short tip would be helpful at this moment.
Thanks in advance
Emmanuel de Castro Santana