On 2010-08-06 22:58, Emmanuel de Castro Santana wrote:
Hi all,

We are running Nutch in a 4 nodes cluster (3 tasktracker&  datanode, 1
jobtracker&  namenode).
These machines are pretty strong hardware and fetch jobs run easily.

however, sometimes as the update job is running, we see the following
exception:

2010-08-05 21:07:19,213 ERROR datanode.DataNode - DatanodeRegistration(
172.16.202.172:50010,
storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException
     at java.io.DataInputStream.readShort(DataInputStream.java:298)
     at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
     at java.lang.Thread.run(Thread.java:619)
2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF
2010-08-05 21:12:19,155 ERROR datanode.DataNode - DatanodeRegistration(
172.16.202.172:50010,
storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException
     at java.io.DataInputStream.readShort(DataInputStream.java:298)
     at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
     at java.lang.Thread.run(Thread.java:619)
2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF
2010-08-05 21:17:19,239 ERROR datanode.DataNode - DatanodeRegistration(
172.16.202.172:50010,
storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException
     at java.io.DataInputStream.readShort(DataInputStream.java:298)
     at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
     at java.lang.Thread.run(Thread.java:619)


This exception plots out at a 5 or 4 minutes rate.
This is the amount of data being read and written by this job as those
exceptions appear:

FILE_BYTES_READ      1,224,570,415      0      1,224,570,415
HDFS_BYTES_READ     1,405,131,713     0     1,405,131,713
FILE_BYTES_WRITTEN     2,501,562,342     1,224,570,187     3,726,132,529

checking fileSystem with "bin/hadoop fsck" shows me most of the time only
HEALTHY blocks,
although there are times when job history files seem to become CORRUPT, as I
can see with "bin/hadoop fsck -openforwrite"

dfs.block.size is 128Mb
system ulimit is set to 16384

The cluster is composed of strong hardware and the network between them is
pretty fast too.
There is plenty of disk space and memory on all nodes.
Given that, I guess it
should be something about my current configuration that is not fully
appropriate.

A short tip would be helpful at this moment.

Hadoop network usage patterns are sometimes taxing for the network equipment - I've seen strange errors pop up in situations with cabling of poor quality, and even one case when everything was perfect except for the gigE switch - the switch was equipped with several gigE ports, and the vendor claimed it can support all ports simultaneously... but it's poor CPU was too underpowered to actually handle so many packets/sec from all ports, so during peaks it would choke and drop packets.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to