Have any more info on the setup to offer? What do performance metrics look like 
on the nodes? Network/Disk/etc?

What are you using to store the data on the namenode? Hadoop hdfs backend and 
hardware?

Also, what are the ulimits (-a) on your nodes? And how much memory per task?

sg

Sent from my iPhone

On Aug 6, 2010, at 1:58 PM, Emmanuel de Castro Santana 
<[email protected]> wrote:

> Hi all,
> 
> We are running Nutch in a 4 nodes cluster (3 tasktracker & datanode, 1
> jobtracker & namenode).
> These machines are pretty strong hardware and fetch jobs run easily.
> 
> however, sometimes as the update job is running, we see the following
> exception:
> 
> 2010-08-05 21:07:19,213 ERROR datanode.DataNode - DatanodeRegistration(
> 172.16.202.172:50010,
> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.EOFException
>    at java.io.DataInputStream.readShort(DataInputStream.java:298)
>    at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>    at java.lang.Thread.run(Thread.java:619)
> 2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF
> 2010-08-05 21:12:19,155 ERROR datanode.DataNode - DatanodeRegistration(
> 172.16.202.172:50010,
> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.EOFException
>    at java.io.DataInputStream.readShort(DataInputStream.java:298)
>    at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>    at java.lang.Thread.run(Thread.java:619)
> 2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF
> 2010-08-05 21:17:19,239 ERROR datanode.DataNode - DatanodeRegistration(
> 172.16.202.172:50010,
> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.EOFException
>    at java.io.DataInputStream.readShort(DataInputStream.java:298)
>    at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>    at java.lang.Thread.run(Thread.java:619)
> 
> 
> This exception plots out at a 5 or 4 minutes rate.
> This is the amount of data being read and written by this job as those
> exceptions appear:
> 
> FILE_BYTES_READ      1,224,570,415      0      1,224,570,415
> HDFS_BYTES_READ     1,405,131,713     0     1,405,131,713
> FILE_BYTES_WRITTEN     2,501,562,342     1,224,570,187     3,726,132,529
> 
> checking fileSystem with "bin/hadoop fsck" shows me most of the time only
> HEALTHY blocks,
> although there are times when job history files seem to become CORRUPT, as I
> can see with "bin/hadoop fsck -openforwrite"
> 
> dfs.block.size is 128Mb
> system ulimit is set to 16384
> 
> The cluster is composed of strong hardware and the network between them is
> pretty fast too.
> There is plenty of disk space and memory on all nodes.
> Given that, I guess it
> should be something about my current configuration that is not fully
> appropriate.
> 
> A short tip would be helpful at this moment.
> 
> 
> Thanks in advance
> 
> Emmanuel de Castro Santana

Reply via email to