"What do performance metrics look like on the nodes? Network/Disk/etc?"

I do not have exact metrics yet.
However, 'top' command tells me that cpu usage gets significantly higher
while parsing. I guess there is nothing to worry about it though.
Most of the time cores are mostly idle and load average does not surpass 0.5
(except when parsing).

"what are the ulimits (-a)"

ulimits are the same for all nodes, which means ...

16384 for open files
139264 for max user processes
32 for max locked memory

"... so during peaks it would choke and drop packets"

All nodes talk directly to each other through a switch, there are no long
paths to cross.
Don't really believe the problem is on network.
It seems to be more likely that I am not using the proper Hadoop
configurations.

Emmanuel


2010/8/7 Andrzej Bialecki <[email protected]>

> On 2010-08-06 22:58, Emmanuel de Castro Santana wrote:
>
>> Hi all,
>>
>> We are running Nutch in a 4 nodes cluster (3 tasktracker&  datanode, 1
>> jobtracker&  namenode).
>> These machines are pretty strong hardware and fetch jobs run easily.
>>
>> however, sometimes as the update job is running, we see the following
>> exception:
>>
>> 2010-08-05 21:07:19,213 ERROR datanode.DataNode - DatanodeRegistration(
>> 172.16.202.172:50010,
>> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
>> ipcPort=50020):DataXceiver
>> java.io.EOFException
>>     at java.io.DataInputStream.readShort(DataInputStream.java:298)
>>     at
>>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>>     at java.lang.Thread.run(Thread.java:619)
>> 2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF
>> 2010-08-05 21:12:19,155 ERROR datanode.DataNode - DatanodeRegistration(
>> 172.16.202.172:50010,
>> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
>> ipcPort=50020):DataXceiver
>> java.io.EOFException
>>     at java.io.DataInputStream.readShort(DataInputStream.java:298)
>>     at
>>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>>     at java.lang.Thread.run(Thread.java:619)
>> 2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF
>> 2010-08-05 21:17:19,239 ERROR datanode.DataNode - DatanodeRegistration(
>> 172.16.202.172:50010,
>> storageID=DS-246829865-172.16.202.172-50010-1280352366878, infoPort=50075,
>> ipcPort=50020):DataXceiver
>> java.io.EOFException
>>     at java.io.DataInputStream.readShort(DataInputStream.java:298)
>>     at
>>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79)
>>     at java.lang.Thread.run(Thread.java:619)
>>
>>
>> This exception plots out at a 5 or 4 minutes rate.
>> This is the amount of data being read and written by this job as those
>> exceptions appear:
>>
>> FILE_BYTES_READ      1,224,570,415      0      1,224,570,415
>> HDFS_BYTES_READ     1,405,131,713     0     1,405,131,713
>> FILE_BYTES_WRITTEN     2,501,562,342     1,224,570,187     3,726,132,529
>>
>> checking fileSystem with "bin/hadoop fsck" shows me most of the time only
>> HEALTHY blocks,
>> although there are times when job history files seem to become CORRUPT, as
>> I
>> can see with "bin/hadoop fsck -openforwrite"
>>
>> dfs.block.size is 128Mb
>> system ulimit is set to 16384
>>
>> The cluster is composed of strong hardware and the network between them is
>> pretty fast too.
>> There is plenty of disk space and memory on all nodes.
>> Given that, I guess it
>> should be something about my current configuration that is not fully
>> appropriate.
>>
>> A short tip would be helpful at this moment.
>>
>
> Hadoop network usage patterns are sometimes taxing for the network
> equipment - I've seen strange errors pop up in situations with cabling of
> poor quality, and even one case when everything was perfect except for the
> gigE switch - the switch was equipped with several gigE ports, and the
> vendor claimed it can support all ports simultaneously... but it's poor CPU
> was too underpowered to actually handle so many packets/sec from all ports,
> so during peaks it would choke and drop packets.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Emmanuel de Castro Santana

Reply via email to