Without going too deep into it, one thing that crossed my mind: How are you naming the nodes (DNS)? When looking at the job tracker, what name does it show for the trackers, in the Machine List? Can you go from 1 machine to another using whatever name appears?
Also, what does it say when it shows that it's parsing? By that, I mean: Look at the job details, and see the status output from each node. Also, it'd be good if you just paste overwhelming amounts of data to the list. That'll make it easier to spot more obvious/potential issues. sg On Mon, Aug 9, 2010 at 1:42 PM, Emmanuel de Castro Santana < [email protected]> wrote: > "What do performance metrics look like on the nodes? Network/Disk/etc?" > > I do not have exact metrics yet. > However, 'top' command tells me that cpu usage gets significantly higher > while parsing. I guess there is nothing to worry about it though. > Most of the time cores are mostly idle and load average does not surpass > 0.5 > (except when parsing). > > "what are the ulimits (-a)" > > ulimits are the same for all nodes, which means ... > > 16384 for open files > 139264 for max user processes > 32 for max locked memory > > "... so during peaks it would choke and drop packets" > > All nodes talk directly to each other through a switch, there are no long > paths to cross. > Don't really believe the problem is on network. > It seems to be more likely that I am not using the proper Hadoop > configurations. > > Emmanuel > > > 2010/8/7 Andrzej Bialecki <[email protected]> > > > On 2010-08-06 22:58, Emmanuel de Castro Santana wrote: > > > >> Hi all, > >> > >> We are running Nutch in a 4 nodes cluster (3 tasktracker& datanode, 1 > >> jobtracker& namenode). > >> These machines are pretty strong hardware and fetch jobs run easily. > >> > >> however, sometimes as the update job is running, we see the following > >> exception: > >> > >> 2010-08-05 21:07:19,213 ERROR datanode.DataNode - DatanodeRegistration( > >> 172.16.202.172:50010, > >> storageID=DS-246829865-172.16.202.172-50010-1280352366878, > infoPort=50075, > >> ipcPort=50020):DataXceiver > >> java.io.EOFException > >> at java.io.DataInputStream.readShort(DataInputStream.java:298) > >> at > >> > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79) > >> at java.lang.Thread.run(Thread.java:619) > >> 2010-08-05 21:07:19,222 DEBUG mortbay.log - EOF > >> 2010-08-05 21:12:19,155 ERROR datanode.DataNode - DatanodeRegistration( > >> 172.16.202.172:50010, > >> storageID=DS-246829865-172.16.202.172-50010-1280352366878, > infoPort=50075, > >> ipcPort=50020):DataXceiver > >> java.io.EOFException > >> at java.io.DataInputStream.readShort(DataInputStream.java:298) > >> at > >> > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79) > >> at java.lang.Thread.run(Thread.java:619) > >> 2010-08-05 21:12:19,164 DEBUG mortbay.log - EOF > >> 2010-08-05 21:17:19,239 ERROR datanode.DataNode - DatanodeRegistration( > >> 172.16.202.172:50010, > >> storageID=DS-246829865-172.16.202.172-50010-1280352366878, > infoPort=50075, > >> ipcPort=50020):DataXceiver > >> java.io.EOFException > >> at java.io.DataInputStream.readShort(DataInputStream.java:298) > >> at > >> > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:79) > >> at java.lang.Thread.run(Thread.java:619) > >> > >> > >> This exception plots out at a 5 or 4 minutes rate. > >> This is the amount of data being read and written by this job as those > >> exceptions appear: > >> > >> FILE_BYTES_READ 1,224,570,415 0 1,224,570,415 > >> HDFS_BYTES_READ 1,405,131,713 0 1,405,131,713 > >> FILE_BYTES_WRITTEN 2,501,562,342 1,224,570,187 3,726,132,529 > >> > >> checking fileSystem with "bin/hadoop fsck" shows me most of the time > only > >> HEALTHY blocks, > >> although there are times when job history files seem to become CORRUPT, > as > >> I > >> can see with "bin/hadoop fsck -openforwrite" > >> > >> dfs.block.size is 128Mb > >> system ulimit is set to 16384 > >> > >> The cluster is composed of strong hardware and the network between them > is > >> pretty fast too. > >> There is plenty of disk space and memory on all nodes. > >> Given that, I guess it > >> should be something about my current configuration that is not fully > >> appropriate. > >> > >> A short tip would be helpful at this moment. > >> > > > > Hadoop network usage patterns are sometimes taxing for the network > > equipment - I've seen strange errors pop up in situations with cabling of > > poor quality, and even one case when everything was perfect except for > the > > gigE switch - the switch was equipped with several gigE ports, and the > > vendor claimed it can support all ports simultaneously... but it's poor > CPU > > was too underpowered to actually handle so many packets/sec from all > ports, > > so during peaks it would choke and drop packets. > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > -- > Emmanuel de Castro Santana >

