Hello everyone, I've been encountering the following problem for some time now and it is really slowing down my work. I would appreciate any help you guys can provide. I am using Hadoop 1.0.3.
I configured the TaskTrackers to send heartbeats to the JobTracker every 1 second. Most of the time the heartbeats are sent as configured. Sometimes though, there is a big gap between two heartbeats sent by a TaskTracker. This gap can be as high as 30 seconds but it is usually on the order of 10 to 15 seconds. In the TaskTracker log when this happens there is usually a big gap in the reporting. Nothing is printed for those 10-30 seconds. I added some print statement in TaskTracker.java in the offerService and transmitHeartBeat functions. Oftentimes the last print statement that I see before the big gap is the one that precedes a call to a synchronized block. I was not able to localize this to any particular synchronized block call. Given enough runs, the big gaps appear in several places in the code. The TaskTracker thread seems to just wait before those synchronized blocks and it is not able to get to the code that actually sends the heartbeat. This makes me think that perhaps the locks are not always released correctly. By running many experiments I also noticed that this problem seems to appear more often when the number of concurrent tasks running on a node is larger. Perhaps because more task threads means more locking/unlocking. Before switching to Hadoop 1.0.3 I used version 0.21.0 which was showing the same problem far more often than 1.0.3. Have you guys seen this before? Do you know what can be causing this behavior? Thank you so much Florin Dinu Rice University
