Hi Nat, I should've mentioned this before. We're running MRv1. On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda < [email protected]> wrote:
> Hey Matt, > > Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / YARN > scheduler config or Cgroups not (incase enabled) set up not correctly. We > could dig in more if we have the yarn-site.xml and scheduler conf files. > > > Thanks, > Nat. > > > > On Mon, Jun 15, 2015 at 10:39 PM, Matt K <[email protected]> wrote: > >> I see there's 2 threads - one that kicks off the mappers, and another >> that kicks off reducers. The one that kicks off the mappers got stuck. It's >> not yet clear to me where it got stuck exactly. >> >> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <[email protected]> wrote: >> >>> Hi all, >>> >>> I'm dealing with a production issue, any help would be appreciated. I am >>> seeing very strange behavior in the TaskTrackers. After they pick up the >>> task, it never comes out of the UNASSIGNED state, and the task just gets >>> killed 10 minutes later. >>> >>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker: >>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0 >>> task's state:UNASSIGNED >>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201506152116_0046_m_000286_0: Task >>> attempt_201506152116_0046_m_000286_0 failed to report status for 600 >>> seconds. Killing! >>> >>> Normally, I would see the following in the logs: >>> >>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker: >>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots >>> >>> However, it doesn't get this far for these particular tasks. I am >>> perusing the source code here, and this doesn't seem to be possible: >>> >>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch >>> >>> The code does something like this: >>> >>> public void addToTaskQueue(LaunchTaskAction action) { >>> synchronized (tasksToLaunch) { >>> TaskInProgress tip = registerTask(action, this); >>> tasksToLaunch.add(tip); >>> tasksToLaunch.notifyAll(); >>> } >>> } >>> >>> The following should pick it up: >>> >>> public void run() { >>> while (!Thread.interrupted()) { >>> try { >>> TaskInProgress tip; >>> Task task; >>> synchronized (tasksToLaunch) { >>> while (tasksToLaunch.isEmpty()) { >>> tasksToLaunch.wait(); >>> } >>> //get the TIP >>> tip = tasksToLaunch.remove(0); >>> task = tip.getTask(); >>> LOG.info("Trying to launch : " + tip.getTask().getTaskID() + >>> " which needs " + task.getNumSlotsRequired() + " >>> slots"); >>> } >>> >>> What's even stranger is that this is happening for Map tasks only. Reduce >>> tasks are fine. >>> >>> This is only happening on a handful of the nodes, but enough to either slow >>> down jobs or cause them to fail. >>> >>> We're running Hadoop 2.3.0-cdh5.0.2 >>> >>> Thanks, >>> >>> -Matt >>> >>> >> >> >> -- >> www.calcmachine.com - easy online calculator. >> > > -- www.calcmachine.com - easy online calculator.
