Hi John,

Not sure if you ended up getting to the bottom of the issue, but often when
the scheduler gives up and his this time out it's because something funky
happened in mesos and the scheduler wasn't updated correctly. Could you
describe the state (with some logs too if possible) of mesos while this
happens?

Tom.

On 25 February 2015 at 17:01, John Omernik <j...@omernik.com> wrote:

> I am running hadoop on mesos 0.0.8 on Mesos 0.21.0.  I am running into
> a weird issue where it appears two of my nodes, when a task tracker is
> run on them,  never really complete the check in process, the job
> tracker is waiting for their heartbeat, they think they are running
> successfully, and then tasks that would be assigned to them stay in a
> hung/pending state waiting for the heartbeat.
>
> Basically in the job tracker log, I see the below (where the pending
> tasks is one, the inactive slots is 2 (launched but no heartbeat yet)
> so the jobtracker just sits there waiting, and the node thinks it's
> running fine.
>
> Is there a way to have the JobTracker give up on a task tracker
> sooner?  This waiting for timeout period seems odd.
>
> Thanks!
>
> (if there is any other information I can provide, please let me know)
>
>
>
> Job Tracker Log:
>
>    Pending Map Tasks: 0
>
>    Pending Reduce Tasks: 1
>
>       Running Map Tasks: 0
>
>    Running Reduce Tasks: 0
>
>          Idle Map Slots: 2
>
>       Idle Reduce Slots: 0
>
>      Inactive Map Slots: 2 (launched but no hearbeat yet)
>
>   Inactive Reduce Slots: 2 (launched but no hearbeat yet)
>
>        Needed Map Slots: 0
>
>     Needed Reduce Slots: 0
>
>      Unhealthy Trackers: 0
>
> 2015-02-25 10:57:01,930 INFO mapred.ResourcePolicy [Thread-1290]:
> Satisfied map and reduce slots needed.
>
> 2015-02-25 10:57:02,083 INFO mapred.MesosScheduler [IPC Server handler
> 7 on 7676]: Unknown/exited TaskTracker: http://hadoopmapr3:31264.
>
> 2015-02-25 10:57:02,097 INFO mapred.MesosScheduler [IPC Server handler
> 0 on 7676]: Unknown/exited TaskTracker: http://hadoopmapr3:50060.
>
> 2015-02-25 10:57:02,148 INFO mapred.MesosScheduler [IPC Server handler
> 4 on 7676]: Unknown/exited TaskTracker: http://moonman:31182.
>
> 2015-02-25 10:57:02,392 INFO mapred.MesosScheduler [IPC Server handler
> 1 on 7676]: Unknown/exited TaskTracker: http://hadoopmapr3:31264.
>
> 2015-02-25 10:57:02,403 INFO mapred.MesosScheduler [IPC Server handler
> 3 on 7676]: Unknown/exited TaskTracker: http://hadoopmapr3:50060.
>
> 2015-02-25 10:57:02,459 INFO mapred.MesosScheduler [IPC Server handler
> 6 on 7676]: Unknown/exited TaskTracker: http://moonman:31182.
>
> 2015-02-25 10:57:02,702 INFO mapred.MesosScheduler [IPC Server handler
> 4 on 7676]: Unknown/exited TaskTracker: http://hadoopmapr3:31264.
>
> 2015-02-25 10:57:02,714 INFO mapred.MesosScheduler [IPC Server handler
> 5 on 7676]: Unknown/exited TaskTracker: http://hadoopmapr3:50060.
>

Reply via email to