Re: Framework stops to receive the heartbeats and events and gets removed from master

Vova Shelgunov Mon, 23 Jan 2017 08:39:18 -0800

Yes, it works. Sorry for troubling, the first time when I looked at the
logs I did not notice that failover_timeout is zero.


2017-01-23 19:27 GMT+03:00 Vova Shelgunov <[email protected]>:

> Logs from mesos master:
>
> 0123 15:53:44.523613     7 http.cpp:391] HTTP POST for
> /master/api/v1/scheduler from 172.18.0.1:58864 with User-Agent='AHC/2.0'
> I0123 15:53:44.524159     7 master.cpp:4827] Processing ACKNOWLEDGE call
> ac9a6e5e-67b3-490a-930f-0024eab734b4 for task 10336 of framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) on agent
> 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0
> I0123 15:53:44.524849     7 master.cpp:7744] Removing task 10336 with
> resources cpus(*):0.1; mem(*):32 of framework 
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005
> on agent 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0 at slave(1)@
> 172.18.0.3:5051 (mesos-slave)
> I0123 15:53:44.529033     7 master.cpp:1297] Framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
> disconnected
> I0123 15:53:44.529636     7 master.cpp:2902] Disconnecting framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
> I0123 15:53:44.529974     7 master.cpp:2926] Deactivating framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
> I0123 15:53:44.530299     7 master.cpp:1310] Giving framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) 0ns to
> failover
> I0123 15:53:44.530594     7 hierarchical.cpp:386] Deactivated framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005
> I0123 15:53:44.531962     7 master.cpp:6369] Framework failover timeout,
> removing framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTif
> TP Framework)
> I0123 15:53:44.534992     7 master.cpp:7103] Removing framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
>
> It seems failover timeout is set to zero for the framework.
>
> It can be my coding error if framework looses its connection to the master
> multiple times (I see that I do not pass failover_timeout value during
> reconnection).
> I will try to observe if it solves my issue.
>
> Thanks
>
> 2017-01-23 19:05 GMT+03:00 Vova Shelgunov <[email protected]>:
>
>> Hi,
>>
>> I faced a very strange situation with my framework that talks to
>> mesos master via Scheduler HTTP API:
>>
>> Sometimes my framework stops to receive the heartbeats and task updates
>> from a master.
>> I read the documentation of mesos (http://mesos.apache.org
>> /documentation/latest/scheduler-http-api/), *Network partitions *section
>> and I see that if a framework does not receive the heartbeats within some
>> time it should reconnect to the master.
>>
>> I have written a heartbeat monitor that checks if there were not
>> heartbeats last n seconds, then reconnect, but after the reconnection, I
>> all the time receive an ERROR from the mesos master that my framework has
>> been removed.
>>
>> Why is it happening?
>>
>> Regards,
>> Uladzimir
>>
>
>

Re: Framework stops to receive the heartbeats and events and gets removed from master

Reply via email to