Logs from mesos master: 0123 15:53:44.523613 7 http.cpp:391] HTTP POST for /master/api/v1/scheduler from 172.18.0.1:58864 with User-Agent='AHC/2.0' I0123 15:53:44.524159 7 master.cpp:4827] Processing ACKNOWLEDGE call ac9a6e5e-67b3-490a-930f-0024eab734b4 for task 10336 of framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) on agent 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0 I0123 15:53:44.524849 7 master.cpp:7744] Removing task 10336 with resources cpus(*):0.1; mem(*):32 of framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 on agent 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0 at slave(1)@172.18.0.3:5051 (mesos-slave) I0123 15:53:44.529033 7 master.cpp:1297] Framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) disconnected I0123 15:53:44.529636 7 master.cpp:2902] Disconnecting framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) I0123 15:53:44.529974 7 master.cpp:2926] Deactivating framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) I0123 15:53:44.530299 7 master.cpp:1310] Giving framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) 0ns to failover I0123 15:53:44.530594 7 hierarchical.cpp:386] Deactivated framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 I0123 15:53:44.531962 7 master.cpp:6369] Framework failover timeout, removing framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTif TP Framework) I0123 15:53:44.534992 7 master.cpp:7103] Removing framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
It seems failover timeout is set to zero for the framework. It can be my coding error if framework looses its connection to the master multiple times (I see that I do not pass failover_timeout value during reconnection). I will try to observe if it solves my issue. Thanks 2017-01-23 19:05 GMT+03:00 Vova Shelgunov <[email protected]>: > Hi, > > I faced a very strange situation with my framework that talks to > mesos master via Scheduler HTTP API: > > Sometimes my framework stops to receive the heartbeats and task updates > from a master. > I read the documentation of mesos (http://mesos.apache. > org/documentation/latest/scheduler-http-api/), *Network partitions *section > and I see that if a framework does not receive the heartbeats within some > time it should reconnect to the master. > > I have written a heartbeat monitor that checks if there were not > heartbeats last n seconds, then reconnect, but after the reconnection, I > all the time receive an ERROR from the mesos master that my framework has > been removed. > > Why is it happening? > > Regards, > Uladzimir >

