Yes, it works. Sorry for troubling, the first time when I looked at the logs I did not notice that failover_timeout is zero.
2017-01-23 19:27 GMT+03:00 Vova Shelgunov <[email protected]>: > Logs from mesos master: > > 0123 15:53:44.523613 7 http.cpp:391] HTTP POST for > /master/api/v1/scheduler from 172.18.0.1:58864 with User-Agent='AHC/2.0' > I0123 15:53:44.524159 7 master.cpp:4827] Processing ACKNOWLEDGE call > ac9a6e5e-67b3-490a-930f-0024eab734b4 for task 10336 of framework > 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) on agent > 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0 > I0123 15:53:44.524849 7 master.cpp:7744] Removing task 10336 with > resources cpus(*):0.1; mem(*):32 of framework > 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 > on agent 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0 at slave(1)@ > 172.18.0.3:5051 (mesos-slave) > I0123 15:53:44.529033 7 master.cpp:1297] Framework > 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) > disconnected > I0123 15:53:44.529636 7 master.cpp:2902] Disconnecting framework > 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) > I0123 15:53:44.529974 7 master.cpp:2926] Deactivating framework > 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) > I0123 15:53:44.530299 7 master.cpp:1310] Giving framework > 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) 0ns to > failover > I0123 15:53:44.530594 7 hierarchical.cpp:386] Deactivated framework > 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 > I0123 15:53:44.531962 7 master.cpp:6369] Framework failover timeout, > removing framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTif > TP Framework) > I0123 15:53:44.534992 7 master.cpp:7103] Removing framework > 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) > > It seems failover timeout is set to zero for the framework. > > It can be my coding error if framework looses its connection to the master > multiple times (I see that I do not pass failover_timeout value during > reconnection). > I will try to observe if it solves my issue. > > Thanks > > 2017-01-23 19:05 GMT+03:00 Vova Shelgunov <[email protected]>: > >> Hi, >> >> I faced a very strange situation with my framework that talks to >> mesos master via Scheduler HTTP API: >> >> Sometimes my framework stops to receive the heartbeats and task updates >> from a master. >> I read the documentation of mesos (http://mesos.apache.org >> /documentation/latest/scheduler-http-api/), *Network partitions *section >> and I see that if a framework does not receive the heartbeats within some >> time it should reconnect to the master. >> >> I have written a heartbeat monitor that checks if there were not >> heartbeats last n seconds, then reconnect, but after the reconnection, I >> all the time receive an ERROR from the mesos master that my framework has >> been removed. >> >> Why is it happening? >> >> Regards, >> Uladzimir >> > >

