Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

Geoffroy Jabouley Tue, 17 Mar 2015 02:41:13 -0700

Thanks a lot Dario for the workaround! It works fine and can be scripted
with ansible.


For the record, the github issue is available here:
https://github.com/mesosphere/marathon/issues/1292

2015-03-12 17:27 GMT+01:00 Dario Rexin <da...@mesosphere.io>:

> Hi Geoffrey,
>
> we identified the issue and will fix it in Marathon 0.8.2. To prevent this
> behaviour for now, you just have to make sure that in a fresh setup
> (Marathon was never connected to Mesos) you first start up a single
> Marathon and let it register with Mesos and then start the other Marathon
> instances. The problem is a race in first registration with Mesos and
> fetching the FrameworkID from Zookeeper. Please let me know if the
> workaround does not help you.
>
> Cheers,
> Dario
>
> On 12 Mar 2015, at 09:20, Alex Rukletsov <a...@mesosphere.io> wrote:
>
> Geoffroy,
>
> yes, it looks like a marathon issue, so feel free to post it there as well.
>
> On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley <
> geoffroy.jabou...@gmail.com> wrote:
>
>> Thanks Alex for your answer. I will have a look.
>>
>> Would it be better to (cross-)post this discussion on the marathon
>> mailing list?
>>
>> Anyway, the issue is "fixed" for 0.8.0, which is the version i'm using.
>>
>> 2015-03-11 22:18 GMT+01:00 Alex Rukletsov <a...@mesosphere.io>:
>>
>>> Geoffroy,
>>>
>>> most probably you're hitting this bug:
>>> https://github.com/mesosphere/marathon/issues/1063. The problem is that
>>> Marathon can register instead of re-registering when a master fails
>>> over. From master point of view, it's a new framework, that's why the
>>> previous task is gone and a new one (that technically belongs to a new
>>> framework) is started. You can see that frameworks have two different IDs
>>> (check lines 11:31:40.055496 and 11:31:40.785038) in your example.
>>>
>>> Hope that helps,
>>> Alex
>>>
>>> On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley <
>>> geoffroy.jabou...@gmail.com> wrote:
>>>
>>>> Hello
>>>>
>>>> thanks for your interest. Following are the requested logs, which will
>>>> result in a pretty big mail.
>>>>
>>>> Mesos/Marathon are *NOT running inside docker*, we only use Docker as
>>>> our mesos containerizer.
>>>>
>>>> For reminder, here is the use case performed to get the logs file:
>>>>
>>>> --------------------------------
>>>>
>>>> Our cluster: 3 identical mesos nodes with:
>>>>     + zookeeper
>>>>     + docker 1.5
>>>>     + mesos master 0.21.1 configured in HA mode
>>>>     + mesos slave 0.21.1 configured with checkpointing, strict and
>>>> reconnect
>>>>     + marathon 0.8.0 configured in HA mode with checkpointing
>>>>
>>>> --------------------------------
>>>>
>>>> *Begin State: *
>>>> + the mesos cluster is up (3 machines)
>>>> + mesos master leader is 10.195.30.19
>>>> + marathon leader is 10.195.30.21
>>>> + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21
>>>>
>>>> *Action*: stop the mesos master leader process (sudo stop mesos-master)
>>>>
>>>> *Expected*: mesos master leader has changed, active tasks / frameworks
>>>> remain unchanged
>>>>
>>>> *End state: *
>>>> + mesos master leader *has changed, now 10.195.30.21*
>>>> + previously running APPTASK on the slave 10.195.30.21 has "disappear"
>>>> (not showing anymore on the mesos UI), but *docker container is still
>>>> running*
>>>> + a n*ew APPTASK is now running on slave 10.195.30.19*
>>>> + marathon framework "registration time" in mesos UI shows "Just now"
>>>> + marathon leader *has changed, now 10.195.30.20*
>>>>
>>>>
>>>> --------------------------------
>>>>
>>>> Now comes the 6 requested logs, which might contain
>>>> interesting/relevant information, but i as a newcomer to mesos it is hard
>>>> to read...
>>>>
>>>>
>>>> *from previous MESOS master leader 10.195.30.19 <http://10.195.30.19/>:*
>>>> W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal
>>>> SIGTERM from process 1 of user 0; exiting
>>>>
>>>>
>>>> *from new MESOS master leader 10.195.30.21 <http://10.195.30.21/>:*
>>>> I0310 11:31:40.011545   922 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.011823   922 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> I0310 11:31:40.015496   915 network.hpp:424] ZooKeeper group
>>>> memberships changed
>>>> I0310 11:31:40.015847   915 group.cpp:659] Trying to get
>>>> '/mesos/log_replicas/0000000000' in ZooKeeper
>>>> I0310 11:31:40.016047   922 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> I0310 11:31:40.016074   922 master.cpp:1263] The newly elected leader
>>>> is master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
>>>> I0310 11:31:40.016089   922 master.cpp:1276] Elected as the leading
>>>> master!
>>>> I0310 11:31:40.016108   922 master.cpp:1094] Recovering from registrar
>>>> I0310 11:31:40.016188   918 registrar.cpp:313] Recovering registrar
>>>> I0310 11:31:40.016542   918 log.cpp:656] Attempting to start the writer
>>>> I0310 11:31:40.016918   918 replica.cpp:474] Replica received implicit
>>>> promise request with proposal 2
>>>> I0310 11:31:40.017503   915 group.cpp:659] Trying to get
>>>> '/mesos/log_replicas/0000000003' in ZooKeeper
>>>> I0310 11:31:40.017832   918 leveldb.cpp:306] Persisting metadata (8
>>>> bytes) to leveldb took 893672ns
>>>> I0310 11:31:40.017848   918 replica.cpp:342] Persisted promised to 2
>>>> I0310 11:31:40.018817   915 network.hpp:466] ZooKeeper group PIDs: {
>>>> log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
>>>> I0310 11:31:40.023022   923 coordinator.cpp:230] Coordinator attemping
>>>> to fill missing position
>>>> I0310 11:31:40.023110   923 log.cpp:672] Writer started with ending
>>>> position 8
>>>> I0310 11:31:40.023293   923 leveldb.cpp:438] Reading position from
>>>> leveldb took 13195ns
>>>> I0310 11:31:40.023309   923 leveldb.cpp:438] Reading position from
>>>> leveldb took 3120ns
>>>> I0310 11:31:40.023619   922 registrar.cpp:346] Successfully fetched the
>>>> registry (610B) in 7.385856ms
>>>> I0310 11:31:40.023679   922 registrar.cpp:445] Applied 1 operations in
>>>> 9263ns; attempting to update the 'registry'
>>>> I0310 11:31:40.024238   922 log.cpp:680] Attempting to append 647 bytes
>>>> to the log
>>>> I0310 11:31:40.024279   923 coordinator.cpp:340] Coordinator attempting
>>>> to write APPEND action at position 9
>>>> I0310 11:31:40.024435   923 replica.cpp:508] Replica received write
>>>> request for position 9
>>>> I0310 11:31:40.025707   923 leveldb.cpp:343] Persisting action (666
>>>> bytes) to leveldb took 1.259338ms
>>>> I0310 11:31:40.025722   923 replica.cpp:676] Persisted action at 9
>>>> I0310 11:31:40.026074   923 replica.cpp:655] Replica received learned
>>>> notice for position 9
>>>> I0310 11:31:40.026495   923 leveldb.cpp:343] Persisting action (668
>>>> bytes) to leveldb took 404795ns
>>>> I0310 11:31:40.026507   923 replica.cpp:676] Persisted action at 9
>>>> I0310 11:31:40.026511   923 replica.cpp:661] Replica learned APPEND
>>>> action at position 9
>>>> I0310 11:31:40.026726   923 registrar.cpp:490] Successfully updated the
>>>> 'registry' in 3.029248ms
>>>> I0310 11:31:40.026765   923 registrar.cpp:376] Successfully recovered
>>>> registrar
>>>> I0310 11:31:40.026814   923 log.cpp:699] Attempting to truncate the log
>>>> to 9
>>>> I0310 11:31:40.026880   923 master.cpp:1121] Recovered 3 slaves from
>>>> the Registry (608B) ; allowing 1days for slaves to re-register
>>>> I0310 11:31:40.026897   923 coordinator.cpp:340] Coordinator attempting
>>>> to write TRUNCATE action at position 10
>>>> I0310 11:31:40.026988   923 replica.cpp:508] Replica received write
>>>> request for position 10
>>>> I0310 11:31:40.027640   923 leveldb.cpp:343] Persisting action (16
>>>> bytes) to leveldb took 641018ns
>>>> I0310 11:31:40.027652   923 replica.cpp:676] Persisted action at 10
>>>> I0310 11:31:40.030848   923 replica.cpp:655] Replica received learned
>>>> notice for position 10
>>>> I0310 11:31:40.031883   923 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 1.008914ms
>>>> I0310 11:31:40.031963   923 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 46724ns
>>>> I0310 11:31:40.031977   923 replica.cpp:676] Persisted action at 10
>>>> I0310 11:31:40.031986   923 replica.cpp:661] Replica learned TRUNCATE
>>>> action at position 10
>>>> I0310 11:31:40.055415   918 master.cpp:1383] Received registration
>>>> request for framework 'marathon' at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.055496   918 master.cpp:1447] Registering framework
>>>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.055642   919 hierarchical_allocator_process.hpp:329]
>>>> Added framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:40.189151   919 master.cpp:3246] Re-registering slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19)
>>>> I0310 11:31:40.189280   919 registrar.cpp:445] Applied 1 operations in
>>>> 15452ns; attempting to update the 'registry'
>>>> I0310 11:31:40.189949   919 log.cpp:680] Attempting to append 647 bytes
>>>> to the log
>>>> I0310 11:31:40.189978   919 coordinator.cpp:340] Coordinator attempting
>>>> to write APPEND action at position 11
>>>> I0310 11:31:40.190112   919 replica.cpp:508] Replica received write
>>>> request for position 11
>>>> I0310 11:31:40.190563   919 leveldb.cpp:343] Persisting action (666
>>>> bytes) to leveldb took 437440ns
>>>> I0310 11:31:40.190577   919 replica.cpp:676] Persisted action at 11
>>>> I0310 11:31:40.191249   921 replica.cpp:655] Replica received learned
>>>> notice for position 11
>>>> I0310 11:31:40.192159   921 leveldb.cpp:343] Persisting action (668
>>>> bytes) to leveldb took 892767ns
>>>> I0310 11:31:40.192178   921 replica.cpp:676] Persisted action at 11
>>>> I0310 11:31:40.192184   921 replica.cpp:661] Replica learned APPEND
>>>> action at position 11
>>>> I0310 11:31:40.192350   921 registrar.cpp:490] Successfully updated the
>>>> 'registry' in 3.0528ms
>>>> I0310 11:31:40.192387   919 log.cpp:699] Attempting to truncate the log
>>>> to 11
>>>> I0310 11:31:40.192415   919 coordinator.cpp:340] Coordinator attempting
>>>> to write TRUNCATE action at position 12
>>>> I0310 11:31:40.192539   915 replica.cpp:508] Replica received write
>>>> request for position 12
>>>> I0310 11:31:40.192600   921 master.cpp:3314] Re-registered slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148
>>>> I0310 11:31:40.192680   917 hierarchical_allocator_process.hpp:442]
>>>> Added slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>>>> available)
>>>> I0310 11:31:40.192847   917 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.193164   915 leveldb.cpp:343] Persisting action (16
>>>> bytes) to leveldb took 610664ns
>>>> I0310 11:31:40.193181   915 replica.cpp:676] Persisted action at 12
>>>> I0310 11:31:40.193568   915 replica.cpp:655] Replica received learned
>>>> notice for position 12
>>>> I0310 11:31:40.193948   915 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 364062ns
>>>> I0310 11:31:40.193979   915 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 12256ns
>>>> I0310 11:31:40.193985   915 replica.cpp:676] Persisted action at 12
>>>> I0310 11:31:40.193990   915 replica.cpp:661] Replica learned TRUNCATE
>>>> action at position 12
>>>> I0310 11:31:40.248615   915 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O0 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.248744   915 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:40.774416   915 master.cpp:3246] Re-registering slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21)
>>>> I0310 11:31:40.774976   915 registrar.cpp:445] Applied 1 operations in
>>>> 42342ns; attempting to update the 'registry'
>>>> I0310 11:31:40.777273   921 log.cpp:680] Attempting to append 647 bytes
>>>> to the log
>>>> I0310 11:31:40.777436   921 coordinator.cpp:340] Coordinator attempting
>>>> to write APPEND action at position 13
>>>> I0310 11:31:40.777989   921 replica.cpp:508] Replica received write
>>>> request for position 13
>>>> I0310 11:31:40.779558   921 leveldb.cpp:343] Persisting action (666
>>>> bytes) to leveldb took 1.513714ms
>>>> I0310 11:31:40.779633   921 replica.cpp:676] Persisted action at 13
>>>> I0310 11:31:40.781821   919 replica.cpp:655] Replica received learned
>>>> notice for position 13
>>>> I0310 11:31:40.784417   919 leveldb.cpp:343] Persisting action (668
>>>> bytes) to leveldb took 2.542036ms
>>>> I0310 11:31:40.784446   919 replica.cpp:676] Persisted action at 13
>>>> I0310 11:31:40.784452   919 replica.cpp:661] Replica learned APPEND
>>>> action at position 13
>>>> I0310 11:31:40.784711   920 registrar.cpp:490] Successfully updated the
>>>> 'registry' in 9.68192ms
>>>> I0310 11:31:40.784762   917 log.cpp:699] Attempting to truncate the log
>>>> to 13
>>>> I0310 11:31:40.784808   920 coordinator.cpp:340] Coordinator attempting
>>>> to write TRUNCATE action at position 14
>>>> I0310 11:31:40.784865   917 master.hpp:877] Adding task
>>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with
>>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>>> 20150310-112310-320783114-5050-24289-S2 (10.195.30.21)
>>>> I0310 11:31:40.784955   919 replica.cpp:508] Replica received write
>>>> request for position 14
>>>> W0310 11:31:40.785038   917 master.cpp:4468] Possibly orphaned task
>>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of
>>>> framework 20150310-112310-320783114-5050-24289-0000 running on slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21)
>>>> I0310 11:31:40.785105   917 master.cpp:3314] Re-registered slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148
>>>> I0310 11:31:40.785162   920 hierarchical_allocator_process.hpp:442]
>>>> Added slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>>> ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148
>>>> available)
>>>> I0310 11:31:40.785679   921 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.786429   919 leveldb.cpp:343] Persisting action (16
>>>> bytes) to leveldb took 1.454211ms
>>>> I0310 11:31:40.786455   919 replica.cpp:676] Persisted action at 14
>>>> I0310 11:31:40.786782   919 replica.cpp:655] Replica received learned
>>>> notice for position 14
>>>> I0310 11:31:40.787833   919 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 1.027014ms
>>>> I0310 11:31:40.787873   919 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 14085ns
>>>> I0310 11:31:40.787883   919 replica.cpp:676] Persisted action at 14
>>>> I0310 11:31:40.787889   919 replica.cpp:661] Replica learned TRUNCATE
>>>> action at position 14
>>>> I0310 11:31:40.792536   922 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O1 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.792670   922 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:40.819602   921 master.cpp:3246] Re-registering slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20)
>>>> I0310 11:31:40.819736   921 registrar.cpp:445] Applied 1 operations in
>>>> 16656ns; attempting to update the 'registry'
>>>> I0310 11:31:40.820617   921 log.cpp:680] Attempting to append 647 bytes
>>>> to the log
>>>> I0310 11:31:40.820726   918 coordinator.cpp:340] Coordinator attempting
>>>> to write APPEND action at position 15
>>>> I0310 11:31:40.820938   918 replica.cpp:508] Replica received write
>>>> request for position 15
>>>> I0310 11:31:40.821641   918 leveldb.cpp:343] Persisting action (666
>>>> bytes) to leveldb took 670583ns
>>>> I0310 11:31:40.821663   918 replica.cpp:676] Persisted action at 15
>>>> I0310 11:31:40.822265   917 replica.cpp:655] Replica received learned
>>>> notice for position 15
>>>> I0310 11:31:40.823463   917 leveldb.cpp:343] Persisting action (668
>>>> bytes) to leveldb took 1.178687ms
>>>> I0310 11:31:40.823490   917 replica.cpp:676] Persisted action at 15
>>>> I0310 11:31:40.823498   917 replica.cpp:661] Replica learned APPEND
>>>> action at position 15
>>>> I0310 11:31:40.823755   917 registrar.cpp:490] Successfully updated the
>>>> 'registry' in 3.97696ms
>>>> I0310 11:31:40.823823   917 log.cpp:699] Attempting to truncate the log
>>>> to 15
>>>> I0310 11:31:40.824147   922 coordinator.cpp:340] Coordinator attempting
>>>> to write TRUNCATE action at position 16
>>>> I0310 11:31:40.824482   922 hierarchical_allocator_process.hpp:442]
>>>> Added slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and
>>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148
>>>> available)
>>>> I0310 11:31:40.824597   921 replica.cpp:508] Replica received write
>>>> request for position 16
>>>> I0310 11:31:40.824128   917 master.cpp:3314] Re-registered slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148
>>>> I0310 11:31:40.824975   917 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.831900   921 leveldb.cpp:343] Persisting action (16
>>>> bytes) to leveldb took 7.228682ms
>>>> I0310 11:31:40.832031   921 replica.cpp:676] Persisted action at 16
>>>> I0310 11:31:40.832456   917 replica.cpp:655] Replica received learned
>>>> notice for position 16
>>>> I0310 11:31:40.835178   917 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 2.674392ms
>>>> I0310 11:31:40.835297   917 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 45220ns
>>>> I0310 11:31:40.835322   917 replica.cpp:676] Persisted action at 16
>>>> I0310 11:31:40.835341   917 replica.cpp:661] Replica learned TRUNCATE
>>>> action at position 16
>>>> I0310 11:31:40.838281   923 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O2 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:40.838389   923 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:40.948725   919 http.cpp:344] HTTP request for
>>>> '/master/redirect'
>>>> I0310 11:31:41.479118   918 http.cpp:478] HTTP request for
>>>> '/master/state.json'
>>>> I0310 11:31:45.368074   918 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:45.385144   917 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O3 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:45.385292   917 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:46.368450   917 master.cpp:3843] Sending 2 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:46.375222   920 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O4 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:46.375360   920 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O5 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:46.375530   920 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:46.375599   920 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:48.031230   915 http.cpp:478] HTTP request for
>>>> '/master/state.json'
>>>> I0310 11:31:51.374285   922 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:51.379391   921 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O6 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:51.379487   921 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:51.482094   923 http.cpp:478] HTTP request for
>>>> '/master/state.json'
>>>> I0310 11:31:52.375326   917 master.cpp:3843] Sending 2 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:52.391376   919 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O7 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051
>>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:52.391512   919 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S2 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:52.391659   921 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O8 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051
>>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:52.391751   921 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961;
>>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443];
>>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S0 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:55.062060   918 master.cpp:3611] Performing explicit task
>>>> state reconciliation for 1 tasks of framework
>>>> 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:55.062588   919 master.cpp:3556] Performing implicit task
>>>> state reconciliation for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:56.140990   923 http.cpp:344] HTTP request for
>>>> '/master/redirect'
>>>> I0310 11:31:57.379288   918 master.cpp:3843] Sending 1 offers to
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:57.430888   918 master.cpp:2344] Processing reply for
>>>> offers: [ 20150310-112310-354337546-5050-895-O9 ] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000
>>>> (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771
>>>> I0310 11:31:57.431068   918 master.hpp:877] Adding task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with
>>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 (10.195.30.19)
>>>> I0310 11:31:57.431089   918 master.cpp:2503] Launching task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at
>>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 with
>>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19)
>>>> I0310 11:31:57.431205   918 hierarchical_allocator_process.hpp:563]
>>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833;
>>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000];
>>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave
>>>> 20150310-112310-320783114-5050-24289-S1 from framework
>>>> 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.682133   919 master.cpp:3446] Forwarding status update
>>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.682186   919 master.cpp:3418] Status update TASK_RUNNING
>>>> (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 from slave
>>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051
>>>> (10.195.30.19)
>>>> I0310 11:31:57.682199   919 master.cpp:4693] Updating the latest state
>>>> of task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799
>>>> of framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING
>>>>
>>>>
>>>> *from MESOS slave 10.195.30.21*
>>>> I0310 11:31:28.750200  1074 slave.cpp:2623] master@10.195.30.19:5050
>>>> exited
>>>> W0310 11:31:28.750249  1074 slave.cpp:2626] Master disconnected!
>>>> Waiting for a new master to be elected
>>>> I0310 11:31:40.012516  1075 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.012899  1073 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> I0310 11:31:40.017143  1072 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> I0310 11:31:40.017408  1072 slave.cpp:602] New master detected at
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.017546  1076 status_update_manager.cpp:171] Pausing
>>>> sending status updates
>>>> I0310 11:31:40.018673  1072 slave.cpp:627] No credentials provided.
>>>> Attempting to register without authentication
>>>> I0310 11:31:40.018689  1072 slave.cpp:638] Detecting new master
>>>> I0310 11:31:40.785364  1075 slave.cpp:824] Re-registered with master
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.785398  1075 status_update_manager.cpp:178] Resuming
>>>> sending status updates
>>>> I0310 11:32:10.639506  1075 slave.cpp:3321] Current usage 12.27%. Max
>>>> allowed age: 5.441217749539572days
>>>>
>>>>
>>>> *from MESOS slave 10.195.30.19*
>>>> I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050
>>>> exited
>>>> W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected!
>>>> Waiting for a new master to be elected
>>>> I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.013530 24458 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided.
>>>> Attempting to register without authentication
>>>> I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master
>>>> I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing
>>>> sending status updates
>>>> I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming
>>>> sending status updates
>>>> I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 in work directory
>>>> '/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>>>> I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for
>>>> executor
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework '20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.434217 24461 docker.cpp:927] Starting container
>>>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and
>>>> executor
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of
>>>> framework '20150310-112310-354337546-5050-895-0000'
>>>> I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to
>>>> '/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid'
>>>> I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>>> framework '20150310-112310-354337546-5050-895-0000' in container
>>>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9'
>>>> I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for
>>>> executor
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>>>> 10.195.30.19:56574
>>>> I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for
>>>> executor
>>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update
>>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@
>>>> 10.195.30.19:56574
>>>> I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received
>>>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>>>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.678411 24461 status_update_manager.hpp:346]
>>>> Checkpointing UPDATE for status update TASK_RUNNING (UUID:
>>>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update
>>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 to
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for
>>>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for
>>>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000 to executor(1)@
>>>> 10.195.30.19:56574
>>>> I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received
>>>> status update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e)
>>>> for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.689028 24461 status_update_manager.hpp:346]
>>>> Checkpointing ACK for status update TASK_RUNNING (UUID:
>>>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of
>>>> framework 20150310-112310-354337546-5050-895-0000
>>>> I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to
>>>> 204 at
>>>> /sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>>>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>>> I0310 11:31:57.755570 24461 docker.cpp:1333] Updated
>>>> 'memory.soft_limit_in_bytes' to 160MB for container
>>>> a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>>> I0310 11:31:57.756013 24461 docker.cpp:1359] Updated
>>>> 'memory.limit_in_bytes' to 160MB at
>>>> /sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2
>>>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9
>>>> I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max
>>>> allowed age: 5.555425200437824days
>>>>
>>>>
>>>> *From previous Marathon leader 10.195.30.21 <http://10.195.30.21/>:*
>>>> I0310 11:31:40.017248  1115 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.017334  1115 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> I0310 11:31:40.017727  1115 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> [2015-03-10 11:31:40,017] WARN Disconnected
>>>> (mesosphere.marathon.MarathonScheduler:224)
>>>> [2015-03-10 11:31:40,019] INFO Abdicating
>>>> (mesosphere.marathon.MarathonSchedulerService:312)
>>>> [2015-03-10 11:31:40,019] INFO Defeat leadership
>>>> (mesosphere.marathon.MarathonSchedulerService:285)
>>>> [INFO] [03/10/2015 11:31:40.019]
>>>> [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/$b]
>>>> POSTing to all endpoints.
>>>> [INFO] [03/10/2015 11:31:40.019]
>>>> [marathon-akka.actor.default-dispatcher-5] [
>>>> akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
>>>> [2015-03-10 11:31:40,021] INFO Stopping driver
>>>> (mesosphere.marathon.MarathonSchedulerService:221)
>>>> I0310 11:31:40.022001  1115 sched.cpp:1286] Asked to stop the driver
>>>> [2015-03-10 11:31:40,024] INFO Setting framework ID to
>>>> 20150310-112310-320783114-5050-24289-0000
>>>> (mesosphere.marathon.MarathonSchedulerService:73)
>>>> I0310 11:31:40.026274  1115 sched.cpp:234] New master detected at
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.026418  1115 sched.cpp:242] No credentials provided.
>>>> Attempting to register without authentication
>>>> I0310 11:31:40.026458  1115 sched.cpp:752] Stopping framework
>>>> '20150310-112310-320783114-5050-24289-0000'
>>>> [2015-03-10 11:31:40,026] INFO Driver future completed. Executing
>>>> optional abdication command.
>>>> (mesosphere.marathon.MarathonSchedulerService:192)
>>>> [2015-03-10 11:31:40,032] INFO Defeated (Leader Interface)
>>>> (mesosphere.marathon.MarathonSchedulerService:246)
>>>> [2015-03-10 11:31:40,032] INFO Defeat leadership
>>>> (mesosphere.marathon.MarathonSchedulerService:285)
>>>> [2015-03-10 11:31:40,032] INFO Stopping driver
>>>> (mesosphere.marathon.MarathonSchedulerService:221)
>>>> I0310 11:31:40.032588  1107 sched.cpp:1286] Asked to stop the driver
>>>> [2015-03-10 11:31:40,033] INFO Will offer leadership after 500
>>>> milliseconds backoff (mesosphere.marathon.MarathonSchedulerService:334)
>>>> [2015-03-10 11:31:40,033] INFO Setting framework ID to
>>>> 20150310-112310-320783114-5050-24289-0000
>>>> (mesosphere.marathon.MarathonSchedulerService:73)
>>>> [2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is
>>>> not a candidate for leader, current voting: [member_0000000001,
>>>> member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144)
>>>> [2015-03-10 11:31:40,552] INFO Using HA and therefore offering
>>>> leadership (mesosphere.marathon.MarathonSchedulerService:341)
>>>> [2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003
>>>> (com.twitter.common.zookeeper.Group:426)
>>>> [2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is
>>>> not a candidate for leader, current voting: [member_0000000001,
>>>> member_0000000002, member_0000000003]
>>>> (com.twitter.common.zookeeper.CandidateImpl:144)
>>>> [2015-03-10 11:31:40,568] INFO Candidate
>>>> /marathon/leader/member_0000000003 waiting for the next leader election,
>>>> current voting: [member_0000000001, member_0000000002, member_0000000003]
>>>> (com.twitter.common.zookeeper.CandidateImpl:165)
>>>>
>>>>
>>>> *From new Marathon leader 10.195.30.20 <http://10.195.30.20/>:*
>>>> [2015-03-10 11:31:40,029] INFO Candidate
>>>> /marathon/leader/member_0000000001 is now leader of group:
>>>> [member_0000000001, member_0000000002]
>>>> (com.twitter.common.zookeeper.CandidateImpl:152)
>>>> [2015-03-10 11:31:40,030] INFO Elected (Leader Interface)
>>>> (mesosphere.marathon.MarathonSchedulerService:253)
>>>> [2015-03-10 11:31:40,044] INFO Elect leadership
>>>> (mesosphere.marathon.MarathonSchedulerService:299)
>>>> [2015-03-10 11:31:40,044] INFO Running driver
>>>> (mesosphere.marathon.MarathonSchedulerService:184)
>>>> I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712:
>>>> Client environment:zookeeper.version=zookeeper C client 3.4.5
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716:
>>>> Client environment:host.name=srv-d2u-9-virtip20
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723:
>>>> Client environment:os.name=Linux
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724:
>>>> Client environment:os.arch=3.13.0-44-generic
>>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725:
>>>> Client environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
>>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733:
>>>> Client environment:user.name=(null)
>>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741:
>>>> Client environment:user.home=/root
>>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753:
>>>> Client environment:user.dir=/
>>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786:
>>>> Initiating client connection, host=10.195.30.19:2181,10.195.30.20:2181,
>>>> 10.195.30.21:2181 sessionTimeout=10000 watcher=0x7fda9da9a6a0
>>>> sessionId=0 sessionPasswd=<null> context=0x7fdaa400dd10 flags=0
>>>> [2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff
>>>> (mesosphere.marathon.MarathonSchedulerService:329)
>>>> 2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703:
>>>> initiated connection to server [10.195.30.19:2181]
>>>> 2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750:
>>>> session establishment complete on server [10.195.30.19:2181],
>>>> sessionId=0x14c0335ad7e000d, negotiated timeout=10000
>>>> I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@
>>>> 10.195.30.20:45771) connected to ZooKeeper
>>>> I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations:
>>>> queue size (joins, cancels, datas) = (0, 0, 0)
>>>> I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path
>>>> '/mesos' in ZooKeeper
>>>> [INFO] [03/10/2015 11:31:40.047]
>>>> [marathon-akka.actor.default-dispatcher-2] [
>>>> akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
>>>> I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader:
>>>> (id='2')
>>>> I0310 11:31:40.053530 22641 group.cpp:659] Trying to get
>>>> '/mesos/info_0000000002' in ZooKeeper
>>>> [2015-03-10 11:31:40,053] INFO Migration successfully applied for
>>>> version Version(0, 8, 0) (mesosphere.marathon.state.Migration:69)
>>>> I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master
>>>> (UPID=master@10.195.30.21:5050) is detected
>>>> I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at
>>>> master@10.195.30.21:5050
>>>> I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided.
>>>> Attempting to register without authentication
>>>> I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with
>>>> 20150310-112310-354337546-5050-895-0000
>>>> [2015-03-10 11:31:40,056] INFO Registered as
>>>> 20150310-112310-354337546-5050-895-0000 to master
>>>> '20150310-112310-354337546-5050-895'
>>>> (mesosphere.marathon.MarathonScheduler:72)
>>>> [2015-03-10 11:31:40,063] INFO Stored framework ID
>>>> '20150310-112310-354337546-5050-895-0000'
>>>> (mesosphere.mesos.util.FrameworkIdUtil:49)
>>>> [INFO] [03/10/2015 11:31:40.065]
>>>> [marathon-akka.actor.default-dispatcher-6] [
>>>> akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
>>>> [INFO] [03/10/2015 11:31:40.067]
>>>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>>>> POSTing to all endpoints.
>>>> ...
>>>> ...
>>>> ...
>>>> [2015-03-10 11:31:55,052] INFO Syncing tasks for all apps
>>>> (mesosphere.marathon.SchedulerActions:403)
>>>> [INFO] [03/10/2015 11:31:55.053]
>>>> [marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters]
>>>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>>>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>>>> Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters
>>>> encountered. This logging can be turned off or adjusted with configuration
>>>> settings 'akka.log-dead-letters' and
>>>> 'akka.log-dead-letters-during-shutdown'.
>>>> [2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the
>>>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>>>> [2015-03-10 11:31:55,064] INFO Received status update for task
>>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799:
>>>> TASK_LOST (Reconciliation: Task is unknown to the slave)
>>>> (mesosphere.marathon.MarathonScheduler:148)
>>>> [2015-03-10 11:31:55,069] INFO Need to scale
>>>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>>>> (mesosphere.marathon.SchedulerActions:488)
>>>> [2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for
>>>> /ffaas-backoffice-app-nopersist (0 queued)
>>>> (mesosphere.marathon.SchedulerActions:494)
>>>> [2015-03-10 11:31:55,069] INFO Task
>>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799
>>>> expunged and removed from TaskTracker
>>>> (mesosphere.marathon.tasks.TaskTracker:107)
>>>> [2015-03-10 11:31:55,070] INFO Sending event notification.
>>>> (mesosphere.marathon.MarathonScheduler:262)
>>>> [INFO] [03/10/2015 11:31:55.072]
>>>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b]
>>>> POSTing to all endpoints.
>>>> [2015-03-10 11:31:55,073] INFO Need to scale
>>>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances
>>>> (mesosphere.marathon.SchedulerActions:488)
>>>> [2015-03-10 11:31:55,074] INFO Already queued 1 tasks for
>>>> /ffaas-backoffice-app-nopersist. Not scaling.
>>>> (mesosphere.marathon.SchedulerActions:498)
>>>> ...
>>>> ...
>>>> ...
>>>> [2015-03-10 11:31:57,682] INFO Received status update for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>>> TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
>>>> [2015-03-10 11:31:57,694] INFO Sending event notification.
>>>> (mesosphere.marathon.MarathonScheduler:262)
>>>> [INFO] [03/10/2015 11:31:57.694]
>>>> [marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b]
>>>> POSTing to all endpoints.
>>>> ...
>>>> ...
>>>> ...
>>>> [2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store
>>>> (mesosphere.marathon.tasks.TaskTracker:170)
>>>> [INFO] [03/10/2015 11:36:55.050]
>>>> [marathon-akka.actor.default-dispatcher-2] [akka://marathon/deadLetters]
>>>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
>>>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
>>>> Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters
>>>> encountered. This logging can be turned off or adjusted with configuration
>>>> settings 'akka.log-dead-letters' and
>>>> 'akka.log-dead-letters-during-shutdown'.
>>>> [2015-03-10 11:36:55,057] INFO Syncing tasks for all apps
>>>> (mesosphere.marathon.SchedulerActions:403)
>>>> [2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the
>>>> Mesos master (mesosphere.marathon.SchedulerActions:430)
>>>> [2015-03-10 11:36:55,063] INFO Received status update for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>>> TASK_RUNNING (Reconciliation: Latest task state)
>>>> (mesosphere.marathon.MarathonScheduler:148)
>>>> [2015-03-10 11:36:55,065] INFO Received status update for task
>>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
>>>> TASK_RUNNING (Reconciliation: Latest task state)
>>>> (mesosphere.marathon.MarathonScheduler:148)
>>>> [2015-03-10 11:36:55,066] INFO Already running 1 instances of
>>>> /ffaas-backoffice-app-nopersist. Not scaling.
>>>> (mesosphere.marathon.SchedulerActions:512)
>>>>
>>>>
>>>>
>>>> -- End of logs
>>>>
>>>>
>>>>
>>>> 2015-03-10 10:25 GMT+01:00 Adam Bordelon <a...@mesosphere.io>:
>>>>
>>>>> This is certainly not the expected/desired behavior when failing over
>>>>> a mesos master in HA mode. In addition to the master logs Alex requested,
>>>>> can you also provide relevant portions of the slave logs for these tasks?
>>>>> If the slave processes themselves never failed over, checkpointing and
>>>>> slave recovery should be irrelevant. Are you running the mesos-slave 
>>>>> itself
>>>>> inside a Docker, or any other non-traditional setup?
>>>>>
>>>>> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
>>>>> defaults to "reconnect", and --strict defaults to true, so none of those
>>>>> are necessary.
>>>>>
>>>>> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <a...@mesosphere.io>
>>>>> wrote:
>>>>>
>>>>>> Geoffroy,
>>>>>>
>>>>>> could you please provide master logs (both from killed and taking
>>>>>> over masters)?
>>>>>>
>>>>>> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley <
>>>>>> geoffroy.jabou...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello
>>>>>>>
>>>>>>> we are facing some unexpecting issues when testing high availability
>>>>>>> behaviors of our mesos cluster.
>>>>>>>
>>>>>>> *Our use case:*
>>>>>>>
>>>>>>> *State*: the mesos cluster is up (3 machines), 1 docker task is
>>>>>>> running on each slave (started from marathon)
>>>>>>>
>>>>>>> *Action*: stop the mesos master leader process
>>>>>>>
>>>>>>> *Expected*: mesos master leader has changed, *active tasks remain
>>>>>>> unchanged*
>>>>>>>
>>>>>>> *Seen*: mesos master leader has changed, *all active tasks are now
>>>>>>> FAILED but docker containers are still running*, marathon detects
>>>>>>> FAILED tasks and starts new tasks. We end with 2 docker containers 
>>>>>>> running
>>>>>>> on each machine, but only one is linked to a RUNNING mesos task.
>>>>>>>
>>>>>>>
>>>>>>> Is the seen behavior correct?
>>>>>>>
>>>>>>> Have we misunderstood the high availability concept? We thought that
>>>>>>> doing this use case would not have any impact on the current cluster 
>>>>>>> state
>>>>>>> (except leader re-election)
>>>>>>>
>>>>>>> Thanks in advance for your help
>>>>>>> Regards
>>>>>>>
>>>>>>> ---------------------------------------------------
>>>>>>>
>>>>>>> our setup is the following:
>>>>>>> 3 identical mesos nodes with:
>>>>>>>     + zookeeper
>>>>>>>     + docker 1.5
>>>>>>>     + mesos master 0.21.1 configured in HA mode
>>>>>>>     + mesos slave 0.21.1 configured with checkpointing, strict and
>>>>>>> reconnect
>>>>>>>     + marathon 0.8.0 configured in HA mode with checkpointing
>>>>>>>
>>>>>>> ---------------------------------------------------
>>>>>>>
>>>>>>> Command lines:
>>>>>>>
>>>>>>>
>>>>>>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
>>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
>>>>>>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
>>>>>>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos
>>>>>>>
>>>>>>> *mesos-slave*
>>>>>>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,
>>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --checkpoint
>>>>>>> --containerizers=docker,mesos --executor_registration_timeout=5mins
>>>>>>> --hostname=10.195.30.19 --ip=10.195.30.19
>>>>>>> --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
>>>>>>> --recovery_timeout=120mins --strict 
>>>>>>> --resources=ports:[31000-32000,80,443]
>>>>>>>
>>>>>>> *marathon*
>>>>>>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
>>>>>>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
>>>>>>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
>>>>>>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080
>>>>>>> --hostname 10.195.30.19 --event_subscriber http_callback --ha 
>>>>>>> --https_port
>>>>>>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
>>>>>>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
>>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

Reply via email to