Thanks a lot Dario for the workaround! It works fine and can be scripted with ansible.
For the record, the github issue is available here: https://github.com/mesosphere/marathon/issues/1292 2015-03-12 17:27 GMT+01:00 Dario Rexin <da...@mesosphere.io>: > Hi Geoffrey, > > we identified the issue and will fix it in Marathon 0.8.2. To prevent this > behaviour for now, you just have to make sure that in a fresh setup > (Marathon was never connected to Mesos) you first start up a single > Marathon and let it register with Mesos and then start the other Marathon > instances. The problem is a race in first registration with Mesos and > fetching the FrameworkID from Zookeeper. Please let me know if the > workaround does not help you. > > Cheers, > Dario > > On 12 Mar 2015, at 09:20, Alex Rukletsov <a...@mesosphere.io> wrote: > > Geoffroy, > > yes, it looks like a marathon issue, so feel free to post it there as well. > > On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley < > geoffroy.jabou...@gmail.com> wrote: > >> Thanks Alex for your answer. I will have a look. >> >> Would it be better to (cross-)post this discussion on the marathon >> mailing list? >> >> Anyway, the issue is "fixed" for 0.8.0, which is the version i'm using. >> >> 2015-03-11 22:18 GMT+01:00 Alex Rukletsov <a...@mesosphere.io>: >> >>> Geoffroy, >>> >>> most probably you're hitting this bug: >>> https://github.com/mesosphere/marathon/issues/1063. The problem is that >>> Marathon can register instead of re-registering when a master fails >>> over. From master point of view, it's a new framework, that's why the >>> previous task is gone and a new one (that technically belongs to a new >>> framework) is started. You can see that frameworks have two different IDs >>> (check lines 11:31:40.055496 and 11:31:40.785038) in your example. >>> >>> Hope that helps, >>> Alex >>> >>> On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley < >>> geoffroy.jabou...@gmail.com> wrote: >>> >>>> Hello >>>> >>>> thanks for your interest. Following are the requested logs, which will >>>> result in a pretty big mail. >>>> >>>> Mesos/Marathon are *NOT running inside docker*, we only use Docker as >>>> our mesos containerizer. >>>> >>>> For reminder, here is the use case performed to get the logs file: >>>> >>>> -------------------------------- >>>> >>>> Our cluster: 3 identical mesos nodes with: >>>> + zookeeper >>>> + docker 1.5 >>>> + mesos master 0.21.1 configured in HA mode >>>> + mesos slave 0.21.1 configured with checkpointing, strict and >>>> reconnect >>>> + marathon 0.8.0 configured in HA mode with checkpointing >>>> >>>> -------------------------------- >>>> >>>> *Begin State: * >>>> + the mesos cluster is up (3 machines) >>>> + mesos master leader is 10.195.30.19 >>>> + marathon leader is 10.195.30.21 >>>> + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21 >>>> >>>> *Action*: stop the mesos master leader process (sudo stop mesos-master) >>>> >>>> *Expected*: mesos master leader has changed, active tasks / frameworks >>>> remain unchanged >>>> >>>> *End state: * >>>> + mesos master leader *has changed, now 10.195.30.21* >>>> + previously running APPTASK on the slave 10.195.30.21 has "disappear" >>>> (not showing anymore on the mesos UI), but *docker container is still >>>> running* >>>> + a n*ew APPTASK is now running on slave 10.195.30.19* >>>> + marathon framework "registration time" in mesos UI shows "Just now" >>>> + marathon leader *has changed, now 10.195.30.20* >>>> >>>> >>>> -------------------------------- >>>> >>>> Now comes the 6 requested logs, which might contain >>>> interesting/relevant information, but i as a newcomer to mesos it is hard >>>> to read... >>>> >>>> >>>> *from previous MESOS master leader 10.195.30.19 <http://10.195.30.19/>:* >>>> W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal >>>> SIGTERM from process 1 of user 0; exiting >>>> >>>> >>>> *from new MESOS master leader 10.195.30.21 <http://10.195.30.21/>:* >>>> I0310 11:31:40.011545 922 detector.cpp:138] Detected a new leader: >>>> (id='2') >>>> I0310 11:31:40.011823 922 group.cpp:659] Trying to get >>>> '/mesos/info_0000000002' in ZooKeeper >>>> I0310 11:31:40.015496 915 network.hpp:424] ZooKeeper group >>>> memberships changed >>>> I0310 11:31:40.015847 915 group.cpp:659] Trying to get >>>> '/mesos/log_replicas/0000000000' in ZooKeeper >>>> I0310 11:31:40.016047 922 detector.cpp:433] A new leading master >>>> (UPID=master@10.195.30.21:5050) is detected >>>> I0310 11:31:40.016074 922 master.cpp:1263] The newly elected leader >>>> is master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895 >>>> I0310 11:31:40.016089 922 master.cpp:1276] Elected as the leading >>>> master! >>>> I0310 11:31:40.016108 922 master.cpp:1094] Recovering from registrar >>>> I0310 11:31:40.016188 918 registrar.cpp:313] Recovering registrar >>>> I0310 11:31:40.016542 918 log.cpp:656] Attempting to start the writer >>>> I0310 11:31:40.016918 918 replica.cpp:474] Replica received implicit >>>> promise request with proposal 2 >>>> I0310 11:31:40.017503 915 group.cpp:659] Trying to get >>>> '/mesos/log_replicas/0000000003' in ZooKeeper >>>> I0310 11:31:40.017832 918 leveldb.cpp:306] Persisting metadata (8 >>>> bytes) to leveldb took 893672ns >>>> I0310 11:31:40.017848 918 replica.cpp:342] Persisted promised to 2 >>>> I0310 11:31:40.018817 915 network.hpp:466] ZooKeeper group PIDs: { >>>> log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 } >>>> I0310 11:31:40.023022 923 coordinator.cpp:230] Coordinator attemping >>>> to fill missing position >>>> I0310 11:31:40.023110 923 log.cpp:672] Writer started with ending >>>> position 8 >>>> I0310 11:31:40.023293 923 leveldb.cpp:438] Reading position from >>>> leveldb took 13195ns >>>> I0310 11:31:40.023309 923 leveldb.cpp:438] Reading position from >>>> leveldb took 3120ns >>>> I0310 11:31:40.023619 922 registrar.cpp:346] Successfully fetched the >>>> registry (610B) in 7.385856ms >>>> I0310 11:31:40.023679 922 registrar.cpp:445] Applied 1 operations in >>>> 9263ns; attempting to update the 'registry' >>>> I0310 11:31:40.024238 922 log.cpp:680] Attempting to append 647 bytes >>>> to the log >>>> I0310 11:31:40.024279 923 coordinator.cpp:340] Coordinator attempting >>>> to write APPEND action at position 9 >>>> I0310 11:31:40.024435 923 replica.cpp:508] Replica received write >>>> request for position 9 >>>> I0310 11:31:40.025707 923 leveldb.cpp:343] Persisting action (666 >>>> bytes) to leveldb took 1.259338ms >>>> I0310 11:31:40.025722 923 replica.cpp:676] Persisted action at 9 >>>> I0310 11:31:40.026074 923 replica.cpp:655] Replica received learned >>>> notice for position 9 >>>> I0310 11:31:40.026495 923 leveldb.cpp:343] Persisting action (668 >>>> bytes) to leveldb took 404795ns >>>> I0310 11:31:40.026507 923 replica.cpp:676] Persisted action at 9 >>>> I0310 11:31:40.026511 923 replica.cpp:661] Replica learned APPEND >>>> action at position 9 >>>> I0310 11:31:40.026726 923 registrar.cpp:490] Successfully updated the >>>> 'registry' in 3.029248ms >>>> I0310 11:31:40.026765 923 registrar.cpp:376] Successfully recovered >>>> registrar >>>> I0310 11:31:40.026814 923 log.cpp:699] Attempting to truncate the log >>>> to 9 >>>> I0310 11:31:40.026880 923 master.cpp:1121] Recovered 3 slaves from >>>> the Registry (608B) ; allowing 1days for slaves to re-register >>>> I0310 11:31:40.026897 923 coordinator.cpp:340] Coordinator attempting >>>> to write TRUNCATE action at position 10 >>>> I0310 11:31:40.026988 923 replica.cpp:508] Replica received write >>>> request for position 10 >>>> I0310 11:31:40.027640 923 leveldb.cpp:343] Persisting action (16 >>>> bytes) to leveldb took 641018ns >>>> I0310 11:31:40.027652 923 replica.cpp:676] Persisted action at 10 >>>> I0310 11:31:40.030848 923 replica.cpp:655] Replica received learned >>>> notice for position 10 >>>> I0310 11:31:40.031883 923 leveldb.cpp:343] Persisting action (18 >>>> bytes) to leveldb took 1.008914ms >>>> I0310 11:31:40.031963 923 leveldb.cpp:401] Deleting ~2 keys from >>>> leveldb took 46724ns >>>> I0310 11:31:40.031977 923 replica.cpp:676] Persisted action at 10 >>>> I0310 11:31:40.031986 923 replica.cpp:661] Replica learned TRUNCATE >>>> action at position 10 >>>> I0310 11:31:40.055415 918 master.cpp:1383] Received registration >>>> request for framework 'marathon' at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:40.055496 918 master.cpp:1447] Registering framework >>>> 20150310-112310-354337546-5050-895-0000 (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:40.055642 919 hierarchical_allocator_process.hpp:329] >>>> Added framework 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:40.189151 919 master.cpp:3246] Re-registering slave >>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 >>>> (10.195.30.19) >>>> I0310 11:31:40.189280 919 registrar.cpp:445] Applied 1 operations in >>>> 15452ns; attempting to update the 'registry' >>>> I0310 11:31:40.189949 919 log.cpp:680] Attempting to append 647 bytes >>>> to the log >>>> I0310 11:31:40.189978 919 coordinator.cpp:340] Coordinator attempting >>>> to write APPEND action at position 11 >>>> I0310 11:31:40.190112 919 replica.cpp:508] Replica received write >>>> request for position 11 >>>> I0310 11:31:40.190563 919 leveldb.cpp:343] Persisting action (666 >>>> bytes) to leveldb took 437440ns >>>> I0310 11:31:40.190577 919 replica.cpp:676] Persisted action at 11 >>>> I0310 11:31:40.191249 921 replica.cpp:655] Replica received learned >>>> notice for position 11 >>>> I0310 11:31:40.192159 921 leveldb.cpp:343] Persisting action (668 >>>> bytes) to leveldb took 892767ns >>>> I0310 11:31:40.192178 921 replica.cpp:676] Persisted action at 11 >>>> I0310 11:31:40.192184 921 replica.cpp:661] Replica learned APPEND >>>> action at position 11 >>>> I0310 11:31:40.192350 921 registrar.cpp:490] Successfully updated the >>>> 'registry' in 3.0528ms >>>> I0310 11:31:40.192387 919 log.cpp:699] Attempting to truncate the log >>>> to 11 >>>> I0310 11:31:40.192415 919 coordinator.cpp:340] Coordinator attempting >>>> to write TRUNCATE action at position 12 >>>> I0310 11:31:40.192539 915 replica.cpp:508] Replica received write >>>> request for position 12 >>>> I0310 11:31:40.192600 921 master.cpp:3314] Re-registered slave >>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 >>>> (10.195.30.19) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; >>>> disk(*):89148 >>>> I0310 11:31:40.192680 917 hierarchical_allocator_process.hpp:442] >>>> Added slave 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) with >>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and >>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 >>>> available) >>>> I0310 11:31:40.192847 917 master.cpp:3843] Sending 1 offers to >>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:40.193164 915 leveldb.cpp:343] Persisting action (16 >>>> bytes) to leveldb took 610664ns >>>> I0310 11:31:40.193181 915 replica.cpp:676] Persisted action at 12 >>>> I0310 11:31:40.193568 915 replica.cpp:655] Replica received learned >>>> notice for position 12 >>>> I0310 11:31:40.193948 915 leveldb.cpp:343] Persisting action (18 >>>> bytes) to leveldb took 364062ns >>>> I0310 11:31:40.193979 915 leveldb.cpp:401] Deleting ~2 keys from >>>> leveldb took 12256ns >>>> I0310 11:31:40.193985 915 replica.cpp:676] Persisted action at 12 >>>> I0310 11:31:40.193990 915 replica.cpp:661] Replica learned TRUNCATE >>>> action at position 12 >>>> I0310 11:31:40.248615 915 master.cpp:2344] Processing reply for >>>> offers: [ 20150310-112310-354337546-5050-895-O0 ] on slave >>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 >>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000 >>>> (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:40.248744 915 hierarchical_allocator_process.hpp:563] >>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; >>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; >>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave >>>> 20150310-112310-320783114-5050-24289-S1 from framework >>>> 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:40.774416 915 master.cpp:3246] Re-registering slave >>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 >>>> (10.195.30.21) >>>> I0310 11:31:40.774976 915 registrar.cpp:445] Applied 1 operations in >>>> 42342ns; attempting to update the 'registry' >>>> I0310 11:31:40.777273 921 log.cpp:680] Attempting to append 647 bytes >>>> to the log >>>> I0310 11:31:40.777436 921 coordinator.cpp:340] Coordinator attempting >>>> to write APPEND action at position 13 >>>> I0310 11:31:40.777989 921 replica.cpp:508] Replica received write >>>> request for position 13 >>>> I0310 11:31:40.779558 921 leveldb.cpp:343] Persisting action (666 >>>> bytes) to leveldb took 1.513714ms >>>> I0310 11:31:40.779633 921 replica.cpp:676] Persisted action at 13 >>>> I0310 11:31:40.781821 919 replica.cpp:655] Replica received learned >>>> notice for position 13 >>>> I0310 11:31:40.784417 919 leveldb.cpp:343] Persisting action (668 >>>> bytes) to leveldb took 2.542036ms >>>> I0310 11:31:40.784446 919 replica.cpp:676] Persisted action at 13 >>>> I0310 11:31:40.784452 919 replica.cpp:661] Replica learned APPEND >>>> action at position 13 >>>> I0310 11:31:40.784711 920 registrar.cpp:490] Successfully updated the >>>> 'registry' in 9.68192ms >>>> I0310 11:31:40.784762 917 log.cpp:699] Attempting to truncate the log >>>> to 13 >>>> I0310 11:31:40.784808 920 coordinator.cpp:340] Coordinator attempting >>>> to write TRUNCATE action at position 14 >>>> I0310 11:31:40.784865 917 master.hpp:877] Adding task >>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 with >>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave >>>> 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) >>>> I0310 11:31:40.784955 919 replica.cpp:508] Replica received write >>>> request for position 14 >>>> W0310 11:31:40.785038 917 master.cpp:4468] Possibly orphaned task >>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 of >>>> framework 20150310-112310-320783114-5050-24289-0000 running on slave >>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 >>>> (10.195.30.21) >>>> I0310 11:31:40.785105 917 master.cpp:3314] Re-registered slave >>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 >>>> (10.195.30.21) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; >>>> disk(*):89148 >>>> I0310 11:31:40.785162 920 hierarchical_allocator_process.hpp:442] >>>> Added slave 20150310-112310-320783114-5050-24289-S2 (10.195.30.21) with >>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and >>>> ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; disk(*):89148 >>>> available) >>>> I0310 11:31:40.785679 921 master.cpp:3843] Sending 1 offers to >>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:40.786429 919 leveldb.cpp:343] Persisting action (16 >>>> bytes) to leveldb took 1.454211ms >>>> I0310 11:31:40.786455 919 replica.cpp:676] Persisted action at 14 >>>> I0310 11:31:40.786782 919 replica.cpp:655] Replica received learned >>>> notice for position 14 >>>> I0310 11:31:40.787833 919 leveldb.cpp:343] Persisting action (18 >>>> bytes) to leveldb took 1.027014ms >>>> I0310 11:31:40.787873 919 leveldb.cpp:401] Deleting ~2 keys from >>>> leveldb took 14085ns >>>> I0310 11:31:40.787883 919 replica.cpp:676] Persisted action at 14 >>>> I0310 11:31:40.787889 919 replica.cpp:661] Replica learned TRUNCATE >>>> action at position 14 >>>> I0310 11:31:40.792536 922 master.cpp:2344] Processing reply for >>>> offers: [ 20150310-112310-354337546-5050-895-O1 ] on slave >>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 >>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000 >>>> (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:40.792670 922 hierarchical_allocator_process.hpp:563] >>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; >>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000]; >>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave >>>> 20150310-112310-320783114-5050-24289-S2 from framework >>>> 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:40.819602 921 master.cpp:3246] Re-registering slave >>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 >>>> (10.195.30.20) >>>> I0310 11:31:40.819736 921 registrar.cpp:445] Applied 1 operations in >>>> 16656ns; attempting to update the 'registry' >>>> I0310 11:31:40.820617 921 log.cpp:680] Attempting to append 647 bytes >>>> to the log >>>> I0310 11:31:40.820726 918 coordinator.cpp:340] Coordinator attempting >>>> to write APPEND action at position 15 >>>> I0310 11:31:40.820938 918 replica.cpp:508] Replica received write >>>> request for position 15 >>>> I0310 11:31:40.821641 918 leveldb.cpp:343] Persisting action (666 >>>> bytes) to leveldb took 670583ns >>>> I0310 11:31:40.821663 918 replica.cpp:676] Persisted action at 15 >>>> I0310 11:31:40.822265 917 replica.cpp:655] Replica received learned >>>> notice for position 15 >>>> I0310 11:31:40.823463 917 leveldb.cpp:343] Persisting action (668 >>>> bytes) to leveldb took 1.178687ms >>>> I0310 11:31:40.823490 917 replica.cpp:676] Persisted action at 15 >>>> I0310 11:31:40.823498 917 replica.cpp:661] Replica learned APPEND >>>> action at position 15 >>>> I0310 11:31:40.823755 917 registrar.cpp:490] Successfully updated the >>>> 'registry' in 3.97696ms >>>> I0310 11:31:40.823823 917 log.cpp:699] Attempting to truncate the log >>>> to 15 >>>> I0310 11:31:40.824147 922 coordinator.cpp:340] Coordinator attempting >>>> to write TRUNCATE action at position 16 >>>> I0310 11:31:40.824482 922 hierarchical_allocator_process.hpp:442] >>>> Added slave 20150310-112310-320783114-5050-24289-S0 (10.195.30.20) with >>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 (and >>>> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; disk(*):89148 >>>> available) >>>> I0310 11:31:40.824597 921 replica.cpp:508] Replica received write >>>> request for position 16 >>>> I0310 11:31:40.824128 917 master.cpp:3314] Re-registered slave >>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 >>>> (10.195.30.20) with ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; >>>> disk(*):89148 >>>> I0310 11:31:40.824975 917 master.cpp:3843] Sending 1 offers to >>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:40.831900 921 leveldb.cpp:343] Persisting action (16 >>>> bytes) to leveldb took 7.228682ms >>>> I0310 11:31:40.832031 921 replica.cpp:676] Persisted action at 16 >>>> I0310 11:31:40.832456 917 replica.cpp:655] Replica received learned >>>> notice for position 16 >>>> I0310 11:31:40.835178 917 leveldb.cpp:343] Persisting action (18 >>>> bytes) to leveldb took 2.674392ms >>>> I0310 11:31:40.835297 917 leveldb.cpp:401] Deleting ~2 keys from >>>> leveldb took 45220ns >>>> I0310 11:31:40.835322 917 replica.cpp:676] Persisted action at 16 >>>> I0310 11:31:40.835341 917 replica.cpp:661] Replica learned TRUNCATE >>>> action at position 16 >>>> I0310 11:31:40.838281 923 master.cpp:2344] Processing reply for >>>> offers: [ 20150310-112310-354337546-5050-895-O2 ] on slave >>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 >>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000 >>>> (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:40.838389 923 hierarchical_allocator_process.hpp:563] >>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; >>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; >>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave >>>> 20150310-112310-320783114-5050-24289-S0 from framework >>>> 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:40.948725 919 http.cpp:344] HTTP request for >>>> '/master/redirect' >>>> I0310 11:31:41.479118 918 http.cpp:478] HTTP request for >>>> '/master/state.json' >>>> I0310 11:31:45.368074 918 master.cpp:3843] Sending 1 offers to >>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:45.385144 917 master.cpp:2344] Processing reply for >>>> offers: [ 20150310-112310-354337546-5050-895-O3 ] on slave >>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 >>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000 >>>> (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:45.385292 917 hierarchical_allocator_process.hpp:563] >>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; >>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; >>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave >>>> 20150310-112310-320783114-5050-24289-S1 from framework >>>> 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:46.368450 917 master.cpp:3843] Sending 2 offers to >>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:46.375222 920 master.cpp:2344] Processing reply for >>>> offers: [ 20150310-112310-354337546-5050-895-O4 ] on slave >>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 >>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000 >>>> (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:46.375360 920 master.cpp:2344] Processing reply for >>>> offers: [ 20150310-112310-354337546-5050-895-O5 ] on slave >>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 >>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000 >>>> (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:46.375530 920 hierarchical_allocator_process.hpp:563] >>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; >>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; >>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave >>>> 20150310-112310-320783114-5050-24289-S0 from framework >>>> 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:46.375599 920 hierarchical_allocator_process.hpp:563] >>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; >>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000]; >>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave >>>> 20150310-112310-320783114-5050-24289-S2 from framework >>>> 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:48.031230 915 http.cpp:478] HTTP request for >>>> '/master/state.json' >>>> I0310 11:31:51.374285 922 master.cpp:3843] Sending 1 offers to >>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:51.379391 921 master.cpp:2344] Processing reply for >>>> offers: [ 20150310-112310-354337546-5050-895-O6 ] on slave >>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 >>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000 >>>> (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:51.379487 921 hierarchical_allocator_process.hpp:563] >>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; >>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; >>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave >>>> 20150310-112310-320783114-5050-24289-S1 from framework >>>> 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:51.482094 923 http.cpp:478] HTTP request for >>>> '/master/state.json' >>>> I0310 11:31:52.375326 917 master.cpp:3843] Sending 2 offers to >>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:52.391376 919 master.cpp:2344] Processing reply for >>>> offers: [ 20150310-112310-354337546-5050-895-O7 ] on slave >>>> 20150310-112310-320783114-5050-24289-S2 at slave(1)@10.195.30.21:5051 >>>> (10.195.30.21) for framework 20150310-112310-354337546-5050-895-0000 >>>> (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:52.391512 919 hierarchical_allocator_process.hpp:563] >>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; >>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000]; >>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave >>>> 20150310-112310-320783114-5050-24289-S2 from framework >>>> 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:52.391659 921 master.cpp:2344] Processing reply for >>>> offers: [ 20150310-112310-354337546-5050-895-O8 ] on slave >>>> 20150310-112310-320783114-5050-24289-S0 at slave(1)@10.195.30.20:5051 >>>> (10.195.30.20) for framework 20150310-112310-354337546-5050-895-0000 >>>> (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:52.391751 921 hierarchical_allocator_process.hpp:563] >>>> Recovered ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):6961; >>>> disk(*):89148 (total allocatable: ports(*):[31000-32000, 80-443]; >>>> cpus(*):2; mem(*):6961; disk(*):89148) on slave >>>> 20150310-112310-320783114-5050-24289-S0 from framework >>>> 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:55.062060 918 master.cpp:3611] Performing explicit task >>>> state reconciliation for 1 tasks of framework >>>> 20150310-112310-354337546-5050-895-0000 (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:55.062588 919 master.cpp:3556] Performing implicit task >>>> state reconciliation for framework 20150310-112310-354337546-5050-895-0000 >>>> (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:56.140990 923 http.cpp:344] HTTP request for >>>> '/master/redirect' >>>> I0310 11:31:57.379288 918 master.cpp:3843] Sending 1 offers to >>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:57.430888 918 master.cpp:2344] Processing reply for >>>> offers: [ 20150310-112310-354337546-5050-895-O9 ] on slave >>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 >>>> (10.195.30.19) for framework 20150310-112310-354337546-5050-895-0000 >>>> (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 >>>> I0310 11:31:57.431068 918 master.hpp:877] Adding task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 with >>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave >>>> 20150310-112310-320783114-5050-24289-S1 (10.195.30.19) >>>> I0310 11:31:57.431089 918 master.cpp:2503] Launching task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of >>>> framework 20150310-112310-354337546-5050-895-0000 (marathon) at >>>> scheduler-c5ae752d-1ffe-40e2-a89b-41013050bec9@10.195.30.20:45771 with >>>> resources cpus(*):0.1; mem(*):128; ports(*):[31000-31000] on slave >>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 >>>> (10.195.30.19) >>>> I0310 11:31:57.431205 918 hierarchical_allocator_process.hpp:563] >>>> Recovered ports(*):[80-443, 31001-32000]; cpus(*):1.9; mem(*):6833; >>>> disk(*):89148 (total allocatable: ports(*):[80-443, 31001-32000]; >>>> cpus(*):1.9; mem(*):6833; disk(*):89148) on slave >>>> 20150310-112310-320783114-5050-24289-S1 from framework >>>> 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:57.682133 919 master.cpp:3446] Forwarding status update >>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of >>>> framework 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:57.682186 919 master.cpp:3418] Status update TASK_RUNNING >>>> (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of >>>> framework 20150310-112310-354337546-5050-895-0000 from slave >>>> 20150310-112310-320783114-5050-24289-S1 at slave(1)@10.195.30.19:5051 >>>> (10.195.30.19) >>>> I0310 11:31:57.682199 919 master.cpp:4693] Updating the latest state >>>> of task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 >>>> of framework 20150310-112310-354337546-5050-895-0000 to TASK_RUNNING >>>> >>>> >>>> *from MESOS slave 10.195.30.21* >>>> I0310 11:31:28.750200 1074 slave.cpp:2623] master@10.195.30.19:5050 >>>> exited >>>> W0310 11:31:28.750249 1074 slave.cpp:2626] Master disconnected! >>>> Waiting for a new master to be elected >>>> I0310 11:31:40.012516 1075 detector.cpp:138] Detected a new leader: >>>> (id='2') >>>> I0310 11:31:40.012899 1073 group.cpp:659] Trying to get >>>> '/mesos/info_0000000002' in ZooKeeper >>>> I0310 11:31:40.017143 1072 detector.cpp:433] A new leading master >>>> (UPID=master@10.195.30.21:5050) is detected >>>> I0310 11:31:40.017408 1072 slave.cpp:602] New master detected at >>>> master@10.195.30.21:5050 >>>> I0310 11:31:40.017546 1076 status_update_manager.cpp:171] Pausing >>>> sending status updates >>>> I0310 11:31:40.018673 1072 slave.cpp:627] No credentials provided. >>>> Attempting to register without authentication >>>> I0310 11:31:40.018689 1072 slave.cpp:638] Detecting new master >>>> I0310 11:31:40.785364 1075 slave.cpp:824] Re-registered with master >>>> master@10.195.30.21:5050 >>>> I0310 11:31:40.785398 1075 status_update_manager.cpp:178] Resuming >>>> sending status updates >>>> I0310 11:32:10.639506 1075 slave.cpp:3321] Current usage 12.27%. Max >>>> allowed age: 5.441217749539572days >>>> >>>> >>>> *from MESOS slave 10.195.30.19* >>>> I0310 11:31:28.749577 24457 slave.cpp:2623] master@10.195.30.19:5050 >>>> exited >>>> W0310 11:31:28.749604 24457 slave.cpp:2626] Master disconnected! >>>> Waiting for a new master to be elected >>>> I0310 11:31:40.013056 24462 detector.cpp:138] Detected a new leader: >>>> (id='2') >>>> I0310 11:31:40.013530 24458 group.cpp:659] Trying to get >>>> '/mesos/info_0000000002' in ZooKeeper >>>> I0310 11:31:40.015897 24458 detector.cpp:433] A new leading master >>>> (UPID=master@10.195.30.21:5050) is detected >>>> I0310 11:31:40.015976 24458 slave.cpp:602] New master detected at >>>> master@10.195.30.21:5050 >>>> I0310 11:31:40.016027 24458 slave.cpp:627] No credentials provided. >>>> Attempting to register without authentication >>>> I0310 11:31:40.016075 24458 slave.cpp:638] Detecting new master >>>> I0310 11:31:40.016091 24458 status_update_manager.cpp:171] Pausing >>>> sending status updates >>>> I0310 11:31:40.192397 24462 slave.cpp:824] Re-registered with master >>>> master@10.195.30.21:5050 >>>> I0310 11:31:40.192437 24462 status_update_manager.cpp:178] Resuming >>>> sending status updates >>>> I0310 11:31:57.431139 24461 slave.cpp:1083] Got assigned task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for >>>> framework 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:57.431479 24461 slave.cpp:1193] Launching task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for >>>> framework 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:57.432144 24461 slave.cpp:3997] Launching executor >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of >>>> framework 20150310-112310-354337546-5050-895-0000 in work directory >>>> '/tmp/mesos/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' >>>> I0310 11:31:57.432318 24461 slave.cpp:1316] Queuing task >>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' for >>>> executor >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of >>>> framework '20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:57.434217 24461 docker.cpp:927] Starting container >>>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' for task >>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' (and >>>> executor >>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799') of >>>> framework '20150310-112310-354337546-5050-895-0000' >>>> I0310 11:31:57.652439 24461 docker.cpp:633] Checkpointing pid 24573 to >>>> '/tmp/mesos/meta/slaves/20150310-112310-320783114-5050-24289-S1/frameworks/20150310-112310-354337546-5050-895-0000/executors/ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799/runs/a8f9aba9-1bc7-4673-854e-82d9fdea8ca9/pids/forked.pid' >>>> I0310 11:31:57.653270 24461 slave.cpp:2840] Monitoring executor >>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of >>>> framework '20150310-112310-354337546-5050-895-0000' in container >>>> 'a8f9aba9-1bc7-4673-854e-82d9fdea8ca9' >>>> I0310 11:31:57.675488 24461 slave.cpp:1860] Got registration for >>>> executor >>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of >>>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@ >>>> 10.195.30.19:56574 >>>> I0310 11:31:57.675696 24461 slave.cpp:1979] Flushing queued task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 for >>>> executor >>>> 'ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799' of >>>> framework 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:57.678129 24461 slave.cpp:2215] Handling status update >>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of >>>> framework 20150310-112310-354337546-5050-895-0000 from executor(1)@ >>>> 10.195.30.19:56574 >>>> I0310 11:31:57.678251 24461 status_update_manager.cpp:317] Received >>>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for >>>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of >>>> framework 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:57.678411 24461 status_update_manager.hpp:346] >>>> Checkpointing UPDATE for status update TASK_RUNNING (UUID: >>>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of >>>> framework 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:57.681231 24461 slave.cpp:2458] Forwarding the update >>>> TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of >>>> framework 20150310-112310-354337546-5050-895-0000 to >>>> master@10.195.30.21:5050 >>>> I0310 11:31:57.681277 24461 slave.cpp:2391] Sending acknowledgement for >>>> status update TASK_RUNNING (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) for >>>> task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of >>>> framework 20150310-112310-354337546-5050-895-0000 to executor(1)@ >>>> 10.195.30.19:56574 >>>> I0310 11:31:57.689007 24461 status_update_manager.cpp:389] Received >>>> status update acknowledgement (UUID: 2afb200e-a172-49ec-b807-dc47ea381e1e) >>>> for task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of >>>> framework 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:57.689028 24461 status_update_manager.hpp:346] >>>> Checkpointing ACK for status update TASK_RUNNING (UUID: >>>> 2afb200e-a172-49ec-b807-dc47ea381e1e) for task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799 of >>>> framework 20150310-112310-354337546-5050-895-0000 >>>> I0310 11:31:57.755231 24461 docker.cpp:1298] Updated 'cpu.shares' to >>>> 204 at >>>> /sys/fs/cgroup/cpu/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2 >>>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9 >>>> I0310 11:31:57.755570 24461 docker.cpp:1333] Updated >>>> 'memory.soft_limit_in_bytes' to 160MB for container >>>> a8f9aba9-1bc7-4673-854e-82d9fdea8ca9 >>>> I0310 11:31:57.756013 24461 docker.cpp:1359] Updated >>>> 'memory.limit_in_bytes' to 160MB at >>>> /sys/fs/cgroup/memory/docker/e76080a071fa9cfb57e66df195c7650aee2f08cd9a23b81622a72e85d78f90b2 >>>> for container a8f9aba9-1bc7-4673-854e-82d9fdea8ca9 >>>> I0310 11:32:10.680750 24459 slave.cpp:3321] Current usage 10.64%. Max >>>> allowed age: 5.555425200437824days >>>> >>>> >>>> *From previous Marathon leader 10.195.30.21 <http://10.195.30.21/>:* >>>> I0310 11:31:40.017248 1115 detector.cpp:138] Detected a new leader: >>>> (id='2') >>>> I0310 11:31:40.017334 1115 group.cpp:659] Trying to get >>>> '/mesos/info_0000000002' in ZooKeeper >>>> I0310 11:31:40.017727 1115 detector.cpp:433] A new leading master >>>> (UPID=master@10.195.30.21:5050) is detected >>>> [2015-03-10 11:31:40,017] WARN Disconnected >>>> (mesosphere.marathon.MarathonScheduler:224) >>>> [2015-03-10 11:31:40,019] INFO Abdicating >>>> (mesosphere.marathon.MarathonSchedulerService:312) >>>> [2015-03-10 11:31:40,019] INFO Defeat leadership >>>> (mesosphere.marathon.MarathonSchedulerService:285) >>>> [INFO] [03/10/2015 11:31:40.019] >>>> [marathon-akka.actor.default-dispatcher-6] [akka://marathon/user/$b] >>>> POSTing to all endpoints. >>>> [INFO] [03/10/2015 11:31:40.019] >>>> [marathon-akka.actor.default-dispatcher-5] [ >>>> akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor >>>> [2015-03-10 11:31:40,021] INFO Stopping driver >>>> (mesosphere.marathon.MarathonSchedulerService:221) >>>> I0310 11:31:40.022001 1115 sched.cpp:1286] Asked to stop the driver >>>> [2015-03-10 11:31:40,024] INFO Setting framework ID to >>>> 20150310-112310-320783114-5050-24289-0000 >>>> (mesosphere.marathon.MarathonSchedulerService:73) >>>> I0310 11:31:40.026274 1115 sched.cpp:234] New master detected at >>>> master@10.195.30.21:5050 >>>> I0310 11:31:40.026418 1115 sched.cpp:242] No credentials provided. >>>> Attempting to register without authentication >>>> I0310 11:31:40.026458 1115 sched.cpp:752] Stopping framework >>>> '20150310-112310-320783114-5050-24289-0000' >>>> [2015-03-10 11:31:40,026] INFO Driver future completed. Executing >>>> optional abdication command. >>>> (mesosphere.marathon.MarathonSchedulerService:192) >>>> [2015-03-10 11:31:40,032] INFO Defeated (Leader Interface) >>>> (mesosphere.marathon.MarathonSchedulerService:246) >>>> [2015-03-10 11:31:40,032] INFO Defeat leadership >>>> (mesosphere.marathon.MarathonSchedulerService:285) >>>> [2015-03-10 11:31:40,032] INFO Stopping driver >>>> (mesosphere.marathon.MarathonSchedulerService:221) >>>> I0310 11:31:40.032588 1107 sched.cpp:1286] Asked to stop the driver >>>> [2015-03-10 11:31:40,033] INFO Will offer leadership after 500 >>>> milliseconds backoff (mesosphere.marathon.MarathonSchedulerService:334) >>>> [2015-03-10 11:31:40,033] INFO Setting framework ID to >>>> 20150310-112310-320783114-5050-24289-0000 >>>> (mesosphere.marathon.MarathonSchedulerService:73) >>>> [2015-03-10 11:31:40,035] ERROR Current member ID member_0000000000 is >>>> not a candidate for leader, current voting: [member_0000000001, >>>> member_0000000002] (com.twitter.common.zookeeper.CandidateImpl:144) >>>> [2015-03-10 11:31:40,552] INFO Using HA and therefore offering >>>> leadership (mesosphere.marathon.MarathonSchedulerService:341) >>>> [2015-03-10 11:31:40,563] INFO Set group member ID to member_0000000003 >>>> (com.twitter.common.zookeeper.Group:426) >>>> [2015-03-10 11:31:40,565] ERROR Current member ID member_0000000000 is >>>> not a candidate for leader, current voting: [member_0000000001, >>>> member_0000000002, member_0000000003] >>>> (com.twitter.common.zookeeper.CandidateImpl:144) >>>> [2015-03-10 11:31:40,568] INFO Candidate >>>> /marathon/leader/member_0000000003 waiting for the next leader election, >>>> current voting: [member_0000000001, member_0000000002, member_0000000003] >>>> (com.twitter.common.zookeeper.CandidateImpl:165) >>>> >>>> >>>> *From new Marathon leader 10.195.30.20 <http://10.195.30.20/>:* >>>> [2015-03-10 11:31:40,029] INFO Candidate >>>> /marathon/leader/member_0000000001 is now leader of group: >>>> [member_0000000001, member_0000000002] >>>> (com.twitter.common.zookeeper.CandidateImpl:152) >>>> [2015-03-10 11:31:40,030] INFO Elected (Leader Interface) >>>> (mesosphere.marathon.MarathonSchedulerService:253) >>>> [2015-03-10 11:31:40,044] INFO Elect leadership >>>> (mesosphere.marathon.MarathonSchedulerService:299) >>>> [2015-03-10 11:31:40,044] INFO Running driver >>>> (mesosphere.marathon.MarathonSchedulerService:184) >>>> I0310 11:31:40.044770 22734 sched.cpp:137] Version: 0.21.1 >>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@712: >>>> Client environment:zookeeper.version=zookeeper C client 3.4.5 >>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@716: >>>> Client environment:host.name=srv-d2u-9-virtip20 >>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@723: >>>> Client environment:os.name=Linux >>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@724: >>>> Client environment:os.arch=3.13.0-44-generic >>>> 2015-03-10 11:31:40,045:22509(0x7fda83fff700):ZOO_INFO@log_env@725: >>>> Client environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 >>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@733: >>>> Client environment:user.name=(null) >>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@741: >>>> Client environment:user.home=/root >>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@log_env@753: >>>> Client environment:user.dir=/ >>>> 2015-03-10 11:31:40,046:22509(0x7fda83fff700):ZOO_INFO@zookeeper_init@786: >>>> Initiating client connection, host=10.195.30.19:2181,10.195.30.20:2181, >>>> 10.195.30.21:2181 sessionTimeout=10000 watcher=0x7fda9da9a6a0 >>>> sessionId=0 sessionPasswd=<null> context=0x7fdaa400dd10 flags=0 >>>> [2015-03-10 11:31:40,046] INFO Reset offerLeadership backoff >>>> (mesosphere.marathon.MarathonSchedulerService:329) >>>> 2015-03-10 11:31:40,047:22509(0x7fda816f2700):ZOO_INFO@check_events@1703: >>>> initiated connection to server [10.195.30.19:2181] >>>> 2015-03-10 11:31:40,049:22509(0x7fda816f2700):ZOO_INFO@check_events@1750: >>>> session establishment complete on server [10.195.30.19:2181], >>>> sessionId=0x14c0335ad7e000d, negotiated timeout=10000 >>>> I0310 11:31:40.049991 22645 group.cpp:313] Group process (group(1)@ >>>> 10.195.30.20:45771) connected to ZooKeeper >>>> I0310 11:31:40.050012 22645 group.cpp:790] Syncing group operations: >>>> queue size (joins, cancels, datas) = (0, 0, 0) >>>> I0310 11:31:40.050024 22645 group.cpp:385] Trying to create path >>>> '/mesos' in ZooKeeper >>>> [INFO] [03/10/2015 11:31:40.047] >>>> [marathon-akka.actor.default-dispatcher-2] [ >>>> akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor >>>> I0310 11:31:40.053429 22645 detector.cpp:138] Detected a new leader: >>>> (id='2') >>>> I0310 11:31:40.053530 22641 group.cpp:659] Trying to get >>>> '/mesos/info_0000000002' in ZooKeeper >>>> [2015-03-10 11:31:40,053] INFO Migration successfully applied for >>>> version Version(0, 8, 0) (mesosphere.marathon.state.Migration:69) >>>> I0310 11:31:40.054226 22640 detector.cpp:433] A new leading master >>>> (UPID=master@10.195.30.21:5050) is detected >>>> I0310 11:31:40.054281 22640 sched.cpp:234] New master detected at >>>> master@10.195.30.21:5050 >>>> I0310 11:31:40.054352 22640 sched.cpp:242] No credentials provided. >>>> Attempting to register without authentication >>>> I0310 11:31:40.055160 22640 sched.cpp:408] Framework registered with >>>> 20150310-112310-354337546-5050-895-0000 >>>> [2015-03-10 11:31:40,056] INFO Registered as >>>> 20150310-112310-354337546-5050-895-0000 to master >>>> '20150310-112310-354337546-5050-895' >>>> (mesosphere.marathon.MarathonScheduler:72) >>>> [2015-03-10 11:31:40,063] INFO Stored framework ID >>>> '20150310-112310-354337546-5050-895-0000' >>>> (mesosphere.mesos.util.FrameworkIdUtil:49) >>>> [INFO] [03/10/2015 11:31:40.065] >>>> [marathon-akka.actor.default-dispatcher-6] [ >>>> akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready >>>> [INFO] [03/10/2015 11:31:40.067] >>>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b] >>>> POSTing to all endpoints. >>>> ... >>>> ... >>>> ... >>>> [2015-03-10 11:31:55,052] INFO Syncing tasks for all apps >>>> (mesosphere.marathon.SchedulerActions:403) >>>> [INFO] [03/10/2015 11:31:55.053] >>>> [marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters] >>>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from >>>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to >>>> Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters >>>> encountered. This logging can be turned off or adjusted with configuration >>>> settings 'akka.log-dead-letters' and >>>> 'akka.log-dead-letters-during-shutdown'. >>>> [2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the >>>> Mesos master (mesosphere.marathon.SchedulerActions:430) >>>> [2015-03-10 11:31:55,064] INFO Received status update for task >>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799: >>>> TASK_LOST (Reconciliation: Task is unknown to the slave) >>>> (mesosphere.marathon.MarathonScheduler:148) >>>> [2015-03-10 11:31:55,069] INFO Need to scale >>>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances >>>> (mesosphere.marathon.SchedulerActions:488) >>>> [2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for >>>> /ffaas-backoffice-app-nopersist (0 queued) >>>> (mesosphere.marathon.SchedulerActions:494) >>>> [2015-03-10 11:31:55,069] INFO Task >>>> ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 >>>> expunged and removed from TaskTracker >>>> (mesosphere.marathon.tasks.TaskTracker:107) >>>> [2015-03-10 11:31:55,070] INFO Sending event notification. >>>> (mesosphere.marathon.MarathonScheduler:262) >>>> [INFO] [03/10/2015 11:31:55.072] >>>> [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b] >>>> POSTing to all endpoints. >>>> [2015-03-10 11:31:55,073] INFO Need to scale >>>> /ffaas-backoffice-app-nopersist from 0 up to 1 instances >>>> (mesosphere.marathon.SchedulerActions:488) >>>> [2015-03-10 11:31:55,074] INFO Already queued 1 tasks for >>>> /ffaas-backoffice-app-nopersist. Not scaling. >>>> (mesosphere.marathon.SchedulerActions:498) >>>> ... >>>> ... >>>> ... >>>> [2015-03-10 11:31:57,682] INFO Received status update for task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799: >>>> TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148) >>>> [2015-03-10 11:31:57,694] INFO Sending event notification. >>>> (mesosphere.marathon.MarathonScheduler:262) >>>> [INFO] [03/10/2015 11:31:57.694] >>>> [marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b] >>>> POSTing to all endpoints. >>>> ... >>>> ... >>>> ... >>>> [2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store >>>> (mesosphere.marathon.tasks.TaskTracker:170) >>>> [INFO] [03/10/2015 11:36:55.050] >>>> [marathon-akka.actor.default-dispatcher-2] [akka://marathon/deadLetters] >>>> Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from >>>> Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to >>>> Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters >>>> encountered. This logging can be turned off or adjusted with configuration >>>> settings 'akka.log-dead-letters' and >>>> 'akka.log-dead-letters-during-shutdown'. >>>> [2015-03-10 11:36:55,057] INFO Syncing tasks for all apps >>>> (mesosphere.marathon.SchedulerActions:403) >>>> [2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the >>>> Mesos master (mesosphere.marathon.SchedulerActions:430) >>>> [2015-03-10 11:36:55,063] INFO Received status update for task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799: >>>> TASK_RUNNING (Reconciliation: Latest task state) >>>> (mesosphere.marathon.MarathonScheduler:148) >>>> [2015-03-10 11:36:55,065] INFO Received status update for task >>>> ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799: >>>> TASK_RUNNING (Reconciliation: Latest task state) >>>> (mesosphere.marathon.MarathonScheduler:148) >>>> [2015-03-10 11:36:55,066] INFO Already running 1 instances of >>>> /ffaas-backoffice-app-nopersist. Not scaling. >>>> (mesosphere.marathon.SchedulerActions:512) >>>> >>>> >>>> >>>> -- End of logs >>>> >>>> >>>> >>>> 2015-03-10 10:25 GMT+01:00 Adam Bordelon <a...@mesosphere.io>: >>>> >>>>> This is certainly not the expected/desired behavior when failing over >>>>> a mesos master in HA mode. In addition to the master logs Alex requested, >>>>> can you also provide relevant portions of the slave logs for these tasks? >>>>> If the slave processes themselves never failed over, checkpointing and >>>>> slave recovery should be irrelevant. Are you running the mesos-slave >>>>> itself >>>>> inside a Docker, or any other non-traditional setup? >>>>> >>>>> FYI, --checkpoint defaults to true (and is removed in 0.22), --recover >>>>> defaults to "reconnect", and --strict defaults to true, so none of those >>>>> are necessary. >>>>> >>>>> On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov <a...@mesosphere.io> >>>>> wrote: >>>>> >>>>>> Geoffroy, >>>>>> >>>>>> could you please provide master logs (both from killed and taking >>>>>> over masters)? >>>>>> >>>>>> On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley < >>>>>> geoffroy.jabou...@gmail.com> wrote: >>>>>> >>>>>>> Hello >>>>>>> >>>>>>> we are facing some unexpecting issues when testing high availability >>>>>>> behaviors of our mesos cluster. >>>>>>> >>>>>>> *Our use case:* >>>>>>> >>>>>>> *State*: the mesos cluster is up (3 machines), 1 docker task is >>>>>>> running on each slave (started from marathon) >>>>>>> >>>>>>> *Action*: stop the mesos master leader process >>>>>>> >>>>>>> *Expected*: mesos master leader has changed, *active tasks remain >>>>>>> unchanged* >>>>>>> >>>>>>> *Seen*: mesos master leader has changed, *all active tasks are now >>>>>>> FAILED but docker containers are still running*, marathon detects >>>>>>> FAILED tasks and starts new tasks. We end with 2 docker containers >>>>>>> running >>>>>>> on each machine, but only one is linked to a RUNNING mesos task. >>>>>>> >>>>>>> >>>>>>> Is the seen behavior correct? >>>>>>> >>>>>>> Have we misunderstood the high availability concept? We thought that >>>>>>> doing this use case would not have any impact on the current cluster >>>>>>> state >>>>>>> (except leader re-election) >>>>>>> >>>>>>> Thanks in advance for your help >>>>>>> Regards >>>>>>> >>>>>>> --------------------------------------------------- >>>>>>> >>>>>>> our setup is the following: >>>>>>> 3 identical mesos nodes with: >>>>>>> + zookeeper >>>>>>> + docker 1.5 >>>>>>> + mesos master 0.21.1 configured in HA mode >>>>>>> + mesos slave 0.21.1 configured with checkpointing, strict and >>>>>>> reconnect >>>>>>> + marathon 0.8.0 configured in HA mode with checkpointing >>>>>>> >>>>>>> --------------------------------------------------- >>>>>>> >>>>>>> Command lines: >>>>>>> >>>>>>> >>>>>>> *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181, >>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050 >>>>>>> --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19 >>>>>>> --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos >>>>>>> >>>>>>> *mesos-slave* >>>>>>> /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181, >>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos --checkpoint >>>>>>> --containerizers=docker,mesos --executor_registration_timeout=5mins >>>>>>> --hostname=10.195.30.19 --ip=10.195.30.19 >>>>>>> --isolation=cgroups/cpu,cgroups/mem --recover=reconnect >>>>>>> --recovery_timeout=120mins --strict >>>>>>> --resources=ports:[31000-32000,80,443] >>>>>>> >>>>>>> *marathon* >>>>>>> java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64 >>>>>>> -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp >>>>>>> /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000 >>>>>>> --local_port_min 31000 --task_launch_timeout 300000 --http_port 8080 >>>>>>> --hostname 10.195.30.19 --event_subscriber http_callback --ha >>>>>>> --https_port >>>>>>> 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181, >>>>>>> 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181, >>>>>>> 10.195.30.20:2181,10.195.30.21:2181/mesos >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > >