Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

2015-03-17 Thread Geoffroy Jabouley
Thanks a lot Dario for the workaround! It works fine and can be scripted
with ansible.

For the record, the github issue is available here:
https://github.com/mesosphere/marathon/issues/1292

2015-03-12 17:27 GMT+01:00 Dario Rexin da...@mesosphere.io:

 Hi Geoffrey,

 we identified the issue and will fix it in Marathon 0.8.2. To prevent this
 behaviour for now, you just have to make sure that in a fresh setup
 (Marathon was never connected to Mesos) you first start up a single
 Marathon and let it register with Mesos and then start the other Marathon
 instances. The problem is a race in first registration with Mesos and
 fetching the FrameworkID from Zookeeper. Please let me know if the
 workaround does not help you.

 Cheers,
 Dario

 On 12 Mar 2015, at 09:20, Alex Rukletsov a...@mesosphere.io wrote:

 Geoffroy,

 yes, it looks like a marathon issue, so feel free to post it there as well.

 On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com wrote:

 Thanks Alex for your answer. I will have a look.

 Would it be better to (cross-)post this discussion on the marathon
 mailing list?

 Anyway, the issue is fixed for 0.8.0, which is the version i'm using.

 2015-03-11 22:18 GMT+01:00 Alex Rukletsov a...@mesosphere.io:

 Geoffroy,

 most probably you're hitting this bug:
 https://github.com/mesosphere/marathon/issues/1063. The problem is that
 Marathon can register instead of re-registering when a master fails
 over. From master point of view, it's a new framework, that's why the
 previous task is gone and a new one (that technically belongs to a new
 framework) is started. You can see that frameworks have two different IDs
 (check lines 11:31:40.055496 and 11:31:40.785038) in your example.

 Hope that helps,
 Alex

 On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com wrote:

 Hello

 thanks for your interest. Following are the requested logs, which will
 result in a pretty big mail.

 Mesos/Marathon are *NOT running inside docker*, we only use Docker as
 our mesos containerizer.

 For reminder, here is the use case performed to get the logs file:

 

 Our cluster: 3 identical mesos nodes with:
 + zookeeper
 + docker 1.5
 + mesos master 0.21.1 configured in HA mode
 + mesos slave 0.21.1 configured with checkpointing, strict and
 reconnect
 + marathon 0.8.0 configured in HA mode with checkpointing

 

 *Begin State: *
 + the mesos cluster is up (3 machines)
 + mesos master leader is 10.195.30.19
 + marathon leader is 10.195.30.21
 + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21

 *Action*: stop the mesos master leader process (sudo stop mesos-master)

 *Expected*: mesos master leader has changed, active tasks / frameworks
 remain unchanged

 *End state: *
 + mesos master leader *has changed, now 10.195.30.21*
 + previously running APPTASK on the slave 10.195.30.21 has disappear
 (not showing anymore on the mesos UI), but *docker container is still
 running*
 + a n*ew APPTASK is now running on slave 10.195.30.19*
 + marathon framework registration time in mesos UI shows Just now
 + marathon leader *has changed, now 10.195.30.20*


 

 Now comes the 6 requested logs, which might contain
 interesting/relevant information, but i as a newcomer to mesos it is hard
 to read...


 *from previous MESOS master leader 10.195.30.19 http://10.195.30.19/:*
 W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal
 SIGTERM from process 1 of user 0; exiting


 *from new MESOS master leader 10.195.30.21 http://10.195.30.21/:*
 I0310 11:31:40.011545   922 detector.cpp:138] Detected a new leader:
 (id='2')
 I0310 11:31:40.011823   922 group.cpp:659] Trying to get
 '/mesos/info_02' in ZooKeeper
 I0310 11:31:40.015496   915 network.hpp:424] ZooKeeper group
 memberships changed
 I0310 11:31:40.015847   915 group.cpp:659] Trying to get
 '/mesos/log_replicas/00' in ZooKeeper
 I0310 11:31:40.016047   922 detector.cpp:433] A new leading master
 (UPID=master@10.195.30.21:5050) is detected
 I0310 11:31:40.016074   922 master.cpp:1263] The newly elected leader
 is master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
 I0310 11:31:40.016089   922 master.cpp:1276] Elected as the leading
 master!
 I0310 11:31:40.016108   922 master.cpp:1094] Recovering from registrar
 I0310 11:31:40.016188   918 registrar.cpp:313] Recovering registrar
 I0310 11:31:40.016542   918 log.cpp:656] Attempting to start the writer
 I0310 11:31:40.016918   918 replica.cpp:474] Replica received implicit
 promise request with proposal 2
 I0310 11:31:40.017503   915 group.cpp:659] Trying to get
 '/mesos/log_replicas/03' in ZooKeeper
 I0310 11:31:40.017832   918 leveldb.cpp:306] Persisting metadata (8
 bytes) to leveldb took 893672ns
 I0310 11:31:40.017848   918 replica.cpp:342] Persisted promised to 2
 I0310 11:31:40.018817   915 

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

2015-03-12 Thread Geoffroy Jabouley
Thanks Alex for your answer. I will have a look.

Would it be better to (cross-)post this discussion on the marathon mailing
list?

Anyway, the issue is fixed for 0.8.0, which is the version i'm using.

2015-03-11 22:18 GMT+01:00 Alex Rukletsov a...@mesosphere.io:

 Geoffroy,

 most probably you're hitting this bug:
 https://github.com/mesosphere/marathon/issues/1063. The problem is that
 Marathon can register instead of re-registering when a master fails
 over. From master point of view, it's a new framework, that's why the
 previous task is gone and a new one (that technically belongs to a new
 framework) is started. You can see that frameworks have two different IDs
 (check lines 11:31:40.055496 and 11:31:40.785038) in your example.

 Hope that helps,
 Alex

 On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com wrote:

 Hello

 thanks for your interest. Following are the requested logs, which will
 result in a pretty big mail.

 Mesos/Marathon are *NOT running inside docker*, we only use Docker as
 our mesos containerizer.

 For reminder, here is the use case performed to get the logs file:

 

 Our cluster: 3 identical mesos nodes with:
 + zookeeper
 + docker 1.5
 + mesos master 0.21.1 configured in HA mode
 + mesos slave 0.21.1 configured with checkpointing, strict and
 reconnect
 + marathon 0.8.0 configured in HA mode with checkpointing

 

 *Begin State: *
 + the mesos cluster is up (3 machines)
 + mesos master leader is 10.195.30.19
 + marathon leader is 10.195.30.21
 + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21

 *Action*: stop the mesos master leader process (sudo stop mesos-master)

 *Expected*: mesos master leader has changed, active tasks / frameworks
 remain unchanged

 *End state: *
 + mesos master leader *has changed, now 10.195.30.21*
 + previously running APPTASK on the slave 10.195.30.21 has disappear
 (not showing anymore on the mesos UI), but *docker container is still
 running*
 + a n*ew APPTASK is now running on slave 10.195.30.19*
 + marathon framework registration time in mesos UI shows Just now
 + marathon leader *has changed, now 10.195.30.20*


 

 Now comes the 6 requested logs, which might contain interesting/relevant
 information, but i as a newcomer to mesos it is hard to read...


 *from previous MESOS master leader 10.195.30.19 http://10.195.30.19:*
 W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM
 from process 1 of user 0; exiting


 *from new MESOS master leader 10.195.30.21 http://10.195.30.21:*
 I0310 11:31:40.011545   922 detector.cpp:138] Detected a new leader:
 (id='2')
 I0310 11:31:40.011823   922 group.cpp:659] Trying to get
 '/mesos/info_02' in ZooKeeper
 I0310 11:31:40.015496   915 network.hpp:424] ZooKeeper group memberships
 changed
 I0310 11:31:40.015847   915 group.cpp:659] Trying to get
 '/mesos/log_replicas/00' in ZooKeeper
 I0310 11:31:40.016047   922 detector.cpp:433] A new leading master (UPID=
 master@10.195.30.21:5050) is detected
 I0310 11:31:40.016074   922 master.cpp:1263] The newly elected leader is
 master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
 I0310 11:31:40.016089   922 master.cpp:1276] Elected as the leading
 master!
 I0310 11:31:40.016108   922 master.cpp:1094] Recovering from registrar
 I0310 11:31:40.016188   918 registrar.cpp:313] Recovering registrar
 I0310 11:31:40.016542   918 log.cpp:656] Attempting to start the writer
 I0310 11:31:40.016918   918 replica.cpp:474] Replica received implicit
 promise request with proposal 2
 I0310 11:31:40.017503   915 group.cpp:659] Trying to get
 '/mesos/log_replicas/03' in ZooKeeper
 I0310 11:31:40.017832   918 leveldb.cpp:306] Persisting metadata (8
 bytes) to leveldb took 893672ns
 I0310 11:31:40.017848   918 replica.cpp:342] Persisted promised to 2
 I0310 11:31:40.018817   915 network.hpp:466] ZooKeeper group PIDs: {
 log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
 I0310 11:31:40.023022   923 coordinator.cpp:230] Coordinator attemping to
 fill missing position
 I0310 11:31:40.023110   923 log.cpp:672] Writer started with ending
 position 8
 I0310 11:31:40.023293   923 leveldb.cpp:438] Reading position from
 leveldb took 13195ns
 I0310 11:31:40.023309   923 leveldb.cpp:438] Reading position from
 leveldb took 3120ns
 I0310 11:31:40.023619   922 registrar.cpp:346] Successfully fetched the
 registry (610B) in 7.385856ms
 I0310 11:31:40.023679   922 registrar.cpp:445] Applied 1 operations in
 9263ns; attempting to update the 'registry'
 I0310 11:31:40.024238   922 log.cpp:680] Attempting to append 647 bytes
 to the log
 I0310 11:31:40.024279   923 coordinator.cpp:340] Coordinator attempting
 to write APPEND action at position 9
 I0310 11:31:40.024435   923 replica.cpp:508] Replica received write
 request for position 9
 I0310 

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

2015-03-12 Thread Dario Rexin
Hi Geoffrey,

we identified the issue and will fix it in Marathon 0.8.2. To prevent this 
behaviour for now, you just have to make sure that in a fresh setup (Marathon 
was never connected to Mesos) you first start up a single Marathon and let it 
register with Mesos and then start the other Marathon instances. The problem is 
a race in first registration with Mesos and fetching the FrameworkID from 
Zookeeper. Please let me know if the workaround does not help you.

Cheers,
Dario

 On 12 Mar 2015, at 09:20, Alex Rukletsov a...@mesosphere.io wrote:
 
 Geoffroy,
 
 yes, it looks like a marathon issue, so feel free to post it there as well.
 
 On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com mailto:geoffroy.jabou...@gmail.com wrote:
 Thanks Alex for your answer. I will have a look.
 
 Would it be better to (cross-)post this discussion on the marathon mailing 
 list?
 
 Anyway, the issue is fixed for 0.8.0, which is the version i'm using.
 
 2015-03-11 22:18 GMT+01:00 Alex Rukletsov a...@mesosphere.io 
 mailto:a...@mesosphere.io:
 Geoffroy,
 
 most probably you're hitting this bug: 
 https://github.com/mesosphere/marathon/issues/1063 
 https://github.com/mesosphere/marathon/issues/1063. The problem is that 
 Marathon can register instead of re-registering when a master fails over. 
 From master point of view, it's a new framework, that's why the previous task 
 is gone and a new one (that technically belongs to a new framework) is 
 started. You can see that frameworks have two different IDs (check lines 
 11:31:40.055496 and 11:31:40.785038) in your example.
 
 Hope that helps,
 Alex
 
 On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com mailto:geoffroy.jabou...@gmail.com wrote:
 Hello
 
 thanks for your interest. Following are the requested logs, which will result 
 in a pretty big mail.
 
 Mesos/Marathon are NOT running inside docker, we only use Docker as our mesos 
 containerizer.
 
 For reminder, here is the use case performed to get the logs file:
 
 
 
 Our cluster: 3 identical mesos nodes with:
 + zookeeper
 + docker 1.5
 + mesos master 0.21.1 configured in HA mode
 + mesos slave 0.21.1 configured with checkpointing, strict and reconnect
 + marathon 0.8.0 configured in HA mode with checkpointing
 
 
 
 Begin State: 
 + the mesos cluster is up (3 machines)
 + mesos master leader is 10.195.30.19
 + marathon leader is 10.195.30.21
 + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21
 
 Action: stop the mesos master leader process (sudo stop mesos-master)
 
 Expected: mesos master leader has changed, active tasks / frameworks remain 
 unchanged
 
 End state: 
 + mesos master leader has changed, now 10.195.30.21
 + previously running APPTASK on the slave 10.195.30.21 has disappear (not 
 showing anymore on the mesos UI), but docker container is still running
 + a new APPTASK is now running on slave 10.195.30.19
 + marathon framework registration time in mesos UI shows Just now
 + marathon leader has changed, now 10.195.30.20
 
 
 
 
 Now comes the 6 requested logs, which might contain interesting/relevant 
 information, but i as a newcomer to mesos it is hard to read...
 
 
 from previous MESOS master leader 10.195.30.19 http://10.195.30.19/:
 W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM from 
 process 1 of user 0; exiting
 
 
 from new MESOS master leader 10.195.30.21 http://10.195.30.21/:
 I0310 11:31:40.011545   922 detector.cpp:138] Detected a new leader: (id='2')
 I0310 11:31:40.011823   922 group.cpp:659] Trying to get 
 '/mesos/info_02' in ZooKeeper
 I0310 11:31:40.015496   915 network.hpp:424] ZooKeeper group memberships 
 changed
 I0310 11:31:40.015847   915 group.cpp:659] Trying to get 
 '/mesos/log_replicas/00' in ZooKeeper
 I0310 11:31:40.016047   922 detector.cpp:433] A new leading master 
 (UPID=master@10.195.30.21:5050 http://master@10.195.30.21:5050/) is detected
 I0310 11:31:40.016074   922 master.cpp:1263] The newly elected leader is 
 master@10.195.30.21:5050 http://master@10.195.30.21:5050/ with id 
 20150310-112310-354337546-5050-895
 I0310 11:31:40.016089   922 master.cpp:1276] Elected as the leading master!
 I0310 11:31:40.016108   922 master.cpp:1094] Recovering from registrar
 I0310 11:31:40.016188   918 registrar.cpp:313] Recovering registrar
 I0310 11:31:40.016542   918 log.cpp:656] Attempting to start the writer
 I0310 11:31:40.016918   918 replica.cpp:474] Replica received implicit 
 promise request with proposal 2
 I0310 11:31:40.017503   915 group.cpp:659] Trying to get 
 '/mesos/log_replicas/03' in ZooKeeper
 I0310 11:31:40.017832   918 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 893672ns
 I0310 11:31:40.017848   918 replica.cpp:342] Persisted promised to 2
 I0310 11:31:40.018817   915 network.hpp:466] 

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

2015-03-10 Thread Geoffroy Jabouley
Hello

thanks for your interest. Following are the requested logs, which will
result in a pretty big mail.

Mesos/Marathon are *NOT running inside docker*, we only use Docker as our
mesos containerizer.

For reminder, here is the use case performed to get the logs file:



Our cluster: 3 identical mesos nodes with:
+ zookeeper
+ docker 1.5
+ mesos master 0.21.1 configured in HA mode
+ mesos slave 0.21.1 configured with checkpointing, strict and reconnect
+ marathon 0.8.0 configured in HA mode with checkpointing



*Begin State: *
+ the mesos cluster is up (3 machines)
+ mesos master leader is 10.195.30.19
+ marathon leader is 10.195.30.21
+ 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21

*Action*: stop the mesos master leader process (sudo stop mesos-master)

*Expected*: mesos master leader has changed, active tasks / frameworks
remain unchanged

*End state: *
+ mesos master leader *has changed, now 10.195.30.21*
+ previously running APPTASK on the slave 10.195.30.21 has disappear (not
showing anymore on the mesos UI), but *docker container is still running*
+ a n*ew APPTASK is now running on slave 10.195.30.19*
+ marathon framework registration time in mesos UI shows Just now
+ marathon leader *has changed, now 10.195.30.20*




Now comes the 6 requested logs, which might contain interesting/relevant
information, but i as a newcomer to mesos it is hard to read...


*from previous MESOS master leader 10.195.30.19 http://10.195.30.19:*
W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM
from process 1 of user 0; exiting


*from new MESOS master leader 10.195.30.21 http://10.195.30.21:*
I0310 11:31:40.011545   922 detector.cpp:138] Detected a new leader:
(id='2')
I0310 11:31:40.011823   922 group.cpp:659] Trying to get
'/mesos/info_02' in ZooKeeper
I0310 11:31:40.015496   915 network.hpp:424] ZooKeeper group memberships
changed
I0310 11:31:40.015847   915 group.cpp:659] Trying to get
'/mesos/log_replicas/00' in ZooKeeper
I0310 11:31:40.016047   922 detector.cpp:433] A new leading master (UPID=
master@10.195.30.21:5050) is detected
I0310 11:31:40.016074   922 master.cpp:1263] The newly elected leader is
master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
I0310 11:31:40.016089   922 master.cpp:1276] Elected as the leading master!
I0310 11:31:40.016108   922 master.cpp:1094] Recovering from registrar
I0310 11:31:40.016188   918 registrar.cpp:313] Recovering registrar
I0310 11:31:40.016542   918 log.cpp:656] Attempting to start the writer
I0310 11:31:40.016918   918 replica.cpp:474] Replica received implicit
promise request with proposal 2
I0310 11:31:40.017503   915 group.cpp:659] Trying to get
'/mesos/log_replicas/03' in ZooKeeper
I0310 11:31:40.017832   918 leveldb.cpp:306] Persisting metadata (8 bytes)
to leveldb took 893672ns
I0310 11:31:40.017848   918 replica.cpp:342] Persisted promised to 2
I0310 11:31:40.018817   915 network.hpp:466] ZooKeeper group PIDs: {
log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
I0310 11:31:40.023022   923 coordinator.cpp:230] Coordinator attemping to
fill missing position
I0310 11:31:40.023110   923 log.cpp:672] Writer started with ending
position 8
I0310 11:31:40.023293   923 leveldb.cpp:438] Reading position from leveldb
took 13195ns
I0310 11:31:40.023309   923 leveldb.cpp:438] Reading position from leveldb
took 3120ns
I0310 11:31:40.023619   922 registrar.cpp:346] Successfully fetched the
registry (610B) in 7.385856ms
I0310 11:31:40.023679   922 registrar.cpp:445] Applied 1 operations in
9263ns; attempting to update the 'registry'
I0310 11:31:40.024238   922 log.cpp:680] Attempting to append 647 bytes to
the log
I0310 11:31:40.024279   923 coordinator.cpp:340] Coordinator attempting to
write APPEND action at position 9
I0310 11:31:40.024435   923 replica.cpp:508] Replica received write request
for position 9
I0310 11:31:40.025707   923 leveldb.cpp:343] Persisting action (666 bytes)
to leveldb took 1.259338ms
I0310 11:31:40.025722   923 replica.cpp:676] Persisted action at 9
I0310 11:31:40.026074   923 replica.cpp:655] Replica received learned
notice for position 9
I0310 11:31:40.026495   923 leveldb.cpp:343] Persisting action (668 bytes)
to leveldb took 404795ns
I0310 11:31:40.026507   923 replica.cpp:676] Persisted action at 9
I0310 11:31:40.026511   923 replica.cpp:661] Replica learned APPEND action
at position 9
I0310 11:31:40.026726   923 registrar.cpp:490] Successfully updated the
'registry' in 3.029248ms
I0310 11:31:40.026765   923 registrar.cpp:376] Successfully recovered
registrar
I0310 11:31:40.026814   923 log.cpp:699] Attempting to truncate the log to 9
I0310 11:31:40.026880   923 master.cpp:1121] Recovered 3 slaves from the
Registry (608B) ; allowing 1days for slaves to re-register
I0310 11:31:40.026897   923 coordinator.cpp:340] Coordinator 

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

2015-03-10 Thread Adam Bordelon
This is certainly not the expected/desired behavior when failing over a
mesos master in HA mode. In addition to the master logs Alex requested, can
you also provide relevant portions of the slave logs for these tasks? If
the slave processes themselves never failed over, checkpointing and slave
recovery should be irrelevant. Are you running the mesos-slave itself
inside a Docker, or any other non-traditional setup?

FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
defaults to reconnect, and --strict defaults to true, so none of those
are necessary.

On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov a...@mesosphere.io wrote:

 Geoffroy,

 could you please provide master logs (both from killed and taking over
 masters)?

 On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com wrote:

 Hello

 we are facing some unexpecting issues when testing high availability
 behaviors of our mesos cluster.

 *Our use case:*

 *State*: the mesos cluster is up (3 machines), 1 docker task is running
 on each slave (started from marathon)

 *Action*: stop the mesos master leader process

 *Expected*: mesos master leader has changed, *active tasks remain
 unchanged*

 *Seen*: mesos master leader has changed, *all active tasks are now
 FAILED but docker containers are still running*, marathon detects FAILED
 tasks and starts new tasks. We end with 2 docker containers running on each
 machine, but only one is linked to a RUNNING mesos task.


 Is the seen behavior correct?

 Have we misunderstood the high availability concept? We thought that
 doing this use case would not have any impact on the current cluster state
 (except leader re-election)

 Thanks in advance for your help
 Regards

 ---

 our setup is the following:
 3 identical mesos nodes with:
 + zookeeper
 + docker 1.5
 + mesos master 0.21.1 configured in HA mode
 + mesos slave 0.21.1 configured with checkpointing, strict and
 reconnect
 + marathon 0.8.0 configured in HA mode with checkpointing

 ---

 Command lines:


 *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
 --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
 --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos

 *mesos-slave*
 /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
 --executor_registration_timeout=5mins --hostname=10.195.30.19
 --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
 --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]

 *marathon*
 java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
 -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
 /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
 --local_port_min 31000 --task_launch_timeout 30 --http_port 8080
 --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
 10.195.30.20:2181,10.195.30.21:2181/mesos





Weird behavior when stopping the mesos master leader of a HA mesos cluster

2015-03-06 Thread Geoffroy Jabouley
Hello

we are facing some unexpecting issues when testing high availability
behaviors of our mesos cluster.

*Our use case:*

*State*: the mesos cluster is up (3 machines), 1 docker task is running on
each slave (started from marathon)

*Action*: stop the mesos master leader process

*Expected*: mesos master leader has changed, *active tasks remain unchanged*

*Seen*: mesos master leader has changed, *all active tasks are now FAILED
but docker containers are still running*, marathon detects FAILED tasks and
starts new tasks. We end with 2 docker containers running on each machine,
but only one is linked to a RUNNING mesos task.


Is the seen behavior correct?

Have we misunderstood the high availability concept? We thought that doing
this use case would not have any impact on the current cluster state
(except leader re-election)

Thanks in advance for your help
Regards

---

our setup is the following:
3 identical mesos nodes with:
+ zookeeper
+ docker 1.5
+ mesos master 0.21.1 configured in HA mode
+ mesos slave 0.21.1 configured with checkpointing, strict and reconnect
+ marathon 0.8.0 configured in HA mode with checkpointing

---

Command lines:


*mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
--cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
--quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos

*mesos-slave*
/usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
--executor_registration_timeout=5mins --hostname=10.195.30.19
--ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
--recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]

*marathon*
java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
-Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
/usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
--local_port_min 31000 --task_launch_timeout 30 --http_port 8080
--hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,10.195.30.20:2181
,10.195.30.21:2181/mesos


Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

2015-03-06 Thread Alex Rukletsov
Geoffroy,

could you please provide master logs (both from killed and taking over
masters)?

On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley 
geoffroy.jabou...@gmail.com wrote:

 Hello

 we are facing some unexpecting issues when testing high availability
 behaviors of our mesos cluster.

 *Our use case:*

 *State*: the mesos cluster is up (3 machines), 1 docker task is running
 on each slave (started from marathon)

 *Action*: stop the mesos master leader process

 *Expected*: mesos master leader has changed, *active tasks remain
 unchanged*

 *Seen*: mesos master leader has changed, *all active tasks are now FAILED
 but docker containers are still running*, marathon detects FAILED tasks
 and starts new tasks. We end with 2 docker containers running on each
 machine, but only one is linked to a RUNNING mesos task.


 Is the seen behavior correct?

 Have we misunderstood the high availability concept? We thought that doing
 this use case would not have any impact on the current cluster state
 (except leader re-election)

 Thanks in advance for your help
 Regards

 ---

 our setup is the following:
 3 identical mesos nodes with:
 + zookeeper
 + docker 1.5
 + mesos master 0.21.1 configured in HA mode
 + mesos slave 0.21.1 configured with checkpointing, strict and
 reconnect
 + marathon 0.8.0 configured in HA mode with checkpointing

 ---

 Command lines:


 *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
 --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
 --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos

 *mesos-slave*
 /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
 --executor_registration_timeout=5mins --hostname=10.195.30.19
 --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
 --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]

 *marathon*
 java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
 -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
 /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
 --local_port_min 31000 --task_launch_timeout 30 --http_port 8080
 --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
 10.195.30.20:2181,10.195.30.21:2181/mesos