Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster
Thanks a lot Dario for the workaround! It works fine and can be scripted with ansible. For the record, the github issue is available here: https://github.com/mesosphere/marathon/issues/1292 2015-03-12 17:27 GMT+01:00 Dario Rexin da...@mesosphere.io: Hi Geoffrey, we identified the issue and will fix it in Marathon 0.8.2. To prevent this behaviour for now, you just have to make sure that in a fresh setup (Marathon was never connected to Mesos) you first start up a single Marathon and let it register with Mesos and then start the other Marathon instances. The problem is a race in first registration with Mesos and fetching the FrameworkID from Zookeeper. Please let me know if the workaround does not help you. Cheers, Dario On 12 Mar 2015, at 09:20, Alex Rukletsov a...@mesosphere.io wrote: Geoffroy, yes, it looks like a marathon issue, so feel free to post it there as well. On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Thanks Alex for your answer. I will have a look. Would it be better to (cross-)post this discussion on the marathon mailing list? Anyway, the issue is fixed for 0.8.0, which is the version i'm using. 2015-03-11 22:18 GMT+01:00 Alex Rukletsov a...@mesosphere.io: Geoffroy, most probably you're hitting this bug: https://github.com/mesosphere/marathon/issues/1063. The problem is that Marathon can register instead of re-registering when a master fails over. From master point of view, it's a new framework, that's why the previous task is gone and a new one (that technically belongs to a new framework) is started. You can see that frameworks have two different IDs (check lines 11:31:40.055496 and 11:31:40.785038) in your example. Hope that helps, Alex On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Hello thanks for your interest. Following are the requested logs, which will result in a pretty big mail. Mesos/Marathon are *NOT running inside docker*, we only use Docker as our mesos containerizer. For reminder, here is the use case performed to get the logs file: Our cluster: 3 identical mesos nodes with: + zookeeper + docker 1.5 + mesos master 0.21.1 configured in HA mode + mesos slave 0.21.1 configured with checkpointing, strict and reconnect + marathon 0.8.0 configured in HA mode with checkpointing *Begin State: * + the mesos cluster is up (3 machines) + mesos master leader is 10.195.30.19 + marathon leader is 10.195.30.21 + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21 *Action*: stop the mesos master leader process (sudo stop mesos-master) *Expected*: mesos master leader has changed, active tasks / frameworks remain unchanged *End state: * + mesos master leader *has changed, now 10.195.30.21* + previously running APPTASK on the slave 10.195.30.21 has disappear (not showing anymore on the mesos UI), but *docker container is still running* + a n*ew APPTASK is now running on slave 10.195.30.19* + marathon framework registration time in mesos UI shows Just now + marathon leader *has changed, now 10.195.30.20* Now comes the 6 requested logs, which might contain interesting/relevant information, but i as a newcomer to mesos it is hard to read... *from previous MESOS master leader 10.195.30.19 http://10.195.30.19/:* W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM from process 1 of user 0; exiting *from new MESOS master leader 10.195.30.21 http://10.195.30.21/:* I0310 11:31:40.011545 922 detector.cpp:138] Detected a new leader: (id='2') I0310 11:31:40.011823 922 group.cpp:659] Trying to get '/mesos/info_02' in ZooKeeper I0310 11:31:40.015496 915 network.hpp:424] ZooKeeper group memberships changed I0310 11:31:40.015847 915 group.cpp:659] Trying to get '/mesos/log_replicas/00' in ZooKeeper I0310 11:31:40.016047 922 detector.cpp:433] A new leading master (UPID=master@10.195.30.21:5050) is detected I0310 11:31:40.016074 922 master.cpp:1263] The newly elected leader is master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895 I0310 11:31:40.016089 922 master.cpp:1276] Elected as the leading master! I0310 11:31:40.016108 922 master.cpp:1094] Recovering from registrar I0310 11:31:40.016188 918 registrar.cpp:313] Recovering registrar I0310 11:31:40.016542 918 log.cpp:656] Attempting to start the writer I0310 11:31:40.016918 918 replica.cpp:474] Replica received implicit promise request with proposal 2 I0310 11:31:40.017503 915 group.cpp:659] Trying to get '/mesos/log_replicas/03' in ZooKeeper I0310 11:31:40.017832 918 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 893672ns I0310 11:31:40.017848 918 replica.cpp:342] Persisted promised to 2 I0310 11:31:40.018817 915
Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster
Thanks Alex for your answer. I will have a look. Would it be better to (cross-)post this discussion on the marathon mailing list? Anyway, the issue is fixed for 0.8.0, which is the version i'm using. 2015-03-11 22:18 GMT+01:00 Alex Rukletsov a...@mesosphere.io: Geoffroy, most probably you're hitting this bug: https://github.com/mesosphere/marathon/issues/1063. The problem is that Marathon can register instead of re-registering when a master fails over. From master point of view, it's a new framework, that's why the previous task is gone and a new one (that technically belongs to a new framework) is started. You can see that frameworks have two different IDs (check lines 11:31:40.055496 and 11:31:40.785038) in your example. Hope that helps, Alex On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Hello thanks for your interest. Following are the requested logs, which will result in a pretty big mail. Mesos/Marathon are *NOT running inside docker*, we only use Docker as our mesos containerizer. For reminder, here is the use case performed to get the logs file: Our cluster: 3 identical mesos nodes with: + zookeeper + docker 1.5 + mesos master 0.21.1 configured in HA mode + mesos slave 0.21.1 configured with checkpointing, strict and reconnect + marathon 0.8.0 configured in HA mode with checkpointing *Begin State: * + the mesos cluster is up (3 machines) + mesos master leader is 10.195.30.19 + marathon leader is 10.195.30.21 + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21 *Action*: stop the mesos master leader process (sudo stop mesos-master) *Expected*: mesos master leader has changed, active tasks / frameworks remain unchanged *End state: * + mesos master leader *has changed, now 10.195.30.21* + previously running APPTASK on the slave 10.195.30.21 has disappear (not showing anymore on the mesos UI), but *docker container is still running* + a n*ew APPTASK is now running on slave 10.195.30.19* + marathon framework registration time in mesos UI shows Just now + marathon leader *has changed, now 10.195.30.20* Now comes the 6 requested logs, which might contain interesting/relevant information, but i as a newcomer to mesos it is hard to read... *from previous MESOS master leader 10.195.30.19 http://10.195.30.19:* W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM from process 1 of user 0; exiting *from new MESOS master leader 10.195.30.21 http://10.195.30.21:* I0310 11:31:40.011545 922 detector.cpp:138] Detected a new leader: (id='2') I0310 11:31:40.011823 922 group.cpp:659] Trying to get '/mesos/info_02' in ZooKeeper I0310 11:31:40.015496 915 network.hpp:424] ZooKeeper group memberships changed I0310 11:31:40.015847 915 group.cpp:659] Trying to get '/mesos/log_replicas/00' in ZooKeeper I0310 11:31:40.016047 922 detector.cpp:433] A new leading master (UPID= master@10.195.30.21:5050) is detected I0310 11:31:40.016074 922 master.cpp:1263] The newly elected leader is master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895 I0310 11:31:40.016089 922 master.cpp:1276] Elected as the leading master! I0310 11:31:40.016108 922 master.cpp:1094] Recovering from registrar I0310 11:31:40.016188 918 registrar.cpp:313] Recovering registrar I0310 11:31:40.016542 918 log.cpp:656] Attempting to start the writer I0310 11:31:40.016918 918 replica.cpp:474] Replica received implicit promise request with proposal 2 I0310 11:31:40.017503 915 group.cpp:659] Trying to get '/mesos/log_replicas/03' in ZooKeeper I0310 11:31:40.017832 918 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 893672ns I0310 11:31:40.017848 918 replica.cpp:342] Persisted promised to 2 I0310 11:31:40.018817 915 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 } I0310 11:31:40.023022 923 coordinator.cpp:230] Coordinator attemping to fill missing position I0310 11:31:40.023110 923 log.cpp:672] Writer started with ending position 8 I0310 11:31:40.023293 923 leveldb.cpp:438] Reading position from leveldb took 13195ns I0310 11:31:40.023309 923 leveldb.cpp:438] Reading position from leveldb took 3120ns I0310 11:31:40.023619 922 registrar.cpp:346] Successfully fetched the registry (610B) in 7.385856ms I0310 11:31:40.023679 922 registrar.cpp:445] Applied 1 operations in 9263ns; attempting to update the 'registry' I0310 11:31:40.024238 922 log.cpp:680] Attempting to append 647 bytes to the log I0310 11:31:40.024279 923 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 9 I0310 11:31:40.024435 923 replica.cpp:508] Replica received write request for position 9 I0310
Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster
Hi Geoffrey, we identified the issue and will fix it in Marathon 0.8.2. To prevent this behaviour for now, you just have to make sure that in a fresh setup (Marathon was never connected to Mesos) you first start up a single Marathon and let it register with Mesos and then start the other Marathon instances. The problem is a race in first registration with Mesos and fetching the FrameworkID from Zookeeper. Please let me know if the workaround does not help you. Cheers, Dario On 12 Mar 2015, at 09:20, Alex Rukletsov a...@mesosphere.io wrote: Geoffroy, yes, it looks like a marathon issue, so feel free to post it there as well. On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com mailto:geoffroy.jabou...@gmail.com wrote: Thanks Alex for your answer. I will have a look. Would it be better to (cross-)post this discussion on the marathon mailing list? Anyway, the issue is fixed for 0.8.0, which is the version i'm using. 2015-03-11 22:18 GMT+01:00 Alex Rukletsov a...@mesosphere.io mailto:a...@mesosphere.io: Geoffroy, most probably you're hitting this bug: https://github.com/mesosphere/marathon/issues/1063 https://github.com/mesosphere/marathon/issues/1063. The problem is that Marathon can register instead of re-registering when a master fails over. From master point of view, it's a new framework, that's why the previous task is gone and a new one (that technically belongs to a new framework) is started. You can see that frameworks have two different IDs (check lines 11:31:40.055496 and 11:31:40.785038) in your example. Hope that helps, Alex On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com mailto:geoffroy.jabou...@gmail.com wrote: Hello thanks for your interest. Following are the requested logs, which will result in a pretty big mail. Mesos/Marathon are NOT running inside docker, we only use Docker as our mesos containerizer. For reminder, here is the use case performed to get the logs file: Our cluster: 3 identical mesos nodes with: + zookeeper + docker 1.5 + mesos master 0.21.1 configured in HA mode + mesos slave 0.21.1 configured with checkpointing, strict and reconnect + marathon 0.8.0 configured in HA mode with checkpointing Begin State: + the mesos cluster is up (3 machines) + mesos master leader is 10.195.30.19 + marathon leader is 10.195.30.21 + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21 Action: stop the mesos master leader process (sudo stop mesos-master) Expected: mesos master leader has changed, active tasks / frameworks remain unchanged End state: + mesos master leader has changed, now 10.195.30.21 + previously running APPTASK on the slave 10.195.30.21 has disappear (not showing anymore on the mesos UI), but docker container is still running + a new APPTASK is now running on slave 10.195.30.19 + marathon framework registration time in mesos UI shows Just now + marathon leader has changed, now 10.195.30.20 Now comes the 6 requested logs, which might contain interesting/relevant information, but i as a newcomer to mesos it is hard to read... from previous MESOS master leader 10.195.30.19 http://10.195.30.19/: W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM from process 1 of user 0; exiting from new MESOS master leader 10.195.30.21 http://10.195.30.21/: I0310 11:31:40.011545 922 detector.cpp:138] Detected a new leader: (id='2') I0310 11:31:40.011823 922 group.cpp:659] Trying to get '/mesos/info_02' in ZooKeeper I0310 11:31:40.015496 915 network.hpp:424] ZooKeeper group memberships changed I0310 11:31:40.015847 915 group.cpp:659] Trying to get '/mesos/log_replicas/00' in ZooKeeper I0310 11:31:40.016047 922 detector.cpp:433] A new leading master (UPID=master@10.195.30.21:5050 http://master@10.195.30.21:5050/) is detected I0310 11:31:40.016074 922 master.cpp:1263] The newly elected leader is master@10.195.30.21:5050 http://master@10.195.30.21:5050/ with id 20150310-112310-354337546-5050-895 I0310 11:31:40.016089 922 master.cpp:1276] Elected as the leading master! I0310 11:31:40.016108 922 master.cpp:1094] Recovering from registrar I0310 11:31:40.016188 918 registrar.cpp:313] Recovering registrar I0310 11:31:40.016542 918 log.cpp:656] Attempting to start the writer I0310 11:31:40.016918 918 replica.cpp:474] Replica received implicit promise request with proposal 2 I0310 11:31:40.017503 915 group.cpp:659] Trying to get '/mesos/log_replicas/03' in ZooKeeper I0310 11:31:40.017832 918 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 893672ns I0310 11:31:40.017848 918 replica.cpp:342] Persisted promised to 2 I0310 11:31:40.018817 915 network.hpp:466]
Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster
Hello thanks for your interest. Following are the requested logs, which will result in a pretty big mail. Mesos/Marathon are *NOT running inside docker*, we only use Docker as our mesos containerizer. For reminder, here is the use case performed to get the logs file: Our cluster: 3 identical mesos nodes with: + zookeeper + docker 1.5 + mesos master 0.21.1 configured in HA mode + mesos slave 0.21.1 configured with checkpointing, strict and reconnect + marathon 0.8.0 configured in HA mode with checkpointing *Begin State: * + the mesos cluster is up (3 machines) + mesos master leader is 10.195.30.19 + marathon leader is 10.195.30.21 + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21 *Action*: stop the mesos master leader process (sudo stop mesos-master) *Expected*: mesos master leader has changed, active tasks / frameworks remain unchanged *End state: * + mesos master leader *has changed, now 10.195.30.21* + previously running APPTASK on the slave 10.195.30.21 has disappear (not showing anymore on the mesos UI), but *docker container is still running* + a n*ew APPTASK is now running on slave 10.195.30.19* + marathon framework registration time in mesos UI shows Just now + marathon leader *has changed, now 10.195.30.20* Now comes the 6 requested logs, which might contain interesting/relevant information, but i as a newcomer to mesos it is hard to read... *from previous MESOS master leader 10.195.30.19 http://10.195.30.19:* W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM from process 1 of user 0; exiting *from new MESOS master leader 10.195.30.21 http://10.195.30.21:* I0310 11:31:40.011545 922 detector.cpp:138] Detected a new leader: (id='2') I0310 11:31:40.011823 922 group.cpp:659] Trying to get '/mesos/info_02' in ZooKeeper I0310 11:31:40.015496 915 network.hpp:424] ZooKeeper group memberships changed I0310 11:31:40.015847 915 group.cpp:659] Trying to get '/mesos/log_replicas/00' in ZooKeeper I0310 11:31:40.016047 922 detector.cpp:433] A new leading master (UPID= master@10.195.30.21:5050) is detected I0310 11:31:40.016074 922 master.cpp:1263] The newly elected leader is master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895 I0310 11:31:40.016089 922 master.cpp:1276] Elected as the leading master! I0310 11:31:40.016108 922 master.cpp:1094] Recovering from registrar I0310 11:31:40.016188 918 registrar.cpp:313] Recovering registrar I0310 11:31:40.016542 918 log.cpp:656] Attempting to start the writer I0310 11:31:40.016918 918 replica.cpp:474] Replica received implicit promise request with proposal 2 I0310 11:31:40.017503 915 group.cpp:659] Trying to get '/mesos/log_replicas/03' in ZooKeeper I0310 11:31:40.017832 918 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 893672ns I0310 11:31:40.017848 918 replica.cpp:342] Persisted promised to 2 I0310 11:31:40.018817 915 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 } I0310 11:31:40.023022 923 coordinator.cpp:230] Coordinator attemping to fill missing position I0310 11:31:40.023110 923 log.cpp:672] Writer started with ending position 8 I0310 11:31:40.023293 923 leveldb.cpp:438] Reading position from leveldb took 13195ns I0310 11:31:40.023309 923 leveldb.cpp:438] Reading position from leveldb took 3120ns I0310 11:31:40.023619 922 registrar.cpp:346] Successfully fetched the registry (610B) in 7.385856ms I0310 11:31:40.023679 922 registrar.cpp:445] Applied 1 operations in 9263ns; attempting to update the 'registry' I0310 11:31:40.024238 922 log.cpp:680] Attempting to append 647 bytes to the log I0310 11:31:40.024279 923 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 9 I0310 11:31:40.024435 923 replica.cpp:508] Replica received write request for position 9 I0310 11:31:40.025707 923 leveldb.cpp:343] Persisting action (666 bytes) to leveldb took 1.259338ms I0310 11:31:40.025722 923 replica.cpp:676] Persisted action at 9 I0310 11:31:40.026074 923 replica.cpp:655] Replica received learned notice for position 9 I0310 11:31:40.026495 923 leveldb.cpp:343] Persisting action (668 bytes) to leveldb took 404795ns I0310 11:31:40.026507 923 replica.cpp:676] Persisted action at 9 I0310 11:31:40.026511 923 replica.cpp:661] Replica learned APPEND action at position 9 I0310 11:31:40.026726 923 registrar.cpp:490] Successfully updated the 'registry' in 3.029248ms I0310 11:31:40.026765 923 registrar.cpp:376] Successfully recovered registrar I0310 11:31:40.026814 923 log.cpp:699] Attempting to truncate the log to 9 I0310 11:31:40.026880 923 master.cpp:1121] Recovered 3 slaves from the Registry (608B) ; allowing 1days for slaves to re-register I0310 11:31:40.026897 923 coordinator.cpp:340] Coordinator
Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster
This is certainly not the expected/desired behavior when failing over a mesos master in HA mode. In addition to the master logs Alex requested, can you also provide relevant portions of the slave logs for these tasks? If the slave processes themselves never failed over, checkpointing and slave recovery should be irrelevant. Are you running the mesos-slave itself inside a Docker, or any other non-traditional setup? FYI, --checkpoint defaults to true (and is removed in 0.22), --recover defaults to reconnect, and --strict defaults to true, so none of those are necessary. On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov a...@mesosphere.io wrote: Geoffroy, could you please provide master logs (both from killed and taking over masters)? On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Hello we are facing some unexpecting issues when testing high availability behaviors of our mesos cluster. *Our use case:* *State*: the mesos cluster is up (3 machines), 1 docker task is running on each slave (started from marathon) *Action*: stop the mesos master leader process *Expected*: mesos master leader has changed, *active tasks remain unchanged* *Seen*: mesos master leader has changed, *all active tasks are now FAILED but docker containers are still running*, marathon detects FAILED tasks and starts new tasks. We end with 2 docker containers running on each machine, but only one is linked to a RUNNING mesos task. Is the seen behavior correct? Have we misunderstood the high availability concept? We thought that doing this use case would not have any impact on the current cluster state (except leader re-election) Thanks in advance for your help Regards --- our setup is the following: 3 identical mesos nodes with: + zookeeper + docker 1.5 + mesos master 0.21.1 configured in HA mode + mesos slave 0.21.1 configured with checkpointing, strict and reconnect + marathon 0.8.0 configured in HA mode with checkpointing --- Command lines: *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181, 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050 --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19 --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos *mesos-slave* /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181, 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos --executor_registration_timeout=5mins --hostname=10.195.30.19 --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443] *marathon* java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64 -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000 --local_port_min 31000 --task_launch_timeout 30 --http_port 8080 --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181, 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181, 10.195.30.20:2181,10.195.30.21:2181/mesos
Weird behavior when stopping the mesos master leader of a HA mesos cluster
Hello we are facing some unexpecting issues when testing high availability behaviors of our mesos cluster. *Our use case:* *State*: the mesos cluster is up (3 machines), 1 docker task is running on each slave (started from marathon) *Action*: stop the mesos master leader process *Expected*: mesos master leader has changed, *active tasks remain unchanged* *Seen*: mesos master leader has changed, *all active tasks are now FAILED but docker containers are still running*, marathon detects FAILED tasks and starts new tasks. We end with 2 docker containers running on each machine, but only one is linked to a RUNNING mesos task. Is the seen behavior correct? Have we misunderstood the high availability concept? We thought that doing this use case would not have any impact on the current cluster state (except leader re-election) Thanks in advance for your help Regards --- our setup is the following: 3 identical mesos nodes with: + zookeeper + docker 1.5 + mesos master 0.21.1 configured in HA mode + mesos slave 0.21.1 configured with checkpointing, strict and reconnect + marathon 0.8.0 configured in HA mode with checkpointing --- Command lines: *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181, 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050 --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19 --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos *mesos-slave* /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181, 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos --executor_registration_timeout=5mins --hostname=10.195.30.19 --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443] *marathon* java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64 -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000 --local_port_min 31000 --task_launch_timeout 30 --http_port 8080 --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181, 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,10.195.30.20:2181 ,10.195.30.21:2181/mesos
Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster
Geoffroy, could you please provide master logs (both from killed and taking over masters)? On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Hello we are facing some unexpecting issues when testing high availability behaviors of our mesos cluster. *Our use case:* *State*: the mesos cluster is up (3 machines), 1 docker task is running on each slave (started from marathon) *Action*: stop the mesos master leader process *Expected*: mesos master leader has changed, *active tasks remain unchanged* *Seen*: mesos master leader has changed, *all active tasks are now FAILED but docker containers are still running*, marathon detects FAILED tasks and starts new tasks. We end with 2 docker containers running on each machine, but only one is linked to a RUNNING mesos task. Is the seen behavior correct? Have we misunderstood the high availability concept? We thought that doing this use case would not have any impact on the current cluster state (except leader re-election) Thanks in advance for your help Regards --- our setup is the following: 3 identical mesos nodes with: + zookeeper + docker 1.5 + mesos master 0.21.1 configured in HA mode + mesos slave 0.21.1 configured with checkpointing, strict and reconnect + marathon 0.8.0 configured in HA mode with checkpointing --- Command lines: *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181, 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050 --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19 --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos *mesos-slave* /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181, 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos --executor_registration_timeout=5mins --hostname=10.195.30.19 --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443] *marathon* java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64 -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000 --local_port_min 31000 --task_launch_timeout 30 --http_port 8080 --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181, 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181, 10.195.30.20:2181,10.195.30.21:2181/mesos