Re: Re: Cluster history wiped after master leader reelection
Hi thanks for your answer. Too bad the cluster history is wiped out. Is this behavior by design (history is stored on current leader, and cannot be copied by new leader)? Any suggestions for a way of persisting it? Maybe outside of mesos using some data collection? -- Yes this is the intended behavior. -- *Rodrick Brown* / Systems Engineer +1 917 445 6839 / rodr...@orchardplatform.com <char...@orchardplatform.com> *Orchard Platform* 101 5th Avenue, 4th Floor, New York, NY 10003 http://www.orchardplatform.com Orchard Blog <http://www.orchardplatform.com/blog/> | Marketplace Lending Meetup <http://www.meetup.com/Peer-to-Peer-Lending-P2P/> > On Mar 10 2016, at 11:47 am, Geoffroy Jabouley < > geoffroy.jabou...@gmail.com> wrote: > Hello > > a leader re-election just occured on our cluster (0.25.0). > > It goes fine except the entire cluster history has been lost. > > All tasks counters have been resetted to 0, Completed tasks and Terminated > frameworks lists are empty. > > Has anybody experienced this? > > Regards > > > PS: this is not a blocking problem, but it is important in our job to > sometimes show figures to our management, and such counters always makes > good impression ;) > *NOTICE TO RECIPIENTS*: This communication is confidential and intended for the use of the addressee only. If you are not an intended recipient of this communication, please delete it�immediately and notify the sender by return email. Unauthorized reading, dissemination, distribution or copying of this communication is prohibited. This communication does not�constitute an offer to sell or a solicitation of an indication of interest to purchase any loan, security or any other financial product or instrument, nor is it an offer to sell or a solicitation of an�indication of interest to purchase any products or services to any persons who are prohibited from receiving such information under applicable law. The contents of this communication may not be�accurate or complete and are subject to change without notice. As such, Orchard App, Inc. (including its subsidiaries and affiliates, "Orchard") makes no representation regarding the�accuracy or completeness of the information contained herein. The intended recipient is advised to consult its own professional advisors, including those specializing in legal, tax and accounting�matters. Orchard does not provide legal, tax or accounting advice.
Re: Re: Mesos 0.25 not incresing Staged/Started counters in the UI
Thanks for the clarification. Does staged means "currently in staging state"? In previous versions of Mesos (at least 0.22.1), the Staged value was increased for each staged tasks, to you could tell "X tasks have been executed on the cluster". My point is there is no straightforward way of telling how many tasks had been running on the cluster since it is up. Or am i missing something? If you have some tasks which state is not equal to TASK_STAGING, it would become non-zero. On Wed, Feb 24, 2016 at 8:23 PM, Geoffroy Jabouley <geoffroy.jabou...@gmail.com> wrote: > Hi again > > just checked the /metrics/snapshot endpoint. Staged value is zero. Is this > normal? > > { >"allocator\/event_queue_dispatches":0.0, >"frameworks\/jenkins\/messages_processed":9.0, >"frameworks\/jenkins\/messages_received":9.0, >"master\/cpus_percent":0.0958, >"master\/cpus_revocable_percent":0.0, >"master\/cpus_revocable_total":0.0, >"master\/cpus_revocable_used":0.0, >"master\/cpus_total":24.0, >"master\/cpus_used":2.3, >"master\/disk_percent":0.0, >"master\/disk_revocable_percent":0.0, >"master\/disk_revocable_total":0.0, >"master\/disk_revocable_used":0.0, >"master\/disk_total":138161.0, >"master\/disk_used":0.0, >"master\/dropped_messages":6.0, >"master\/elected":1.0, >"master\/event_queue_dispatches":26.0, >"master\/event_queue_http_requests":0.0, >"master\/event_queue_messages":0.0, >"master\/frameworks_active":2.0, >"master\/frameworks_connected":2.0, >"master\/frameworks_disconnected":0.0, >"master\/frameworks_inactive":0.0, >"master\/invalid_executor_to_framework_messages":0.0, >"master\/invalid_framework_to_executor_messages":0.0, >"master\/invalid_status_update_acknowledgements":0.0, >"master\/invalid_status_updates":256.0, >"master\/mem_percent":0.268649930174402, >"master\/mem_revocable_percent":0.0, >"master\/mem_revocable_total":0.0, >"master\/mem_revocable_used":0.0, >"master\/mem_total":92373.0, >"master\/mem_used":24816.0, >"master\/messages_authenticate":0.0, >"master\/messages_deactivate_framework":0.0, >"master\/messages_decline_offers":45642.0, >"master\/messages_executor_to_framework":0.0, >"master\/messages_exited_executor":0.0, >"master\/messages_framework_to_executor":0.0, >"master\/messages_kill_task":1401.0, >"master\/messages_launch_tasks":1525.0, >"master\/messages_reconcile_tasks":5100.0, >"master\/messages_register_framework":0.0, >"master\/messages_register_slave":3.0, >"master\/messages_reregister_framework":0.0, >"master\/messages_reregister_slave":3.0, >"master\/messages_resource_request":0.0, >"master\/messages_revive_offers":78.0, >"master\/messages_status_update":3252.0, >"master\/messages_status_update_acknowledgement":1964.0, >"master\/messages_suppress_offers":0.0, >"master\/messages_unregister_framework":183.0, >"master\/messages_unregister_slave":0.0, >"master\/messages_update_slave":6.0, >"master\/outstanding_offers":0.0, >"master\/recovery_slave_removals":0.0, >"master\/slave_registrations":3.0, >"master\/slave_removals":0.0, >"master\/slave_removals\/reason_registered":0.0, >"master\/slave_removals\/reason_unhealthy":0.0, >"master\/slave_removals\/reason_unregistered":0.0, >"master\/slave_reregistrations":0.0, >"master\/slave_shutdowns_canceled":0.0, >"master\/slave_shutdowns_completed":0.0, >"master\/slave_shutdowns_scheduled":0.0, >"master\/slaves_active":3.0, >"master\/slaves_connected":3.0, >"master\/slaves_disconnected":0.0, >"master\/slaves_inactive":0.0, >"master\/task_killed\/source_master\/reason_framework_removed":1065.0, > > > > > > > > * "master\/tasks_error":0.0, "master\/tasks_failed":4.0, > "master\/tasks_finished":4.0, "master\/tasks_killed":150
Re: Mesos 0.25 not incresing Staged/Started counters in the UI
Hi again just checked the /metrics/snapshot endpoint. Staged value is zero. Is this normal? { "allocator\/event_queue_dispatches":0.0, "frameworks\/jenkins\/messages_processed":9.0, "frameworks\/jenkins\/messages_received":9.0, "master\/cpus_percent":0.0958, "master\/cpus_revocable_percent":0.0, "master\/cpus_revocable_total":0.0, "master\/cpus_revocable_used":0.0, "master\/cpus_total":24.0, "master\/cpus_used":2.3, "master\/disk_percent":0.0, "master\/disk_revocable_percent":0.0, "master\/disk_revocable_total":0.0, "master\/disk_revocable_used":0.0, "master\/disk_total":138161.0, "master\/disk_used":0.0, "master\/dropped_messages":6.0, "master\/elected":1.0, "master\/event_queue_dispatches":26.0, "master\/event_queue_http_requests":0.0, "master\/event_queue_messages":0.0, "master\/frameworks_active":2.0, "master\/frameworks_connected":2.0, "master\/frameworks_disconnected":0.0, "master\/frameworks_inactive":0.0, "master\/invalid_executor_to_framework_messages":0.0, "master\/invalid_framework_to_executor_messages":0.0, "master\/invalid_status_update_acknowledgements":0.0, "master\/invalid_status_updates":256.0, "master\/mem_percent":0.268649930174402, "master\/mem_revocable_percent":0.0, "master\/mem_revocable_total":0.0, "master\/mem_revocable_used":0.0, "master\/mem_total":92373.0, "master\/mem_used":24816.0, "master\/messages_authenticate":0.0, "master\/messages_deactivate_framework":0.0, "master\/messages_decline_offers":45642.0, "master\/messages_executor_to_framework":0.0, "master\/messages_exited_executor":0.0, "master\/messages_framework_to_executor":0.0, "master\/messages_kill_task":1401.0, "master\/messages_launch_tasks":1525.0, "master\/messages_reconcile_tasks":5100.0, "master\/messages_register_framework":0.0, "master\/messages_register_slave":3.0, "master\/messages_reregister_framework":0.0, "master\/messages_reregister_slave":3.0, "master\/messages_resource_request":0.0, "master\/messages_revive_offers":78.0, "master\/messages_status_update":3252.0, "master\/messages_status_update_acknowledgement":1964.0, "master\/messages_suppress_offers":0.0, "master\/messages_unregister_framework":183.0, "master\/messages_unregister_slave":0.0, "master\/messages_update_slave":6.0, "master\/outstanding_offers":0.0, "master\/recovery_slave_removals":0.0, "master\/slave_registrations":3.0, "master\/slave_removals":0.0, "master\/slave_removals\/reason_registered":0.0, "master\/slave_removals\/reason_unhealthy":0.0, "master\/slave_removals\/reason_unregistered":0.0, "master\/slave_reregistrations":0.0, "master\/slave_shutdowns_canceled":0.0, "master\/slave_shutdowns_completed":0.0, "master\/slave_shutdowns_scheduled":0.0, "master\/slaves_active":3.0, "master\/slaves_connected":3.0, "master\/slaves_disconnected":0.0, "master\/slaves_inactive":0.0, "master\/task_killed\/source_master\/reason_framework_removed":1065.0, * "master\/tasks_error":0.0, "master\/tasks_failed":4.0, "master\/tasks_finished":4.0, "master\/tasks_killed":1506.0, "master\/tasks_lost":0.0, "master\/tasks_running":12.0, "master\/tasks_staging":0.0, "master\/tasks_starting":0.0,* "master\/uptime_secs":767193.85357312, "master\/valid_executor_to_framework_messages":0.0, "master\/valid_framework_to_executor_messages":0.0, "master\/valid_status_update_acknowledgements":1964.0, "master\/valid_status_updates":2996.0, "registrar\/queued_operations":0.0, "registrar\/registry_size_bytes":681.0, "registrar\/state_fetch_ms":1693.7408, "registrar\/state_store_ms":2.151936, "registrar\/state_store_ms\/count":4, "registrar\/state_store_ms\/max":7.021056, "registrar\/state_store_ms\/min":2.151936, "registrar\/state_store_ms\/p50":2.361856, "registrar\/state_store_ms\/p90":5.6530176, "registrar\/state_store_ms\/p95":6.3370368, "registrar\/state_store_ms\/p99":6.88425216, "registrar\/state_store_ms\/p999":7.007375616, "registrar\/state_store_ms\/p":7.0196879616, "system\/cpus_total":8.0, "system\/load_15min":0.06, "system\/load_1min":0.08, "system\/load_5min":0.09, "system\/mem_free_bytes":281022464.0, "system\/mem_total_bytes":33360670720.0 }
Mesos 0.25 not incresing Staged/Started counters in the UI
Hello since we moved to Mesos 0.25, we noticed that in the left column of the UI, in the TASKS part, counters for Staged and Started tasks are always equals to 0. [image: Images intégrées 1] Is this normal? Or maybe a known-issue? With 0.22.1, Started counter was always zero but at least Staged counter showed the number of tasks executed on the cluster since it had started. Regards
Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster
Thanks a lot Dario for the workaround! It works fine and can be scripted with ansible. For the record, the github issue is available here: https://github.com/mesosphere/marathon/issues/1292 2015-03-12 17:27 GMT+01:00 Dario Rexin da...@mesosphere.io: Hi Geoffrey, we identified the issue and will fix it in Marathon 0.8.2. To prevent this behaviour for now, you just have to make sure that in a fresh setup (Marathon was never connected to Mesos) you first start up a single Marathon and let it register with Mesos and then start the other Marathon instances. The problem is a race in first registration with Mesos and fetching the FrameworkID from Zookeeper. Please let me know if the workaround does not help you. Cheers, Dario On 12 Mar 2015, at 09:20, Alex Rukletsov a...@mesosphere.io wrote: Geoffroy, yes, it looks like a marathon issue, so feel free to post it there as well. On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Thanks Alex for your answer. I will have a look. Would it be better to (cross-)post this discussion on the marathon mailing list? Anyway, the issue is fixed for 0.8.0, which is the version i'm using. 2015-03-11 22:18 GMT+01:00 Alex Rukletsov a...@mesosphere.io: Geoffroy, most probably you're hitting this bug: https://github.com/mesosphere/marathon/issues/1063. The problem is that Marathon can register instead of re-registering when a master fails over. From master point of view, it's a new framework, that's why the previous task is gone and a new one (that technically belongs to a new framework) is started. You can see that frameworks have two different IDs (check lines 11:31:40.055496 and 11:31:40.785038) in your example. Hope that helps, Alex On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Hello thanks for your interest. Following are the requested logs, which will result in a pretty big mail. Mesos/Marathon are *NOT running inside docker*, we only use Docker as our mesos containerizer. For reminder, here is the use case performed to get the logs file: Our cluster: 3 identical mesos nodes with: + zookeeper + docker 1.5 + mesos master 0.21.1 configured in HA mode + mesos slave 0.21.1 configured with checkpointing, strict and reconnect + marathon 0.8.0 configured in HA mode with checkpointing *Begin State: * + the mesos cluster is up (3 machines) + mesos master leader is 10.195.30.19 + marathon leader is 10.195.30.21 + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21 *Action*: stop the mesos master leader process (sudo stop mesos-master) *Expected*: mesos master leader has changed, active tasks / frameworks remain unchanged *End state: * + mesos master leader *has changed, now 10.195.30.21* + previously running APPTASK on the slave 10.195.30.21 has disappear (not showing anymore on the mesos UI), but *docker container is still running* + a n*ew APPTASK is now running on slave 10.195.30.19* + marathon framework registration time in mesos UI shows Just now + marathon leader *has changed, now 10.195.30.20* Now comes the 6 requested logs, which might contain interesting/relevant information, but i as a newcomer to mesos it is hard to read... *from previous MESOS master leader 10.195.30.19 http://10.195.30.19/:* W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM from process 1 of user 0; exiting *from new MESOS master leader 10.195.30.21 http://10.195.30.21/:* I0310 11:31:40.011545 922 detector.cpp:138] Detected a new leader: (id='2') I0310 11:31:40.011823 922 group.cpp:659] Trying to get '/mesos/info_02' in ZooKeeper I0310 11:31:40.015496 915 network.hpp:424] ZooKeeper group memberships changed I0310 11:31:40.015847 915 group.cpp:659] Trying to get '/mesos/log_replicas/00' in ZooKeeper I0310 11:31:40.016047 922 detector.cpp:433] A new leading master (UPID=master@10.195.30.21:5050) is detected I0310 11:31:40.016074 922 master.cpp:1263] The newly elected leader is master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895 I0310 11:31:40.016089 922 master.cpp:1276] Elected as the leading master! I0310 11:31:40.016108 922 master.cpp:1094] Recovering from registrar I0310 11:31:40.016188 918 registrar.cpp:313] Recovering registrar I0310 11:31:40.016542 918 log.cpp:656] Attempting to start the writer I0310 11:31:40.016918 918 replica.cpp:474] Replica received implicit promise request with proposal 2 I0310 11:31:40.017503 915 group.cpp:659] Trying to get '/mesos/log_replicas/03' in ZooKeeper I0310 11:31:40.017832 918 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 893672ns I0310 11:31:40.017848 918 replica.cpp:342] Persisted promised to 2 I0310 11:31:40.018817 915
Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster
Thanks Alex for your answer. I will have a look. Would it be better to (cross-)post this discussion on the marathon mailing list? Anyway, the issue is fixed for 0.8.0, which is the version i'm using. 2015-03-11 22:18 GMT+01:00 Alex Rukletsov a...@mesosphere.io: Geoffroy, most probably you're hitting this bug: https://github.com/mesosphere/marathon/issues/1063. The problem is that Marathon can register instead of re-registering when a master fails over. From master point of view, it's a new framework, that's why the previous task is gone and a new one (that technically belongs to a new framework) is started. You can see that frameworks have two different IDs (check lines 11:31:40.055496 and 11:31:40.785038) in your example. Hope that helps, Alex On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Hello thanks for your interest. Following are the requested logs, which will result in a pretty big mail. Mesos/Marathon are *NOT running inside docker*, we only use Docker as our mesos containerizer. For reminder, here is the use case performed to get the logs file: Our cluster: 3 identical mesos nodes with: + zookeeper + docker 1.5 + mesos master 0.21.1 configured in HA mode + mesos slave 0.21.1 configured with checkpointing, strict and reconnect + marathon 0.8.0 configured in HA mode with checkpointing *Begin State: * + the mesos cluster is up (3 machines) + mesos master leader is 10.195.30.19 + marathon leader is 10.195.30.21 + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21 *Action*: stop the mesos master leader process (sudo stop mesos-master) *Expected*: mesos master leader has changed, active tasks / frameworks remain unchanged *End state: * + mesos master leader *has changed, now 10.195.30.21* + previously running APPTASK on the slave 10.195.30.21 has disappear (not showing anymore on the mesos UI), but *docker container is still running* + a n*ew APPTASK is now running on slave 10.195.30.19* + marathon framework registration time in mesos UI shows Just now + marathon leader *has changed, now 10.195.30.20* Now comes the 6 requested logs, which might contain interesting/relevant information, but i as a newcomer to mesos it is hard to read... *from previous MESOS master leader 10.195.30.19 http://10.195.30.19:* W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM from process 1 of user 0; exiting *from new MESOS master leader 10.195.30.21 http://10.195.30.21:* I0310 11:31:40.011545 922 detector.cpp:138] Detected a new leader: (id='2') I0310 11:31:40.011823 922 group.cpp:659] Trying to get '/mesos/info_02' in ZooKeeper I0310 11:31:40.015496 915 network.hpp:424] ZooKeeper group memberships changed I0310 11:31:40.015847 915 group.cpp:659] Trying to get '/mesos/log_replicas/00' in ZooKeeper I0310 11:31:40.016047 922 detector.cpp:433] A new leading master (UPID= master@10.195.30.21:5050) is detected I0310 11:31:40.016074 922 master.cpp:1263] The newly elected leader is master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895 I0310 11:31:40.016089 922 master.cpp:1276] Elected as the leading master! I0310 11:31:40.016108 922 master.cpp:1094] Recovering from registrar I0310 11:31:40.016188 918 registrar.cpp:313] Recovering registrar I0310 11:31:40.016542 918 log.cpp:656] Attempting to start the writer I0310 11:31:40.016918 918 replica.cpp:474] Replica received implicit promise request with proposal 2 I0310 11:31:40.017503 915 group.cpp:659] Trying to get '/mesos/log_replicas/03' in ZooKeeper I0310 11:31:40.017832 918 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 893672ns I0310 11:31:40.017848 918 replica.cpp:342] Persisted promised to 2 I0310 11:31:40.018817 915 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 } I0310 11:31:40.023022 923 coordinator.cpp:230] Coordinator attemping to fill missing position I0310 11:31:40.023110 923 log.cpp:672] Writer started with ending position 8 I0310 11:31:40.023293 923 leveldb.cpp:438] Reading position from leveldb took 13195ns I0310 11:31:40.023309 923 leveldb.cpp:438] Reading position from leveldb took 3120ns I0310 11:31:40.023619 922 registrar.cpp:346] Successfully fetched the registry (610B) in 7.385856ms I0310 11:31:40.023679 922 registrar.cpp:445] Applied 1 operations in 9263ns; attempting to update the 'registry' I0310 11:31:40.024238 922 log.cpp:680] Attempting to append 647 bytes to the log I0310 11:31:40.024279 923 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 9 I0310 11:31:40.024435 923 replica.cpp:508] Replica received write request for position 9 I0310 11:31
Re: CPU resource allocation: ignore?
Ok so it seems better to keep cpu isolator and use little cpu share. btw, when trying to create a mesos task using Marathon with cpu=0.0, i get the following errors: [2015-03-11 17:05:48,395] INFO Received status update for task test-app.7b6ad5d9-c808-11e4-946b-56847afe9799: *TASK_LOST (Task uses invalid resources: cpus(*):0)* (mesosphere.marathon.MarathonScheduler:148) [2015-03-11 17:05:48,402] INFO Task test-app.7b6ad5d9-c808-11e4-946b-56847afe9799 expunged and removed from TaskTracker (mesosphere.marathon.tasks.TaskTracker:107) so i guess this is not possible. 2015-03-11 17:05 GMT+01:00 Ian Downes idow...@twitter.com: Sorry, I meant that no cpu isolator only means no isolation. *The allocator* *does enforce a non-zero cpu allocation*, specifically see MIN_CPUS defined in src/master/constants.cpp to be 0.01 and used by the allocator: HierarchicalAllocatorProcessRoleSorter, FrameworkSorter::*allocatable*( const Resources resources) { Optiondouble cpus = resources.cpus(); OptionBytes mem = resources.mem(); return (*cpus.isSome() cpus.get() = MIN_CPUS*) || (mem.isSome() mem.get() = MIN_MEM); } On Wed, Mar 11, 2015 at 8:54 AM, Connor Doyle con...@mesosphere.io wrote: If you don't care at all about accounting usage of that resource then you should be able to set it to 0.0. As Ian mentioned, this won't be enforced with the cpu isolator disabled. -- Connor On Mar 11, 2015, at 08:43, Ian Downes idow...@twitter.com wrote: The --isolation flag for the slave determines how resources are *isolated*, i.e., by not specifying any cpu isolator there will be no isolation between executors for cpu usage; the Linux scheduler will try to balance their execution. Cpu and memory are considered required resources for executors and I believe the master enforces this. What are behavior are you trying to achieve? If your jobs don't require much cpu then can you not just set a small value, like 0.25 cpu? On Wed, Mar 11, 2015 at 7:20 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Hello As cpu relatives shares are *not very* relevant in our heterogenous cluster, we would like to get rid of CPU resources management and only use MEM resources for our cluster and tasks allocation. Even when modifying the isolation flag of our slave to --isolation=cgroups/mem, we see these in the logs: *from the slave, at startup:* I0311 15:09:55.006750 50906 slave.cpp:289] Slave resources: ports(*):[31000-32000, 80-443]; *cpus(*):2*; mem(*):1979; disk(*):22974 *from the master:* I0311 15:15:16.764714 50884 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; *cpus(*):2*; mem(*):1979; disk(*):22974 (total allocatable: ports(*):[31000-32000, 80-443]; *cpus(*):2*; mem(*):1979; disk(*):22974) on slave 20150311-150951-3982541578-5050-50860-S0 from framework 20150311-150951-3982541578-5050-50860- And mesos master UI is showing both CPU and MEM resources status. Btw, we are using Marathon and Jenkins frameworks to start our mesos tasks, and the cpus field seems mandatory (set to 1.0 by default). So i guess you cannot easily bypass cpu resources allocation... Any idea? Regards 2015-02-19 15:15 GMT+01:00 Ryan Thomas r.n.tho...@gmail.com: Hey Don, Have you tried only setting the 'cgroups/mem' isolation flag on the slave and not the cpu one? http://mesosphere.com/docs/reference/mesos-slave/ ryan On 19 February 2015 at 14:13, Donald Laidlaw donlaid...@me.com wrote: I am using Mesos 0.21.1 with Marathon 0.8.0 and running everything in docker containers. Is there a way to have mesos ignore the cpu relative shares? That is, not limit the docker container CPU at all when it runs. I would still want to have the Memory resource limitation, but would rather just let the linux system under the containers schedule all the CPU. This would allow us to just allocate tasks to mesos slaves based on available memory only, and to let those tasks get whatever CPU they could when they needed it. This is desireable where there can be lots of relative high memory tasks that have very low CPU requirements. Especially if we do not know the capabilities of the slave machines with regards to CPU. Some of them may have fast CPU's, some slow, so it is hard to pick a relative number for that slave. Thanks, Don Laidlaw
Re: CPU resource allocation: ignore?
Hello As cpu relatives shares are *not very* relevant in our heterogenous cluster, we would like to get rid of CPU resources management and only use MEM resources for our cluster and tasks allocation. Even when modifying the isolation flag of our slave to --isolation=cgroups/mem, we see these in the logs: *from the slave, at startup:* I0311 15:09:55.006750 50906 slave.cpp:289] Slave resources: ports(*):[31000-32000, 80-443]; *cpus(*):2*; mem(*):1979; disk(*):22974 *from the master:* I0311 15:15:16.764714 50884 hierarchical_allocator_process.hpp:563] Recovered ports(*):[31000-32000, 80-443]; *cpus(*):2*; mem(*):1979; disk(*):22974 (total allocatable: ports(*):[31000-32000, 80-443]; *cpus(*):2*; mem(*):1979; disk(*):22974) on slave 20150311-150951-3982541578-5050-50860-S0 from framework 20150311-150951-3982541578-5050-50860- And mesos master UI is showing both CPU and MEM resources status. Btw, we are using Marathon and Jenkins frameworks to start our mesos tasks, and the cpus field seems mandatory (set to 1.0 by default). So i guess you cannot easily bypass cpu resources allocation... Any idea? Regards 2015-02-19 15:15 GMT+01:00 Ryan Thomas r.n.tho...@gmail.com: Hey Don, Have you tried only setting the 'cgroups/mem' isolation flag on the slave and not the cpu one? http://mesosphere.com/docs/reference/mesos-slave/ ryan On 19 February 2015 at 14:13, Donald Laidlaw donlaid...@me.com wrote: I am using Mesos 0.21.1 with Marathon 0.8.0 and running everything in docker containers. Is there a way to have mesos ignore the cpu relative shares? That is, not limit the docker container CPU at all when it runs. I would still want to have the Memory resource limitation, but would rather just let the linux system under the containers schedule all the CPU. This would allow us to just allocate tasks to mesos slaves based on available memory only, and to let those tasks get whatever CPU they could when they needed it. This is desireable where there can be lots of relative high memory tasks that have very low CPU requirements. Especially if we do not know the capabilities of the slave machines with regards to CPU. Some of them may have fast CPU's, some slow, so it is hard to pick a relative number for that slave. Thanks, Don Laidlaw
Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster
/2015 11:31:55.053] [marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters] Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. [2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the Mesos master (mesosphere.marathon.SchedulerActions:430) [2015-03-10 11:31:55,064] INFO Received status update for task ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799: TASK_LOST (Reconciliation: Task is unknown to the slave) (mesosphere.marathon.MarathonScheduler:148) [2015-03-10 11:31:55,069] INFO Need to scale /ffaas-backoffice-app-nopersist from 0 up to 1 instances (mesosphere.marathon.SchedulerActions:488) [2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for /ffaas-backoffice-app-nopersist (0 queued) (mesosphere.marathon.SchedulerActions:494) [2015-03-10 11:31:55,069] INFO Task ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799 expunged and removed from TaskTracker (mesosphere.marathon.tasks.TaskTracker:107) [2015-03-10 11:31:55,070] INFO Sending event notification. (mesosphere.marathon.MarathonScheduler:262) [INFO] [03/10/2015 11:31:55.072] [marathon-akka.actor.default-dispatcher-7] [akka://marathon/user/$b] POSTing to all endpoints. [2015-03-10 11:31:55,073] INFO Need to scale /ffaas-backoffice-app-nopersist from 0 up to 1 instances (mesosphere.marathon.SchedulerActions:488) [2015-03-10 11:31:55,074] INFO Already queued 1 tasks for /ffaas-backoffice-app-nopersist. Not scaling. (mesosphere.marathon.SchedulerActions:498) ... ... ... [2015-03-10 11:31:57,682] INFO Received status update for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148) [2015-03-10 11:31:57,694] INFO Sending event notification. (mesosphere.marathon.MarathonScheduler:262) [INFO] [03/10/2015 11:31:57.694] [marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b] POSTing to all endpoints. ... ... ... [2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store (mesosphere.marathon.tasks.TaskTracker:170) [INFO] [03/10/2015 11:36:55.050] [marathon-akka.actor.default-dispatcher-2] [akka://marathon/deadLetters] Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. [2015-03-10 11:36:55,057] INFO Syncing tasks for all apps (mesosphere.marathon.SchedulerActions:403) [2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the Mesos master (mesosphere.marathon.SchedulerActions:430) [2015-03-10 11:36:55,063] INFO Received status update for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799: TASK_RUNNING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:148) [2015-03-10 11:36:55,065] INFO Received status update for task ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799: TASK_RUNNING (Reconciliation: Latest task state) (mesosphere.marathon.MarathonScheduler:148) [2015-03-10 11:36:55,066] INFO Already running 1 instances of /ffaas-backoffice-app-nopersist. Not scaling. (mesosphere.marathon.SchedulerActions:512) -- End of logs 2015-03-10 10:25 GMT+01:00 Adam Bordelon a...@mesosphere.io: This is certainly not the expected/desired behavior when failing over a mesos master in HA mode. In addition to the master logs Alex requested, can you also provide relevant portions of the slave logs for these tasks? If the slave processes themselves never failed over, checkpointing and slave recovery should be irrelevant. Are you running the mesos-slave itself inside a Docker, or any other non-traditional setup? FYI, --checkpoint defaults to true (and is removed in 0.22), --recover defaults to reconnect, and --strict defaults to true, so none of those are necessary. On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov a...@mesosphere.io wrote: Geoffroy, could you please provide master logs (both from killed and taking over masters)? On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Hello we are facing some unexpecting issues when testing high availability behaviors of our mesos cluster. *Our use case:* *State*: the mesos cluster is up (3 machines), 1 docker task is running on each slave (started from marathon) *Action*: stop the mesos master leader process *Expected*: mesos master leader has changed, *active tasks remain unchanged* *Seen*: mesos master leader
Weird behavior when stopping the mesos master leader of a HA mesos cluster
Hello we are facing some unexpecting issues when testing high availability behaviors of our mesos cluster. *Our use case:* *State*: the mesos cluster is up (3 machines), 1 docker task is running on each slave (started from marathon) *Action*: stop the mesos master leader process *Expected*: mesos master leader has changed, *active tasks remain unchanged* *Seen*: mesos master leader has changed, *all active tasks are now FAILED but docker containers are still running*, marathon detects FAILED tasks and starts new tasks. We end with 2 docker containers running on each machine, but only one is linked to a RUNNING mesos task. Is the seen behavior correct? Have we misunderstood the high availability concept? We thought that doing this use case would not have any impact on the current cluster state (except leader re-election) Thanks in advance for your help Regards --- our setup is the following: 3 identical mesos nodes with: + zookeeper + docker 1.5 + mesos master 0.21.1 configured in HA mode + mesos slave 0.21.1 configured with checkpointing, strict and reconnect + marathon 0.8.0 configured in HA mode with checkpointing --- Command lines: *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181, 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050 --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19 --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos *mesos-slave* /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181, 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos --executor_registration_timeout=5mins --hostname=10.195.30.19 --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443] *marathon* java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64 -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000 --local_port_min 31000 --task_launch_timeout 30 --http_port 8080 --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181, 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,10.195.30.20:2181 ,10.195.30.21:2181/mesos
Re: Is mesos spamming me?
Hello let's have a look at the message displayed in Jenkins log: INFO: Offer not sufficient for slave request: [name: cpus type: SCALAR scalar { value: 1.6 } role: * *== The Mesos slave is currently offering 1.6 CPU resources* name: mem type: SCALAR scalar { value: 455.0 } role: * *== The Mesos slave is currently offering 455MB of RAM resources* name: disk type: SCALAR scalar { value: 32833.0 } role: * == The Mesos slave is currently offering 32GB of Disk resources , name: ports type: RANGES ranges { range { begin: 31000 end: 32000 } } role: * == The Mesos slave is currently offering ports between 31000 32000 (default) ] [] *Requested for Jenkins slave: cpus: 0.2 mem: 704.0* *== Your Jenkins slave is requesting 0.2 CPU and 704 MB of RAM* So for me it is normal that your Jenkins slave request cannot be fullfilled by Mesos, at least by *this mesos slave, as it only has 455MB of RAM to offer and you need 704MB*. FYI, the requested memory for a Jenkins slave is derived from the following calculation: *Jenkins Slave Memory in MB + (Maximum number of Executors per Slave * Jenkins Executor Memory in MB)*. Maybe that's why you are seeing 704MB here and not 512MB as expected. But if you have several other Mesos slaves each offering 2CPU/2GB RAM, then this should not be a problem and the Jenkins slave should be created on another Mesos slave (log message is something like offers match) Are there any other apps running on your Mesos slave (another jenkins slave, a jenkins master, ...) that would consume missing resources? 2015-02-02 6:11 GMT+01:00 Hepple, Robert rhep...@tnsi.com: On Sun, 2015-02-01 at 21:02 -0800, Vinod Kone wrote: On Sun, Feb 1, 2015 at 8:58 PM, Vinod Kone vinodk...@gmail.com wrote: By default mesos slave leaves some RAM and CPU for system processes. You can override this behavior by --resources flag. Yeah but ... the slave is reporting 1863Mb RAM and 2 CPUS - so how come that is rejected by jenkins which is asking for the default 0.1 cpu and 512Mb RAM??? Thanks Bob On Sun, Feb 1, 2015 at 6:05 PM, Hepple, Robert rhep...@tnsi.com wrote: On Fri, 2015-01-30 at 10:00 +0100, Geoffroy Jabouley wrote: Hello The message means that the received resource offer from Mesos cluster does not meet your jenkins slave requirements (memory or cpu). This is normal message. ... and here's another thing - the mesos master registers the slave as having 2*cpus and 1.8Gb RAM: I0202 11:43:47.623059 25809 hierarchical_allocator_process.hpp:442] Added slave 20150129-120204-1408111020-5050-10811-S18 (ci00bldslv02 v.ss.corp.cnp.tnsi.com) with cpus(*):2; mem(*):1863; disk(*):32961; ports(*):[31000-32000] (and cpus(*):2; mem(*):1863; disk(*):32961; ports(*):[31000-32000] available) -- Senior Software Engineer T. 07 3224 9778 M. 04 1177 6888 Level 20, 300 Adelaide Street, Brisbane QLD 4000, Australia. On 18th December 2014, MasterCard acquired the Gateway Services business (TNSPay Retail and TNSPay eCommerce) of Transaction Network Services, to join MasterCard’s global gateway business, DataCash. -- Senior Software Engineer T. 07 3224 9778 M. 04 1177 6888 Level 20, 300 Adelaide Street, Brisbane QLD 4000, Australia. On 18th December 2014, MasterCard acquired the Gateway Services business (TNSPay Retail and TNSPay eCommerce) of Transaction Network Services, to join MasterCard’s global gateway business, DataCash.
Re: Is mesos spamming me?
Hello The message means that the received resource offer from Mesos cluster does not meet your jenkins slave requirements (memory or cpu). This is normal message. you can filter logs from specific classes in Jenkins 1. from the webUI, in the jenkins_url/log/levels panel, set the logging level for org.jenkinsci.plugins.mesos.JenkinsScheduler to *WARNING* 2. use a logging.properties file We use the second solution. Content of the logging.properties file is: -- *# Global logging handlershandlers=java.util.logging.ConsoleHandler# Define custom logger for Jenkins mesos plugin (too verbose!)org.jenkinsci.plugins.mesos.JenkinsScheduler=java.util.logging.ConsoleHandlerorg.jenkinsci.plugins.mesos.JenkinsScheduler.useParentHandlers=FALSEorg.jenkinsci.plugins.mesos.JenkinsScheduler.level=WARNING# Define common logging configurationjava.util.logging.ConsoleHandler.level=INFOjava.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter* -- Jenkins instance is then started using:* java -Djava.util.logging.config.file=/path/to/logging.properties -jar $HOME/jenkins.war* One drawback of this solution is that it also filters other interestings logs from the mesos JenkinsScheduler class... Hope this helps Regards 2015-01-30 6:06 GMT+01:00 Hepple, Robert rhep...@tnsi.com: I have a single mesos master and 19 slaves. I have several jenkins servers making on-demand requests using the jenkins-mesos plugin - it all seems to be working correctly, mesos slaves are assigned to the jenkins servers, they execute jobs and eventually they detach. Except. Except the jenkins servers are getting spammed about every 1 or 2 seconds with this in /var/log/jenkins/jenkins.log: Jan 30, 2015 2:59:15 PM org.jenkinsci.plugins.mesos.JenkinsScheduler matches WARNING: Ignoring disk resources from offer Jan 30, 2015 2:59:15 PM org.jenkinsci.plugins.mesos.JenkinsScheduler matches INFO: Ignoring ports resources from offer Jan 30, 2015 2:59:15 PM org.jenkinsci.plugins.mesos.JenkinsScheduler matches INFO: Offer not sufficient for slave request: [name: cpus type: SCALAR scalar { value: 1.6 } role: * , name: mem type: SCALAR scalar { value: 455.0 } role: * , name: disk type: SCALAR scalar { value: 32833.0 } role: * , name: ports type: RANGES ranges { range { begin: 31000 end: 32000 } } role: * ] [] Requested for Jenkins slave: cpus: 0.2 mem: 704.0 attributes: The mesos master side is also hitting the logs with eg: I0130 14:59:43.789172 10828 master.cpp:2344] Processing reply for offers: [ 20150129-120204-1408111020-5050-10811-O665754 ] on slave 20150129-120204-1408111020-5050-10811-S2 at slave(1)@172.17.238.75:5051 ( ci00bldslv15v.ss.corp.cnp.tnsi.com) for framework 20150129-120204-1408111020-5050-10811-0001 (Jenkins Scheduler) at scheduler-1aab9acc-fba9-4123-b1ac-56ce74c0365b@172.17.152.201:54503 I0130 14:59:43.789654 10828 master.cpp:2344] Processing reply for offers: [ 20150129-120204-1408111020-5050-10811-O665755 ] on slave 20150129-120204-1408111020-5050-10811-S13 at slave(1)@172.17.238.98:5051 ( ci00bldslv12v.ss.corp.cnp.tnsi.com) for framework 20150129-120204-1408111020-5050-10811-0001 (Jenkins Scheduler) at scheduler-1aab9acc-fba9-4123-b1ac-56ce74c0365b@172.17.152.201:54503 I0130 14:59:43.790004 10828 master.cpp:2344] Processing reply for offers: [ 20150129-120204-1408111020-5050-10811-O665756 ] on slave 20150129-120204-1408111020-5050-10811-S11 at slave(1)@172.17.238.95:5051 ( ci00bldslv11v.ss.corp.cnp.tnsi.com) for framework 20150129-120204-1408111020-5050-10811-0001 (Jenkins Scheduler) at scheduler-1aab9acc-fba9-4123-b1ac-56ce74c0365b@172.17.152.201:54503 I0130 14:59:43.790349 10828 master.cpp:2344] Processing reply for offers: [ 20150129-120204-1408111020-5050-10811-O665757 ] on slave 20150129-120204-1408111020-5050-10811-S7 at slave(1)@172.17.238.108:5051 ( ci00bldslv19v.ss.corp.cnp.tnsi.com) for framework 20150129-120204-1408111020-5050-10811-0001 (Jenkins Scheduler) at scheduler-1aab9acc-fba9-4123-b1ac-56ce74c0365b@172.17.152.201:54503 I0130 14:59:43.790670 10828 master.cpp:2344] Processing reply for offers: [ 20150129-120204-1408111020-5050-10811-O665758 ] on slave 20150129-120204-1408111020-5050-10811-S14 at slave(1)@172.17.238.78:5051 ( ci00bldslv06v.ss.corp.cnp.tnsi.com) for framework 20150129-120204-1408111020-5050-10811-0001 (Jenkins Scheduler) at scheduler-1aab9acc-fba9-4123-b1ac-56ce74c0365b@172.17.152.201:54503 I0130 14:59:43.791192 10828 hierarchical_allocator_process.hpp:563] Recovered cpus(*):1.6; mem(*):453; disk(*):32961; ports(*):[31000-32000] (total allocatable: cpus(*):1.6; mem(*):453; disk(*):32961; ports(*):[31000-32000]) on slave 20150129-120204-1408111020-5050-10811-S2 from framework 20150129-120204-1408111020-5050-10811-0001 I0130 14:59:43.791507 10828 hierarchical_allocator_process.hpp:563] Recovered
Re: Unable to follow Sandbox links from Mesos UI.
Hello just in case, which internet browser are you using? Do you have installed any extensions (NoScript, Ghostery, ...) that could prevent the display /statis/pailer display? I personnaly use NoScript with Firefox, and i have to turn it off on all @IP of our cluster to correctly access slave information from Mesos UI. My 2 cents Regards 2015-01-26 21:08 GMT+01:00 Suijian Zhou suijian.z...@ige-project.eu: Hi, Alex, Yes, I can see the link points to the slave machine when I hover on the Download button and stdout/stderr can be downloaded. So do you mean it is expected/designed that clicking on 'stdout/stderr' themselves will not show you anything? Thanks! Cheers, Dan 2015-01-26 7:44 GMT-06:00 Alex Rukletsov a...@mesosphere.io: Dan, that's correct. The 'static/pailer.html' is a page that lives on the master and it gets a url to the actual slave as a parameter. The url is computed in 'controllers.js' based on where the associated executor lives. You should see this 'actual' url if you hover over the Download button. Please check this url for correctness and that you can access it from your browser. On Fri, Jan 23, 2015 at 9:24 PM, Dan Dong dongda...@gmail.com wrote: I see the problem: when I move the cursor onto the link, e.g: stderr, it actually points to the IP address of the master machine, so it trys to follow links of Master_IP:/tmp/mesos/slaves/... which is not there. So why the link does not point to the IP address of slaves( config problems somewhere?)? Cheers, Dan 2015-01-23 11:25 GMT-06:00 Dick Davies d...@hellooperator.net: Start with 'inspect element' in the browser and see if that gives any clues. Sounds like your network is a little strict so it may be something else needs opening up. On 23 January 2015 at 16:56, Dan Dong dongda...@gmail.com wrote: Hi, Alex, That is what expected, but when I click on it, it pops a new blank window(pailer.html) without the content of the file(9KB size). Any hints? Cheers, Dan 2015-01-23 4:37 GMT-06:00 Alex Rukletsov a...@mesosphere.io: Dan, you should be able to view file contents just by clicking on the link. On Thu, Jan 22, 2015 at 9:57 PM, Dan Dong dongda...@gmail.com wrote: Yes, --hostname solves the problem. Now I can see all files there like stdout, stderr etc, but when I click on e.g stdout, it pops a new blank window(pailer.html) without the content of the file(9KB size). Although it provides a Download link beside, it would be much more convenient if one can view the stdout and stderr directly. Is this normal or there is still problem on my envs? Thanks! Cheers, Dan 2015-01-22 11:33 GMT-06:00 Adam Bordelon a...@mesosphere.io: Try the --hostname parameters for master/slave. If you want to be extra explicit about the IP (e.g. publish the public IP instead of the private one in a cloud environment), you can also set the --ip parameter on master/slave. On Thu, Jan 22, 2015 at 8:43 AM, Dan Dong dongda...@gmail.com wrote: Thanks Ryan, yes, from the machine where the browser is on slave hostnames could not be resolved, so that's why failure, but it can reach them by IP address( I don't think sys admin would like to add those VMs entries to /etc/hosts on the server). I tried to change masters and slaves of mesos to IP addresses instead of hostname but UI still points to hostnames of slaves. Is threre a way to let mesos only use IP address of master and slaves? Cheers, Dan 2015-01-22 9:48 GMT-06:00 Ryan Thomas r.n.tho...@gmail.com: It is a request from your browser session, not from the master that is going to the slaves - so in order to view the sandbox you need to ensure that the machine your browser is on can resolve and route to the masters _and_ the slaves. The master doesn't proxy the sandbox requests through itself (yet) - they are made directly from your browser instance to the slaves. Make sure you can resolve the slaves from the machine you're browsing the UI on. Cheers, ryan On 22 January 2015 at 15:42, Dan Dong dongda...@gmail.com wrote: Thank you all, the master and slaves can resolve each others' hostname and ssh login without password, firewalls have been switched off on all the machines too. So I'm confused what will block such a pull of info of slaves from UI? Cheers, Dan 2015-01-21 16:35 GMT-06:00 Cody Maloney c...@mesosphere.io: Also see https://issues.apache.org/jira/browse/MESOS-2129 if you want to track progress on changing this. Unfortunately it is on hold for me at the moment to fix. Cody On Wed, Jan 21, 2015 at 2:07 PM, Ryan Thomas r.n.tho...@gmail.com wrote: Hey Dan, The UI will attempt to pull that info directly from the
Re: Task Checkpointing with Mesos, Marathon and Docker containers
Hello the idea is to be able of tuning the mesos slave configuration (attributes, resources offers, general options, ... upgrades?) without altering the current tasks running on this mesos slave (a dockerized jenkins instance + docker jenkins slaves for example). I am setting up a test cluster with latest mesos/marathon releases, to check if behaviors are identicals 2014-12-01 19:28 GMT+01:00 Benjamin Mahler benjamin.mah...@gmail.com: I would like to be able to shutdown a mesos-slave for maintenance without altering the current tasks. What are you trying to do? If your maintenance operation does not affect the tasks, why do you need to stop the slave in the first place? On Wed, Nov 26, 2014 at 1:36 AM, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Hello all thanks for your answers. Is there a way of configuring this 75s timeout for slave reconnection? I think that my problem is that as the task status is lost: - marathon framework detects the loss and start another instance - mesos-slave, when restarting, detects the lost task and restart a new one == 2 tasks on mesos cluster, 2 running docker containers, 1 app instance in marathon So a solution would be to extend the 75s timeout. I thought that my command lines for starting the cluster were fine, but it seems incomplete... I would like to be able to shutdown a mesos-slave for maintenance without altering the current tasks. 2014-11-25 18:30 GMT+01:00 Connor Doyle con...@mesosphere.io: Hi Geoffroy, For the Marathon instances, in all released version of Marathon you must supply the --checkpoint flag to turn on task checkpointing for the framework. We've changed the default to true starting with the next release. There is a bug in Mesos where the FrameworkInfo does not get updated when a framework re-registers. This means that if you shut down Marathon and restart it with --checkpoint, the Mesos master (with the same FrameworkId, which Marathon picks up from ZK) will ignore the new setting. For reference, here is the design doc to address that: https://cwiki.apache.org/confluence/display/MESOS/Design+doc%3A+Updating+Framework+Info Fortunately, there is an easy workaround. 1) Shut down Marathon (tasks keep running) 2) Restart the leading Mesos master (tasks keep running) 3) Start Marathon with --checkpoint enabled This works by clearing the Mesos master's in-memory state. It is rebuilt as the slave nodes and frameworks re-register. Please report back if this doesn't solve the issue for you. -- Connor On Nov 25, 2014, at 07:43, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Hello i am currently trying to activate checkpointing for my Mesos cloud. Starting from an application running in a docker container on the cluster, launched from marathon, my use cases are the followings: UC1: kill the marathon service, then restart after 2 minutes. Expected: the mesos task is still active, the docker container is running. When the marathon service restarts, it get backs its tasks. Result: OK UC2: kill the mesos slave, then restart after 2 minutes. Expected: the mesos task remains active, the docker container is running. When the mesos slave service restarts, it get backs its tasks. Marathon does not show error. Results: task get status LOST when slave is killed. Docker container still running. Marathon detects the application went down and spawn a new one on another available mesos slave. When the slave restarts, it kills the previous running container and start a new one. So i end up with 2 applications on my cluster, one spawn by Marathon, and another orphan one. Is this behavior normal? Can you please explain what i am doing wrong? --- Here is the configuration i have come so far: Mesos 0.19.1 (not dockerized) Marathon 0.6.1 (not dockerized) Docker 1.3 + Deimos 0.4.2 Mesos master is started: /usr/local/sbin/mesos-master --zk=zk://...:2181/mesos --port=5050 --log_dir=/var/log/mesos --cluster=CLUSTER_POC --hostname=... --ip=... --quorum=1 --work_dir=/var/lib/mesos Mesos slave is started: /usr/local/sbin/mesos-slave --master=zk://...:2181/mesos --log_dir=/var/log/mesos --checkpoint=true --containerizer_path=/usr/local/bin/deimos --executor_registration_timeout=5mins --hostname=... --ip=... --isolation=external --recover=reconnect --recovery_timeout=120mins --strict=true Marathon is started: java -Xmx512m -Djava.library.path=/usr/local/lib -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp /usr/local/bin/marathon mesosphere.marathon.Main --zk zk://...:2181/marathon --master zk://...:2181/mesos --local_port_min 3 --hostname ... --event_subscriber http_callback --http_port 8080 --task_launch_timeout 30 --local_port_max 4 --ha --checkpoint
Re: Task Checkpointing with Mesos, Marathon and Docker containers
Hello all thanks for your answers. Is there a way of configuring this 75s timeout for slave reconnection? I think that my problem is that as the task status is lost: - marathon framework detects the loss and start another instance - mesos-slave, when restarting, detects the lost task and restart a new one == 2 tasks on mesos cluster, 2 running docker containers, 1 app instance in marathon So a solution would be to extend the 75s timeout. I thought that my command lines for starting the cluster were fine, but it seems incomplete... I would like to be able to shutdown a mesos-slave for maintenance without altering the current tasks. 2014-11-25 18:30 GMT+01:00 Connor Doyle con...@mesosphere.io: Hi Geoffroy, For the Marathon instances, in all released version of Marathon you must supply the --checkpoint flag to turn on task checkpointing for the framework. We've changed the default to true starting with the next release. There is a bug in Mesos where the FrameworkInfo does not get updated when a framework re-registers. This means that if you shut down Marathon and restart it with --checkpoint, the Mesos master (with the same FrameworkId, which Marathon picks up from ZK) will ignore the new setting. For reference, here is the design doc to address that: https://cwiki.apache.org/confluence/display/MESOS/Design+doc%3A+Updating+Framework+Info Fortunately, there is an easy workaround. 1) Shut down Marathon (tasks keep running) 2) Restart the leading Mesos master (tasks keep running) 3) Start Marathon with --checkpoint enabled This works by clearing the Mesos master's in-memory state. It is rebuilt as the slave nodes and frameworks re-register. Please report back if this doesn't solve the issue for you. -- Connor On Nov 25, 2014, at 07:43, Geoffroy Jabouley geoffroy.jabou...@gmail.com wrote: Hello i am currently trying to activate checkpointing for my Mesos cloud. Starting from an application running in a docker container on the cluster, launched from marathon, my use cases are the followings: UC1: kill the marathon service, then restart after 2 minutes. Expected: the mesos task is still active, the docker container is running. When the marathon service restarts, it get backs its tasks. Result: OK UC2: kill the mesos slave, then restart after 2 minutes. Expected: the mesos task remains active, the docker container is running. When the mesos slave service restarts, it get backs its tasks. Marathon does not show error. Results: task get status LOST when slave is killed. Docker container still running. Marathon detects the application went down and spawn a new one on another available mesos slave. When the slave restarts, it kills the previous running container and start a new one. So i end up with 2 applications on my cluster, one spawn by Marathon, and another orphan one. Is this behavior normal? Can you please explain what i am doing wrong? --- Here is the configuration i have come so far: Mesos 0.19.1 (not dockerized) Marathon 0.6.1 (not dockerized) Docker 1.3 + Deimos 0.4.2 Mesos master is started: /usr/local/sbin/mesos-master --zk=zk://...:2181/mesos --port=5050 --log_dir=/var/log/mesos --cluster=CLUSTER_POC --hostname=... --ip=... --quorum=1 --work_dir=/var/lib/mesos Mesos slave is started: /usr/local/sbin/mesos-slave --master=zk://...:2181/mesos --log_dir=/var/log/mesos --checkpoint=true --containerizer_path=/usr/local/bin/deimos --executor_registration_timeout=5mins --hostname=... --ip=... --isolation=external --recover=reconnect --recovery_timeout=120mins --strict=true Marathon is started: java -Xmx512m -Djava.library.path=/usr/local/lib -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp /usr/local/bin/marathon mesosphere.marathon.Main --zk zk://...:2181/marathon --master zk://...:2181/mesos --local_port_min 3 --hostname ... --event_subscriber http_callback --http_port 8080 --task_launch_timeout 30 --local_port_max 4 --ha --checkpoint
Task Checkpointing with Mesos, Marathon and Docker containers
Hello i am currently trying to activate checkpointing for my Mesos cloud. Starting from an application running in a docker container on the cluster, launched from marathon, my use cases are the followings: *UC1: kill the marathon service, then restart after 2 minutes.* *Expected*: the mesos task is still active, the docker container is running. When the marathon service restarts, it get backs its tasks. *Result*: OK *UC2: kill the mesos slave, then restart after 2 minutes.* *Expected*: the mesos task remains active, the docker container is running. When the mesos slave service restarts, it get backs its tasks. Marathon does not show error. *Results*: task get status LOST when slave is killed. Docker container still running. Marathon detects the application went down and spawn a new one on another available mesos slave. When the slave restarts, it kills the previous running container and start a new one. So i end up with 2 applications on my cluster, one spawn by Marathon, and another orphan one. Is this behavior normal? Can you please explain what i am doing wrong? --- Here is the configuration i have come so far: Mesos 0.19.1 (not dockerized) Marathon 0.6.1 (not dockerized) Docker 1.3 + Deimos 0.4.2 Mesos master is started: */usr/local/sbin/mesos-master --zk=zk://...:2181/mesos --port=5050 --log_dir=/var/log/mesos --cluster=CLUSTER_POC --hostname=... --ip=... --quorum=1 --work_dir=/var/lib/mesos* Mesos slave is started: */usr/local/sbin/mesos-slave --master=zk://...:2181/mesos --log_dir=/var/log/mesos --checkpoint=true --containerizer_path=/usr/local/bin/deimos --executor_registration_timeout=5mins --hostname=... --ip=... --isolation=external --recover=reconnect --recovery_timeout=120mins --strict=true* Marathon is started: *java -Xmx512m -Djava.library.path=/usr/local/lib -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp /usr/local/bin/marathon mesosphere.marathon.Main --zk zk://...:2181/marathon --master zk://...:2181/mesos --local_port_min 3 --hostname ... --event_subscriber http_callback --http_port 8080 --task_launch_timeout 30 --local_port_max 4 --ha --checkpoint*