Hi there, Some of our Mesos slaves (running version 0.23) got into a strange state last week. A networking blip from ~20:59 to ~21:03 in AWS caused a number of slaves to lose connectivity to the Mesos master:
I0421 21:00:46.351019 85618 slave.cpp:3077] No pings from master received within 75secs I0421 21:00:46.355203 85594 status_update_manager.cpp:176] Pausing sending status updates I0421 21:00:46.355406 85622 slave.cpp:673] Re-detecting master I0421 21:00:46.355630 85622 slave.cpp:720] Detecting new master I0421 21:00:46.356101 85603 status_update_manager.cpp:176] Pausing sending status updates I0421 21:00:46.356115 85622 slave.cpp:684] New master detected at master@172.16.22.2:5050 I0421 21:00:46.357239 85622 slave.cpp:709] No credentials provided. Attempting to register without authentication I0421 21:00:46.357364 85622 slave.cpp:720] Detecting new master These slaves were shut down and removed by the master, and their corresponding tasks were all marked as TASK_LOST: I0421 21:01:01.355435 17076 master.cpp:241] Shutting down slave 20151116-203245-4077719724-5050-17017-S208 due to health check timeout W0421 21:01:01.355556 17076 master.cpp:3913] Shutting down slave 20151116-203245-4077719724-5050-17017-S208 at slave(1)@172.16.3.103:5051 (lively-rice) with message 'health check timed out' I0421 21:01:01.355660 17076 master.cpp:4974] Removing slave 20151116-203245-4077719724-5050-17017-S208 at slave(1)@172.16.3.103:5051 (lively-rice): health check timed out ...snip... I0421 21:01:01.498541 17073 master.cpp:5079] Removed slave 20151116-203245-4077719724-5050-17017-S208 (lively-rice): health check timed out I0421 21:01:01.501723 17073 master.cpp:5102] Notifying framework sy3x4 (sy3x4) at scheduler-6a46b6f2-ccf8-416b-b8ba-7bef99576197@172.16.40.17:38483 of lost slave 20151116-203245-4077719724-5050-17017-S208 (lively-rice) after recovering The networking issues eventually clear up. The slaves attempt to re-register with the master, but are shut down due to the master having removed them: I0421 21:03:13.789948 85612 slave.cpp:606] Slave asked to shut down by master@172.16.22.2:5050 because 'health check timed out' I0421 21:03:13.791801 85612 slave.cpp:1946] Asked to shut down framework sy3x4 by master@172.16.22.2:5050 I0421 21:03:13.791960 85612 slave.cpp:1971] Shutting down framework sy3x4 I0421 21:03:13.793388 85612 slave.cpp:3667] Shutting down executor '4ki18' of framework sy3x4 I0421 21:03:13.793678 85612 slave.cpp:3667] Shutting down executor '8cjp8' of framework sy3x4 I0421 21:03:13.793822 85612 slave.cpp:3667] Shutting down executor 't4ila' of framework sy3x4 I0421 21:03:13.794312 85612 slave.cpp:3667] Shutting down executor '1al5a' of framework sy3x4 I0421 21:03:13.794628 85612 slave.cpp:3667] Shutting down executor 'i4qp9' of framework sy3x4 ...snip... I0421 21:03:13.820853 85612 slave.cpp:606] Slave asked to shut down by master@172.16.22.2:5050 because 'Slave attempted to re-register after removal' I0421 21:03:13.821146 85612 slave.cpp:1946] Asked to shut down framework sy3x4 by master@172.16.22.2:5050 W0421 21:03:13.821462 85612 slave.cpp:1967] Ignoring shutdown framework sy3x4 because it is terminating ...snip... I0421 21:03:19.281539 85617 slave.cpp:606] Slave asked to shut down by master@172.16.22.2:5050 because 'Executor exited message from unknown slave' I0421 21:03:19.281738 85617 slave.cpp:1946] Asked to shut down framework sy3x4 by master@172.16.22.2:5050 W0421 21:03:19.281782 85617 slave.cpp:1967] Ignoring shutdown framework sy3x4 because it is terminating ...snip... I0421 21:03:23.154587 85619 slave.cpp:564] Slave terminating Monit starts up the mesos-slave process again, and the affected slaves successfully register with the master with new slave IDs: 2016-04-21 21:03:31,210:53195(0x7ff160eb8700):ZOO_INFO@check_events@1750: session establishment complete on server [172.16.5.8:2181], sessionId=0x1751d0ed4b004cbb, negotiated timeout=10000 I0421 21:03:31.210914 53209 group.cpp:313] Group process (group(1)@172.16.3.103:5051) connected to ZooKeeper I0421 21:03:31.210963 53209 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) I0421 21:03:31.210979 53209 group.cpp:385] Trying to create path '/mesos/mesos_prod_3x4' in ZooKeeper I0421 21:03:31.212005 53198 state.cpp:36] Recovering state from '/usr/share/hubspot/mesos/meta' I0421 21:03:31.213176 53198 state.cpp:79] Failed to find the latest slave from '/usr/share/hubspot/mesos/meta' I0421 21:03:31.213376 53198 status_update_manager.cpp:202] Recovering status update manager ...snip... I0421 21:03:31.228364 53229 status_update_manager.cpp:176] Pausing sending status updates I0421 21:03:31.228436 53218 slave.cpp:684] New master detected at master@172.16.22.2:5050 I0421 21:03:31.228768 53218 slave.cpp:709] No credentials provided. Attempting to register without authentication I0421 21:03:31.228844 53218 slave.cpp:720] Detecting new master I0421 21:03:31.228996 53218 slave.cpp:4193] Received oversubscribable resources from the resource estimator I0421 21:03:31.229069 53214 cgroups.cpp:1415] Successfully froze cgroup /cgroup/freezer/mesos/89e71e43-c7ea-4a86-a2b9-f528e8bb0346 after 3.120896ms I0421 21:03:31.230046 53211 cgroups.cpp:1444] Successfullly thawed cgroup /cgroup/freezer/mesos/2efbe99b-6b36-411e-b351-af670398b4ae after 2.013952ms I0421 21:03:31.230401 53196 cgroups.cpp:1444] Successfullly thawed cgroup /cgroup/freezer/mesos/5daaa7d5-ffb7-45b6-bd90-30343f90a3c4 after 2.036992ms I0421 21:03:31.231226 53220 cgroups.cpp:2399] Thawing cgroup /cgroup/freezer/mesos/89e71e43-c7ea-4a86-a2b9-f528e8bb0346 I0421 21:03:31.233616 53217 cgroups.cpp:1444] Successfullly thawed cgroup /cgroup/freezer/mesos/89e71e43-c7ea-4a86-a2b9-f528e8bb0346 after 2.216192ms I0421 21:03:32.016067 53204 slave.cpp:859] Registered with master master@172.16.22.2:5050; given slave ID 20151116-203437-35000492-5050-17068-S70 I0421 21:03:32.016317 53215 status_update_manager.cpp:183] Resuming sending status updates Everything makes perfect sense up to this point. The slaves appear to be online and connected to the cluster, but we quickly noticed that these slaves were not sending resource offers to the master. This seemed especially wrong because these slave had recently started up and had no tasks running on them. The only thing that ended up fixing the situation was bouncing our scheduler (~10 minutes after the restarted slaves joined the cluster) -- the act of failing over the framework appeared to "recover" the missing resources: I0421 21:13:13.806171 17072 hierarchical.hpp:761] Recovered mem(*):217609; cpus(*):210; ports(*):[2048-3048]; disk(*):639829 (total: mem(*):217609; cpus(*):210; ports(*):[2048-3048]; disk(*):639829, allocated: ) on slave 20151116-203437-35000492-5050-17068-S70 from framework sy3x4 I assume this situation isn't expected -- does anyone have an insight into this? Thanks, Tom