Hi there,

Some of our Mesos slaves (running version 0.23) got into a strange state
last week. A networking blip from ~20:59 to ~21:03 in AWS caused a number
of slaves to lose connectivity to the Mesos master:

I0421 21:00:46.351019 85618 slave.cpp:3077] No pings from master
received within 75secs
I0421 21:00:46.355203 85594 status_update_manager.cpp:176] Pausing
sending status updates
I0421 21:00:46.355406 85622 slave.cpp:673] Re-detecting master
I0421 21:00:46.355630 85622 slave.cpp:720] Detecting new master
I0421 21:00:46.356101 85603 status_update_manager.cpp:176] Pausing
sending status updates
I0421 21:00:46.356115 85622 slave.cpp:684] New master detected at
master@172.16.22.2:5050
I0421 21:00:46.357239 85622 slave.cpp:709] No credentials provided.
Attempting to register without authentication
I0421 21:00:46.357364 85622 slave.cpp:720] Detecting new master

These slaves were shut down and removed by the master, and their
corresponding tasks were all marked as TASK_LOST:

I0421 21:01:01.355435 17076 master.cpp:241] Shutting down slave
20151116-203245-4077719724-5050-17017-S208 due to health check timeout
W0421 21:01:01.355556 17076 master.cpp:3913] Shutting down slave
20151116-203245-4077719724-5050-17017-S208 at
slave(1)@172.16.3.103:5051 (lively-rice) with message 'health check
timed out'
I0421 21:01:01.355660 17076 master.cpp:4974] Removing slave
20151116-203245-4077719724-5050-17017-S208 at
slave(1)@172.16.3.103:5051 (lively-rice): health check timed out
...snip...
I0421 21:01:01.498541 17073 master.cpp:5079] Removed slave
20151116-203245-4077719724-5050-17017-S208 (lively-rice): health check
timed out
I0421 21:01:01.501723 17073 master.cpp:5102] Notifying framework sy3x4
(sy3x4) at scheduler-6a46b6f2-ccf8-416b-b8ba-7bef99576197@172.16.40.17:38483
of lost slave 20151116-203245-4077719724-5050-17017-S208 (lively-rice)
after recovering

The networking issues eventually clear up. The slaves attempt to
re-register with the master, but are shut down due to the master having
removed them:

I0421 21:03:13.789948 85612 slave.cpp:606] Slave asked to shut down by
master@172.16.22.2:5050 because 'health check timed out'
I0421 21:03:13.791801 85612 slave.cpp:1946] Asked to shut down
framework sy3x4 by master@172.16.22.2:5050
I0421 21:03:13.791960 85612 slave.cpp:1971] Shutting down framework sy3x4
I0421 21:03:13.793388 85612 slave.cpp:3667] Shutting down executor
'4ki18' of framework sy3x4
I0421 21:03:13.793678 85612 slave.cpp:3667] Shutting down executor
'8cjp8' of framework sy3x4
I0421 21:03:13.793822 85612 slave.cpp:3667] Shutting down executor
't4ila' of framework sy3x4
I0421 21:03:13.794312 85612 slave.cpp:3667] Shutting down executor
'1al5a' of framework sy3x4
I0421 21:03:13.794628 85612 slave.cpp:3667] Shutting down executor
'i4qp9' of framework sy3x4
...snip...
I0421 21:03:13.820853 85612 slave.cpp:606] Slave asked to shut down by
master@172.16.22.2:5050 because 'Slave attempted to re-register after
removal'
I0421 21:03:13.821146 85612 slave.cpp:1946] Asked to shut down
framework sy3x4 by master@172.16.22.2:5050
W0421 21:03:13.821462 85612 slave.cpp:1967] Ignoring shutdown
framework sy3x4 because it is terminating
...snip...
I0421 21:03:19.281539 85617 slave.cpp:606] Slave asked to shut down by
master@172.16.22.2:5050 because 'Executor exited message from unknown
slave'
I0421 21:03:19.281738 85617 slave.cpp:1946] Asked to shut down
framework sy3x4 by master@172.16.22.2:5050
W0421 21:03:19.281782 85617 slave.cpp:1967] Ignoring shutdown
framework sy3x4 because it is terminating
...snip...
I0421 21:03:23.154587 85619 slave.cpp:564] Slave terminating

Monit starts up the mesos-slave process again, and the affected slaves
successfully register with the master with new slave IDs:

2016-04-21 21:03:31,210:53195(0x7ff160eb8700):ZOO_INFO@check_events@1750:
session establishment complete on server [172.16.5.8:2181],
sessionId=0x1751d0ed4b004cbb, negotiated timeout=10000
I0421 21:03:31.210914 53209 group.cpp:313] Group process
(group(1)@172.16.3.103:5051) connected to ZooKeeper
I0421 21:03:31.210963 53209 group.cpp:787] Syncing group operations:
queue size (joins, cancels, datas) = (0, 0, 0)
I0421 21:03:31.210979 53209 group.cpp:385] Trying to create path
'/mesos/mesos_prod_3x4' in ZooKeeper
I0421 21:03:31.212005 53198 state.cpp:36] Recovering state from
'/usr/share/hubspot/mesos/meta'
I0421 21:03:31.213176 53198 state.cpp:79] Failed to find the latest
slave from '/usr/share/hubspot/mesos/meta'
I0421 21:03:31.213376 53198 status_update_manager.cpp:202] Recovering
status update manager
...snip...
I0421 21:03:31.228364 53229 status_update_manager.cpp:176] Pausing
sending status updates
I0421 21:03:31.228436 53218 slave.cpp:684] New master detected at
master@172.16.22.2:5050
I0421 21:03:31.228768 53218 slave.cpp:709] No credentials provided.
Attempting to register without authentication
I0421 21:03:31.228844 53218 slave.cpp:720] Detecting new master
I0421 21:03:31.228996 53218 slave.cpp:4193] Received oversubscribable
resources  from the resource estimator
I0421 21:03:31.229069 53214 cgroups.cpp:1415] Successfully froze
cgroup /cgroup/freezer/mesos/89e71e43-c7ea-4a86-a2b9-f528e8bb0346
after 3.120896ms
I0421 21:03:31.230046 53211 cgroups.cpp:1444] Successfullly thawed
cgroup /cgroup/freezer/mesos/2efbe99b-6b36-411e-b351-af670398b4ae
after 2.013952ms
I0421 21:03:31.230401 53196 cgroups.cpp:1444] Successfullly thawed
cgroup /cgroup/freezer/mesos/5daaa7d5-ffb7-45b6-bd90-30343f90a3c4
after 2.036992ms
I0421 21:03:31.231226 53220 cgroups.cpp:2399] Thawing cgroup
/cgroup/freezer/mesos/89e71e43-c7ea-4a86-a2b9-f528e8bb0346
I0421 21:03:31.233616 53217 cgroups.cpp:1444] Successfullly thawed
cgroup /cgroup/freezer/mesos/89e71e43-c7ea-4a86-a2b9-f528e8bb0346
after 2.216192ms
I0421 21:03:32.016067 53204 slave.cpp:859] Registered with master
master@172.16.22.2:5050; given slave ID
20151116-203437-35000492-5050-17068-S70
I0421 21:03:32.016317 53215 status_update_manager.cpp:183] Resuming
sending status updates

Everything makes perfect sense up to this point. The slaves appear to be
online and connected to the cluster, but we quickly noticed that these
slaves were not sending resource offers to the master. This seemed
especially wrong because these slave had recently started up and had no
tasks running on them.

The only thing that ended up fixing the situation was bouncing our
scheduler (~10 minutes after the restarted slaves joined the cluster) --
the act of failing over the framework appeared to "recover" the missing
resources:

I0421 21:13:13.806171 17072 hierarchical.hpp:761] Recovered
mem(*):217609; cpus(*):210; ports(*):[2048-3048]; disk(*):639829
(total: mem(*):217609; cpus(*):210; ports(*):[2048-3048];
disk(*):639829, allocated: ) on slave
20151116-203437-35000492-5050-17068-S70 from framework sy3x4

I assume this situation isn't expected -- does anyone have an insight into
this?

Thanks,
Tom

Reply via email to