Re: Reconnected slaves not sending resource offers?

2016-04-25 Thread Thomas Petr
Ah, thanks for the clarification. I can't find any logs from the framework
indicating that we got the initial offer, so it looks like it could have
been dropped. We haven't set --offer-timeout on our masters, so your
explanation makes sense. Thanks!

On Mon, Apr 25, 2016 at 4:17 PM, Vinod Kone  wrote:

>
> I0421 21:03:32.014999 17071 master.cpp:4290] Sending 1 offers to
>> framework sy3x4 (sy3x4) at
>> scheduler-6bb2bcf0-d060-4072-a25b-917d8007fb1c@172.16.13.243:56861
>>
>
> This shows that the slaves resources were sent to a framework. Looks like
> the framework is holding on to the offer for a long time?
>
>
>> I0421 21:03:32.019800 17076 hierarchical.hpp:588] Slave
>> 20151116-203437-35000492-5050-17068-S70 (lively-rice) updated with
>> oversubscribed resources  (total: mem(*):217609; cpus(*):210;
>> ports(*):[2048-3048]; disk(*):639829, allocated: mem(*):217609;
>> cpus(*):210; ports(*):[2048-3048]; disk(*):639829)
>>
>
> This says that from the view point of master/allocator, all the resources
> are allocated. This is because the framework hasn't replied to the offer.
> Did the framework receive the offer or was it dropped by the network due to
> the networking issues?
>
>


Re: Reconnected slaves not sending resource offers?

2016-04-25 Thread Vinod Kone
On Mon, Apr 25, 2016 at 8:40 AM, Thomas Petr  wrote:

> The only thing that ended up fixing the situation was bouncing our
> scheduler (~10 minutes after the restarted slaves joined the cluster) --
> the act of failing over the framework appeared to "recover" the missing
> resources:
>

What do the master logs say when the slave is registered with a new id?


Reconnected slaves not sending resource offers?

2016-04-25 Thread Thomas Petr
Hi there,

Some of our Mesos slaves (running version 0.23) got into a strange state
last week. A networking blip from ~20:59 to ~21:03 in AWS caused a number
of slaves to lose connectivity to the Mesos master:

I0421 21:00:46.351019 85618 slave.cpp:3077] No pings from master
received within 75secs
I0421 21:00:46.355203 85594 status_update_manager.cpp:176] Pausing
sending status updates
I0421 21:00:46.355406 85622 slave.cpp:673] Re-detecting master
I0421 21:00:46.355630 85622 slave.cpp:720] Detecting new master
I0421 21:00:46.356101 85603 status_update_manager.cpp:176] Pausing
sending status updates
I0421 21:00:46.356115 85622 slave.cpp:684] New master detected at
master@172.16.22.2:5050
I0421 21:00:46.357239 85622 slave.cpp:709] No credentials provided.
Attempting to register without authentication
I0421 21:00:46.357364 85622 slave.cpp:720] Detecting new master

These slaves were shut down and removed by the master, and their
corresponding tasks were all marked as TASK_LOST:

I0421 21:01:01.355435 17076 master.cpp:241] Shutting down slave
20151116-203245-4077719724-5050-17017-S208 due to health check timeout
W0421 21:01:01.36 17076 master.cpp:3913] Shutting down slave
20151116-203245-4077719724-5050-17017-S208 at
slave(1)@172.16.3.103:5051 (lively-rice) with message 'health check
timed out'
I0421 21:01:01.355660 17076 master.cpp:4974] Removing slave
20151116-203245-4077719724-5050-17017-S208 at
slave(1)@172.16.3.103:5051 (lively-rice): health check timed out
...snip...
I0421 21:01:01.498541 17073 master.cpp:5079] Removed slave
20151116-203245-4077719724-5050-17017-S208 (lively-rice): health check
timed out
I0421 21:01:01.501723 17073 master.cpp:5102] Notifying framework sy3x4
(sy3x4) at scheduler-6a46b6f2-ccf8-416b-b8ba-7bef99576197@172.16.40.17:38483
of lost slave 20151116-203245-4077719724-5050-17017-S208 (lively-rice)
after recovering

The networking issues eventually clear up. The slaves attempt to
re-register with the master, but are shut down due to the master having
removed them:

I0421 21:03:13.789948 85612 slave.cpp:606] Slave asked to shut down by
master@172.16.22.2:5050 because 'health check timed out'
I0421 21:03:13.791801 85612 slave.cpp:1946] Asked to shut down
framework sy3x4 by master@172.16.22.2:5050
I0421 21:03:13.791960 85612 slave.cpp:1971] Shutting down framework sy3x4
I0421 21:03:13.793388 85612 slave.cpp:3667] Shutting down executor
'4ki18' of framework sy3x4
I0421 21:03:13.793678 85612 slave.cpp:3667] Shutting down executor
'8cjp8' of framework sy3x4
I0421 21:03:13.793822 85612 slave.cpp:3667] Shutting down executor
't4ila' of framework sy3x4
I0421 21:03:13.794312 85612 slave.cpp:3667] Shutting down executor
'1al5a' of framework sy3x4
I0421 21:03:13.794628 85612 slave.cpp:3667] Shutting down executor
'i4qp9' of framework sy3x4
...snip...
I0421 21:03:13.820853 85612 slave.cpp:606] Slave asked to shut down by
master@172.16.22.2:5050 because 'Slave attempted to re-register after
removal'
I0421 21:03:13.821146 85612 slave.cpp:1946] Asked to shut down
framework sy3x4 by master@172.16.22.2:5050
W0421 21:03:13.821462 85612 slave.cpp:1967] Ignoring shutdown
framework sy3x4 because it is terminating
...snip...
I0421 21:03:19.281539 85617 slave.cpp:606] Slave asked to shut down by
master@172.16.22.2:5050 because 'Executor exited message from unknown
slave'
I0421 21:03:19.281738 85617 slave.cpp:1946] Asked to shut down
framework sy3x4 by master@172.16.22.2:5050
W0421 21:03:19.281782 85617 slave.cpp:1967] Ignoring shutdown
framework sy3x4 because it is terminating
...snip...
I0421 21:03:23.154587 85619 slave.cpp:564] Slave terminating

Monit starts up the mesos-slave process again, and the affected slaves
successfully register with the master with new slave IDs:

2016-04-21 21:03:31,210:53195(0x7ff160eb8700):ZOO_INFO@check_events@1750:
session establishment complete on server [172.16.5.8:2181],
sessionId=0x1751d0ed4b004cbb, negotiated timeout=1
I0421 21:03:31.210914 53209 group.cpp:313] Group process
(group(1)@172.16.3.103:5051) connected to ZooKeeper
I0421 21:03:31.210963 53209 group.cpp:787] Syncing group operations:
queue size (joins, cancels, datas) = (0, 0, 0)
I0421 21:03:31.210979 53209 group.cpp:385] Trying to create path
'/mesos/mesos_prod_3x4' in ZooKeeper
I0421 21:03:31.212005 53198 state.cpp:36] Recovering state from
'/usr/share/hubspot/mesos/meta'
I0421 21:03:31.213176 53198 state.cpp:79] Failed to find the latest
slave from '/usr/share/hubspot/mesos/meta'
I0421 21:03:31.213376 53198 status_update_manager.cpp:202] Recovering
status update manager
...snip...
I0421 21:03:31.228364 53229 status_update_manager.cpp:176] Pausing
sending status updates
I0421 21:03:31.228436 53218 slave.cpp:684] New master detected at
master@172.16.22.2:5050
I0421 21:03:31.228768 53218 slave.cpp:709] No credentials provided.
Attempting to register without authentication
I0421 21:03:31.228844 53218 slave.cpp:720] Detecting new master
I0421 21:03:31.228996 53218 slave.cpp:4193] Received