​​
I am on 0.18 still.

I think I found a bug. I wrote a simple program to repeat this and there's
a new twist as well.

Again, although I have fixed this for now in my framework by removing all
previous leases after re-registration, this can show up when mesos starts
rescinding offers in the future.

Here's what I do:

1. register with mesos that has just one slave in the cluster and only one
master
2. get an offer, O1
3. kill and restart mesos master
4. get new offer for the only slave, O2
5. launch a task with both offers O1 and O2
6. receive TASK_LOST
7. wait for new offer, that never comes.
Here's the new twist:
8. kill my framework and restart
9. get no offers from mesos at all.

Here's the relevant mesos master logs:

I1106 19:31:55.734485 10423 master.cpp:770] Elected as the leading master!
I1106 19:31:55.737759 10423 master.cpp:1936] Attempting to re-register
slave 20141029-125131-16842879-5050-18827-1 at slave(1)@127.0.1.1:5051
(lgud-spodila2)
I1106 19:31:55.737788 10423 master.cpp:2818] Adding slave
20141029-125131-16842879-5050-18827-1 at lgud-spodila2 with cpus(*):8;
mem(*):39209; disk(*):219127; ports(*):[31000-32000]
I1106 19:31:55.738088 10422 hierarchical_allocator_process.hpp:445] Added
slave 20141029-125131-16842879-5050-18827-1 (lgud-spodila2) with cpus(*):8;
mem(*):39209; disk(*):219127; ports(*):[31000-32000] (and cpus(*):8;
mem(*):39209; disk(*):219127; ports(*):[31000-32000] available)
I1106 19:31:56.733850 10423 master.cpp:906] Re-registering framework
20141106-193136-16842879-5050-10308-0000 at scheduler(1)@127.0.1.1:55515
I1106 19:31:56.734544 10424 hierarchical_allocator_process.hpp:332] Added
framework 20141106-193136-16842879-5050-10308-0000
I1106 19:31:56.735044 10424 master.cpp:2285] Sending 1 offers to framework
20141106-193136-16842879-5050-10308-0000
I1106 19:31:59.627913 10423 http.cpp:391] HTTP request for
'/master/state.json'
I1106 19:32:09.634088 10421 http.cpp:391] HTTP request for
'/master/state.json'
W1106 19:32:10.377226 10425 master.cpp:1556] Failed to validate offer  :
Offer 20141106-193136-16842879-5050-10308-0 is no longer valid
I1106 19:32:10.378697 10425 master.cpp:1567] Sending status update
TASK_LOST (UUID: afadf504-f606-47f2-82cc-5af2e532afcd) for task Job123 of
framework 20141106-193136-16842879-5050-10308-0000 for launch task attempt
on invalid offers: [ 20141106-193147-16842879-5050-10406-0,
20141106-193136-16842879-5050-10308-0 ]

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Master thinks both offers are invalid and basically leaks it.

I1106 19:32:19.640913 10422 http.cpp:391] HTTP request for
'/master/state.json'
I1106 19:32:22.667037 10424 master.cpp:595] Framework
20141106-193136-16842879-5050-10308-0000 disconnected
I1106 19:32:22.667280 10424 master.cpp:1079] Deactivating framework
20141106-193136-16842879-5050-10308-0000
I1106 19:32:22.668009 10424 master.cpp:617] Giving framework
20141106-193136-16842879-5050-10308-0000 0ns to failover
I1106 19:32:22.668124 10427 hierarchical_allocator_process.hpp:408]
Deactivated framework 20141106-193136-16842879-5050-10308-0000
I1106 19:32:22.668252 10425 master.cpp:2201] Framework failover timeout,
removing framework 20141106-193136-16842879-5050-10308-0000
I1106 19:32:22.668443 10425 master.cpp:2688] Removing framework
20141106-193136-16842879-5050-10308-0000
I1106 19:32:22.668829 10425 hierarchical_allocator_process.hpp:363] Removed
framework 20141106-193136-16842879-5050-10308-0000
I1106 19:32:24.739157 10426 master.cpp:818] Received registration request
from scheduler(1)@127.0.1.1:37122
I1106 19:32:24.739328 10426 master.cpp:836] Registering framework
20141106-193147-16842879-5050-10406-0000 at scheduler(1)@127.0.1.1:37122
I1106 19:32:24.739753 10426 hierarchical_allocator_process.hpp:332] Added
framework 20141106-193147-16842879-5050-10406-0000
I1106 19:32:29.647886 10423 http.cpp:391] HTTP request for
'/master/state.json'


On Thu, Nov 6, 2014 at 6:53 PM, Benjamin Mahler <[email protected]>
wrote:

> Which version of the master are you using and do you have the logs? The
> fact that no offers were coming back sounds like a bug!
>
> As for using O1 after a disconnection, all offers are invalid once a
> disconnection occurs. The scheduler driver does not automatically rescind
> offers upon disconnection, so I'd recommend clearing all cached offers when
> your scheduler gets disconnected, to avoid the unnecessary TASK_LOST
> updates.
>
> On Thu, Nov 6, 2014 at 6:25 PM, Sharma Podila <[email protected]> wrote:
>
>> We had an interesting problem with resource offers today and I would like
>> to confirm this problem and request an enhancement. Here's the summary in
>> the right sequence of events:
>>
>> 1. resource offer O1 for slave A arrives
>> 2. mesos disconnects
>> 3. mesos reregisters
>> 4. mesos offer O2 for slave A arrives
>>     (our framework keeps offers for sometime if unused, therefore, we now
>> have both O1 and O2, incorrectly)
>> 5. launch task T1 using offers O1 and O2
>> 6. framework thinks it has no offers with it now for slave A, will wait
>> for new offer after mesos consumes resources for task T1
>> 7. mesos sends TASK_LOST for T1 saying it was using an invalid offer
>>     (even though only O1 was invalid, O2 is gone missing silently)
>> 8. no more offers come for slave A
>> 9. basically we have an offer leak problem.
>>
>> To work around this, I am changing my framework so that when it receives
>> mesos reregistration callback (step 3 above), it removes all existing
>> offers. This should fix the problem.
>>
>> However, I am wondering if #7 can be improved in Mesos. When a task is
>> (or set of tasks are) launched using multiple offers, if at least one of
>> the offers is invalid, then Mesos should treat all offers as given up by
>> the framework. This will send TASK_LOST to the framework, but, also make
>> the valid offers available again through new offers.
>>
>> I am thinking this will be critical to do when Mesos starts rescinding
>> offers. Because in that case the frameworks cannot rely on the strategy
>> like the one I am using with reregistration.
>>
>> Sharma
>>
>>
>

Reply via email to