Fixed in 0.19: https://issues.apache.org/jira/browse/MESOS-1400

On Thu, Nov 6, 2014 at 7:59 PM, Timothy Chen <t...@mesosphere.io> wrote:

> Hi Sharma,
>
> Can you try out the latest master and see if you can repro it?
>
> Tim
>
> Sent from my iPhone
>
> On Nov 6, 2014, at 7:41 PM, Sharma Podila <spod...@netflix.com> wrote:
>
> ​​
> I am on 0.18 still.
>
> I think I found a bug. I wrote a simple program to repeat this and
> there's a new twist as well.
>
> Again, although I have fixed this for now in my framework by removing all
> previous leases after re-registration, this can show up when mesos starts
> rescinding offers in the future.
>
> Here's what I do:
>
> 1. register with mesos that has just one slave in the cluster and only one
> master
> 2. get an offer, O1
> 3. kill and restart mesos master
> 4. get new offer for the only slave, O2
> 5. launch a task with both offers O1 and O2
> 6. receive TASK_LOST
> 7. wait for new offer, that never comes.
> Here's the new twist:
> 8. kill my framework and restart
> 9. get no offers from mesos at all.
>
> Here's the relevant mesos master logs:
>
> I1106 19:31:55.734485 10423 master.cpp:770] Elected as the leading master!
> I1106 19:31:55.737759 10423 master.cpp:1936] Attempting to re-register
> slave 20141029-125131-16842879-5050-18827-1 at slave(1)@127.0.1.1:5051
> (lgud-spodila2)
> I1106 19:31:55.737788 10423 master.cpp:2818] Adding slave
> 20141029-125131-16842879-5050-18827-1 at lgud-spodila2 with cpus(*):8;
> mem(*):39209; disk(*):219127; ports(*):[31000-32000]
> I1106 19:31:55.738088 10422 hierarchical_allocator_process.hpp:445] Added
> slave 20141029-125131-16842879-5050-18827-1 (lgud-spodila2) with cpus(*):8;
> mem(*):39209; disk(*):219127; ports(*):[31000-32000] (and cpus(*):8;
> mem(*):39209; disk(*):219127; ports(*):[31000-32000] available)
> I1106 19:31:56.733850 10423 master.cpp:906] Re-registering framework
> 20141106-193136-16842879-5050-10308-0000 at scheduler(1)@127.0.1.1:55515
> I1106 19:31:56.734544 10424 hierarchical_allocator_process.hpp:332] Added
> framework 20141106-193136-16842879-5050-10308-0000
> I1106 19:31:56.735044 10424 master.cpp:2285] Sending 1 offers to framework
> 20141106-193136-16842879-5050-10308-0000
> I1106 19:31:59.627913 10423 http.cpp:391] HTTP request for
> '/master/state.json'
> I1106 19:32:09.634088 10421 http.cpp:391] HTTP request for
> '/master/state.json'
> W1106 19:32:10.377226 10425 master.cpp:1556] Failed to validate offer  :
> Offer 20141106-193136-16842879-5050-10308-0 is no longer valid
> I1106 19:32:10.378697 10425 master.cpp:1567] Sending status update
> TASK_LOST (UUID: afadf504-f606-47f2-82cc-5af2e532afcd) for task Job123 of
> framework 20141106-193136-16842879-5050-10308-0000 for launch task attempt
> on invalid offers: [ 20141106-193147-16842879-5050-10406-0,
> 20141106-193136-16842879-5050-10308-0 ]
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Master thinks both offers are invalid and basically leaks it.
>
> I1106 19:32:19.640913 10422 http.cpp:391] HTTP request for
> '/master/state.json'
> I1106 19:32:22.667037 10424 master.cpp:595] Framework
> 20141106-193136-16842879-5050-10308-0000 disconnected
> I1106 19:32:22.667280 10424 master.cpp:1079] Deactivating framework
> 20141106-193136-16842879-5050-10308-0000
> I1106 19:32:22.668009 10424 master.cpp:617] Giving framework
> 20141106-193136-16842879-5050-10308-0000 0ns to failover
> I1106 19:32:22.668124 10427 hierarchical_allocator_process.hpp:408]
> Deactivated framework 20141106-193136-16842879-5050-10308-0000
> I1106 19:32:22.668252 10425 master.cpp:2201] Framework failover timeout,
> removing framework 20141106-193136-16842879-5050-10308-0000
> I1106 19:32:22.668443 10425 master.cpp:2688] Removing framework
> 20141106-193136-16842879-5050-10308-0000
> I1106 19:32:22.668829 10425 hierarchical_allocator_process.hpp:363]
> Removed framework 20141106-193136-16842879-5050-10308-0000
> I1106 19:32:24.739157 10426 master.cpp:818] Received registration request
> from scheduler(1)@127.0.1.1:37122
> I1106 19:32:24.739328 10426 master.cpp:836] Registering framework
> 20141106-193147-16842879-5050-10406-0000 at scheduler(1)@127.0.1.1:37122
> I1106 19:32:24.739753 10426 hierarchical_allocator_process.hpp:332] Added
> framework 20141106-193147-16842879-5050-10406-0000
> I1106 19:32:29.647886 10423 http.cpp:391] HTTP request for
> '/master/state.json'
>
>
> On Thu, Nov 6, 2014 at 6:53 PM, Benjamin Mahler <benjamin.mah...@gmail.com
> > wrote:
>
>> Which version of the master are you using and do you have the logs? The
>> fact that no offers were coming back sounds like a bug!
>>
>> As for using O1 after a disconnection, all offers are invalid once a
>> disconnection occurs. The scheduler driver does not automatically rescind
>> offers upon disconnection, so I'd recommend clearing all cached offers when
>> your scheduler gets disconnected, to avoid the unnecessary TASK_LOST
>> updates.
>>
>> On Thu, Nov 6, 2014 at 6:25 PM, Sharma Podila <spod...@netflix.com>
>> wrote:
>>
>>> We had an interesting problem with resource offers today and I would
>>> like to confirm this problem and request an enhancement. Here's the summary
>>> in the right sequence of events:
>>>
>>> 1. resource offer O1 for slave A arrives
>>> 2. mesos disconnects
>>> 3. mesos reregisters
>>> 4. mesos offer O2 for slave A arrives
>>>     (our framework keeps offers for sometime if unused, therefore, we
>>> now have both O1 and O2, incorrectly)
>>> 5. launch task T1 using offers O1 and O2
>>> 6. framework thinks it has no offers with it now for slave A, will wait
>>> for new offer after mesos consumes resources for task T1
>>> 7. mesos sends TASK_LOST for T1 saying it was using an invalid offer
>>>     (even though only O1 was invalid, O2 is gone missing silently)
>>> 8. no more offers come for slave A
>>> 9. basically we have an offer leak problem.
>>>
>>> To work around this, I am changing my framework so that when it receives
>>> mesos reregistration callback (step 3 above), it removes all existing
>>> offers. This should fix the problem.
>>>
>>> However, I am wondering if #7 can be improved in Mesos. When a task is
>>> (or set of tasks are) launched using multiple offers, if at least one of
>>> the offers is invalid, then Mesos should treat all offers as given up by
>>> the framework. This will send TASK_LOST to the framework, but, also make
>>> the valid offers available again through new offers.
>>>
>>> I am thinking this will be critical to do when Mesos starts rescinding
>>> offers. Because in that case the frameworks cannot rely on the strategy
>>> like the one I am using with reregistration.
>>>
>>> Sharma
>>>
>>>
>>
>

Reply via email to