Re: A problem with resource offers

Sharma Podila Fri, 07 Nov 2014 09:22:13 -0800

Thanks, Adam. I should've looked at the fixed issues for this.
Things work fine with a later version, confirmed with 0.20.


On Fri, Nov 7, 2014 at 1:29 AM, Adam Bordelon <a...@mesosphere.io> wrote:

> Fixed in 0.19: https://issues.apache.org/jira/browse/MESOS-1400
>
> On Thu, Nov 6, 2014 at 7:59 PM, Timothy Chen <t...@mesosphere.io> wrote:
>
>> Hi Sharma,
>>
>> Can you try out the latest master and see if you can repro it?
>>
>> Tim
>>
>> Sent from my iPhone
>>
>> On Nov 6, 2014, at 7:41 PM, Sharma Podila <spod...@netflix.com> wrote:
>>
>> 
>> I am on 0.18 still.
>>
>> I think I found a bug. I wrote a simple program to repeat this and
>> there's a new twist as well.
>>
>> Again, although I have fixed this for now in my framework by removing all
>> previous leases after re-registration, this can show up when mesos starts
>> rescinding offers in the future.
>>
>> Here's what I do:
>>
>> 1. register with mesos that has just one slave in the cluster and only
>> one master
>> 2. get an offer, O1
>> 3. kill and restart mesos master
>> 4. get new offer for the only slave, O2
>> 5. launch a task with both offers O1 and O2
>> 6. receive TASK_LOST
>> 7. wait for new offer, that never comes.
>> Here's the new twist:
>> 8. kill my framework and restart
>> 9. get no offers from mesos at all.
>>
>> Here's the relevant mesos master logs:
>>
>> I1106 19:31:55.734485 10423 master.cpp:770] Elected as the leading master!
>> I1106 19:31:55.737759 10423 master.cpp:1936] Attempting to re-register
>> slave 20141029-125131-16842879-5050-18827-1 at slave(1)@127.0.1.1:5051
>> (lgud-spodila2)
>> I1106 19:31:55.737788 10423 master.cpp:2818] Adding slave
>> 20141029-125131-16842879-5050-18827-1 at lgud-spodila2 with cpus(*):8;
>> mem(*):39209; disk(*):219127; ports(*):[31000-32000]
>> I1106 19:31:55.738088 10422 hierarchical_allocator_process.hpp:445] Added
>> slave 20141029-125131-16842879-5050-18827-1 (lgud-spodila2) with cpus(*):8;
>> mem(*):39209; disk(*):219127; ports(*):[31000-32000] (and cpus(*):8;
>> mem(*):39209; disk(*):219127; ports(*):[31000-32000] available)
>> I1106 19:31:56.733850 10423 master.cpp:906] Re-registering framework
>> 20141106-193136-16842879-5050-10308-0000 at scheduler(1)@127.0.1.1:55515
>> I1106 19:31:56.734544 10424 hierarchical_allocator_process.hpp:332] Added
>> framework 20141106-193136-16842879-5050-10308-0000
>> I1106 19:31:56.735044 10424 master.cpp:2285] Sending 1 offers to
>> framework 20141106-193136-16842879-5050-10308-0000
>> I1106 19:31:59.627913 10423 http.cpp:391] HTTP request for
>> '/master/state.json'
>> I1106 19:32:09.634088 10421 http.cpp:391] HTTP request for
>> '/master/state.json'
>> W1106 19:32:10.377226 10425 master.cpp:1556] Failed to validate offer  :
>> Offer 20141106-193136-16842879-5050-10308-0 is no longer valid
>> I1106 19:32:10.378697 10425 master.cpp:1567] Sending status update
>> TASK_LOST (UUID: afadf504-f606-47f2-82cc-5af2e532afcd) for task Job123 of
>> framework 20141106-193136-16842879-5050-10308-0000 for launch task attempt
>> on invalid offers: [ 20141106-193147-16842879-5050-10406-0,
>> 20141106-193136-16842879-5050-10308-0 ]
>>
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> Master thinks both offers are invalid and basically leaks it.
>>
>> I1106 19:32:19.640913 10422 http.cpp:391] HTTP request for
>> '/master/state.json'
>> I1106 19:32:22.667037 10424 master.cpp:595] Framework
>> 20141106-193136-16842879-5050-10308-0000 disconnected
>> I1106 19:32:22.667280 10424 master.cpp:1079] Deactivating framework
>> 20141106-193136-16842879-5050-10308-0000
>> I1106 19:32:22.668009 10424 master.cpp:617] Giving framework
>> 20141106-193136-16842879-5050-10308-0000 0ns to failover
>> I1106 19:32:22.668124 10427 hierarchical_allocator_process.hpp:408]
>> Deactivated framework 20141106-193136-16842879-5050-10308-0000
>> I1106 19:32:22.668252 10425 master.cpp:2201] Framework failover timeout,
>> removing framework 20141106-193136-16842879-5050-10308-0000
>> I1106 19:32:22.668443 10425 master.cpp:2688] Removing framework
>> 20141106-193136-16842879-5050-10308-0000
>> I1106 19:32:22.668829 10425 hierarchical_allocator_process.hpp:363]
>> Removed framework 20141106-193136-16842879-5050-10308-0000
>> I1106 19:32:24.739157 10426 master.cpp:818] Received registration request
>> from scheduler(1)@127.0.1.1:37122
>> I1106 19:32:24.739328 10426 master.cpp:836] Registering framework
>> 20141106-193147-16842879-5050-10406-0000 at scheduler(1)@127.0.1.1:37122
>> I1106 19:32:24.739753 10426 hierarchical_allocator_process.hpp:332] Added
>> framework 20141106-193147-16842879-5050-10406-0000
>> I1106 19:32:29.647886 10423 http.cpp:391] HTTP request for
>> '/master/state.json'
>>
>>
>> On Thu, Nov 6, 2014 at 6:53 PM, Benjamin Mahler <
>> benjamin.mah...@gmail.com> wrote:
>>
>>> Which version of the master are you using and do you have the logs? The
>>> fact that no offers were coming back sounds like a bug!
>>>
>>> As for using O1 after a disconnection, all offers are invalid once a
>>> disconnection occurs. The scheduler driver does not automatically rescind
>>> offers upon disconnection, so I'd recommend clearing all cached offers when
>>> your scheduler gets disconnected, to avoid the unnecessary TASK_LOST
>>> updates.
>>>
>>> On Thu, Nov 6, 2014 at 6:25 PM, Sharma Podila <spod...@netflix.com>
>>> wrote:
>>>
>>>> We had an interesting problem with resource offers today and I would
>>>> like to confirm this problem and request an enhancement. Here's the summary
>>>> in the right sequence of events:
>>>>
>>>> 1. resource offer O1 for slave A arrives
>>>> 2. mesos disconnects
>>>> 3. mesos reregisters
>>>> 4. mesos offer O2 for slave A arrives
>>>>     (our framework keeps offers for sometime if unused, therefore, we
>>>> now have both O1 and O2, incorrectly)
>>>> 5. launch task T1 using offers O1 and O2
>>>> 6. framework thinks it has no offers with it now for slave A, will wait
>>>> for new offer after mesos consumes resources for task T1
>>>> 7. mesos sends TASK_LOST for T1 saying it was using an invalid offer
>>>>     (even though only O1 was invalid, O2 is gone missing silently)
>>>> 8. no more offers come for slave A
>>>> 9. basically we have an offer leak problem.
>>>>
>>>> To work around this, I am changing my framework so that when it
>>>> receives mesos reregistration callback (step 3 above), it removes all
>>>> existing offers. This should fix the problem.
>>>>
>>>> However, I am wondering if #7 can be improved in Mesos. When a task is
>>>> (or set of tasks are) launched using multiple offers, if at least one of
>>>> the offers is invalid, then Mesos should treat all offers as given up by
>>>> the framework. This will send TASK_LOST to the framework, but, also make
>>>> the valid offers available again through new offers.
>>>>
>>>> I am thinking this will be critical to do when Mesos starts rescinding
>>>> offers. Because in that case the frameworks cannot rely on the strategy
>>>> like the one I am using with reregistration.
>>>>
>>>> Sharma
>>>>
>>>>
>>>
>>
>

Re: A problem with resource offers

Reply via email to