Hi Sharma,

Can you try out the latest master and see if you can repro it?

Tim

Sent from my iPhone

> On Nov 6, 2014, at 7:41 PM, Sharma Podila <[email protected]> wrote:
> 
> ​​
> I am on 0.18 still.
> 
> I think I found a bug. I wrote a simple program to repeat this and there's a 
> new twist as well.
> 
> Again, although I have fixed this for now in my framework by removing all 
> previous leases after re-registration, this can show up when mesos starts 
> rescinding offers in the future.
> 
> Here's what I do:
> 
> 1. register with mesos that has just one slave in the cluster and only one 
> master
> 2. get an offer, O1
> 3. kill and restart mesos master
> 4. get new offer for the only slave, O2
> 5. launch a task with both offers O1 and O2
> 6. receive TASK_LOST 
> 7. wait for new offer, that never comes.
> Here's the new twist:
> 8. kill my framework and restart
> 9. get no offers from mesos at all.
> 
> Here's the relevant mesos master logs:
> 
> I1106 19:31:55.734485 10423 master.cpp:770] Elected as the leading master!
> I1106 19:31:55.737759 10423 master.cpp:1936] Attempting to re-register slave 
> 20141029-125131-16842879-5050-18827-1 at slave(1)@127.0.1.1:5051 
> (lgud-spodila2)
> I1106 19:31:55.737788 10423 master.cpp:2818] Adding slave 
> 20141029-125131-16842879-5050-18827-1 at lgud-spodila2 with cpus(*):8; 
> mem(*):39209; disk(*):219127; ports(*):[31000-32000]
> I1106 19:31:55.738088 10422 hierarchical_allocator_process.hpp:445] Added 
> slave 20141029-125131-16842879-5050-18827-1 (lgud-spodila2) with cpus(*):8; 
> mem(*):39209; disk(*):219127; ports(*):[31000-32000] (and cpus(*):8; 
> mem(*):39209; disk(*):219127; ports(*):[31000-32000] available)
> I1106 19:31:56.733850 10423 master.cpp:906] Re-registering framework 
> 20141106-193136-16842879-5050-10308-0000 at scheduler(1)@127.0.1.1:55515
> I1106 19:31:56.734544 10424 hierarchical_allocator_process.hpp:332] Added 
> framework 20141106-193136-16842879-5050-10308-0000
> I1106 19:31:56.735044 10424 master.cpp:2285] Sending 1 offers to framework 
> 20141106-193136-16842879-5050-10308-0000
> I1106 19:31:59.627913 10423 http.cpp:391] HTTP request for 
> '/master/state.json'
> I1106 19:32:09.634088 10421 http.cpp:391] HTTP request for 
> '/master/state.json'
> W1106 19:32:10.377226 10425 master.cpp:1556] Failed to validate offer  : 
> Offer 20141106-193136-16842879-5050-10308-0 is no longer valid
> I1106 19:32:10.378697 10425 master.cpp:1567] Sending status update TASK_LOST 
> (UUID: afadf504-f606-47f2-82cc-5af2e532afcd) for task Job123 of framework 
> 20141106-193136-16842879-5050-10308-0000 for launch task attempt on invalid 
> offers: [ 20141106-193147-16842879-5050-10406-0, 
> 20141106-193136-16842879-5050-10308-0 ]
> 
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Master thinks both offers are invalid and basically leaks it. 
> 
> I1106 19:32:19.640913 10422 http.cpp:391] HTTP request for 
> '/master/state.json'
> I1106 19:32:22.667037 10424 master.cpp:595] Framework 
> 20141106-193136-16842879-5050-10308-0000 disconnected
> I1106 19:32:22.667280 10424 master.cpp:1079] Deactivating framework 
> 20141106-193136-16842879-5050-10308-0000
> I1106 19:32:22.668009 10424 master.cpp:617] Giving framework 
> 20141106-193136-16842879-5050-10308-0000 0ns to failover
> I1106 19:32:22.668124 10427 hierarchical_allocator_process.hpp:408] 
> Deactivated framework 20141106-193136-16842879-5050-10308-0000
> I1106 19:32:22.668252 10425 master.cpp:2201] Framework failover timeout, 
> removing framework 20141106-193136-16842879-5050-10308-0000
> I1106 19:32:22.668443 10425 master.cpp:2688] Removing framework 
> 20141106-193136-16842879-5050-10308-0000
> I1106 19:32:22.668829 10425 hierarchical_allocator_process.hpp:363] Removed 
> framework 20141106-193136-16842879-5050-10308-0000
> I1106 19:32:24.739157 10426 master.cpp:818] Received registration request 
> from scheduler(1)@127.0.1.1:37122
> I1106 19:32:24.739328 10426 master.cpp:836] Registering framework 
> 20141106-193147-16842879-5050-10406-0000 at scheduler(1)@127.0.1.1:37122
> I1106 19:32:24.739753 10426 hierarchical_allocator_process.hpp:332] Added 
> framework 20141106-193147-16842879-5050-10406-0000
> I1106 19:32:29.647886 10423 http.cpp:391] HTTP request for 
> '/master/state.json'
> 
> 
>> On Thu, Nov 6, 2014 at 6:53 PM, Benjamin Mahler <[email protected]> 
>> wrote:
>> Which version of the master are you using and do you have the logs? The fact 
>> that no offers were coming back sounds like a bug!
>> 
>> As for using O1 after a disconnection, all offers are invalid once a 
>> disconnection occurs. The scheduler driver does not automatically rescind 
>> offers upon disconnection, so I'd recommend clearing all cached offers when 
>> your scheduler gets disconnected, to avoid the unnecessary TASK_LOST updates.
>> 
>>> On Thu, Nov 6, 2014 at 6:25 PM, Sharma Podila <[email protected]> wrote:
>>> We had an interesting problem with resource offers today and I would like 
>>> to confirm this problem and request an enhancement. Here's the summary in 
>>> the right sequence of events:
>>> 
>>> 1. resource offer O1 for slave A arrives
>>> 2. mesos disconnects
>>> 3. mesos reregisters
>>> 4. mesos offer O2 for slave A arrives
>>>     (our framework keeps offers for sometime if unused, therefore, we now 
>>> have both O1 and O2, incorrectly)
>>> 5. launch task T1 using offers O1 and O2
>>> 6. framework thinks it has no offers with it now for slave A, will wait for 
>>> new offer after mesos consumes resources for task T1
>>> 7. mesos sends TASK_LOST for T1 saying it was using an invalid offer 
>>>     (even though only O1 was invalid, O2 is gone missing silently)
>>> 8. no more offers come for slave A
>>> 9. basically we have an offer leak problem.
>>> 
>>> To work around this, I am changing my framework so that when it receives 
>>> mesos reregistration callback (step 3 above), it removes all existing 
>>> offers. This should fix the problem. 
>>> 
>>> However, I am wondering if #7 can be improved in Mesos. When a task is (or 
>>> set of tasks are) launched using multiple offers, if at least one of the 
>>> offers is invalid, then Mesos should treat all offers as given up by the 
>>> framework. This will send TASK_LOST to the framework, but, also make the 
>>> valid offers available again through new offers. 
>>> 
>>> I am thinking this will be critical to do when Mesos starts rescinding 
>>> offers. Because in that case the frameworks cannot rely on the strategy 
>>> like the one I am using with reregistration.
>>> 
>>> Sharma
> 

Reply via email to