Fixed in 0.19: https://issues.apache.org/jira/browse/MESOS-1400
On Thu, Nov 6, 2014 at 7:59 PM, Timothy Chen <t...@mesosphere.io> wrote: > Hi Sharma, > > Can you try out the latest master and see if you can repro it? > > Tim > > Sent from my iPhone > > On Nov 6, 2014, at 7:41 PM, Sharma Podila <spod...@netflix.com> wrote: > > > I am on 0.18 still. > > I think I found a bug. I wrote a simple program to repeat this and > there's a new twist as well. > > Again, although I have fixed this for now in my framework by removing all > previous leases after re-registration, this can show up when mesos starts > rescinding offers in the future. > > Here's what I do: > > 1. register with mesos that has just one slave in the cluster and only one > master > 2. get an offer, O1 > 3. kill and restart mesos master > 4. get new offer for the only slave, O2 > 5. launch a task with both offers O1 and O2 > 6. receive TASK_LOST > 7. wait for new offer, that never comes. > Here's the new twist: > 8. kill my framework and restart > 9. get no offers from mesos at all. > > Here's the relevant mesos master logs: > > I1106 19:31:55.734485 10423 master.cpp:770] Elected as the leading master! > I1106 19:31:55.737759 10423 master.cpp:1936] Attempting to re-register > slave 20141029-125131-16842879-5050-18827-1 at slave(1)@127.0.1.1:5051 > (lgud-spodila2) > I1106 19:31:55.737788 10423 master.cpp:2818] Adding slave > 20141029-125131-16842879-5050-18827-1 at lgud-spodila2 with cpus(*):8; > mem(*):39209; disk(*):219127; ports(*):[31000-32000] > I1106 19:31:55.738088 10422 hierarchical_allocator_process.hpp:445] Added > slave 20141029-125131-16842879-5050-18827-1 (lgud-spodila2) with cpus(*):8; > mem(*):39209; disk(*):219127; ports(*):[31000-32000] (and cpus(*):8; > mem(*):39209; disk(*):219127; ports(*):[31000-32000] available) > I1106 19:31:56.733850 10423 master.cpp:906] Re-registering framework > 20141106-193136-16842879-5050-10308-0000 at scheduler(1)@127.0.1.1:55515 > I1106 19:31:56.734544 10424 hierarchical_allocator_process.hpp:332] Added > framework 20141106-193136-16842879-5050-10308-0000 > I1106 19:31:56.735044 10424 master.cpp:2285] Sending 1 offers to framework > 20141106-193136-16842879-5050-10308-0000 > I1106 19:31:59.627913 10423 http.cpp:391] HTTP request for > '/master/state.json' > I1106 19:32:09.634088 10421 http.cpp:391] HTTP request for > '/master/state.json' > W1106 19:32:10.377226 10425 master.cpp:1556] Failed to validate offer : > Offer 20141106-193136-16842879-5050-10308-0 is no longer valid > I1106 19:32:10.378697 10425 master.cpp:1567] Sending status update > TASK_LOST (UUID: afadf504-f606-47f2-82cc-5af2e532afcd) for task Job123 of > framework 20141106-193136-16842879-5050-10308-0000 for launch task attempt > on invalid offers: [ 20141106-193147-16842879-5050-10406-0, > 20141106-193136-16842879-5050-10308-0 ] > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > Master thinks both offers are invalid and basically leaks it. > > I1106 19:32:19.640913 10422 http.cpp:391] HTTP request for > '/master/state.json' > I1106 19:32:22.667037 10424 master.cpp:595] Framework > 20141106-193136-16842879-5050-10308-0000 disconnected > I1106 19:32:22.667280 10424 master.cpp:1079] Deactivating framework > 20141106-193136-16842879-5050-10308-0000 > I1106 19:32:22.668009 10424 master.cpp:617] Giving framework > 20141106-193136-16842879-5050-10308-0000 0ns to failover > I1106 19:32:22.668124 10427 hierarchical_allocator_process.hpp:408] > Deactivated framework 20141106-193136-16842879-5050-10308-0000 > I1106 19:32:22.668252 10425 master.cpp:2201] Framework failover timeout, > removing framework 20141106-193136-16842879-5050-10308-0000 > I1106 19:32:22.668443 10425 master.cpp:2688] Removing framework > 20141106-193136-16842879-5050-10308-0000 > I1106 19:32:22.668829 10425 hierarchical_allocator_process.hpp:363] > Removed framework 20141106-193136-16842879-5050-10308-0000 > I1106 19:32:24.739157 10426 master.cpp:818] Received registration request > from scheduler(1)@127.0.1.1:37122 > I1106 19:32:24.739328 10426 master.cpp:836] Registering framework > 20141106-193147-16842879-5050-10406-0000 at scheduler(1)@127.0.1.1:37122 > I1106 19:32:24.739753 10426 hierarchical_allocator_process.hpp:332] Added > framework 20141106-193147-16842879-5050-10406-0000 > I1106 19:32:29.647886 10423 http.cpp:391] HTTP request for > '/master/state.json' > > > On Thu, Nov 6, 2014 at 6:53 PM, Benjamin Mahler <benjamin.mah...@gmail.com > > wrote: > >> Which version of the master are you using and do you have the logs? The >> fact that no offers were coming back sounds like a bug! >> >> As for using O1 after a disconnection, all offers are invalid once a >> disconnection occurs. The scheduler driver does not automatically rescind >> offers upon disconnection, so I'd recommend clearing all cached offers when >> your scheduler gets disconnected, to avoid the unnecessary TASK_LOST >> updates. >> >> On Thu, Nov 6, 2014 at 6:25 PM, Sharma Podila <spod...@netflix.com> >> wrote: >> >>> We had an interesting problem with resource offers today and I would >>> like to confirm this problem and request an enhancement. Here's the summary >>> in the right sequence of events: >>> >>> 1. resource offer O1 for slave A arrives >>> 2. mesos disconnects >>> 3. mesos reregisters >>> 4. mesos offer O2 for slave A arrives >>> (our framework keeps offers for sometime if unused, therefore, we >>> now have both O1 and O2, incorrectly) >>> 5. launch task T1 using offers O1 and O2 >>> 6. framework thinks it has no offers with it now for slave A, will wait >>> for new offer after mesos consumes resources for task T1 >>> 7. mesos sends TASK_LOST for T1 saying it was using an invalid offer >>> (even though only O1 was invalid, O2 is gone missing silently) >>> 8. no more offers come for slave A >>> 9. basically we have an offer leak problem. >>> >>> To work around this, I am changing my framework so that when it receives >>> mesos reregistration callback (step 3 above), it removes all existing >>> offers. This should fix the problem. >>> >>> However, I am wondering if #7 can be improved in Mesos. When a task is >>> (or set of tasks are) launched using multiple offers, if at least one of >>> the offers is invalid, then Mesos should treat all offers as given up by >>> the framework. This will send TASK_LOST to the framework, but, also make >>> the valid offers available again through new offers. >>> >>> I am thinking this will be critical to do when Mesos starts rescinding >>> offers. Because in that case the frameworks cannot rely on the strategy >>> like the one I am using with reregistration. >>> >>> Sharma >>> >>> >> >