Thanks, Adam. I should've looked at the fixed issues for this. Things work fine with a later version, confirmed with 0.20.
On Fri, Nov 7, 2014 at 1:29 AM, Adam Bordelon <a...@mesosphere.io> wrote: > Fixed in 0.19: https://issues.apache.org/jira/browse/MESOS-1400 > > On Thu, Nov 6, 2014 at 7:59 PM, Timothy Chen <t...@mesosphere.io> wrote: > >> Hi Sharma, >> >> Can you try out the latest master and see if you can repro it? >> >> Tim >> >> Sent from my iPhone >> >> On Nov 6, 2014, at 7:41 PM, Sharma Podila <spod...@netflix.com> wrote: >> >> >> I am on 0.18 still. >> >> I think I found a bug. I wrote a simple program to repeat this and >> there's a new twist as well. >> >> Again, although I have fixed this for now in my framework by removing all >> previous leases after re-registration, this can show up when mesos starts >> rescinding offers in the future. >> >> Here's what I do: >> >> 1. register with mesos that has just one slave in the cluster and only >> one master >> 2. get an offer, O1 >> 3. kill and restart mesos master >> 4. get new offer for the only slave, O2 >> 5. launch a task with both offers O1 and O2 >> 6. receive TASK_LOST >> 7. wait for new offer, that never comes. >> Here's the new twist: >> 8. kill my framework and restart >> 9. get no offers from mesos at all. >> >> Here's the relevant mesos master logs: >> >> I1106 19:31:55.734485 10423 master.cpp:770] Elected as the leading master! >> I1106 19:31:55.737759 10423 master.cpp:1936] Attempting to re-register >> slave 20141029-125131-16842879-5050-18827-1 at slave(1)@127.0.1.1:5051 >> (lgud-spodila2) >> I1106 19:31:55.737788 10423 master.cpp:2818] Adding slave >> 20141029-125131-16842879-5050-18827-1 at lgud-spodila2 with cpus(*):8; >> mem(*):39209; disk(*):219127; ports(*):[31000-32000] >> I1106 19:31:55.738088 10422 hierarchical_allocator_process.hpp:445] Added >> slave 20141029-125131-16842879-5050-18827-1 (lgud-spodila2) with cpus(*):8; >> mem(*):39209; disk(*):219127; ports(*):[31000-32000] (and cpus(*):8; >> mem(*):39209; disk(*):219127; ports(*):[31000-32000] available) >> I1106 19:31:56.733850 10423 master.cpp:906] Re-registering framework >> 20141106-193136-16842879-5050-10308-0000 at scheduler(1)@127.0.1.1:55515 >> I1106 19:31:56.734544 10424 hierarchical_allocator_process.hpp:332] Added >> framework 20141106-193136-16842879-5050-10308-0000 >> I1106 19:31:56.735044 10424 master.cpp:2285] Sending 1 offers to >> framework 20141106-193136-16842879-5050-10308-0000 >> I1106 19:31:59.627913 10423 http.cpp:391] HTTP request for >> '/master/state.json' >> I1106 19:32:09.634088 10421 http.cpp:391] HTTP request for >> '/master/state.json' >> W1106 19:32:10.377226 10425 master.cpp:1556] Failed to validate offer : >> Offer 20141106-193136-16842879-5050-10308-0 is no longer valid >> I1106 19:32:10.378697 10425 master.cpp:1567] Sending status update >> TASK_LOST (UUID: afadf504-f606-47f2-82cc-5af2e532afcd) for task Job123 of >> framework 20141106-193136-16842879-5050-10308-0000 for launch task attempt >> on invalid offers: [ 20141106-193147-16842879-5050-10406-0, >> 20141106-193136-16842879-5050-10308-0 ] >> >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> Master thinks both offers are invalid and basically leaks it. >> >> I1106 19:32:19.640913 10422 http.cpp:391] HTTP request for >> '/master/state.json' >> I1106 19:32:22.667037 10424 master.cpp:595] Framework >> 20141106-193136-16842879-5050-10308-0000 disconnected >> I1106 19:32:22.667280 10424 master.cpp:1079] Deactivating framework >> 20141106-193136-16842879-5050-10308-0000 >> I1106 19:32:22.668009 10424 master.cpp:617] Giving framework >> 20141106-193136-16842879-5050-10308-0000 0ns to failover >> I1106 19:32:22.668124 10427 hierarchical_allocator_process.hpp:408] >> Deactivated framework 20141106-193136-16842879-5050-10308-0000 >> I1106 19:32:22.668252 10425 master.cpp:2201] Framework failover timeout, >> removing framework 20141106-193136-16842879-5050-10308-0000 >> I1106 19:32:22.668443 10425 master.cpp:2688] Removing framework >> 20141106-193136-16842879-5050-10308-0000 >> I1106 19:32:22.668829 10425 hierarchical_allocator_process.hpp:363] >> Removed framework 20141106-193136-16842879-5050-10308-0000 >> I1106 19:32:24.739157 10426 master.cpp:818] Received registration request >> from scheduler(1)@127.0.1.1:37122 >> I1106 19:32:24.739328 10426 master.cpp:836] Registering framework >> 20141106-193147-16842879-5050-10406-0000 at scheduler(1)@127.0.1.1:37122 >> I1106 19:32:24.739753 10426 hierarchical_allocator_process.hpp:332] Added >> framework 20141106-193147-16842879-5050-10406-0000 >> I1106 19:32:29.647886 10423 http.cpp:391] HTTP request for >> '/master/state.json' >> >> >> On Thu, Nov 6, 2014 at 6:53 PM, Benjamin Mahler < >> benjamin.mah...@gmail.com> wrote: >> >>> Which version of the master are you using and do you have the logs? The >>> fact that no offers were coming back sounds like a bug! >>> >>> As for using O1 after a disconnection, all offers are invalid once a >>> disconnection occurs. The scheduler driver does not automatically rescind >>> offers upon disconnection, so I'd recommend clearing all cached offers when >>> your scheduler gets disconnected, to avoid the unnecessary TASK_LOST >>> updates. >>> >>> On Thu, Nov 6, 2014 at 6:25 PM, Sharma Podila <spod...@netflix.com> >>> wrote: >>> >>>> We had an interesting problem with resource offers today and I would >>>> like to confirm this problem and request an enhancement. Here's the summary >>>> in the right sequence of events: >>>> >>>> 1. resource offer O1 for slave A arrives >>>> 2. mesos disconnects >>>> 3. mesos reregisters >>>> 4. mesos offer O2 for slave A arrives >>>> (our framework keeps offers for sometime if unused, therefore, we >>>> now have both O1 and O2, incorrectly) >>>> 5. launch task T1 using offers O1 and O2 >>>> 6. framework thinks it has no offers with it now for slave A, will wait >>>> for new offer after mesos consumes resources for task T1 >>>> 7. mesos sends TASK_LOST for T1 saying it was using an invalid offer >>>> (even though only O1 was invalid, O2 is gone missing silently) >>>> 8. no more offers come for slave A >>>> 9. basically we have an offer leak problem. >>>> >>>> To work around this, I am changing my framework so that when it >>>> receives mesos reregistration callback (step 3 above), it removes all >>>> existing offers. This should fix the problem. >>>> >>>> However, I am wondering if #7 can be improved in Mesos. When a task is >>>> (or set of tasks are) launched using multiple offers, if at least one of >>>> the offers is invalid, then Mesos should treat all offers as given up by >>>> the framework. This will send TASK_LOST to the framework, but, also make >>>> the valid offers available again through new offers. >>>> >>>> I am thinking this will be critical to do when Mesos starts rescinding >>>> offers. Because in that case the frameworks cannot rely on the strategy >>>> like the one I am using with reregistration. >>>> >>>> Sharma >>>> >>>> >>> >> >