[ 
https://issues.apache.org/jira/browse/MESOS-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8524:
------------------------------------
    Comment: was deleted

(was: Review: https://reviews.apache.org/r/65506/)

> When `UPDATE_SLAVE` messages are received, offers might not be rescinded due 
> to a race 
> ---------------------------------------------------------------------------------------
>
>                 Key: MESOS-8524
>                 URL: https://issues.apache.org/jira/browse/MESOS-8524
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation, master
>    Affects Versions: 1.5.0
>         Environment: Master + Agent running with enabled 
> {{RESOURCE_PROVIDER}} capability
>            Reporter: Jan Schlicht
>            Assignee: Benjamin Bannier
>            Priority: Major
>              Labels: mesosphere
>
> When an agent with enabled {{RESOURCE_PROVIDER}} capability (re-)registers 
> with the master it sends a {{UPDATE_SLAVE}} after being (re-)registered. In 
> the master, the agent is added (back) to the allocator, as soon as it's 
> (re-)registered, i.e. before {{UPDATE_SLAVE}} is being send. This triggers an 
> allocation and offers might get sent out to frameworks. When {{UPDATE_SLAVE}} 
> is being handled in the master, these offers have to be rescinded, as they're 
> based on an outdated agent state.
> Internally, the allocator defers a offer callback in the master 
> ({{Master::offer}}). In rare cases a {{UPDATE_SLAVE}} message might arrive at 
> the same time and its handler in the master called before the offer callback 
> (but after the actual allocation took place). In this case the (outdated) 
> offer is still sent to frameworks and never rescinded.
> Here's the relevant log lines, this was discovered while working on 
> https://reviews.apache.org/r/65045/:
> {noformat}
> I0201 14:17:47.041093 242208768 hierarchical.cpp:1517] Performed allocation 
> for 1 agents in 704915ns
> I0201 14:17:47.041738 242745344 master.cpp:7235] Received update of agent 
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 at slave(540)@172.18.8.20:60469 
> (172.18.8.20) with total oversubscribed resources {}
> I0201 14:17:47.042778 242745344 master.cpp:8808] Sending 1 offers to 
> framework 53c557e7-3161-449b-bacc-a4f8c02e78e7-0000 (default) at 
> scheduler-798f476b-b099-443e-bd3b-9e7333f29672@172.18.8.20:60469
> I0201 14:17:47.043102 243281920 sched.cpp:921] Scheduler::resourceOffers took 
> 40444ns
> I0201 14:17:47.043427 243818496 hierarchical.cpp:712] Grew agent 
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 by disk[MOUNT]:200 (total), {  } 
> (used)
> I0201 14:17:47.043643 243818496 hierarchical.cpp:669] Agent 
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 (172.18.8.20) updated with total 
> resources disk[MOUNT]:200; cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to