[jira] [Commented] (MESOS-9542) Hierarchical allocator check failure when an operation on a shutdown framework finishes

2019-02-26 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778553#comment-16778553
 ] 

Benjamin Bannier commented on MESOS-9542:
-

Note that while this crash is disruptive, after the master is restarted it will 
reconcile with the agent in question just fine and correctly reflect its state.

> Hierarchical allocator check failure when an operation on a shutdown 
> framework finishes
> ---
>
> Key: MESOS-9542
> URL: https://issues.apache.org/jira/browse/MESOS-9542
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.7.0, 1.7.1, 1.8.0
>Reporter: Benjamin Bannier
>Assignee: Joseph Wu
>Priority: Blocker
>  Labels: foundations, mesosphere, mesosphere-dss-ga, 
> operation-feedback
> Fix For: 1.8.0
>
>
> When a non-speculated operation like e.g., {{CREATE_DISK}} becomes terminal 
> after the originating framework was torn down, we run into an assertion 
> failure in the allocator.
> {noformat}
> I0129 11:55:35.764394 57857 master.cpp:11373] Updating the state of operation 
> 'operation' (uuid: 10a782bd-9e60-42da-90d6-c00997a25645) for framework 
> a4d0499b-c0d3-4abf-8458-73e595d061ce- (latest state: OPERATION_PENDING, 
> status update state: OPERATION_FINISHED)
> F0129 11:55:35.764744 57925 hierarchical.cpp:834] Check failed: 
> frameworks.contains(frameworkId){noformat}
> With non-speculated operations like e.g., {{CREATE_DISK}} it became possible 
> that operations outlive their originating framework. This was not possible 
> with speculated operations like {{RESERVE}} which were always applied 
> immediately by the master.
> The master does not take this into account, but instead unconditionally calls 
> {{Allocator::updateAllocation}} which asserts that the framework is still 
> known to the allocator.
> Reproducer:
>  * register a framework with the master.
>  * add a master with a resource provider.
>  * let the framework trigger a non-speculated operation like {{CREATE_DISK.}}
>  * tear down the framework before a terminal operation status update reaches 
> the master; this causes the master to e.g., remove the framework from the 
> allocator.
>  * let a terminal, successful operation status update reach the master
>  *  
> To solve this we should cleanup the lifetimes of operations. Since operations 
> can outlive their framework (unlike e.g., tasks), we probably need a 
> different approach here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9542) Hierarchical allocator check failure when an operation on a shutdown framework finishes

2019-02-12 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1676#comment-1676
 ] 

Joseph Wu commented on MESOS-9542:
--

Still in progress, but some reviews are up starting at: 
https://reviews.apache.org/r/69960/

> Hierarchical allocator check failure when an operation on a shutdown 
> framework finishes
> ---
>
> Key: MESOS-9542
> URL: https://issues.apache.org/jira/browse/MESOS-9542
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.7.0, 1.7.1, 1.8.0
>Reporter: Benjamin Bannier
>Assignee: Joseph Wu
>Priority: Blocker
>  Labels: foundations, mesosphere, mesosphere-dss-ga, 
> operation-feedback
>
> When a non-speculated operation like e.g., {{CREATE_DISK}} becomes terminal 
> after the originating framework was torn down, we run into an assertion 
> failure in the allocator.
> {noformat}
> I0129 11:55:35.764394 57857 master.cpp:11373] Updating the state of operation 
> 'operation' (uuid: 10a782bd-9e60-42da-90d6-c00997a25645) for framework 
> a4d0499b-c0d3-4abf-8458-73e595d061ce- (latest state: OPERATION_PENDING, 
> status update state: OPERATION_FINISHED)
> F0129 11:55:35.764744 57925 hierarchical.cpp:834] Check failed: 
> frameworks.contains(frameworkId){noformat}
> With non-speculated operations like e.g., {{CREATE_DISK}} it became possible 
> that operations outlive their originating framework. This was not possible 
> with speculated operations like {{RESERVE}} which were always applied 
> immediately by the master.
> The master does not take this into account, but instead unconditionally calls 
> {{Allocator::updateAllocation}} which asserts that the framework is still 
> known to the allocator.
> Reproducer:
>  * register a framework with the master.
>  * add a master with a resource provider.
>  * let the framework trigger a non-speculated operation like {{CREATE_DISK.}}
>  * tear down the framework before a terminal operation status update reaches 
> the master; this causes the master to e.g., remove the framework from the 
> allocator.
>  * let a terminal, successful operation status update reach the master
>  *  
> To solve this we should cleanup the lifetimes of operations. Since operations 
> can outlive their framework (unlike e.g., tasks), we probably need a 
> different approach here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)