[jira] [Commented] (MESOS-9542) Hierarchical allocator check failure when an operation on a shutdown framework finishes
[ https://issues.apache.org/jira/browse/MESOS-9542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778553#comment-16778553 ] Benjamin Bannier commented on MESOS-9542: - Note that while this crash is disruptive, after the master is restarted it will reconcile with the agent in question just fine and correctly reflect its state. > Hierarchical allocator check failure when an operation on a shutdown > framework finishes > --- > > Key: MESOS-9542 > URL: https://issues.apache.org/jira/browse/MESOS-9542 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.7.0, 1.7.1, 1.8.0 >Reporter: Benjamin Bannier >Assignee: Joseph Wu >Priority: Blocker > Labels: foundations, mesosphere, mesosphere-dss-ga, > operation-feedback > Fix For: 1.8.0 > > > When a non-speculated operation like e.g., {{CREATE_DISK}} becomes terminal > after the originating framework was torn down, we run into an assertion > failure in the allocator. > {noformat} > I0129 11:55:35.764394 57857 master.cpp:11373] Updating the state of operation > 'operation' (uuid: 10a782bd-9e60-42da-90d6-c00997a25645) for framework > a4d0499b-c0d3-4abf-8458-73e595d061ce- (latest state: OPERATION_PENDING, > status update state: OPERATION_FINISHED) > F0129 11:55:35.764744 57925 hierarchical.cpp:834] Check failed: > frameworks.contains(frameworkId){noformat} > With non-speculated operations like e.g., {{CREATE_DISK}} it became possible > that operations outlive their originating framework. This was not possible > with speculated operations like {{RESERVE}} which were always applied > immediately by the master. > The master does not take this into account, but instead unconditionally calls > {{Allocator::updateAllocation}} which asserts that the framework is still > known to the allocator. > Reproducer: > * register a framework with the master. > * add a master with a resource provider. > * let the framework trigger a non-speculated operation like {{CREATE_DISK.}} > * tear down the framework before a terminal operation status update reaches > the master; this causes the master to e.g., remove the framework from the > allocator. > * let a terminal, successful operation status update reach the master > * > To solve this we should cleanup the lifetimes of operations. Since operations > can outlive their framework (unlike e.g., tasks), we probably need a > different approach here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9542) Hierarchical allocator check failure when an operation on a shutdown framework finishes
[ https://issues.apache.org/jira/browse/MESOS-9542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1676#comment-1676 ] Joseph Wu commented on MESOS-9542: -- Still in progress, but some reviews are up starting at: https://reviews.apache.org/r/69960/ > Hierarchical allocator check failure when an operation on a shutdown > framework finishes > --- > > Key: MESOS-9542 > URL: https://issues.apache.org/jira/browse/MESOS-9542 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.7.0, 1.7.1, 1.8.0 >Reporter: Benjamin Bannier >Assignee: Joseph Wu >Priority: Blocker > Labels: foundations, mesosphere, mesosphere-dss-ga, > operation-feedback > > When a non-speculated operation like e.g., {{CREATE_DISK}} becomes terminal > after the originating framework was torn down, we run into an assertion > failure in the allocator. > {noformat} > I0129 11:55:35.764394 57857 master.cpp:11373] Updating the state of operation > 'operation' (uuid: 10a782bd-9e60-42da-90d6-c00997a25645) for framework > a4d0499b-c0d3-4abf-8458-73e595d061ce- (latest state: OPERATION_PENDING, > status update state: OPERATION_FINISHED) > F0129 11:55:35.764744 57925 hierarchical.cpp:834] Check failed: > frameworks.contains(frameworkId){noformat} > With non-speculated operations like e.g., {{CREATE_DISK}} it became possible > that operations outlive their originating framework. This was not possible > with speculated operations like {{RESERVE}} which were always applied > immediately by the master. > The master does not take this into account, but instead unconditionally calls > {{Allocator::updateAllocation}} which asserts that the framework is still > known to the allocator. > Reproducer: > * register a framework with the master. > * add a master with a resource provider. > * let the framework trigger a non-speculated operation like {{CREATE_DISK.}} > * tear down the framework before a terminal operation status update reaches > the master; this causes the master to e.g., remove the framework from the > allocator. > * let a terminal, successful operation status update reach the master > * > To solve this we should cleanup the lifetimes of operations. Since operations > can outlive their framework (unlike e.g., tasks), we probably need a > different approach here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)