[jira] [Created] (MESOS-9847) Docker executor doesn't wait for status updates to be ack'd before shutting down.
Meng Zhu created MESOS-9847: --- Summary: Docker executor doesn't wait for status updates to be ack'd before shutting down. Key: MESOS-9847 URL: https://issues.apache.org/jira/browse/MESOS-9847 Project: Mesos Issue Type: Bug Components: executor Reporter: Meng Zhu The docker executor doesn't wait for pending status updates to be acknowledged before shutting down, instead it sleeps for one second and then terminates: {noformat} void _stop() { // A hack for now ... but we need to wait until the status update // is sent to the slave before we shut ourselves down. // TODO(tnachen): Remove this hack and also the same hack in the // command executor when we have the new HTTP APIs to wait until // an ack. os::sleep(Seconds(1)); driver.get()->stop(); } {noformat} This would result in racing between task status update (e.g. TASK_FINISHED) and executor exit. The latter would lead agent generating a `TASK_FAILED` status update by itself, leading to the confusing case where the agent handles two different terminal status updates. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9846) Update UI for agent draining
Greg Mann created MESOS-9846: Summary: Update UI for agent draining Key: MESOS-9846 URL: https://issues.apache.org/jira/browse/MESOS-9846 Project: Mesos Issue Type: Task Components: webui Reporter: Greg Mann We should expose the new agent metadata in the web UI: * Drain info * Deactivation state It may also be worth exposing unreachable and gone agents in some way, so that agents do not simply disappear from the UI when they transition to unreachable and/or gone, during or after maintenance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9845) Add docs for automatic agent draining
Greg Mann created MESOS-9845: Summary: Add docs for automatic agent draining Key: MESOS-9845 URL: https://issues.apache.org/jira/browse/MESOS-9845 Project: Mesos Issue Type: Task Components: documentation Reporter: Greg Mann -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9763) Race between two re-subscriptions against an empty master.
[ https://issues.apache.org/jira/browse/MESOS-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863008#comment-16863008 ] Andrei Sekretenko edited comment on MESOS-9763 at 6/13/19 12:19 PM: In [https://reviews.apache.org/r/70668] the validation of the new FrameworkInfo against the current one was moved into the `_subscribe()` continuation (which also performs applying the update). This fixes the race. No deterministic test against this race has been implemented yet, though. was (Author: asekretenko): In [https://reviews.apache.org/r/70668] the validation of the new FrameworkInfo against the current one was moved into the `_subscribe()` continuation (which also performs applying the update). This fixes the race. No deterministic test against this race has been implemened yet, though. > Race between two re-subscriptions against an empty master. > -- > > Key: MESOS-9763 > URL: https://issues.apache.org/jira/browse/MESOS-9763 > Project: Mesos > Issue Type: Bug > Components: master, scheduler api >Reporter: Andrei Sekretenko >Priority: Major > Labels: foundations > > Currently, subscription (and re-subscription) is not atomic. > It consists of three steps performed by two actors: > - Validating the supplied FrameworkInfo against the master state (which > possibly includes an existing FrameworkInfo) > - Authorizing the (re-)subscribing framework > - Applying the update > A partitioned or buggy (or both) framework can trigger a race by sending two > SUBSCRIBE calls with differing FrameworkInfo's on master failover. > One of the possible sequences of events: > 1. FrameworkInfo A is validated by master (which has no data about this > framework) > 2. conflicting FrameworkInfo B is validated by master (which stores no data > about this framework as SchedulerA is not even authorized yet) > 3. Scheduler A is authorized > 4. Scheduler B is authorized > 5. FrameworkInfo A is applied > 6. Master attempts to apply FrameworkInfoB which is no longer valid after > the previous step. > One simple example is an attempt to re-subscribe with two different > principals: currently the scheduler B's principal will be silently ignored at > step 6 (instead of a validation error sent to B). > At the moment of writing I'm not sure if there are other problems caused by > this race. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9763) Race between two re-subscriptions against an empty master.
[ https://issues.apache.org/jira/browse/MESOS-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863008#comment-16863008 ] Andrei Sekretenko commented on MESOS-9763: -- In [https://reviews.apache.org/r/70668] the validation of the new FrameworkInfo against the current one was moved into the `_subscribe()` continuation (which also performs applying the update). This fixes the race. No deterministic test against this race has been implemened yet, though. > Race between two re-subscriptions against an empty master. > -- > > Key: MESOS-9763 > URL: https://issues.apache.org/jira/browse/MESOS-9763 > Project: Mesos > Issue Type: Bug > Components: master, scheduler api >Reporter: Andrei Sekretenko >Priority: Major > Labels: foundations > > Currently, subscription (and re-subscription) is not atomic. > It consists of three steps performed by two actors: > - Validating the supplied FrameworkInfo against the master state (which > possibly includes an existing FrameworkInfo) > - Authorizing the (re-)subscribing framework > - Applying the update > A partitioned or buggy (or both) framework can trigger a race by sending two > SUBSCRIBE calls with differing FrameworkInfo's on master failover. > One of the possible sequences of events: > 1. FrameworkInfo A is validated by master (which has no data about this > framework) > 2. conflicting FrameworkInfo B is validated by master (which stores no data > about this framework as SchedulerA is not even authorized yet) > 3. Scheduler A is authorized > 4. Scheduler B is authorized > 5. FrameworkInfo A is applied > 6. Master attempts to apply FrameworkInfoB which is no longer valid after > the previous step. > One simple example is an attempt to re-subscribe with two different > principals: currently the scheduler B's principal will be silently ignored at > step 6 (instead of a validation error sent to B). > At the moment of writing I'm not sure if there are other problems caused by > this race. -- This message was sent by Atlassian JIRA (v7.6.3#76005)