[jira] [Created] (MESOS-9847) Docker executor doesn't wait for status updates to be ack'd before shutting down.

2019-06-13 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9847:
---

 Summary: Docker executor doesn't wait for status updates to be 
ack'd before shutting down.
 Key: MESOS-9847
 URL: https://issues.apache.org/jira/browse/MESOS-9847
 Project: Mesos
  Issue Type: Bug
  Components: executor
Reporter: Meng Zhu


The docker executor doesn't wait for pending status updates to be acknowledged 
before shutting down, instead it sleeps for one second and then terminates:

{noformat}
  void _stop()
  {
// A hack for now ... but we need to wait until the status update
// is sent to the slave before we shut ourselves down.
// TODO(tnachen): Remove this hack and also the same hack in the
// command executor when we have the new HTTP APIs to wait until
// an ack.
os::sleep(Seconds(1));
driver.get()->stop();
  }
{noformat}

This would result in racing between task status update (e.g. TASK_FINISHED) and 
executor exit. The latter would lead agent generating a `TASK_FAILED` status 
update by itself, leading to the confusing case where the agent handles two 
different terminal status updates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9846) Update UI for agent draining

2019-06-13 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9846:


 Summary: Update UI for agent draining
 Key: MESOS-9846
 URL: https://issues.apache.org/jira/browse/MESOS-9846
 Project: Mesos
  Issue Type: Task
  Components: webui
Reporter: Greg Mann


We should expose the new agent metadata in the web UI:
* Drain info
* Deactivation state

It may also be worth exposing unreachable and gone agents in some way, so that 
agents do not simply disappear from the UI when they transition to unreachable 
and/or gone, during or after maintenance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9845) Add docs for automatic agent draining

2019-06-13 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9845:


 Summary: Add docs for automatic agent draining
 Key: MESOS-9845
 URL: https://issues.apache.org/jira/browse/MESOS-9845
 Project: Mesos
  Issue Type: Task
  Components: documentation
Reporter: Greg Mann






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9763) Race between two re-subscriptions against an empty master.

2019-06-13 Thread Andrei Sekretenko (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863008#comment-16863008
 ] 

Andrei Sekretenko edited comment on MESOS-9763 at 6/13/19 12:19 PM:


In [https://reviews.apache.org/r/70668] the validation of the new FrameworkInfo 
against the current one was moved into the `_subscribe()` continuation (which 
also performs applying the update).  This fixes the race.

No deterministic test against this race has been implemented yet, though.


was (Author: asekretenko):
In [https://reviews.apache.org/r/70668] the validation of the new FrameworkInfo 
against the current one was moved into the `_subscribe()` continuation (which 
also performs applying the update).  This fixes the race.

No deterministic test against this race has been implemened yet, though.

> Race between two re-subscriptions against an empty master.
> --
>
> Key: MESOS-9763
> URL: https://issues.apache.org/jira/browse/MESOS-9763
> Project: Mesos
>  Issue Type: Bug
>  Components: master, scheduler api
>Reporter: Andrei Sekretenko
>Priority: Major
>  Labels: foundations
>
> Currently, subscription (and re-subscription)  is not atomic.
>  It consists of three steps performed by two actors:
>   - Validating the supplied FrameworkInfo against the master state (which 
> possibly includes an existing FrameworkInfo)
>   - Authorizing the (re-)subscribing framework
>   - Applying the update
> A partitioned or buggy (or both) framework can trigger a race by sending two 
> SUBSCRIBE calls with differing FrameworkInfo's on master failover.
> One of the possible sequences of events:
>  1. FrameworkInfo A is validated by master (which has no data about this 
> framework)
>  2. conflicting FrameworkInfo B is validated by master  (which stores no data 
> about this framework as SchedulerA is not even authorized yet)
>  3. Scheduler A is authorized
>  4. Scheduler B is authorized
>  5. FrameworkInfo A is applied
>  6. Master attempts to apply FrameworkInfoB which is no longer valid after 
> the previous step.
> One simple example is an attempt to re-subscribe with two different 
> principals: currently the scheduler B's principal will be silently ignored at 
> step 6 (instead of a validation error sent to B).
> At the moment of writing I'm not sure if there are other problems caused by 
> this race.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9763) Race between two re-subscriptions against an empty master.

2019-06-13 Thread Andrei Sekretenko (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863008#comment-16863008
 ] 

Andrei Sekretenko commented on MESOS-9763:
--

In [https://reviews.apache.org/r/70668] the validation of the new FrameworkInfo 
against the current one was moved into the `_subscribe()` continuation (which 
also performs applying the update).  This fixes the race.

No deterministic test against this race has been implemened yet, though.

> Race between two re-subscriptions against an empty master.
> --
>
> Key: MESOS-9763
> URL: https://issues.apache.org/jira/browse/MESOS-9763
> Project: Mesos
>  Issue Type: Bug
>  Components: master, scheduler api
>Reporter: Andrei Sekretenko
>Priority: Major
>  Labels: foundations
>
> Currently, subscription (and re-subscription)  is not atomic.
>  It consists of three steps performed by two actors:
>   - Validating the supplied FrameworkInfo against the master state (which 
> possibly includes an existing FrameworkInfo)
>   - Authorizing the (re-)subscribing framework
>   - Applying the update
> A partitioned or buggy (or both) framework can trigger a race by sending two 
> SUBSCRIBE calls with differing FrameworkInfo's on master failover.
> One of the possible sequences of events:
>  1. FrameworkInfo A is validated by master (which has no data about this 
> framework)
>  2. conflicting FrameworkInfo B is validated by master  (which stores no data 
> about this framework as SchedulerA is not even authorized yet)
>  3. Scheduler A is authorized
>  4. Scheduler B is authorized
>  5. FrameworkInfo A is applied
>  6. Master attempts to apply FrameworkInfoB which is no longer valid after 
> the previous step.
> One simple example is an attempt to re-subscribe with two different 
> principals: currently the scheduler B's principal will be silently ignored at 
> step 6 (instead of a validation error sent to B).
> At the moment of writing I'm not sure if there are other problems caused by 
> this race.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)