[jira] [Created] (MESOS-10031) Agent's 'executorTerminated()' can cause double task status update

2019-11-06 Thread Greg Mann (Jira)
Greg Mann created MESOS-10031:
-

 Summary: Agent's 'executorTerminated()' can cause double task 
status update
 Key: MESOS-10031
 URL: https://issues.apache.org/jira/browse/MESOS-10031
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.9.0
Reporter: Greg Mann
Assignee: Greg Mann


When the agent first receives a task status update from an executor, it 
executes {{Slave::statusUpdate()}}, which adds the task ID to the 
{{Executor::pendingStatusUpdates}} map, but leaves the ID in 
{{Executor::launchedTasks}}.

Meanwhile, the code in {{Slave::executorTerminated()}} is not capable of 
handling the intermediate task state which exists in between the execution of 
{{Slave::statusUpdate()}} and {{Slave::_statusUpdate()}}. If 
{{Slave::executorTerminated()}} executes at that point in time, it's possible 
that the task will be transitioned to a terminal state twice (for example, it 
could be transitioned to TASK_FINISHED by the executor, then to TASK_FAILED by 
the agent if the executor suddenly terminates).

If the agent has already received a status update from an executor, that state 
transition should be honored even if the executor terminates immediately after 
it's sent. We should ensure that {{Slave::executorTerminated()}} cannot cause a 
valid update received from an executor to be ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9497) Parallel reads for master v1 state calls

2019-11-06 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9497:
--

Assignee: Meng Zhu

> Parallel reads for master v1 state calls
> 
>
> Key: MESOS-9497
> URL: https://issues.apache.org/jira/browse/MESOS-9497
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API, master
>Reporter: Greg Mann
>Assignee: Meng Zhu
>Priority: Major
>  Labels: foundations, mesosphere, performance
>
> Similar to MESOS-9158 - we should make the operator API calls which serve 
> master state perform computation of multiple such responses in parallel to 
> reduce the performance impact on the master actor.
> Note that this includes the initial expensive SUBSCRIBE payload for the event 
> streaming API, which is less straightforward to incorporate into the parallel 
> serving logic since it performs writes (to track the subscriber) and produces 
> an infinite response, unlike the other state related calls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10026) Improve v1 operator API read performance.

2019-11-06 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10026:
---

Assignee: Benjamin Mahler

> Improve v1 operator API read performance.
> -
>
> Key: MESOS-10026
> URL: https://issues.apache.org/jira/browse/MESOS-10026
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> Currently, the v1 operator API has poor performance relative to the v0 json 
> API. The following initial numbers were provided by [~Will Mahler] from our 
> state serving benchmark:
>  
> |OPTIMIZED - Master (baseline)| | | | |
> |Test setup|1000 agents with a total of 1 running tasks and 1 
> completed tasks|1 agents with a total of 10 running tasks and 10 
> completed tasks|2 agents with a total of 20 running tasks and 20 
> completed tasks|4 agents with a total of 40 running tasks and 40 
> completed tasks|
> |v0 'state' response|0.17|1.66|8.96|12.42|
> |v1 x-protobuf|0.35|3.21|9.47|19.09|
> |v1 json|0.45|4.72|10.81|31.43|
> There is quite a lot of variance, but v1 protobuf consistently slower than v0 
> (sometimes significantly so) and v1 json is consistently slower than v1 
> protobuf (sometimes significantly so).
> The reason that the v1 operator API is slower is that it does the following:
> (1) Construct temporary unversioned state response object by copying 
> in-memory un-versioned state into overall response object. (expensive!)
> (2) Evolve it to v1: serialize, de-serialize into v1 overall state object. 
> (expensive!)
> (3) Serialize the overall v1 state object to protobuf or json.
> (4) Destruct the temporaries (expensive! but is done after response starts 
> serving)
> On the other hand, the v0 jsonify approach does the following:
> (1) Serialize the in-memory unversioned state into json, by traversing state 
> and accumulating the overall serialized json.
> This means that v1 has substantial overhead vs v0, and we need to remove it 
> to bring v1 on-par or better than v0. v1 should serialize directly to json 
> (straightforward with jsonify) or protobuf (this can be done via a 
> io::CodedOutputStream).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10023) Allocator method dispatches can be reordered (relative to scheduler API calls which triggered them).

2019-11-06 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10023:
---

Assignee: Andrei Sekretenko

> Allocator method dispatches can be reordered (relative to scheduler API calls 
> which triggered them).
> 
>
> Key: MESOS-10023
> URL: https://issues.apache.org/jira/browse/MESOS-10023
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Major
>  Labels: foundations
>
> Observed an example of such reordering on a testing cluster with a V1 
> framework.
> Framework side:
>  - framework issues ACCEPT for a slave with no operations and a 365+ days 
> filter 
>  - framework issues REVIVE call for all roles (which should clear all filters)
>  - framework waits for an offer for that slave and never receives it
> Master side:
>  - master receives ACCEPT, processes the first part and starts authorization
>  - master receives REVIVE and dispatches reviveOffers() to the allocator
>  - master receives a response from authorizer (for ACCEPT) and dispatches 
> recoverResources() with a 365-day filter to the allocator
> *We need to provide an ability for the framework to avoid such kind of 
> reorderings.*
> Things to consider:
>  - v1 framework are not required to use a single connection for API requests; 
> even if they were, there still is a reconnection case, during which the views 
> of the framework and the master on the state of connection might differ. This 
> means that we cannot completely avoid this problem by sequencing processing 
> of requests from the same connection.
> - Currently, all calls directly influencing allocator (except for 
> UPDATE_FRAMEWORK) return '202 ACCEPTED` at an early stage of processing. 
> _Unconditionally_ changing this might break compatibility with some existing 
> frameworks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9992) Add end-to-end test excercising re-reservation operator API

2019-11-06 Thread Benno Evers (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9992:
--

Assignee: Benno Evers

> Add end-to-end test excercising re-reservation operator API
> ---
>
> Key: MESOS-9992
> URL: https://issues.apache.org/jira/browse/MESOS-9992
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9993) Update operator API documentation for re-reservations

2019-11-06 Thread Benjamin Bannier (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-9993:
---

Assignee: Benjamin Bannier

> Update operator API documentation for re-reservations
> -
>
> Key: MESOS-9993
> URL: https://issues.apache.org/jira/browse/MESOS-9993
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)