[jira] [Created] (MESOS-10031) Agent's 'executorTerminated()' can cause double task status update
Greg Mann created MESOS-10031: - Summary: Agent's 'executorTerminated()' can cause double task status update Key: MESOS-10031 URL: https://issues.apache.org/jira/browse/MESOS-10031 Project: Mesos Issue Type: Bug Affects Versions: 1.9.0 Reporter: Greg Mann Assignee: Greg Mann When the agent first receives a task status update from an executor, it executes {{Slave::statusUpdate()}}, which adds the task ID to the {{Executor::pendingStatusUpdates}} map, but leaves the ID in {{Executor::launchedTasks}}. Meanwhile, the code in {{Slave::executorTerminated()}} is not capable of handling the intermediate task state which exists in between the execution of {{Slave::statusUpdate()}} and {{Slave::_statusUpdate()}}. If {{Slave::executorTerminated()}} executes at that point in time, it's possible that the task will be transitioned to a terminal state twice (for example, it could be transitioned to TASK_FINISHED by the executor, then to TASK_FAILED by the agent if the executor suddenly terminates). If the agent has already received a status update from an executor, that state transition should be honored even if the executor terminates immediately after it's sent. We should ensure that {{Slave::executorTerminated()}} cannot cause a valid update received from an executor to be ignored. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-9497) Parallel reads for master v1 state calls
[ https://issues.apache.org/jira/browse/MESOS-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9497: -- Assignee: Meng Zhu > Parallel reads for master v1 state calls > > > Key: MESOS-9497 > URL: https://issues.apache.org/jira/browse/MESOS-9497 > Project: Mesos > Issue Type: Improvement > Components: HTTP API, master >Reporter: Greg Mann >Assignee: Meng Zhu >Priority: Major > Labels: foundations, mesosphere, performance > > Similar to MESOS-9158 - we should make the operator API calls which serve > master state perform computation of multiple such responses in parallel to > reduce the performance impact on the master actor. > Note that this includes the initial expensive SUBSCRIBE payload for the event > streaming API, which is less straightforward to incorporate into the parallel > serving logic since it performs writes (to track the subscriber) and produces > an infinite response, unlike the other state related calls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10026) Improve v1 operator API read performance.
[ https://issues.apache.org/jira/browse/MESOS-10026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-10026: --- Assignee: Benjamin Mahler > Improve v1 operator API read performance. > - > > Key: MESOS-10026 > URL: https://issues.apache.org/jira/browse/MESOS-10026 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: foundations > > Currently, the v1 operator API has poor performance relative to the v0 json > API. The following initial numbers were provided by [~Will Mahler] from our > state serving benchmark: > > |OPTIMIZED - Master (baseline)| | | | | > |Test setup|1000 agents with a total of 1 running tasks and 1 > completed tasks|1 agents with a total of 10 running tasks and 10 > completed tasks|2 agents with a total of 20 running tasks and 20 > completed tasks|4 agents with a total of 40 running tasks and 40 > completed tasks| > |v0 'state' response|0.17|1.66|8.96|12.42| > |v1 x-protobuf|0.35|3.21|9.47|19.09| > |v1 json|0.45|4.72|10.81|31.43| > There is quite a lot of variance, but v1 protobuf consistently slower than v0 > (sometimes significantly so) and v1 json is consistently slower than v1 > protobuf (sometimes significantly so). > The reason that the v1 operator API is slower is that it does the following: > (1) Construct temporary unversioned state response object by copying > in-memory un-versioned state into overall response object. (expensive!) > (2) Evolve it to v1: serialize, de-serialize into v1 overall state object. > (expensive!) > (3) Serialize the overall v1 state object to protobuf or json. > (4) Destruct the temporaries (expensive! but is done after response starts > serving) > On the other hand, the v0 jsonify approach does the following: > (1) Serialize the in-memory unversioned state into json, by traversing state > and accumulating the overall serialized json. > This means that v1 has substantial overhead vs v0, and we need to remove it > to bring v1 on-par or better than v0. v1 should serialize directly to json > (straightforward with jsonify) or protobuf (this can be done via a > io::CodedOutputStream). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10023) Allocator method dispatches can be reordered (relative to scheduler API calls which triggered them).
[ https://issues.apache.org/jira/browse/MESOS-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-10023: --- Assignee: Andrei Sekretenko > Allocator method dispatches can be reordered (relative to scheduler API calls > which triggered them). > > > Key: MESOS-10023 > URL: https://issues.apache.org/jira/browse/MESOS-10023 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.9.0 >Reporter: Andrei Sekretenko >Assignee: Andrei Sekretenko >Priority: Major > Labels: foundations > > Observed an example of such reordering on a testing cluster with a V1 > framework. > Framework side: > - framework issues ACCEPT for a slave with no operations and a 365+ days > filter > - framework issues REVIVE call for all roles (which should clear all filters) > - framework waits for an offer for that slave and never receives it > Master side: > - master receives ACCEPT, processes the first part and starts authorization > - master receives REVIVE and dispatches reviveOffers() to the allocator > - master receives a response from authorizer (for ACCEPT) and dispatches > recoverResources() with a 365-day filter to the allocator > *We need to provide an ability for the framework to avoid such kind of > reorderings.* > Things to consider: > - v1 framework are not required to use a single connection for API requests; > even if they were, there still is a reconnection case, during which the views > of the framework and the master on the state of connection might differ. This > means that we cannot completely avoid this problem by sequencing processing > of requests from the same connection. > - Currently, all calls directly influencing allocator (except for > UPDATE_FRAMEWORK) return '202 ACCEPTED` at an early stage of processing. > _Unconditionally_ changing this might break compatibility with some existing > frameworks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-9992) Add end-to-end test excercising re-reservation operator API
[ https://issues.apache.org/jira/browse/MESOS-9992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-9992: -- Assignee: Benno Evers > Add end-to-end test excercising re-reservation operator API > --- > > Key: MESOS-9992 > URL: https://issues.apache.org/jira/browse/MESOS-9992 > Project: Mesos > Issue Type: Task >Reporter: Benjamin Bannier >Assignee: Benno Evers >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-9993) Update operator API documentation for re-reservations
[ https://issues.apache.org/jira/browse/MESOS-9993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier reassigned MESOS-9993: --- Assignee: Benjamin Bannier > Update operator API documentation for re-reservations > - > > Key: MESOS-9993 > URL: https://issues.apache.org/jira/browse/MESOS-9993 > Project: Mesos > Issue Type: Task >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)