[jira] [Assigned] (MESOS-8968) Wire `UPDATE_QUOTA` call.
[ https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-8968: --- Assignee: Meng Zhu > Wire `UPDATE_QUOTA` call. > - > > Key: MESOS-8968 > URL: https://issues.apache.org/jira/browse/MESOS-8968 > Project: Mesos > Issue Type: Bug >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: Quota, allocator, multitenancy > > Wire the existing master, auth, registar, and allocator pieces together to > complete the `UPDATE_QUOTA` call. > This would enable the master capability `QUOTA_V2`. > This also fixes the "ignoring zero resource quota" bug in the old quota > implementation, namely: > Currently, Mesos discards resource object with zero scalar value when parsing > resources. This means quota set to zero would be ignored and not enforced. > For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no > GPU. Due to the above issue, the allocator can only see the quota as > "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs > may still be allocated to this role. > With the completion of `UPDATE_QUOTA` which takes a map of name, scalar > values, zero value will no longer be dropped. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8968) Wire `UPDATE_QUOTA` call.
[ https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882508#comment-16882508 ] Meng Zhu edited comment on MESOS-8968 at 7/10/19 11:54 PM: --- {noformat} commit 0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8 (apache/master) Author: Meng Zhu Date: Fri Jul 5 18:05:59 2019 -0700 Implemented `UPDATE_QUOTA` operator call. This patch wires up the master, auth, registar and allocator pieces for `UPDATE_QUOTA` call. This enables the master capability `QUOTA_V2`. The capability implies the quota v2 API is capable of writes (`UPDATE_QUOTA`) and the master is capable of recovering from V2 quota (`QuotaConfig`) in registry. This patch lacks the rescind offer logic. When quota limits and guarantees are configured, it might be necessary to rescind offers on the fly to satisfy new guarantees or be constrained by the new limits. A todo is left and will be tackled in subsequent patches. Also enabled test `MasterQuotaTest.RecoverQuotaEmptyCluster`. Review: https://reviews.apache.org/r/71021 {noformat} {noformat} commit dcd73437549413790751d1ff127989dbb29bd753 (HEAD -> update_quota, apache/master) Author: Meng Zhu Date: Sun Jul 7 14:27:14 2019 -0700 Added tests for `UPDATE_QUOTA`. These tests reuse the existing tests for `SET_QUOTA` and `REMOVE_QUOTA` calls. In general, `UPDATE_QUOTA` request should fail where `SET_QUOTA` fails. When the existing test expects `SET_QUOTA` call succeeds, we test the `UPDATE_QUOTA` call by first remove the set quota and then send the `UPDATE_QUOTA` request. Review: https://reviews.apache.org/r/71022 {noformat} was (Author: mzhu): {noformat} commit 0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8 (apache/master) Author: Meng Zhu Date: Fri Jul 5 18:05:59 2019 -0700 Implemented `UPDATE_QUOTA` operator call. This patch wires up the master, auth, registar and allocator pieces for `UPDATE_QUOTA` call. This enables the master capability `QUOTA_V2`. The capability implies the quota v2 API is capable of writes (`UPDATE_QUOTA`) and the master is capable of recovering from V2 quota (`QuotaConfig`) in registry. This patch lacks the rescind offer logic. When quota limits and guarantees are configured, it might be necessary to rescind offers on the fly to satisfy new guarantees or be constrained by the new limits. A todo is left and will be tackled in subsequent patches. Also enabled test `MasterQuotaTest.RecoverQuotaEmptyCluster`. Review: https://reviews.apache.org/r/71021 {noformat} > Wire `UPDATE_QUOTA` call. > - > > Key: MESOS-8968 > URL: https://issues.apache.org/jira/browse/MESOS-8968 > Project: Mesos > Issue Type: Bug >Reporter: Meng Zhu >Priority: Major > Labels: Quota, allocator, multitenancy > > Wire the existing master, auth, registar, and allocator pieces together to > complete the `UPDATE_QUOTA` call. > This would enable the master capability `QUOTA_V2`. > This also fixes the "ignoring zero resource quota" bug in the old quota > implementation, namely: > Currently, Mesos discards resource object with zero scalar value when parsing > resources. This means quota set to zero would be ignored and not enforced. > For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no > GPU. Due to the above issue, the allocator can only see the quota as > "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs > may still be allocated to this role. > With the completion of `UPDATE_QUOTA` which takes a map of name, scalar > values, zero value will no longer be dropped. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9812) Add achievability validation for update quota call.
[ https://issues.apache.org/jira/browse/MESOS-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882512#comment-16882512 ] Meng Zhu commented on MESOS-9812: - This covers: the guarantee overcommitment check, and hierchical gurantees check {noformat} commit 16f0b0c295960e397e56f6d504b8075cb62e6e4f Author: Meng Zhu Date: Fri Jul 5 15:41:01 2019 -0700 Added overcommit and hierarchical inclusion check for `UPDATE_QUOTA`. The overcommit check validates that the total quota guarantees in the cluster is contained by the cluster capacity. The hierarchical inclusion check validates that the sum of children's guarantees is contained by the parent guarantee. Further validation is needed for: - Check a role's limit is less than its current consumption. - Check a role's limit is less than its parent's limit. Review: https://reviews.apache.org/r/71020 {noformat} Leave the ticket on for now for: limits < consumption, hierarchical limits invariant. > Add achievability validation for update quota call. > --- > > Key: MESOS-9812 > URL: https://issues.apache.org/jira/browse/MESOS-9812 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > Add overcommit check, hierarchical quota validation and force flag override > for update quota call. > Right now, we only have validation for per quota config. We need to add > further validation for the update quota call regarding: > 1. Check if the role's resource limits are already breached. To achieve this, > we need to first rescind offers until its allocated resources are below > limits. If after all rescinds, allocated resources are still above the > requested limits, we will return an error unless the `force` flag is used. > 2. If the aggregated quota guarantees of all roles are less than the cluster > capacity. If so we will return an error unless the `force` flag is used. > 3. hierarchical limits validation > a. Check a role's limit is less than its parent's limit. > b. Check the sum of children's guarantees is less than its parent's > guarantees. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8968) Wire `UPDATE_QUOTA` call.
[ https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882508#comment-16882508 ] Meng Zhu commented on MESOS-8968: - {noformat} commit 0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8 (apache/master) Author: Meng Zhu Date: Fri Jul 5 18:05:59 2019 -0700 Implemented `UPDATE_QUOTA` operator call. This patch wires up the master, auth, registar and allocator pieces for `UPDATE_QUOTA` call. This enables the master capability `QUOTA_V2`. The capability implies the quota v2 API is capable of writes (`UPDATE_QUOTA`) and the master is capable of recovering from V2 quota (`QuotaConfig`) in registry. This patch lacks the rescind offer logic. When quota limits and guarantees are configured, it might be necessary to rescind offers on the fly to satisfy new guarantees or be constrained by the new limits. A todo is left and will be tackled in subsequent patches. Also enabled test `MasterQuotaTest.RecoverQuotaEmptyCluster`. Review: https://reviews.apache.org/r/71021 {noformat} > Wire `UPDATE_QUOTA` call. > - > > Key: MESOS-8968 > URL: https://issues.apache.org/jira/browse/MESOS-8968 > Project: Mesos > Issue Type: Bug >Reporter: Meng Zhu >Priority: Major > Labels: Quota, allocator, multitenancy > > Wire the existing master, auth, registar, and allocator pieces together to > complete the `UPDATE_QUOTA` call. > This would enable the master capability `QUOTA_V2`. > This also fixes the "ignoring zero resource quota" bug in the old quota > implementation, namely: > Currently, Mesos discards resource object with zero scalar value when parsing > resources. This means quota set to zero would be ignored and not enforced. > For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no > GPU. Due to the above issue, the allocator can only see the quota as > "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs > may still be allocated to this role. > With the completion of `UPDATE_QUOTA` which takes a map of name, scalar > values, zero value will no longer be dropped. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8968) Wire `UPDATE_QUOTA` call.
[ https://issues.apache.org/jira/browse/MESOS-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882509#comment-16882509 ] Meng Zhu commented on MESOS-8968: - Leave it open for now, until more tests are landed. > Wire `UPDATE_QUOTA` call. > - > > Key: MESOS-8968 > URL: https://issues.apache.org/jira/browse/MESOS-8968 > Project: Mesos > Issue Type: Bug >Reporter: Meng Zhu >Priority: Major > Labels: Quota, allocator, multitenancy > > Wire the existing master, auth, registar, and allocator pieces together to > complete the `UPDATE_QUOTA` call. > This would enable the master capability `QUOTA_V2`. > This also fixes the "ignoring zero resource quota" bug in the old quota > implementation, namely: > Currently, Mesos discards resource object with zero scalar value when parsing > resources. This means quota set to zero would be ignored and not enforced. > For example, role with quota set to "cpu:10;mem:10;gpu:0" intends to get no > GPU. Due to the above issue, the allocator can only see the quota as > "cpu:10;mem:10", and no quota GPU means no guarantee and NO limit. Thus GPUs > may still be allocated to this role. > With the completion of `UPDATE_QUOTA` which takes a map of name, scalar > values, zero value will no longer be dropped. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8503) Improve UI when displaying frameworks with many roles.
[ https://issues.apache.org/jira/browse/MESOS-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-8503: -- Assignee: (was: Armand Grillet) > Improve UI when displaying frameworks with many roles. > -- > > Key: MESOS-8503 > URL: https://issues.apache.org/jira/browse/MESOS-8503 > Project: Mesos > Issue Type: Task >Reporter: Armand Grillet >Priority: Major > Attachments: Screen Shot 2018-01-29 à 10.38.05.png > > > The /frameworks UI endpoint displays all the roles of each framework in a > table: > !Screen Shot 2018-01-29 à 10.38.05.png! > This is not readable if a framework has many roles. We thus need to provide a > solution to only display a few roles per framework and show more when a user > wants to see all of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.
Andrei Budnik created MESOS-9887: Summary: Race condition between two terminal task status updates for Docker executor. Key: MESOS-9887 URL: https://issues.apache.org/jira/browse/MESOS-9887 Project: Mesos Issue Type: Bug Components: agent, containerization Reporter: Andrei Budnik Attachments: race_example.txt h2. Overview Expected behavior: Task successfully finishes and sends TASK_FINISHED status update. Observed behavior: Task successfully finishes, but the agent sends TASK_FAILED with the reason "REASON_EXECUTOR_TERMINATED". In normal circumstances, Docker executor [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758] final status update TASK_FINISHED to the agent, which then [gets processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543] by the agent before termination of the executor's process. However, if the processing of the initial TASK_FINISHED gets delayed, then there is a chance that Docker executor terminates and the agent [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] TASK_FAILED which will [be handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826] prior to the TASK_FINISHED status update. See attached logs which contain an example of the race condition. h2. Reproducing bug 1. Add the following code: {code:java} static int c = 0; if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates. ::sleep(2); } {code} to the [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578] and to the [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167]. 2. Recompile mesos 3. Launch mesos master and agent locally 4. Launch a simple Docker task via `mesos-execute`: {code} # cd build ./src/mesos-execute --master="`hostname`:5050" --name="a" --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" --command="ls" {code} h2. Race condition - description 1. Mesos agent receives TASK_FINISHED status update and then subscribes on [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]. 2. `containerizer->status()` operation for TASK_FINISHED status update gets delayed in the composing containerizer (e.g. due to switch of the worker thread that executes `status` method). 3. Docker executor terminates and the agent [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] TASK_FAILED. 4. Docker containerizer destroys the container. A registered callback for the `containerizer->wait` call in the composing containerizer dispatches [lambda function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373] that will clean up `containers_` map. 5. Composing c'zer resumes and dispatches `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]` method to the Docker containerizer for TASK_FINISHED, which in turn hangs for a few seconds. 6. Corresponding `containerId` gets removed from the `containers_` map of the composing c'zer. 7. Mesos agent subscribes on [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761] for the TASK_FAILED status update. 8. Composing c'zer returns ["Container not found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576] for TASK_FAILED. 9. `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]` stores TASK_FAILED terminal status update in the executor's data structure. 10. Docker executor resumes and finishes processing of `status()` method for TASK_FINISHED. Finally, it returns control to the `Slave::_statusUpdate` continuation. This method [discovers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5808-L5814] that the executor has already been destroyed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9618) Display quota consumption in the webui.
[ https://issues.apache.org/jira/browse/MESOS-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9618: -- Assignee: Benjamin Mahler > Display quota consumption in the webui. > --- > > Key: MESOS-9618 > URL: https://issues.apache.org/jira/browse/MESOS-9618 > Project: Mesos > Issue Type: Improvement > Components: webui >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: resource-management > > Currently, the Roles table in the webui displays allocation and quota > guarantees / limits. However, quota "consumption" is different from > allocation, in that reserved resources are always considered consumed against > the quota. > This discrepancy has led to confusion from users. One exampled occurred when > an agent was added with a large reservation exceeding the memory quota > guarantee. The user sees memory chopping in offers, and since the scheduler > didn't want to use the reservation, it can't launch its tasks. > If consumption is shown in the UI, we should include a tool tip that > indicates how consumed is calculated so that users know how to interpret it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9886) RoleTest.RolesEndpointContainsConsumedQuota is flaky.
Benjamin Mahler created MESOS-9886: -- Summary: RoleTest.RolesEndpointContainsConsumedQuota is flaky. Key: MESOS-9886 URL: https://issues.apache.org/jira/browse/MESOS-9886 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler Assignee: Benjamin Mahler {noformat} [ RUN ] RoleTest.RolesEndpointContainsConsumedQuota I0710 07:05:42.670790 9995 cluster.cpp:176] Creating default 'local' authorizer I0710 07:05:42.672238 master.cpp:440] Master 8db40cec-43ef-41a1-89a4-4f7b877d8f13 (ip-172-16-10-69.ec2.internal) started on 172.16.10.69:37082 I0710 07:05:42.672256 master.cpp:443] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregiste r_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate _frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwr ite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/1d 0m6o/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http _authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initializ e="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_co mpleted_tasks_per_framework="1000" --max_operator_event_stream_subscribers="1000" --max_unreachable_tasks_per_framework ="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework _metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout=" 1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry _store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submission s="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/1d0m6o/master" --zk_session_time out="10secs" I0710 07:05:42.672351 master.cpp:492] Master only allowing authenticated frameworks to register I0710 07:05:42.672356 master.cpp:498] Master only allowing authenticated agents to register I0710 07:05:42.672360 master.cpp:504] Master only allowing authenticated HTTP frameworks to register I0710 07:05:42.672364 credentials.hpp:37] Loading credentials for authentication from '/tmp/1d0m6o/credentials' I0710 07:05:42.672430 master.cpp:548] Using default 'crammd5' authenticator I0710 07:05:42.672466 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0710 07:05:42.672508 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite ' I0710 07:05:42.672538 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler ' I0710 07:05:42.672569 master.cpp:629] Authorization enabled I0710 07:05:42.672658 10001 hierarchical.cpp:241] Initialized hierarchical allocator process I0710 07:05:42.672685 10001 whitelist_watcher.cpp:77] No whitelist given I0710 07:05:42.673316 10001 master.cpp:2150] Elected as the leading master! I0710 07:05:42.673331 10001 master.cpp:1664] Recovering from registrar I0710 07:05:42.673616 10001 registrar.cpp:339] Recovering registrar I0710 07:05:42.673874 10001 registrar.cpp:383] Successfully fetched the registry (0B) in 239104ns I0710 07:05:42.673923 10001 registrar.cpp:487] Applied 1 operations in 7745ns; attempting to update the registry I0710 07:05:42.674052 registrar.cpp:544] Successfully updated the registry in 108032ns I0710 07:05:42.674082 registrar.cpp:416] Successfully recovered registrar I0710 07:05:42.674152 master.cpp:1799] Recovered 0 agents from the registry (180B); allowing 10mins for agents to reregister I0710 07:05:42.674185 9996 hierarchical.cpp:280] Skipping recovery of hierarchical allocator: nothing to recover W0710 07:05:42.676100 9995 process.cpp:2877] Attempted to spawn already running process files@172.16.10.69:37082 I0710 07:05:42.676537 9995 containerizer.cpp:314] Using isolation { environment_secret, posix/cpu, posix/mem, filesyst em/posix, network/cni } I0710 07:05:42.678514 9995 linux_launcher.cpp:144] Using /cgroup/freezer as the freezer hierarchy for the Linux launch er I0710 07:05:42.678980 9995 provisioner.cpp:298] Using default backend 'copy' I0710 07:05:42.680043 9995 cluster.cpp:510] Creating default 'local' authorizer I0710 07:05:42.680832 9998 slave.cpp:265] Mesos agent started on (522)@172.16.10.69:37082 I0710 07:05:42.680850 9998 slave.cpp:266] Flags at startup: --acls="" --appc_simple_discovery_uri_prefix="http://"; --a ppc_s
[jira] [Commented] (MESOS-9849) Add support for per-role REVIVE / SUPPRESS to V0 scheduler driver.
[ https://issues.apache.org/jira/browse/MESOS-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882214#comment-16882214 ] Andrei Sekretenko commented on MESOS-9849: -- Exercising this functionality in the Java V0 test framework in the Mesos tests: [https://reviews.apache.org/r/71047 |https://reviews.apache.org/r/71047] (to ensure that bindings don't fail to pass this parameters through due to someone's stupid mistake) > Add support for per-role REVIVE / SUPPRESS to V0 scheduler driver. > -- > > Key: MESOS-9849 > URL: https://issues.apache.org/jira/browse/MESOS-9849 > Project: Mesos > Issue Type: Task > Components: scheduler driver >Reporter: Benjamin Mahler >Assignee: Andrei Sekretenko >Priority: Major > Labels: resource-management > > Unfortunately, there are still schedulers that are using the v0 bindings and > are unable to move to v1 before wanting to use the per-role REVIVE / SUPPRESS > calls. > We'll need to add per-role REVIVE / SUPPRESS into the v1 scheduler driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9885) Resource provider configuration are only removing its container, causing issues in failover scenarios
Jan Schlicht created MESOS-9885: --- Summary: Resource provider configuration are only removing its container, causing issues in failover scenarios Key: MESOS-9885 URL: https://issues.apache.org/jira/browse/MESOS-9885 Project: Mesos Issue Type: Bug Components: resource provider Affects Versions: 1.8.0 Reporter: Jan Schlicht An agent could crash while it is handling a {{REMOVE_RESOURCE_PROVIDER_CONFIG}} call. In that case, the resource provider won't be removed. This is because its configuration is only removed if the actual resource provider container has been stopped. I.e. in {{LocalResourceProviderDaemonProcess::remove}} {{os::rm}} is only called if {{cleanupContainers}} was successful. After agent failover, the resource provider will still be running. This can be a problem for frameworks/operators, because there isn't a feedback channel that informs them if their removal requests was successful or not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)