[jira] [Created] (MESOS-9882) Mesos.UpdateFrameworkV0Test.SuppressedRoles is flaky.
Meng Zhu created MESOS-9882: --- Summary: Mesos.UpdateFrameworkV0Test.SuppressedRoles is flaky. Key: MESOS-9882 URL: https://issues.apache.org/jira/browse/MESOS-9882 Project: Mesos Issue Type: Bug Components: flaky Reporter: Meng Zhu Attachments: UpdateFrameworkV0Test.SuppressedRoles_badrun.txt Observed in CI, log attached. {noformat} mesos-ec2-ubuntu-14.04-SSL.Mesos.UpdateFrameworkV0Test.SuppressedRoles (from UpdateFrameworkV0Test) Error Message ../../src/tests/master/update_framework_tests.cpp:1117 Mock function called more times than expected - returning directly. Function call: agentAdded(@0x7fb254001c40 32-byte object <90-7A 6C-85 B2-7F 00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 F0-85 00-54 B2-7F 00-00>) Expected: to be called once Actual: called twice - over-saturated and active Stacktrace ../../src/tests/master/update_framework_tests.cpp:1117 Mock function called more times than expected - returning directly. Function call: agentAdded(@0x7fb254001c40 32-byte object <90-7A 6C-85 B2-7F 00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 F0-85 00-54 B2-7F 00-00>) Expected: to be called once Actual: called twice - over-saturated and active {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9881) StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery is flaky.
Benjamin Mahler created MESOS-9881: -- Summary: StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery is flaky. Key: MESOS-9881 URL: https://issues.apache.org/jira/browse/MESOS-9881 Project: Mesos Issue Type: Improvement Reporter: Benjamin Mahler This failed in CI: {noformat} 1 tests failed. FAILED: CSIVersion/StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery/v0 Error Message: ../../../3rdparty/libprocess/include/process/gmock.hpp:667 Mock function called more times than expected - returning default value. Function call: filter(@0x5617542ee270 master@172.17.0.3:35735, @0x7f83cc053c30 264-byte object <48-23 06-32 84-7F 00-00 40-DE 07-CC 83-7F 00-00 2B-00 00-00 00-00 00-00 2B-00 00-00 00-00 00-00 4C-65 6E-67 74-68 00-6F 20-AF 00-54 17-56 00-00 10-AF 00-54 17-56 00-00 02-00 00-00 AC-11 00-03 ... 20-20 05-CC 83-7F 00-00 00-00 00-00 6E-20 76-61 50-2B 4B-53 17-56 00-00 40-2B 4B-53 17-56 00-00 60-DA 07-CC 83-7F 00-00 CA-03 00-00 00-00 00-00 CA-03 00-00 00-00 00-00 10-01 00-00 00-00 00-00>) Returns: false Expected: to be never called Actual: called once - over-saturated and active Stack Trace: ../../../3rdparty/libprocess/include/process/gmock.hpp:667 Mock function called more times than expected - returning default value. Function call: filter(@0x5617542ee270 master@172.17.0.3:35735, @0x7f83cc053c30 264-byte object <48-23 06-32 84-7F 00-00 40-DE 07-CC 83-7F 00-00 2B-00 00-00 00-00 00-00 2B-00 00-00 00-00 00-00 4C-65 6E-67 74-68 00-6F 20-AF 00-54 17-56 00-00 10-AF 00-54 17-56 00-00 02-00 00-00 AC-11 00-03 ... 20-20 05-CC 83-7F 00-00 00-00 00-00 6E-20 76-61 50-2B 4B-53 17-56 00-00 40-2B 4B-53 17-56 00-00 60-DA 07-CC 83-7F 00-00 CA-03 00-00 00-00 00-00 CA-03 00-00 00-00 00-00 10-01 00-00 00-00 00-00>) Returns: false Expected: to be never called Actual: called once - over-saturated and active {noformat} Full test output: {noformat} [ RUN ] CSIVersion/StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery/v0 I0702 06:51:02.172196 6961 cluster.cpp:176] Creating default 'local' authorizer I0702 06:51:02.183229 17274 master.cpp:440] Master c310f701-ca24-4ea8-a4be-df3aa3637194 (005dc56bde82) started on 172.17.0.3:35735 I0702 06:51:02.184095 17274 master.cpp:443] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="50ms" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/Pq6bYz/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_operator_event_stream_subscribers="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/tmp/SRC/build/mesos-1.9.0/_inst/share/mesos/webui" --work_dir="/tmp/Pq6bYz/master" --zk_session_timeout="10secs" I0702 06:51:02.185236 17274 master.cpp:492] Master only allowing authenticated frameworks to register I0702 06:51:02.185819 17274 master.cpp:498] Master only allowing authenticated agents to register I0702 06:51:02.186395 17274 master.cpp:504] Master only allowing authenticated HTTP frameworks to register I0702 06:51:02.186951 17274 credentials.hpp:37] Loading credentials for authentication from '/tmp/Pq6bYz/credentials' I0702 06:51:02.187907 17274 master.cpp:548] Using default 'crammd5' authenticator I0702 06:51:02.188771 17274 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0702 06:51:02.189630 17274 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0702 06:51:02.190573 17274 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0702 06:51:02.191690 17274 master.cpp:629] Authorization enabled I0702 06:51:02.195374 17265
[jira] [Created] (MESOS-9880) Update SUPPRESS/REVIVE calls to return error codes / 200 OK.
Benjamin Mahler created MESOS-9880: -- Summary: Update SUPPRESS/REVIVE calls to return error codes / 200 OK. Key: MESOS-9880 URL: https://issues.apache.org/jira/browse/MESOS-9880 Project: Mesos Issue Type: Improvement Components: master, scheduler api Reporter: Benjamin Mahler Currently, the SUPPRESS/REVIVE calls always return '202 Accepted' even if the call is invalid. Instead, to be aligned with UPDATE_FRAMEWORK, these calls should: -Return 200 OK if successful. -Return appropriate error response if invalid or erroneous. For the v0 driver, this means: -Send back a FrameworkErrorMessage if invalid or erroneous. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9816) Add draining state information to master event stream and state endpoints
[ https://issues.apache.org/jira/browse/MESOS-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-9816: Assignee: Joseph Wu > Add draining state information to master event stream and state endpoints > - > > Key: MESOS-9816 > URL: https://issues.apache.org/jira/browse/MESOS-9816 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Joseph Wu >Assignee: Joseph Wu >Priority: Major > Labels: foundations, mesosphere > > The response for {{GET_STATE}} and {{GET_AGENTS}} should include the new > fields indicating deactivation or draining states: > {code} > message Response { > . . . > message GetAgents { > message Agent { > . . . > optional bool deactivated = 12; > optional DrainInfo drain_info = 13; > . . . > } > } > . . . > } > {code} > Additionally, the master's event stream should get a new event whenever these > states change: > {code} > message Event { > . . . > enum Type { > . . . > AGENT_UPDATED = 10; > } > message AgentUpdated { > optional bool deactivated = 1; > optional DrainInfo drain_info = 2; > } > . . . > optional AgentUpdated agent_updated = 10; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9870) Simultaneous adding/removal of a role from framework's roles and its suppressed roles crashes the master.
[ https://issues.apache.org/jira/browse/MESOS-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874489#comment-16874489 ] Andrei Sekretenko edited comment on MESOS-9870 at 7/3/19 3:18 PM: -- There is one more problem which might be much more serious: removing a role from framework's roles and framework's suppressed roles in the same call also crashes the master. Test via UPDATE_FRAMEWORK: [https://reviews.apache.org/r/70966/] Re-subscribing also crashes (no wonder: they use the same code path) - I don't have the test completed. Patches with a fix: -[https://reviews.apache.org/r/70967/]- -[https://reviews.apache.org/r/70968/]- [https://reviews.apache.org/r/70994/] [https://reviews.apache.org/r/70995/] was (Author: asekretenko): There is one more problem which might be much more serious: removing a role from framework's roles and framework's suppressed roles in the same call also crashes the master. Test via UPDATE_FRAMEWORK: [https://reviews.apache.org/r/70966/] Re-subscribing also crashes (no wonder: they use the same code path) - I don't have the test completed. Patches with a fix: [https://reviews.apache.org/r/70967/] [https://reviews.apache.org/r/70968/] > Simultaneous adding/removal of a role from framework's roles and its > suppressed roles crashes the master. > - > > Key: MESOS-9870 > URL: https://issues.apache.org/jira/browse/MESOS-9870 > Project: Mesos > Issue Type: Bug >Reporter: Andrei Sekretenko >Assignee: Andrei Sekretenko >Priority: Blocker > Labels: resource-management > > Calling UPDATE_FRAMEWORK with a new role added both to 'FrameworkInfo.roles` > and `suppressed_roles` crashes the master. > The first place which doesn't expect this is increasing a `suppressed` > allocator metric: > [https://github.com/apache/mesos/blob/fe7be9701e92d863734621ae1a3d339bb8598044/src/master/allocator/mesos/hierarchical.cpp#L507] > [ > https://github.com/apache/mesos/blob/fe7be9701e92d863734621ae1a3d339bb8598044/src/master/allocator/mesos/metrics.cpp#L255] > Probably there are other similar places. > Adding a new role in a suppressed state via re-subscribing should also > trigger this bug - haven't checked it -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9793) Implement UPDATE_FRAMEWORK call in V0 API
[ https://issues.apache.org/jira/browse/MESOS-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877867#comment-16877867 ] Andrei Sekretenko commented on MESOS-9793: -- {code} commit 89a16ae04a2e4ed10a4d49bff467cb066883febb Author: Andrei Sekretenko Date: Mon Jul 1 22:56:06 2019 -0400 Provided ability to pass suppressed roles via V0 updateFramework(). This patch adds a list of suppressed roles to the arguments of scheduler driver's updateFramework() method to make it possible for V0 frameworks to selectively suppress/revive offers, to reach parity with V1 UPDATE_FRAMEWORK. To keep re-registration consistent with updateFramework(), the set of suppressed roles is now stored by the driver and used when it re-registers. To prevent re-registration from implicitly altering the effects of reviveOffers() and suppressOffers(), reviveOffers() now clears the stored suppressed roles and suppressOffers() now fills the stored suppressed roles. If these calls are issued in a disconnected state of the driver, the driver will perform an UPDATE_FRAMEWORK call upon re-connection. Review: https://reviews.apache.org/r/70894/ {code} {code} commit f2deec5e57a84cef0033bda45ee87ee4259ab4c7 Author: Andrei Sekretenko Date: Mon Jul 1 23:21:25 2019 -0400 Supported suppressedRoles in updateFramework() in V0 Java bindings. Review: https://reviews.apache.org/r/70897/ {code} {code} commit 3b3ee085582b0a1d116627178cb898c4b6d5c64d Author: Andrei Sekretenko Date: Mon Jul 1 23:40:11 2019 -0400 Added a test for suppressing roles via V0 updateFramework(). Review: https://reviews.apache.org/r/70895/ {code} {code} commit 285a1b1896a9feb3095b266ab9bbda300103e6eb Author: Andrei Sekretenko Date: Mon Jul 1 23:40:24 2019 -0400 Added a test that driver re-registration does not unsuppress offers. Review: https://reviews.apache.org/r/70982/ {code} {code} commit 57b98ab790a8b2ed3e91cca825e1a38ebd51150a Author: Andrei Sekretenko Date: Tue Jul 2 00:25:22 2019 -0400 Added scheduler driver constructors which set initial suppressed roles. Review: https://reviews.apache.org/r/70943/ {code} {code} commit 7b2811217257d49818eebf2e519f92d0354c6e20 Author: Andrei Sekretenko Date: Tue Jul 2 00:27:55 2019 -0400 Added a test for scheduler driver registering with a suppressed role. Review: https://reviews.apache.org/r/70944/ {code} {code} commit 5d2732692541ec9aecba20c501fb5e35a19ab49f Author: Andrei Sekretenko Date: Tue Jul 2 00:26:37 2019 -0400 Added constructors with a list of suppressed roles to Java V0 bindings. Review: https://reviews.apache.org/r/70945/ {code} > Implement UPDATE_FRAMEWORK call in V0 API > - > > Key: MESOS-9793 > URL: https://issues.apache.org/jira/browse/MESOS-9793 > Project: Mesos > Issue Type: Task > Environment: Reviews for adding suppressed roles to the > updatedFramework() in scheduler driver: > [https://reviews.apache.org/r/70894/] > [https://reviews.apache.org/r/70897/] > [https://reviews.apache.org/r/70895/] > [https://reviews.apache.org/r/70982/] >Reporter: Andrei Sekretenko >Assignee: Andrei Sekretenko >Priority: Major > Labels: multitenancy, resource-management > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9876) Use geteuid to determine subprocess' user when launching task.
[ https://issues.apache.org/jira/browse/MESOS-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877752#comment-16877752 ] longfei commented on MESOS-9876: [https://reviews.apache.org/r/71005/] > Use geteuid to determine subprocess' user when launching task. > -- > > Key: MESOS-9876 > URL: https://issues.apache.org/jira/browse/MESOS-9876 > Project: Mesos > Issue Type: Improvement >Reporter: longfei >Assignee: longfei >Priority: Major > > I have to run mesos-agent as root(or some user with root privilege) to > isolate tasks' execution environment. For security, we > # chmod +s to mesos-agent and then run it as some user A(We'll ssh as user A > to do some ops, but NOT every has root privilege.). > # use --switch_user to restrict tasks' capabilities(e.g. "rm -rf /" is not > allowed). > The problem is that if we set CommandInfo.User to A(the same one running > mesos-agent), the check in MesosContainerizerLaunch::execute() > {code:java} > if(uid.get() != os::getuid().get()){ > // some code > }{code} > will always be false. As a result, all subprocesses will run as root. > So I suggest that we use geteuid here to replace getuid, namely > {code:java} > if (uid.get() != ::geteuid()){ > // some code > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9876) Use geteuid to determine subprocess' user when launching task.
[ https://issues.apache.org/jira/browse/MESOS-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] longfei reassigned MESOS-9876: -- Assignee: longfei > Use geteuid to determine subprocess' user when launching task. > -- > > Key: MESOS-9876 > URL: https://issues.apache.org/jira/browse/MESOS-9876 > Project: Mesos > Issue Type: Improvement >Reporter: longfei >Assignee: longfei >Priority: Major > > I have to run mesos-agent as root(or some user with root privilege) to > isolate tasks' execution environment. For security, we > # chmod +s to mesos-agent and then run it as some user A(We'll ssh as user A > to do some ops, but NOT every has root privilege.). > # use --switch_user to restrict tasks' capabilities(e.g. "rm -rf /" is not > allowed). > The problem is that if we set CommandInfo.User to A(the same one running > mesos-agent), the check in MesosContainerizerLaunch::execute() > {code:java} > if(uid.get() != os::getuid().get()){ > // some code > }{code} > will always be false. As a result, all subprocesses will run as root. > So I suggest that we use geteuid here to replace getuid, namely > {code:java} > if (uid.get() != ::geteuid()){ > // some code > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)