[jira] [Created] (MESOS-9882) Mesos.UpdateFrameworkV0Test.SuppressedRoles is flaky.

2019-07-03 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9882:
---

 Summary: Mesos.UpdateFrameworkV0Test.SuppressedRoles is flaky.
 Key: MESOS-9882
 URL: https://issues.apache.org/jira/browse/MESOS-9882
 Project: Mesos
  Issue Type: Bug
  Components: flaky
Reporter: Meng Zhu
 Attachments: UpdateFrameworkV0Test.SuppressedRoles_badrun.txt

Observed in CI, log attached.

{noformat}
mesos-ec2-ubuntu-14.04-SSL.Mesos.UpdateFrameworkV0Test.SuppressedRoles (from 
UpdateFrameworkV0Test)


Error Message
../../src/tests/master/update_framework_tests.cpp:1117
Mock function called more times than expected - returning directly.
Function call: agentAdded(@0x7fb254001c40 32-byte object <90-7A 6C-85 B2-7F 
00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 F0-85 00-54 B2-7F 00-00>)
 Expected: to be called once
   Actual: called twice - over-saturated and active
Stacktrace
../../src/tests/master/update_framework_tests.cpp:1117
Mock function called more times than expected - returning directly.
Function call: agentAdded(@0x7fb254001c40 32-byte object <90-7A 6C-85 B2-7F 
00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 F0-85 00-54 B2-7F 00-00>)
 Expected: to be called once
   Actual: called twice - over-saturated and active
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9881) StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery is flaky.

2019-07-03 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-9881:
--

 Summary: 
StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery is 
flaky.
 Key: MESOS-9881
 URL: https://issues.apache.org/jira/browse/MESOS-9881
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler


This failed in CI:

{noformat}
1 tests failed.
FAILED:  
CSIVersion/StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery/v0

Error Message:
../../../3rdparty/libprocess/include/process/gmock.hpp:667
Mock function called more times than expected - returning default value.
Function call: filter(@0x5617542ee270 master@172.17.0.3:35735, 
@0x7f83cc053c30 264-byte object <48-23 06-32 84-7F 00-00 40-DE 07-CC 83-7F 
00-00 2B-00 00-00 00-00 00-00 2B-00 00-00 00-00 00-00 4C-65 6E-67 74-68 00-6F 
20-AF 00-54 17-56 00-00 10-AF 00-54 17-56 00-00 02-00 00-00 AC-11 00-03 ... 
20-20 05-CC 83-7F 00-00 00-00 00-00 6E-20 76-61 50-2B 4B-53 17-56 00-00 40-2B 
4B-53 17-56 00-00 60-DA 07-CC 83-7F 00-00 CA-03 00-00 00-00 00-00 CA-03 00-00 
00-00 00-00 10-01 00-00 00-00 00-00>)
  Returns: false
 Expected: to be never called
   Actual: called once - over-saturated and active

Stack Trace:
../../../3rdparty/libprocess/include/process/gmock.hpp:667
Mock function called more times than expected - returning default value.
Function call: filter(@0x5617542ee270 master@172.17.0.3:35735, 
@0x7f83cc053c30 264-byte object <48-23 06-32 84-7F 00-00 40-DE 07-CC 83-7F 
00-00 2B-00 00-00 00-00 00-00 2B-00 00-00 00-00 00-00 4C-65 6E-67 74-68 00-6F 
20-AF 00-54 17-56 00-00 10-AF 00-54 17-56 00-00 02-00 00-00 AC-11 00-03 ... 
20-20 05-CC 83-7F 00-00 00-00 00-00 6E-20 76-61 50-2B 4B-53 17-56 00-00 40-2B 
4B-53 17-56 00-00 60-DA 07-CC 83-7F 00-00 CA-03 00-00 00-00 00-00 CA-03 00-00 
00-00 00-00 10-01 00-00 00-00 00-00>)
  Returns: false
 Expected: to be never called
   Actual: called once - over-saturated and active
{noformat}

Full test output:

{noformat}
[ RUN  ] 
CSIVersion/StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery/v0
I0702 06:51:02.172196  6961 cluster.cpp:176] Creating default 'local' authorizer
I0702 06:51:02.183229 17274 master.cpp:440] Master 
c310f701-ca24-4ea8-a4be-df3aa3637194 (005dc56bde82) started on 172.17.0.3:35735
I0702 06:51:02.184095 17274 master.cpp:443] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="50ms" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/Pq6bYz/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_operator_event_stream_subscribers="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" 
--webui_dir="/tmp/SRC/build/mesos-1.9.0/_inst/share/mesos/webui" 
--work_dir="/tmp/Pq6bYz/master" --zk_session_timeout="10secs"
I0702 06:51:02.185236 17274 master.cpp:492] Master only allowing authenticated 
frameworks to register
I0702 06:51:02.185819 17274 master.cpp:498] Master only allowing authenticated 
agents to register
I0702 06:51:02.186395 17274 master.cpp:504] Master only allowing authenticated 
HTTP frameworks to register
I0702 06:51:02.186951 17274 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/Pq6bYz/credentials'
I0702 06:51:02.187907 17274 master.cpp:548] Using default 'crammd5' 
authenticator
I0702 06:51:02.188771 17274 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0702 06:51:02.189630 17274 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0702 06:51:02.190573 17274 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0702 06:51:02.191690 17274 master.cpp:629] Authorization enabled
I0702 06:51:02.195374 17265 

[jira] [Created] (MESOS-9880) Update SUPPRESS/REVIVE calls to return error codes / 200 OK.

2019-07-03 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-9880:
--

 Summary: Update SUPPRESS/REVIVE calls to return error codes / 200 
OK.
 Key: MESOS-9880
 URL: https://issues.apache.org/jira/browse/MESOS-9880
 Project: Mesos
  Issue Type: Improvement
  Components: master, scheduler api
Reporter: Benjamin Mahler


Currently, the SUPPRESS/REVIVE calls always return '202 Accepted' even if the 
call is invalid.

Instead, to be aligned with UPDATE_FRAMEWORK, these calls should:

-Return 200 OK if successful.
-Return appropriate error response if invalid or erroneous.

For the v0 driver, this means:

-Send back a FrameworkErrorMessage if invalid or erroneous.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9816) Add draining state information to master event stream and state endpoints

2019-07-03 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-9816:


Assignee: Joseph Wu

> Add draining state information to master event stream and state endpoints
> -
>
> Key: MESOS-9816
> URL: https://issues.apache.org/jira/browse/MESOS-9816
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations, mesosphere
>
> The response for {{GET_STATE}} and {{GET_AGENTS}} should include the new 
> fields indicating deactivation or draining states:
> {code}
> message Response {
>   . . .
>   message GetAgents {
> message Agent {
>   . . .
>   optional bool deactivated = 12;
>   optional DrainInfo drain_info = 13;
>   . . .
> }
>   }
>   . . .
> }
> {code}
> Additionally, the master's event stream should get a new event whenever these 
> states change:
> {code}
> message Event {
>   . . .
>   enum Type {
> . . .
> AGENT_UPDATED = 10;
>   }
>   message AgentUpdated {
> optional bool deactivated = 1;
> optional DrainInfo drain_info = 2;
>   }
>   . . .
>   optional AgentUpdated agent_updated = 10;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9870) Simultaneous adding/removal of a role from framework's roles and its suppressed roles crashes the master.

2019-07-03 Thread Andrei Sekretenko (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874489#comment-16874489
 ] 

Andrei Sekretenko edited comment on MESOS-9870 at 7/3/19 3:18 PM:
--

There is one more problem which might be much more serious: removing a role 
from framework's roles and framework's suppressed roles in the same call also 
crashes the master.

Test via UPDATE_FRAMEWORK: [https://reviews.apache.org/r/70966/]

Re-subscribing also crashes (no wonder: they use the same code path) - I don't 
have the test completed.

Patches with a fix:
-[https://reviews.apache.org/r/70967/]-
-[https://reviews.apache.org/r/70968/]-

[https://reviews.apache.org/r/70994/]
[https://reviews.apache.org/r/70995/]


was (Author: asekretenko):
There is one more problem which might be much more serious: removing a role 
from framework's roles and framework's suppressed roles in the same call also 
crashes the master.

Test via UPDATE_FRAMEWORK: [https://reviews.apache.org/r/70966/]

Re-subscribing also crashes (no wonder: they use the same code path) - I don't 
have the test completed.

Patches with a fix:
[https://reviews.apache.org/r/70967/]
[https://reviews.apache.org/r/70968/]

> Simultaneous adding/removal of a role from framework's roles and its 
> suppressed roles crashes the master.
> -
>
> Key: MESOS-9870
> URL: https://issues.apache.org/jira/browse/MESOS-9870
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Blocker
>  Labels: resource-management
>
> Calling UPDATE_FRAMEWORK with a new role added both to 'FrameworkInfo.roles` 
> and `suppressed_roles` crashes the master.
> The first place which doesn't expect this is increasing a `suppressed` 
> allocator metric:
> [https://github.com/apache/mesos/blob/fe7be9701e92d863734621ae1a3d339bb8598044/src/master/allocator/mesos/hierarchical.cpp#L507]
> [
> https://github.com/apache/mesos/blob/fe7be9701e92d863734621ae1a3d339bb8598044/src/master/allocator/mesos/metrics.cpp#L255]
> Probably there are other similar places.
> Adding a new role in a suppressed state via re-subscribing  should also 
> trigger this bug - haven't checked it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9793) Implement UPDATE_FRAMEWORK call in V0 API

2019-07-03 Thread Andrei Sekretenko (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877867#comment-16877867
 ] 

Andrei Sekretenko commented on MESOS-9793:
--

{code}
commit 89a16ae04a2e4ed10a4d49bff467cb066883febb
Author: Andrei Sekretenko 
Date:   Mon Jul 1 22:56:06 2019 -0400

Provided ability to pass suppressed roles via V0 updateFramework().

This patch adds a list of suppressed roles to the arguments of
scheduler driver's updateFramework() method to make it possible
for V0 frameworks to selectively suppress/revive offers, to reach
parity with V1 UPDATE_FRAMEWORK.

To keep re-registration consistent with updateFramework(), the set
of suppressed roles is now stored by the driver and used when it
re-registers.

To prevent re-registration from implicitly altering the effects of
reviveOffers() and suppressOffers(), reviveOffers() now clears the
stored suppressed roles and suppressOffers() now fills the stored
suppressed roles. If these calls are issued in a disconnected state
of the driver, the driver will perform an UPDATE_FRAMEWORK call upon
re-connection.

Review: https://reviews.apache.org/r/70894/

{code}
{code}
commit f2deec5e57a84cef0033bda45ee87ee4259ab4c7
Author: Andrei Sekretenko 
Date:   Mon Jul 1 23:21:25 2019 -0400

Supported suppressedRoles in updateFramework() in V0 Java bindings.

Review: https://reviews.apache.org/r/70897/

{code}
{code}
commit 3b3ee085582b0a1d116627178cb898c4b6d5c64d
Author: Andrei Sekretenko 
Date:   Mon Jul 1 23:40:11 2019 -0400

Added a test for suppressing roles via V0 updateFramework().

Review: https://reviews.apache.org/r/70895/

{code}
{code}
commit 285a1b1896a9feb3095b266ab9bbda300103e6eb
Author: Andrei Sekretenko 
Date:   Mon Jul 1 23:40:24 2019 -0400

Added a test that driver re-registration does not unsuppress offers.

Review: https://reviews.apache.org/r/70982/

{code}
{code}
commit 57b98ab790a8b2ed3e91cca825e1a38ebd51150a
Author: Andrei Sekretenko 
Date:   Tue Jul 2 00:25:22 2019 -0400

Added scheduler driver constructors which set initial suppressed roles.

Review: https://reviews.apache.org/r/70943/

{code}
{code}
commit 7b2811217257d49818eebf2e519f92d0354c6e20
Author: Andrei Sekretenko 
Date:   Tue Jul 2 00:27:55 2019 -0400

Added a test for scheduler driver registering with a suppressed role.

Review: https://reviews.apache.org/r/70944/

{code}
{code}
commit 5d2732692541ec9aecba20c501fb5e35a19ab49f
Author: Andrei Sekretenko 
Date:   Tue Jul 2 00:26:37 2019 -0400

Added constructors with a list of suppressed roles to Java V0 bindings.

Review: https://reviews.apache.org/r/70945/
{code}

> Implement UPDATE_FRAMEWORK call in V0 API
> -
>
> Key: MESOS-9793
> URL: https://issues.apache.org/jira/browse/MESOS-9793
> Project: Mesos
>  Issue Type: Task
> Environment: Reviews for adding suppressed roles to the 
> updatedFramework() in scheduler driver:
> [https://reviews.apache.org/r/70894/]
> [https://reviews.apache.org/r/70897/]
> [https://reviews.apache.org/r/70895/]
> [https://reviews.apache.org/r/70982/]
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Major
>  Labels: multitenancy, resource-management
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9876) Use geteuid to determine subprocess' user when launching task.

2019-07-03 Thread longfei (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877752#comment-16877752
 ] 

longfei commented on MESOS-9876:


[https://reviews.apache.org/r/71005/]

> Use geteuid to determine subprocess' user when launching task.
> --
>
> Key: MESOS-9876
> URL: https://issues.apache.org/jira/browse/MESOS-9876
> Project: Mesos
>  Issue Type: Improvement
>Reporter: longfei
>Assignee: longfei
>Priority: Major
>
> I have to run mesos-agent as root(or some user with root privilege) to 
> isolate tasks' execution environment. For security, we 
>  # chmod +s to mesos-agent and then run it as some user A(We'll ssh as user A 
> to do some ops, but NOT every has root privilege.).
>  # use --switch_user to restrict tasks' capabilities(e.g. "rm -rf /" is not 
> allowed).
> The problem is that if we set CommandInfo.User to A(the same one running 
> mesos-agent), the check in MesosContainerizerLaunch::execute()
> {code:java}
> if(uid.get() != os::getuid().get()){
>   // some code
> }{code}
> will always be false. As a result, all subprocesses will run as root. 
> So I suggest that we use geteuid here to replace getuid, namely
> {code:java}
> if (uid.get() != ::geteuid()){ 
>   // some code 
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9876) Use geteuid to determine subprocess' user when launching task.

2019-07-03 Thread longfei (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

longfei reassigned MESOS-9876:
--

Assignee: longfei

> Use geteuid to determine subprocess' user when launching task.
> --
>
> Key: MESOS-9876
> URL: https://issues.apache.org/jira/browse/MESOS-9876
> Project: Mesos
>  Issue Type: Improvement
>Reporter: longfei
>Assignee: longfei
>Priority: Major
>
> I have to run mesos-agent as root(or some user with root privilege) to 
> isolate tasks' execution environment. For security, we 
>  # chmod +s to mesos-agent and then run it as some user A(We'll ssh as user A 
> to do some ops, but NOT every has root privilege.).
>  # use --switch_user to restrict tasks' capabilities(e.g. "rm -rf /" is not 
> allowed).
> The problem is that if we set CommandInfo.User to A(the same one running 
> mesos-agent), the check in MesosContainerizerLaunch::execute()
> {code:java}
> if(uid.get() != os::getuid().get()){
>   // some code
> }{code}
> will always be false. As a result, all subprocesses will run as root. 
> So I suggest that we use geteuid here to replace getuid, namely
> {code:java}
> if (uid.get() != ::geteuid()){ 
>   // some code 
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)