[jira] [Assigned] (MESOS-9812) Add achievability validation for update quota call.

2019-07-02 Thread Meng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meng Zhu reassigned MESOS-9812:
---

Assignee: Meng Zhu

> Add achievability validation for update quota call.
> ---
>
> Key: MESOS-9812
> URL: https://issues.apache.org/jira/browse/MESOS-9812
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> Add overcommit check and force flag override for update quota call.
> Right now, we only have validation for per quota config. We need to add 
> further validation for the update quota call regarding:
> 1. If the role's resource limits are already breached. To achieve this, we 
> need to first rescind offers until its allocated resources are below limits. 
> If after all rescinds, allocated resources are still above the requested 
> limits, we will return an error unless the `force` flag is used.
> 2. If the aggregated quota guarantees of all roles are less than the cluster 
> capacity. If so we will return an error unless the `force` flag is used.
> 3. hierarchical quota validness (we could probably punt this given that we 
> only support flat role quota at the moment).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9879) Create a unit test ensuring that a client certificate requests are properly ignored

2019-07-02 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9879:
--

 Summary: Create a unit test ensuring that a client certificate 
requests are properly ignored
 Key: MESOS-9879
 URL: https://issues.apache.org/jira/browse/MESOS-9879
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


When a TLS server sends a Client Certificate Request as part of the handshake 
and the client does not have a certificate available, the TLS specification 
mandates that the client shall attempt to continue the connection attempt 
sending a zero-length certificate.

We should write a unit test verifying libprocess handles this correctly when 
acting as a client, although it's not completely clear how this might be 
implemented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9878) Enable libprocess users to pass a custom SSL context when using Socket

2019-07-02 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9878:
--

 Summary: Enable libprocess users to pass a custom SSL context when 
using Socket
 Key: MESOS-9878
 URL: https://issues.apache.org/jira/browse/MESOS-9878
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


Connections made through the `Socket::connect()` API will always use the 
libprocess-global SSL configuration made through the `LIBPROCESS_SSL_*` 
environment variables.

Libprocess users might want to override these options while still using the 
generic socket class.

Therefore we should provide a way to pass custom configuration to the 
`Socket::connect()` function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail

2019-07-02 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877190#comment-16877190
 ] 

Greg Mann commented on MESOS-9875:
--

Perhaps we can fix this in the short-term by simply moving the 
{{updateOperation()}} call after the call to {{checkpointResourceState()}}… 
although with current agent behavior, this would result in the agent crashing, 
then reconciling with master, and the scheduler would receive an 
{{OPERATION_DROPPED}} update for that operation, which isn’t accurate (but 
better than {{FINISHED}} I would say).

I think our current code isn’t going to handle this type of operation failure 
well; rather than crashing when checkpointing fails, I think we could simply 
send an {{OPERATION_FAILED}} update and allow the agent to continue running.

> Mesos did not respond correctly when operations should fail
> ---
>
> Key: MESOS-9875
> URL: https://issues.apache.org/jira/browse/MESOS-9875
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yifan Xing
>Priority: Major
>
> For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we 
> sshed into the mesos-agent and made it unable to create subdirectories in 
> {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation 
> failed response. Instead, we received {{OPERATION_FINISHED}} feedback.
> Steps to recreate the issue:
> 1. Ssh into a magent.
> 2. Make it impossible to create a persistent volume (we expect the agent to 
> crash and reregister, and the master to release that the operation is 
> {{OPERATION_DROPPED}}):
> * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes)
> * chattr -RV +i volumes (then no subdirectories can be created)
> 3. Launch a service with persistent volumes with the constraint of only using 
> the magent modified above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9877) Possible segfault due to spurious EPOLLHUP.

2019-07-02 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9877:
--

 Summary: Possible segfault due to spurious EPOLLHUP.
 Key: MESOS-9877
 URL: https://issues.apache.org/jira/browse/MESOS-9877
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


In Linux, calling `epoll()` on a TCP socket before calling connect() will 
return an EPOLLHUP event on that socket. This can be verified with the 
following code snippet:

{noformat}
#include 
#include 

#include 

int main() {
int epfd = epoll_create1(0);
int s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
struct epoll_event event;
event.events = EPOLLIN;
event.data.u64 = s; // user data
epoll_ctl(epfd, EPOLL_CTL_ADD, s, &event);

struct epoll_event events[128];
epoll_wait(epfd, events, 128, 500 /*ms*/);
}

// Run using `strace ./a.out`.
{noformat}

Libevent then turns EPOLLHUP into an read/write event:
{noformat}
// epoll.c
if (what & (EPOLLHUP|EPOLLERR)) {
ev = EV_READ | EV_WRITE;
}
[...]
{noformat}

This means, when another thread was inside `epoll_wait()` while that fd is 
added, the wait will return immediately for that new fd.

Apparently, some of either our own or libevent code does not handle this case 
correctly. For example, here is a syscall sequence of `SSLTest.VerifyBadCA` 
failing:
{noformat}
[pid 12012] 1562077806.912193 socket(AF_INET, 
SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 8
[pid 12012] 1562077806.912244 epoll_ctl(3, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, 
u64=8}}) = 0
[pid 12021] 1562077806.912261 <... epoll_wait resumed> [{EPOLLHUP, {u32=8, 
u64=8}}], 32, 100) = 1
[pid 12012] 1562077806.912269 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 12012] 1562077806.912303 epoll_ctl(3, EPOLL_CTL_MOD, 8, {EPOLLIN|EPOLLOUT, 
{u32=8, u64=8}}) = 0
[pid 12021] 1562077806.912371 write(8, 
"\26\3\1\0k\1\0\0g\3\3\r~\336VZ\227I\216\260\304\356\10\200\327\271\320\td\304'O"...,
 112) = -1 EPIPE (Broken pipe)
[pid 12021] 1562077806.912395 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, 
si_pid=12012, si_uid=1000} ---
[pid 12021] 1562077806.912415 epoll_ctl(3, EPOLL_CTL_MOD, 8, {EPOLLOUT, {u32=8, 
u64=8}}) = 0
[pid 12021] 1562077806.912435 epoll_ctl(3, EPOLL_CTL_DEL, 8, 0x7fc35be23afc) = 0
[pid 12021] 1562077806.912460 connect(8, {sa_family=AF_INET, 
sin_port=htons(45067), sin_addr=inet_addr("127.0.1.1")}, 16) = -1 EINPROGRESS 
(Operation now in progress)
[pid 12011] 1562077806.912533 <... epoll_wait resumed> [{EPOLLIN, {u32=7, 
u64=7}}], 32, 11) = 1
[pid 12021] 1562077806.912543 epoll_ctl(3, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, 
u64=8}}) = 0
[pid 12011] 1562077806.912562 epoll_ctl(3, EPOLL_CTL_DEL, 7, 0x7f1dbcee0a9c 

[pid 12021] 1562077806.912571 epoll_ctl(3, EPOLL_CTL_MOD, 8, {EPOLLIN|EPOLLOUT, 
{u32=8, u64=8}} 
[pid 12011] 1562077806.912580 <... epoll_ctl resumed> ) = 0
[pid 12021] 1562077806.912586 <... epoll_ctl resumed> ) = 0
[pid 12021] 1562077806.912599 epoll_wait(3, [{EPOLLIN, {u32=6, u64=6}}, 
{EPOLLOUT, {u32=8, u64=8}}], 32, 100) = 2
[pid 12021] 1562077806.912636 write(8, 
"\26\3\1\0k\1\0\0g\3\3\r~\336VZ\227I\216\260\304\356\10\200\327\271\320\td\304'O"...,
 112) = 112
[pid 12021] 1562077806.912684 epoll_ctl(3, EPOLL_CTL_MOD, 8, {EPOLLIN, {u32=8, 
u64=8}}) = 0
[pid 12021] 1562077806.912705 epoll_wait(3,  
[pid 12011] 1562077806.912954 write(2, "W0702 16:30:06.912921 12011 proc"..., 
113W0702 16:30:06.912921 12011 process.cpp:844] Failed to recv on socket 9 to 
peer '127.0.0.1:52578': Decoder error
) = 113
[pid 12011] 1562077806.913004 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, 
u64=7}}) = 0
[pid 12021] 1562077806.913088 <... epoll_wait resumed> [{EPOLLIN, {u32=8, 
u64=8}}], 32, 100) = 1
[pid 12021] 1562077806.913119 epoll_ctl(3, EPOLL_CTL_DEL, 8, 0x7fc35be23afc) = 0
[pid 12011] 1562077806.913159 epoll_wait(3,  
[pid 12021] 1562077806.913168 write(2, "SETTING bev TO NULL 1\n", 22SETTING bev 
TO NULL 1
) = 22
[pid 12021] 1562077806.913219 epoll_wait(3,  
[pid 12003] 1562077806.913233 write(6, "\1\0\0\0\0\0\0\0", 8 
[pid 12011] 1562077806.913253 <... epoll_wait resumed> [{EPOLLIN, {u32=6, 
u64=6}}], 32, 14990) = 1
[pid 12003] 1562077806.913293 <... write resumed> ) = 8
[pid 12011] 1562077806.913375 epoll_wait(3,  
[pid 12012] 1562077806.913412 write(1, "../../../3rdparty/libprocess/src"..., 
122) = 122
[pid 12012] 1562077806.913449 write(6, "\1\0\0\0\0\0\0\0", 8 
[pid 12021] 1562077806.913464 <... epoll_wait resumed> [{EPOLLIN, {u32=6, 
u64=6}}], 32, 99) = 1
[pid 12012] 1562077806.913475 <... write resumed> ) = 8
[pid 12021] 1562077806.913515 --- SIGSEGV {si_signo=SIGSEGV, 
si_code=SEGV_MAPERR, si_addr=0x128} ---
[pid 12020] 1562077807.003305 +++ killed by SIGSEGV (core dumped) +++
{noformat}

As we can see from the above, the first wakeup triggered the `ssl-client` to 
attempt to write the SSL Client Hello to the sock

[jira] [Assigned] (MESOS-9874) Add environment variable `MESOS_ALLOCATION_ROLE` to the task/container.

2019-07-02 Thread Qian Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-9874:
-

Assignee: Qian Zhang

> Add environment variable `MESOS_ALLOCATION_ROLE` to the task/container.
> ---
>
> Key: MESOS-9874
> URL: https://issues.apache.org/jira/browse/MESOS-9874
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Qian Zhang
>Priority: Major
>  Labels: containerization
>
> Set this env var as the role from the task resource. Here is an example:
> https://github.com/apache/mesos/blob/master/src/master/readonly_handler.cpp#L197
> We probably want to set this env from executors, by adding this env to 
> CommandInfo.
> Mesos and docker containerizers should be supported.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9876) Use geteuid to determine subprocess' user when launching task.

2019-07-02 Thread longfei (JIRA)
longfei created MESOS-9876:
--

 Summary: Use geteuid to determine subprocess' user when launching 
task.
 Key: MESOS-9876
 URL: https://issues.apache.org/jira/browse/MESOS-9876
 Project: Mesos
  Issue Type: Improvement
Reporter: longfei


I have to run mesos-agent as root(or some use with root privilege) to isolate 
tasks' execution environment. For security, we 
 # chmod +s to mesos-agent and then run it with some user A.
 # use switch_user to restrict tasks' capabilities(e.g. "rm -rf /" is not 
allowed).

The problem is that if we set user to A(the same user running mesos-agent), the 
check in MesosContainerizerLaunch::execute() (i.e. `uid.get() != 
os::getuid().get() `) will always be false. As a result, all subprocesses will 
be run as root. 

So I suggest that we should use geteuid here to replace getuid, namely

`

if (uid.get() != ::geteuid()) {

// some code

}

`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail

2019-07-02 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876736#comment-16876736
 ] 

James Peach commented on MESOS-9875:


{{f9330006-d885-4ef0-b2c7-c9c6fcc239e5}} is the persistence ID.
{{5fa5c810-2dd3-41cb-9633-a3ef404b08c4}} is the operation UUID.
{{honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14}} is the operation ID.

{noformat}

I0627 22:03:17.360236 3529210 slave.cpp:4282] Updated checkpointed operations 
from [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for framework 
efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: 
honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: 
OPERATION_FINISHED) ] to [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for 
framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: 
honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: 
OPERATION_FINISHED), 5fa5c810-2dd3-41cb-9633-a3ef404b08c4 (CREATE for framework 
efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: 
honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14, latest state: 
OPERATION_PENDING) ]
...
I0627 22:03:17.360723 3529210 slave.cpp:8670] Updating the state of operation 
'honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14' (uuid: 
5fa5c810-2dd3-41cb-9633-a3ef404b08c4) for framework 
efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525 (latest state: OPERATION_FINISHED, 
status update state: OPERATION_FINISHED)
...
E0627 22:03:17.365811 3529210 slave.cpp:4257] EXIT with status 1: Failed to 
sync checkpointed resources: Failed to create the persistent volume 
f9330006-d885-4ef0-b2c7-c9c6fcc239e5 at 
'/srv/mesos/work/volumes/roles/test-3/f9330006-d885-4ef0-b2c7-c9c6fcc239e5': 
Operation not permitted
{noformat}


The relevant code sequence is in Slave::applyOperation, and looks roughly like 
this:

{noformat}
track the new operation

checkpointResourceState() (1)

apply the operation (2)
report that the operation was applied

checkpointResourceState() (3)
{noformat}

The operation is checkpointed as pending in (1), but no resource changes are 
made yet. In (3), the operation is applied by making changes to the agent 
resources. At (3) the checkpointed resources discrepancy is discovered and the 
agent tries to create the persistent volume and fails.


> Mesos did not respond correctly when operations should fail
> ---
>
> Key: MESOS-9875
> URL: https://issues.apache.org/jira/browse/MESOS-9875
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yifan Xing
>Priority: Major
>
> For testing persistent volumes with `OPERATION_FAILED/ERROR` feedbacks, we 
> sshed into the mesos-agent and made it unable to create subdirectories in 
> /srv/mesos/work/volumes, however, mesos did not respond any operation failed 
> response. Instead, we received `OPERATION_FINISHED` feedback.
> Steps to recreate the issue:
> 1. Ssh into a magent.
> 2. Make it impossible to create a persistent volume (we expect the agent to 
> crash and reregister, and the master to release that the operation is 
> `OPERATION_DROPPED`):
> * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes)
>  * chattr -RV +i volumes (then no subdirectories can be created)
> 3. Launch a service with persistent volumes with the constraint of only using 
> the magent modified above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)