[jira] [Assigned] (MESOS-9812) Add achievability validation for update quota call.
[ https://issues.apache.org/jira/browse/MESOS-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9812: --- Assignee: Meng Zhu > Add achievability validation for update quota call. > --- > > Key: MESOS-9812 > URL: https://issues.apache.org/jira/browse/MESOS-9812 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: resource-management > > Add overcommit check and force flag override for update quota call. > Right now, we only have validation for per quota config. We need to add > further validation for the update quota call regarding: > 1. If the role's resource limits are already breached. To achieve this, we > need to first rescind offers until its allocated resources are below limits. > If after all rescinds, allocated resources are still above the requested > limits, we will return an error unless the `force` flag is used. > 2. If the aggregated quota guarantees of all roles are less than the cluster > capacity. If so we will return an error unless the `force` flag is used. > 3. hierarchical quota validness (we could probably punt this given that we > only support flat role quota at the moment). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9879) Create a unit test ensuring that a client certificate requests are properly ignored
Benno Evers created MESOS-9879: -- Summary: Create a unit test ensuring that a client certificate requests are properly ignored Key: MESOS-9879 URL: https://issues.apache.org/jira/browse/MESOS-9879 Project: Mesos Issue Type: Improvement Reporter: Benno Evers When a TLS server sends a Client Certificate Request as part of the handshake and the client does not have a certificate available, the TLS specification mandates that the client shall attempt to continue the connection attempt sending a zero-length certificate. We should write a unit test verifying libprocess handles this correctly when acting as a client, although it's not completely clear how this might be implemented. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9878) Enable libprocess users to pass a custom SSL context when using Socket
Benno Evers created MESOS-9878: -- Summary: Enable libprocess users to pass a custom SSL context when using Socket Key: MESOS-9878 URL: https://issues.apache.org/jira/browse/MESOS-9878 Project: Mesos Issue Type: Improvement Reporter: Benno Evers Connections made through the `Socket::connect()` API will always use the libprocess-global SSL configuration made through the `LIBPROCESS_SSL_*` environment variables. Libprocess users might want to override these options while still using the generic socket class. Therefore we should provide a way to pass custom configuration to the `Socket::connect()` function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail
[ https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877190#comment-16877190 ] Greg Mann commented on MESOS-9875: -- Perhaps we can fix this in the short-term by simply moving the {{updateOperation()}} call after the call to {{checkpointResourceState()}}… although with current agent behavior, this would result in the agent crashing, then reconciling with master, and the scheduler would receive an {{OPERATION_DROPPED}} update for that operation, which isn’t accurate (but better than {{FINISHED}} I would say). I think our current code isn’t going to handle this type of operation failure well; rather than crashing when checkpointing fails, I think we could simply send an {{OPERATION_FAILED}} update and allow the agent to continue running. > Mesos did not respond correctly when operations should fail > --- > > Key: MESOS-9875 > URL: https://issues.apache.org/jira/browse/MESOS-9875 > Project: Mesos > Issue Type: Bug >Reporter: Yifan Xing >Priority: Major > > For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we > sshed into the mesos-agent and made it unable to create subdirectories in > {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation > failed response. Instead, we received {{OPERATION_FINISHED}} feedback. > Steps to recreate the issue: > 1. Ssh into a magent. > 2. Make it impossible to create a persistent volume (we expect the agent to > crash and reregister, and the master to release that the operation is > {{OPERATION_DROPPED}}): > * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes) > * chattr -RV +i volumes (then no subdirectories can be created) > 3. Launch a service with persistent volumes with the constraint of only using > the magent modified above. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9877) Possible segfault due to spurious EPOLLHUP.
Benno Evers created MESOS-9877: -- Summary: Possible segfault due to spurious EPOLLHUP. Key: MESOS-9877 URL: https://issues.apache.org/jira/browse/MESOS-9877 Project: Mesos Issue Type: Bug Reporter: Benno Evers In Linux, calling `epoll()` on a TCP socket before calling connect() will return an EPOLLHUP event on that socket. This can be verified with the following code snippet: {noformat} #include #include #include int main() { int epfd = epoll_create1(0); int s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP); struct epoll_event event; event.events = EPOLLIN; event.data.u64 = s; // user data epoll_ctl(epfd, EPOLL_CTL_ADD, s, &event); struct epoll_event events[128]; epoll_wait(epfd, events, 128, 500 /*ms*/); } // Run using `strace ./a.out`. {noformat} Libevent then turns EPOLLHUP into an read/write event: {noformat} // epoll.c if (what & (EPOLLHUP|EPOLLERR)) { ev = EV_READ | EV_WRITE; } [...] {noformat} This means, when another thread was inside `epoll_wait()` while that fd is added, the wait will return immediately for that new fd. Apparently, some of either our own or libevent code does not handle this case correctly. For example, here is a syscall sequence of `SSLTest.VerifyBadCA` failing: {noformat} [pid 12012] 1562077806.912193 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 8 [pid 12012] 1562077806.912244 epoll_ctl(3, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, u64=8}}) = 0 [pid 12021] 1562077806.912261 <... epoll_wait resumed> [{EPOLLHUP, {u32=8, u64=8}}], 32, 100) = 1 [pid 12012] 1562077806.912269 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 [pid 12012] 1562077806.912303 epoll_ctl(3, EPOLL_CTL_MOD, 8, {EPOLLIN|EPOLLOUT, {u32=8, u64=8}}) = 0 [pid 12021] 1562077806.912371 write(8, "\26\3\1\0k\1\0\0g\3\3\r~\336VZ\227I\216\260\304\356\10\200\327\271\320\td\304'O"..., 112) = -1 EPIPE (Broken pipe) [pid 12021] 1562077806.912395 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=12012, si_uid=1000} --- [pid 12021] 1562077806.912415 epoll_ctl(3, EPOLL_CTL_MOD, 8, {EPOLLOUT, {u32=8, u64=8}}) = 0 [pid 12021] 1562077806.912435 epoll_ctl(3, EPOLL_CTL_DEL, 8, 0x7fc35be23afc) = 0 [pid 12021] 1562077806.912460 connect(8, {sa_family=AF_INET, sin_port=htons(45067), sin_addr=inet_addr("127.0.1.1")}, 16) = -1 EINPROGRESS (Operation now in progress) [pid 12011] 1562077806.912533 <... epoll_wait resumed> [{EPOLLIN, {u32=7, u64=7}}], 32, 11) = 1 [pid 12021] 1562077806.912543 epoll_ctl(3, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, u64=8}}) = 0 [pid 12011] 1562077806.912562 epoll_ctl(3, EPOLL_CTL_DEL, 7, 0x7f1dbcee0a9c [pid 12021] 1562077806.912571 epoll_ctl(3, EPOLL_CTL_MOD, 8, {EPOLLIN|EPOLLOUT, {u32=8, u64=8}} [pid 12011] 1562077806.912580 <... epoll_ctl resumed> ) = 0 [pid 12021] 1562077806.912586 <... epoll_ctl resumed> ) = 0 [pid 12021] 1562077806.912599 epoll_wait(3, [{EPOLLIN, {u32=6, u64=6}}, {EPOLLOUT, {u32=8, u64=8}}], 32, 100) = 2 [pid 12021] 1562077806.912636 write(8, "\26\3\1\0k\1\0\0g\3\3\r~\336VZ\227I\216\260\304\356\10\200\327\271\320\td\304'O"..., 112) = 112 [pid 12021] 1562077806.912684 epoll_ctl(3, EPOLL_CTL_MOD, 8, {EPOLLIN, {u32=8, u64=8}}) = 0 [pid 12021] 1562077806.912705 epoll_wait(3, [pid 12011] 1562077806.912954 write(2, "W0702 16:30:06.912921 12011 proc"..., 113W0702 16:30:06.912921 12011 process.cpp:844] Failed to recv on socket 9 to peer '127.0.0.1:52578': Decoder error ) = 113 [pid 12011] 1562077806.913004 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0 [pid 12021] 1562077806.913088 <... epoll_wait resumed> [{EPOLLIN, {u32=8, u64=8}}], 32, 100) = 1 [pid 12021] 1562077806.913119 epoll_ctl(3, EPOLL_CTL_DEL, 8, 0x7fc35be23afc) = 0 [pid 12011] 1562077806.913159 epoll_wait(3, [pid 12021] 1562077806.913168 write(2, "SETTING bev TO NULL 1\n", 22SETTING bev TO NULL 1 ) = 22 [pid 12021] 1562077806.913219 epoll_wait(3, [pid 12003] 1562077806.913233 write(6, "\1\0\0\0\0\0\0\0", 8 [pid 12011] 1562077806.913253 <... epoll_wait resumed> [{EPOLLIN, {u32=6, u64=6}}], 32, 14990) = 1 [pid 12003] 1562077806.913293 <... write resumed> ) = 8 [pid 12011] 1562077806.913375 epoll_wait(3, [pid 12012] 1562077806.913412 write(1, "../../../3rdparty/libprocess/src"..., 122) = 122 [pid 12012] 1562077806.913449 write(6, "\1\0\0\0\0\0\0\0", 8 [pid 12021] 1562077806.913464 <... epoll_wait resumed> [{EPOLLIN, {u32=6, u64=6}}], 32, 99) = 1 [pid 12012] 1562077806.913475 <... write resumed> ) = 8 [pid 12021] 1562077806.913515 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x128} --- [pid 12020] 1562077807.003305 +++ killed by SIGSEGV (core dumped) +++ {noformat} As we can see from the above, the first wakeup triggered the `ssl-client` to attempt to write the SSL Client Hello to the sock
[jira] [Assigned] (MESOS-9874) Add environment variable `MESOS_ALLOCATION_ROLE` to the task/container.
[ https://issues.apache.org/jira/browse/MESOS-9874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-9874: - Assignee: Qian Zhang > Add environment variable `MESOS_ALLOCATION_ROLE` to the task/container. > --- > > Key: MESOS-9874 > URL: https://issues.apache.org/jira/browse/MESOS-9874 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Gilbert Song >Assignee: Qian Zhang >Priority: Major > Labels: containerization > > Set this env var as the role from the task resource. Here is an example: > https://github.com/apache/mesos/blob/master/src/master/readonly_handler.cpp#L197 > We probably want to set this env from executors, by adding this env to > CommandInfo. > Mesos and docker containerizers should be supported. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9876) Use geteuid to determine subprocess' user when launching task.
longfei created MESOS-9876: -- Summary: Use geteuid to determine subprocess' user when launching task. Key: MESOS-9876 URL: https://issues.apache.org/jira/browse/MESOS-9876 Project: Mesos Issue Type: Improvement Reporter: longfei I have to run mesos-agent as root(or some use with root privilege) to isolate tasks' execution environment. For security, we # chmod +s to mesos-agent and then run it with some user A. # use switch_user to restrict tasks' capabilities(e.g. "rm -rf /" is not allowed). The problem is that if we set user to A(the same user running mesos-agent), the check in MesosContainerizerLaunch::execute() (i.e. `uid.get() != os::getuid().get() `) will always be false. As a result, all subprocesses will be run as root. So I suggest that we should use geteuid here to replace getuid, namely ` if (uid.get() != ::geteuid()) { // some code } ` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail
[ https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876736#comment-16876736 ] James Peach commented on MESOS-9875: {{f9330006-d885-4ef0-b2c7-c9c6fcc239e5}} is the persistence ID. {{5fa5c810-2dd3-41cb-9633-a3ef404b08c4}} is the operation UUID. {{honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14}} is the operation ID. {noformat} I0627 22:03:17.360236 3529210 slave.cpp:4282] Updated checkpointed operations from [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: OPERATION_FINISHED) ] to [ cfd6b624-996f-45d7-9aaf-9a13ab9714b4 (RESERVE for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: honvr62494cqk_a5b92fff-5491-4616-8970-8c390265c009, latest state: OPERATION_FINISHED), 5fa5c810-2dd3-41cb-9633-a3ef404b08c4 (CREATE for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525, ID: honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14, latest state: OPERATION_PENDING) ] ... I0627 22:03:17.360723 3529210 slave.cpp:8670] Updating the state of operation 'honvr62494cqk_ff4e953f-0eca-4b41-a08d-ddea27980b14' (uuid: 5fa5c810-2dd3-41cb-9633-a3ef404b08c4) for framework efd8f75d-25a9-4346-8c7b-d8c8c95ba328-22525 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED) ... E0627 22:03:17.365811 3529210 slave.cpp:4257] EXIT with status 1: Failed to sync checkpointed resources: Failed to create the persistent volume f9330006-d885-4ef0-b2c7-c9c6fcc239e5 at '/srv/mesos/work/volumes/roles/test-3/f9330006-d885-4ef0-b2c7-c9c6fcc239e5': Operation not permitted {noformat} The relevant code sequence is in Slave::applyOperation, and looks roughly like this: {noformat} track the new operation checkpointResourceState() (1) apply the operation (2) report that the operation was applied checkpointResourceState() (3) {noformat} The operation is checkpointed as pending in (1), but no resource changes are made yet. In (3), the operation is applied by making changes to the agent resources. At (3) the checkpointed resources discrepancy is discovered and the agent tries to create the persistent volume and fails. > Mesos did not respond correctly when operations should fail > --- > > Key: MESOS-9875 > URL: https://issues.apache.org/jira/browse/MESOS-9875 > Project: Mesos > Issue Type: Bug >Reporter: Yifan Xing >Priority: Major > > For testing persistent volumes with `OPERATION_FAILED/ERROR` feedbacks, we > sshed into the mesos-agent and made it unable to create subdirectories in > /srv/mesos/work/volumes, however, mesos did not respond any operation failed > response. Instead, we received `OPERATION_FINISHED` feedback. > Steps to recreate the issue: > 1. Ssh into a magent. > 2. Make it impossible to create a persistent volume (we expect the agent to > crash and reregister, and the master to release that the operation is > `OPERATION_DROPPED`): > * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes) > * chattr -RV +i volumes (then no subdirectories can be created) > 3. Launch a service with persistent volumes with the constraint of only using > the magent modified above. -- This message was sent by Atlassian JIRA (v7.6.3#76005)