[jira] [Assigned] (MESOS-9753) Agent Draining
[ https://issues.apache.org/jira/browse/MESOS-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-9753: Assignee: Greg Mann > Agent Draining > -- > > Key: MESOS-9753 > URL: https://issues.apache.org/jira/browse/MESOS-9753 > Project: Mesos > Issue Type: Epic >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > Labels: foundations, mesosphere > > This epic holds tickets related to maintenance primitive improvements which > facilitate draining of agent nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10192) Recent Nvidia CUDA changes break Mesos GPU support
Greg Mann created MESOS-10192: - Summary: Recent Nvidia CUDA changes break Mesos GPU support Key: MESOS-10192 URL: https://issues.apache.org/jira/browse/MESOS-10192 Project: Mesos Issue Type: Bug Components: agent, containerization, gpu Reporter: Greg Mann Recently it seems that the layout of the Nvidia device files has changed: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ This prevents GPU tasks from launching: {noformat} W0929 17:27:21.002178 65691 http.cpp:3436] Failed to launch container c08e1fc7-53c4-427e-a1a1-85b770e77d69.738440a3-f4cc-42ce-8978-418ba0011160: Failed to copy device '/dev/nvidia-caps': Failed to get source dev: Not a special file: /dev/nvidia-caps {noformat} due to this code, which detects the nvidia device files: https://github.com/apache/mesos/blob/8700dd8d5ece658804d7b7a40863800dcc5c72bc/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L438-L454 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10167) Mesos-websitebot fails due to wrong permissions of voulmes mounted into Docker container
[ https://issues.apache.org/jira/browse/MESOS-10167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10167: - Assignee: Vinod Kone > Mesos-websitebot fails due to wrong permissions of voulmes mounted into > Docker container > > > Key: MESOS-10167 > URL: https://issues.apache.org/jira/browse/MESOS-10167 > Project: Mesos > Issue Type: Bug > Components: project website >Reporter: Andrei Sekretenko >Assignee: Vinod Kone >Priority: Minor > > Last successful run was on Apr 7: > https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Websitebot/2464/ > First failure: > https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Websitebot/2465/console > Build with added permissions dump > https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Websitebot/2525/console > shows that while the build scripts in the container are, as expected, running > under "tempuser" (with the same uid as the user outside container which pulls > the git repositories), > the directories with git repositories mounted into the container are owned by > root: > {noformat} > 19:06:21 uid=910(tempuser) gid=1001(tempuser) groups=1001(tempuser) > 19:06:21 total 836 > 19:06:21 drwxr-xr-x 12 root root 4096 Jul 3 17:02 . > 19:06:21 drwxr-xr-x 1 root root 4096 Jul 3 17:04 .. > 19:06:21 drwxr-xr-x 6 root root 4096 Jun 29 14:12 3rdparty > 19:06:21 drwxr-xr-x 2 root root 4096 Apr 15 14:33 bin > 19:06:21 -rwxr-xr-x 1 root root 1294 Jul 3 17:02 bootstrap > 19:06:21 -rw-r--r-- 1 root root 536015 May 29 09:21 CHANGELOG > 19:06:21 drwxr-xr-x 2 root root 4096 May 29 11:30 cmake > 19:06:21 -rw-r--r-- 1 root root 3990 May 7 13:40 CMakeLists.txt > 19:06:21 -rw-r--r-- 1 root root 105737 May 7 13:40 configure.ac > 19:06:21 lrwxrwxrwx 1 root root 31 Apr 15 14:33 CONTRIBUTING.md -> > ./docs/beginner-contribution.md > 19:06:21 drwxr-xr-x 6 root root 4096 May 28 19:18 docs > 19:06:21 -rw-r--r-- 1 root root 63778 Apr 15 14:33 Doxyfile > 19:06:21 drwxr-xr-x 8 root root 4096 Jul 3 17:02 .git > 19:06:21 -rw-r--r-- 1 root root 99 Apr 15 14:33 .gitattributes > 19:06:21 drwxr-xr-x 3 root root 4096 Aug 27 2019 include > 19:06:21 -rw-r--r-- 1 root root 66156 Apr 15 14:33 LICENSE > 19:06:21 drwxr-xr-x 2 root root 4096 Apr 15 14:33 m4 > 19:06:21 -rw-r--r-- 1 root root 3842 Apr 15 14:33 Makefile.am > 19:06:21 -rw-r--r-- 1 root root426 Apr 15 14:33 mesos.pc.in > 19:06:21 -rw-r--r-- 1 root root162 Apr 15 14:33 NOTICE > 19:06:21 -rw-r--r-- 1 root root 1103 Apr 15 14:33 README.md > 19:06:21 drwxr-xr-x 5 root root 4096 Jul 3 17:04 site > 19:06:21 drwxr-xr-x 48 root root 4096 Jun 30 19:30 src > 19:06:21 drwxr-xr-x 9 root root 4096 Jul 3 17:02 support > 19:06:21 autoreconf: Entering directory `.' > 19:06:21 autoreconf: configure.ac: not using Gettext > 19:06:22 autoreconf: running: aclocal --warnings=all -I m4 > 19:06:23 autom4te: cannot create autom4te.cache: No such file or directory > {noformat} > Note that the Dockerfile specifies "USER root" > https://github.com/apache/mesos/blob/master/support/mesos-website/Dockerfile > and the permissions are dropped to the "testuser" only inside the > entrypoint.sh script. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10156) Enable the `volume/csi` isolator in UCR
[ https://issues.apache.org/jira/browse/MESOS-10156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190876#comment-17190876 ] Greg Mann commented on MESOS-10156: --- {noformat} commit a8059a78473774e3d95e8e908f360ee5e9aadd0d Author: Greg Mann Date: Fri Sep 4 10:39:10 2020 -0700 Added tests for 'volume/csi' isolator recovery. Review: https://reviews.apache.org/r/72806/ {noformat} > Enable the `volume/csi` isolator in UCR > --- > > Key: MESOS-10156 > URL: https://issues.apache.org/jira/browse/MESOS-10156 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10157) Add documentation for the `volume/csi` isolator
[ https://issues.apache.org/jira/browse/MESOS-10157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10157: - Assignee: Greg Mann > Add documentation for the `volume/csi` isolator > --- > > Key: MESOS-10157 > URL: https://issues.apache.org/jira/browse/MESOS-10157 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > Labels: docs, documentation > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10156) Enable the `volume/csi` isolator in UCR
[ https://issues.apache.org/jira/browse/MESOS-10156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190390#comment-17190390 ] Greg Mann commented on MESOS-10156: --- Adding test patches for the CSI isolator here: {noformat} commit a3fe939616fe13f34bd3555d613a0e1323730424 Author: Greg Mann Date: Thu Sep 3 12:06:31 2020 -0700 Updated the test CSI plugin for CSI server testing. This patch adds additional configuration flags to the test CSI plugin which are necessary in order to test the agent's CSI server. Review: https://reviews.apache.org/r/72727/ {noformat} {noformat} commit f0ce0f1d8601228f16efbb98420693af42b19d43 Author: Greg Mann Date: Thu Sep 3 12:06:34 2020 -0700 Added a test helper for CSI volumes. Review: https://reviews.apache.org/r/72805/ {noformat} commit fc22984de558302029a8cad0655e375653208448 Author: Greg Mann Date: Thu Sep 3 12:06:38 2020 -0700 Added tests for the 'volume/csi' isolator. Review: https://reviews.apache.org/r/72728/ {noformat} > Enable the `volume/csi` isolator in UCR > --- > > Key: MESOS-10156 > URL: https://issues.apache.org/jira/browse/MESOS-10156 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10156) Enable the `volume/csi` isolator in UCR
[ https://issues.apache.org/jira/browse/MESOS-10156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190390#comment-17190390 ] Greg Mann edited comment on MESOS-10156 at 9/3/20, 9:02 PM: Adding test patches for the CSI isolator here: {noformat} commit a3fe939616fe13f34bd3555d613a0e1323730424 Author: Greg Mann Date: Thu Sep 3 12:06:31 2020 -0700 Updated the test CSI plugin for CSI server testing. This patch adds additional configuration flags to the test CSI plugin which are necessary in order to test the agent's CSI server. Review: https://reviews.apache.org/r/72727/ {noformat} {noformat} commit f0ce0f1d8601228f16efbb98420693af42b19d43 Author: Greg Mann Date: Thu Sep 3 12:06:34 2020 -0700 Added a test helper for CSI volumes. Review: https://reviews.apache.org/r/72805/ {noformat} {noformat} commit fc22984de558302029a8cad0655e375653208448 Author: Greg Mann Date: Thu Sep 3 12:06:38 2020 -0700 Added tests for the 'volume/csi' isolator. Review: https://reviews.apache.org/r/72728/ {noformat} was (Author: greggomann): Adding test patches for the CSI isolator here: {noformat} commit a3fe939616fe13f34bd3555d613a0e1323730424 Author: Greg Mann Date: Thu Sep 3 12:06:31 2020 -0700 Updated the test CSI plugin for CSI server testing. This patch adds additional configuration flags to the test CSI plugin which are necessary in order to test the agent's CSI server. Review: https://reviews.apache.org/r/72727/ {noformat} {noformat} commit f0ce0f1d8601228f16efbb98420693af42b19d43 Author: Greg Mann Date: Thu Sep 3 12:06:34 2020 -0700 Added a test helper for CSI volumes. Review: https://reviews.apache.org/r/72805/ {noformat} commit fc22984de558302029a8cad0655e375653208448 Author: Greg Mann Date: Thu Sep 3 12:06:38 2020 -0700 Added tests for the 'volume/csi' isolator. Review: https://reviews.apache.org/r/72728/ {noformat} > Enable the `volume/csi` isolator in UCR > --- > > Key: MESOS-10156 > URL: https://issues.apache.org/jira/browse/MESOS-10156 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10163) Implement a new component to launch CSI plugins as standalone containers and make CSI gRPC calls
[ https://issues.apache.org/jira/browse/MESOS-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182170#comment-17182170 ] Greg Mann commented on MESOS-10163: --- {noformat} commit 68b481085fb82b475e108b9aa39935a8d7729983 Author: Greg Mann Date: Thu Aug 20 19:26:48 2020 -0700 Fixed a bug in CSI volume manager initialization. Previously, the volume managers would assume that they could make CONTROLLER_SERVICE calls during plugin initialization, regardless of whether or not the plugin provides that service. Review: https://reviews.apache.org/r/72726/ {noformat} {noformat} commit 5ed30db48785007e35805886a024ebb8a61a7037 Author: Greg Mann Date: Thu Aug 20 19:27:02 2020 -0700 Added the CSI server to the Mesos agent. This patch adds a CSI server to the Mesos agent in both the agent binary and in tests. Review: https://reviews.apache.org/r/72761/ {noformat} {noformat} commit 4ff51041df860dbcc2247ef47a0596e5132da190 Author: Greg Mann g...@mesosphere.io Date: Thu Aug 20 19:27:23 2020 -0700 Initialized plugins lazily in the CSI server. Review: https://reviews.apache.org/r/72779/ {noformat} > Implement a new component to launch CSI plugins as standalone containers and > make CSI gRPC calls > > > Key: MESOS-10163 > URL: https://issues.apache.org/jira/browse/MESOS-10163 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > > *Background:* > Originally we want `volume/csi` isolator to leverage the existing [service > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51] > to launch CSI plugins as standalone containers and currently service manager > needs to call the following agent HTTP APIs: > # `GET_CONTAINERS` to get all standalone containers in its `recover` method. > # `KILL_CONTAINER` and `WAIT_CONTAINER` to kill the outdated standalone > containers in its `recover` method. > # `LAUNCH_CONTAINER` via the existing > [ContainerDaemon|https://github.com/apache/mesos/blob/1.10.0/src/slave/container_daemon.hpp#L41:L46] > to launch CSI plugin as standalone container when its `getEndpoint` method > is called. > The problem with the above design is, `volume/csi` isolator may need to clean > up orphan container during agent recovery which is triggered by containerizer > (see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/containerizer.cpp#L1272:L1275] > for details), to clean up an orphan container which is using a CSI volume, > `volume/csi` isolator needs to instantiate and recover the service manager > and get CSI plugin’s endpoint from it (i.e., service manager’s `getEndpoint` > method will be called by `volume/csi` isolator during agent recovery. And as > I mentioned above service manager’s `getEndpoint` may need to call > `LAUNCH_CONTAINER` to launch CSI plugin as standalone container, since agent > is still in recovering state, such agent HTTP call will be just rejected by > agent. So we have to instantiate and recover service manager *after agent > recovery is done*, but in `volume/csi` isolator we do not have such > information (i.e. the signal that agent recovery is done). > *Solution* > We need to implement a new component (like `CSIVolumeManager` or a better > name?) in Mesos agent which is responsible for launching CSI plugins as > standalone containers (via the existing [service > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51]) > and making CSI gRPC calls (via the existing [volume > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L56]). > * We can instantiate this new component in the `main` method of agent and > pass it to both containerizer and agent (i.e. it will be a member of the > `Slave` object), and containerizer will in turn pass it to the `volume/csi` > isolator. > * Since this new component relies on service manager which will call agent > HTTP APIs, we need to pass agent URL to it, like `process::http::URL(scheme, > agentIP, agentPort, agentLibprocessId + "/api/v1")`, see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L459:L471] > for an example. > * When agent registers/reregisters with master (`Slave::registered` and > `Slave::reregistered`), we should call this new component’s `start` method > (see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1740:L1742] > and > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1825:L1827] > as examples) which will scan the directory `--csi_plugin_config_dir` and > create the `service manager - volume manager` pair for each CSI plugin
[jira] [Commented] (MESOS-9609) Master check failure when marking agent unreachable
[ https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180677#comment-17180677 ] Greg Mann commented on MESOS-9609: -- Hi [~arostami], my apologies for the delay. This came up again recently, I understand it's a serious bug so we'll start working on a fix soon, will update here. > Master check failure when marking agent unreachable > --- > > Key: MESOS-9609 > URL: https://issues.apache.org/jira/browse/MESOS-9609 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Greg Mann >Priority: Critical > Labels: foundations, mesosphere > > {code} > Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81543313 > http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133 > Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81558813 > master.cpp:5467] Processing DECLINE call for offers: [ > 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework > 5e57f633-a69c-4009-b7 > Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81569313 > master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82014210 > master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at > slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82036710 > registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the > registry > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82057210 > registrar.cpp:552] Successfully updated the registry in 175872ns > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82064211 > master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at > slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957 9 > hierarchical.cpp:609] Removed agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 > Mar 11 10:04:35 research docker[4503]: F0311 10:04:35.85196111 > master.cpp:10018] Check failed: 'framework' Must be non NULL > Mar 11 10:04:35 research docker[4503]: *** Check failure stack trace: *** > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d > google::LogMessage::Fail() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830 > google::LogMessage::SendToLog() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663 > google::LogMessage::Flush() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259 > google::LogMessageFatal::~LogMessageFatal() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14 > google::CheckNotNull<>() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8 > mesos::internal::master::Master::__removeSlave() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2 > mesos::internal::master::Master::_markUnreachable() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11 > process::ProcessBase::consume() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb2a4a > process::ProcessManager::resume() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb65d6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c35d4c80 (unknown) > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2de76ba start_thread > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2b1d41d (unknown) > Mar 11 10:04:36 research docker[4503]: *** Aborted at 1520762676 (unix time) > try "date -d @1520762676" if you are using GNU date *** > Mar 11 10:04:36 research docker[4503]: PC: @ 0x7f96c2a4d196 (unknown) > Mar 11 10:04:36 research docker[4503]: *** SIGSEGV (@0x0) received by PID 1 > (TID 0x7f96b986d700) from PID 0; stack trace: *** > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2df1390 (unknown) > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2a4d196 (unknown) > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c604ce2c > google::DumpStackTraceAndExit() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d > google::LogMessage::Fail() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830 > google::LogMessage::SendToLog() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663 > google::LogMessage::Flush() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259 > google::LogMessageFatal::~LogMessageFatal() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14 > google::CheckNotNull<>() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8 > mesos::internal::master::Master::__removeSlave() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2 >
[jira] [Commented] (MESOS-10163) Implement a new component to launch CSI plugins as standalone containers and make CSI gRPC calls
[ https://issues.apache.org/jira/browse/MESOS-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175226#comment-17175226 ] Greg Mann commented on MESOS-10163: --- {noformat} commit fe0cd02a0697a4c4fcf5087fcafd6729beec0b41 (HEAD -> master, origin/master, origin/HEAD, merge) Author: Greg Mann Date: Mon Aug 10 20:11:50 2020 -0700 Added implementation of the CSI server. Review: https://reviews.apache.org/r/72716/ {noformat} > Implement a new component to launch CSI plugins as standalone containers and > make CSI gRPC calls > > > Key: MESOS-10163 > URL: https://issues.apache.org/jira/browse/MESOS-10163 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > > *Background:* > Originally we want `volume/csi` isolator to leverage the existing [service > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51] > to launch CSI plugins as standalone containers and currently service manager > needs to call the following agent HTTP APIs: > # `GET_CONTAINERS` to get all standalone containers in its `recover` method. > # `KILL_CONTAINER` and `WAIT_CONTAINER` to kill the outdated standalone > containers in its `recover` method. > # `LAUNCH_CONTAINER` via the existing > [ContainerDaemon|https://github.com/apache/mesos/blob/1.10.0/src/slave/container_daemon.hpp#L41:L46] > to launch CSI plugin as standalone container when its `getEndpoint` method > is called. > The problem with the above design is, `volume/csi` isolator may need to clean > up orphan container during agent recovery which is triggered by containerizer > (see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/containerizer.cpp#L1272:L1275] > for details), to clean up an orphan container which is using a CSI volume, > `volume/csi` isolator needs to instantiate and recover the service manager > and get CSI plugin’s endpoint from it (i.e., service manager’s `getEndpoint` > method will be called by `volume/csi` isolator during agent recovery. And as > I mentioned above service manager’s `getEndpoint` may need to call > `LAUNCH_CONTAINER` to launch CSI plugin as standalone container, since agent > is still in recovering state, such agent HTTP call will be just rejected by > agent. So we have to instantiate and recover service manager *after agent > recovery is done*, but in `volume/csi` isolator we do not have such > information (i.e. the signal that agent recovery is done). > *Solution* > We need to implement a new component (like `CSIVolumeManager` or a better > name?) in Mesos agent which is responsible for launching CSI plugins as > standalone containers (via the existing [service > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51]) > and making CSI gRPC calls (via the existing [volume > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L56]). > * We can instantiate this new component in the `main` method of agent and > pass it to both containerizer and agent (i.e. it will be a member of the > `Slave` object), and containerizer will in turn pass it to the `volume/csi` > isolator. > * Since this new component relies on service manager which will call agent > HTTP APIs, we need to pass agent URL to it, like `process::http::URL(scheme, > agentIP, agentPort, agentLibprocessId + "/api/v1")`, see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L459:L471] > for an example. > * When agent registers/reregisters with master (`Slave::registered` and > `Slave::reregistered`), we should call this new component’s `start` method > (see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1740:L1742] > and > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1825:L1827] > as examples) which will scan the directory `--csi_plugin_config_dir` and > create the `service manager - volume manager` pair for each CSI plugin loaded > from that directory. > * For the `volume/csi` isolator, it needs to call this new component’s > `publishVolume` and `unpublishVolume` methods in its `prepare` and `cleanup` > method. > In the case of clean up orphan containers during agent recovery, `volume/csi` > isolator will just call this new component’s `unpublishVolume` method as > usual, and it is this new component’s responsibility to only make the actual > CSI gRPC call after agent recovery is done and agent has registered with > master (e.g., when this new component’s start method is called). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10168) Add secrets support to the CSI service and volume managers
[ https://issues.apache.org/jira/browse/MESOS-10168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10168: - Assignee: Greg Mann > Add secrets support to the CSI service and volume managers > -- > > Key: MESOS-10168 > URL: https://issues.apache.org/jira/browse/MESOS-10168 > Project: Mesos > Issue Type: Task >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > Labels: csi > > We must update our CSI code to pass secrets to CSI drivers when > staging/unstaging and publishing/unpublishing volumes. We must ensure that we > avoid writing any secrets to disk by holding a secret resolver in the > appropriate component to resolve secrets associated with already-attached > volumes during/after recovery. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10156) Enable the `volume/csi` isolator in UCR
[ https://issues.apache.org/jira/browse/MESOS-10156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10156: - Assignee: (was: Greg Mann) > Enable the `volume/csi` isolator in UCR > --- > > Key: MESOS-10156 > URL: https://issues.apache.org/jira/browse/MESOS-10156 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10156) Enable the `volume/csi` isolator in UCR
[ https://issues.apache.org/jira/browse/MESOS-10156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10156: - Assignee: Greg Mann > Enable the `volume/csi` isolator in UCR > --- > > Key: MESOS-10156 > URL: https://issues.apache.org/jira/browse/MESOS-10156 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10163) Implement a new component to launch CSI plugins as standalone containers and make CSI gRPC calls
[ https://issues.apache.org/jira/browse/MESOS-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170261#comment-17170261 ] Greg Mann commented on MESOS-10163: --- {noformat} commit c78dc333fc893a43d40dc33299a61987198a6ea9 (HEAD -> master, origin/master, origin/HEAD) Author: Greg Mann Date: Mon Aug 3 10:11:57 2020 -0700 Added interface for the CSI server. This component will hold objects associated with CSI plugins running on the agent. Review: https://reviews.apache.org/r/72707/ {noformat} > Implement a new component to launch CSI plugins as standalone containers and > make CSI gRPC calls > > > Key: MESOS-10163 > URL: https://issues.apache.org/jira/browse/MESOS-10163 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > > *Background:* > Originally we want `volume/csi` isolator to leverage the existing [service > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51] > to launch CSI plugins as standalone containers and currently service manager > needs to call the following agent HTTP APIs: > # `GET_CONTAINERS` to get all standalone containers in its `recover` method. > # `KILL_CONTAINER` and `WAIT_CONTAINER` to kill the outdated standalone > containers in its `recover` method. > # `LAUNCH_CONTAINER` via the existing > [ContainerDaemon|https://github.com/apache/mesos/blob/1.10.0/src/slave/container_daemon.hpp#L41:L46] > to launch CSI plugin as standalone container when its `getEndpoint` method > is called. > The problem with the above design is, `volume/csi` isolator may need to clean > up orphan container during agent recovery which is triggered by containerizer > (see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/containerizer.cpp#L1272:L1275] > for details), to clean up an orphan container which is using a CSI volume, > `volume/csi` isolator needs to instantiate and recover the service manager > and get CSI plugin’s endpoint from it (i.e., service manager’s `getEndpoint` > method will be called by `volume/csi` isolator during agent recovery. And as > I mentioned above service manager’s `getEndpoint` may need to call > `LAUNCH_CONTAINER` to launch CSI plugin as standalone container, since agent > is still in recovering state, such agent HTTP call will be just rejected by > agent. So we have to instantiate and recover service manager *after agent > recovery is done*, but in `volume/csi` isolator we do not have such > information (i.e. the signal that agent recovery is done). > *Solution* > We need to implement a new component (like `CSIVolumeManager` or a better > name?) in Mesos agent which is responsible for launching CSI plugins as > standalone containers (via the existing [service > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51]) > and making CSI gRPC calls (via the existing [volume > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L56]). > * We can instantiate this new component in the `main` method of agent and > pass it to both containerizer and agent (i.e. it will be a member of the > `Slave` object), and containerizer will in turn pass it to the `volume/csi` > isolator. > * Since this new component relies on service manager which will call agent > HTTP APIs, we need to pass agent URL to it, like `process::http::URL(scheme, > agentIP, agentPort, agentLibprocessId + "/api/v1")`, see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L459:L471] > for an example. > * When agent registers/reregisters with master (`Slave::registered` and > `Slave::reregistered`), we should call this new component’s `start` method > (see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1740:L1742] > and > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1825:L1827] > as examples) which will scan the directory `--csi_plugin_config_dir` and > create the `service manager - volume manager` pair for each CSI plugin loaded > from that directory. > * For the `volume/csi` isolator, it needs to call this new component’s > `publishVolume` and `unpublishVolume` methods in its `prepare` and `cleanup` > method. > In the case of clean up orphan containers during agent recovery, `volume/csi` > isolator will just call this new component’s `unpublishVolume` method as > usual, and it is this new component’s responsibility to only make the actual > CSI gRPC call after agent recovery is done and agent has registered with > master (e.g., when this new component’s start method is called). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10168) Add secrets support to the CSI service and volume managers
Greg Mann created MESOS-10168: - Summary: Add secrets support to the CSI service and volume managers Key: MESOS-10168 URL: https://issues.apache.org/jira/browse/MESOS-10168 Project: Mesos Issue Type: Task Reporter: Greg Mann We must update our CSI code to pass secrets to CSI drivers when staging/unstaging and publishing/unpublishing volumes. We must ensure that we avoid writing any secrets to disk by holding a secret resolver in the appropriate component to resolve secrets associated with already-attached volumes during/after recovery. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10150) Refactor CSI volume manager to support pre-provisioned CSI volumes
[ https://issues.apache.org/jira/browse/MESOS-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10150: - Assignee: Greg Mann > Refactor CSI volume manager to support pre-provisioned CSI volumes > -- > > Key: MESOS-10150 > URL: https://issues.apache.org/jira/browse/MESOS-10150 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > > The existing > [VolumeManager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L138] > is like a wrapper for various CSI gRPC calls, we could consider leveraging > it to call CSI plugins rather than making raw CSI gRPC calls in `volume/csi` > isolator. But there is a problem, the lifecycle of the volumes managed by > VolumeManager starts from the > `[createVolume|https://github.com/apache/mesos/blob/1.10.0/src/csi/v1_volume_manager.cpp#L281:L329]` > CSI call, but what we plan to support in MVP is pre-provisioned volumes, so > we need to refactor VolumeManager by making it support pre-provisioned > volumes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10140) CMake Error: Problem with archive_read_open_file(): Unrecognized archive format
[ https://issues.apache.org/jira/browse/MESOS-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152815#comment-17152815 ] Greg Mann commented on MESOS-10140: --- [~QuellaZhang] could you try building again on latest master branch of Mesos? We believe the issue should be fixed now. If so, please close out this ticket, otherwise let us know. Thanks! > CMake Error: Problem with archive_read_open_file(): Unrecognized archive > format > --- > > Key: MESOS-10140 > URL: https://issues.apache.org/jira/browse/MESOS-10140 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: QuellaZhang >Priority: Major > Labels: windows > Attachments: mesos_build.log > > > Hi All, > We tried to build Mesos on Windows with VS2019. It failed to build due to > "CUSTOMBUILD : CMake error : Problem with archive_read_open_file(): > Unrecognized archive format > [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj]" on Windows > using MSVC. It can be reproduced on latest reversion d4634f4 on master > branch. Could you help confirm? We use cmake version 3.17.2. > > Reproduce steps: > 1. git clone -c core.autocrlf=true [https://github.com/apache/mesos] > F:\gitP\apache\mesos > 2. Open a VS 2019 x64 command prompt as admin and browse to > F:\gitP\apache\mesos > 3. mkdir build_amd64 && pushd build_amd64 > 4. cmake -G "Visual Studio 16 2019" -A x64 > -DCMAKE_SYSTEM_VERSION=10.0.18362.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="F:\tools\gnuwin32\bin" -T host=x64 .. > 5. set _CL_=/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING > 6. msbuild /maxcpucount:4 /p:Platform=x64 /p:Configuration=Debug Mesos.sln > /t:Rebuild > > ErrorMessage: > *manual run:* > F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP\src>cmake --version > cmake version 3.17.2 > CMake suite maintained and supported by Kitware (kitware.com/cmake). > F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP\src>cmake -E tar xjf > archive.tar > CMake Error: Problem with archive_read_open_file(): Unrecognized archive > format > CMake Error: Problem extracting tar: archive.tar > *build log: (see attachment)* > 59>CUSTOMBUILD : CMake error : Problem with archive_read_open_file(): > Unrecognized archive format > [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj] > 59>CUSTOMBUILD : CMake error : Problem extracting tar: > F:/gitP/apache/mesos/build_amd64/3rdparty/wclayer-WIP/src/archive.tar > [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj] > – extracting... [error clean up] > CMake Error at wclayer-WIP-stamp/extract-wclayer-WIP.cmake:33 (message): > 59>CUSTOMBUILD : error : extract of > [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj] > 'F:/gitP/apache/mesos/build_amd64/3rdparty/wclayer-WIP/src/archive.tar' > failed -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10143) Outstanding Offers accumulating
[ https://issues.apache.org/jira/browse/MESOS-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152811#comment-17152811 ] Greg Mann commented on MESOS-10143: --- [~puneetku287] it's unclear to me from the description if this is an issue in Mesos or in your scheduler. A more precise description of the framework's behavior during the incidents would help - what does the scheduler do with the offers during this time? Feel free to find us on Mesos Slack, that might be an easier place to have a synchronous discussion about your issue. > Outstanding Offers accumulating > --- > > Key: MESOS-10143 > URL: https://issues.apache.org/jira/browse/MESOS-10143 > Project: Mesos > Issue Type: Bug > Components: master, scheduler driver >Affects Versions: 1.7.0 > Environment: Mesos Version 1.7.0 > JDK 8.0 >Reporter: Puneet Kumar >Priority: Minor > > We manage an Apache Mesos cluster version 1.7.0. We have written a framework > in Java that schedules tasks to Mesos master at a rate of 300 TPS. Everything > works fine for almost 24 hours but then outstanding offers accumulate & > saturate within 15 minutes. Outstanding offers aren't reclaimed by Mesos > master. We observe "RescindOffer" messages in verbose (GLOG v=3) framework > logs but outstanding offers don't reduce. New resources aren't offered to > framework when outstanding offers saturate. We have to restart the scheduler > to reset outstanding offers to zero. > Any suggestions to debug this issue are welcome. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10146) Removing task from slave when framework is disconnected causes master to crash
[ https://issues.apache.org/jira/browse/MESOS-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152809#comment-17152809 ] Greg Mann commented on MESOS-10146: --- [~sunshine123] thank you for the bug report! Would it be possible to get a full verbose master log from an incident? The logs surrounding the check failure may help us pinpoint the issue more precisely. > Removing task from slave when framework is disconnected causes master to crash > -- > > Key: MESOS-10146 > URL: https://issues.apache.org/jira/browse/MESOS-10146 > Project: Mesos > Issue Type: Bug > Components: c++ api, framework >Affects Versions: 1.9.0 > Environment: Mesos master with three master nodes >Reporter: Naveen >Priority: Major > > Hello, > we want to report an issue we observed when remove tasks from slave. > There is condition to check for valid framework before tasks can be removed. > There can be several reasons framework can be disconnected. This check fails > and crashes mesos master node. > [https://github.com/apache/mesos/blob/1.9.0/src/master/master.cpp#L11842] > There is also unguarded access to the internal framework state on line 11853. > Error logs - > {noformat} > mesos-master[5483]: I0618 14:05:20.859189 5491 master.cpp:9512] Marked agent > 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 (10.160.73.79) unreachable: health > check timed out > mesos-master[5483]: F0618 14:05:20.859347 5491 master.cpp:11842] Check > failed: framework != nullptr Framework > 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067 not found while removing agent > 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 at slave(1)@10.160.73.79:5051 > (10.160.73.79); agent tasks: { 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067: { } > } > mesos-master[5483]: *** Check failure stack trace: *** > mesos-master[5483]: I0618 14:05:20.859781 5490 hierarchical.cpp:1013] Removed > all filters for agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 > mesos-master[5483]: I0618 14:05:20.872217 5490 hierarchical.cpp:890] Removed > agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 > mesos-master[5483]: I0618 14:05:20.859922 5487 replica.cpp:695] Replica > received learned notice for position 42070 from > log-network(1)@10.160.73.212:5050 > mesos-master[5483]: @ 0x7f2fdf6a5b1d google::LogMessage::Fail() > mesos-master[5483]: @ 0x7f2fdf6a7dfd google::LogMessage::SendToLog() > mesos-master[5483]: @ 0x7f2fdf6a56ab google::LogMessage::Flush() > mesos-master[5483]: @ 0x7f2fdf6a8859 > google::LogMessageFatal::~LogMessageFatal() > mesos-master[5483]: @ 0x7f2fde2677f2 > mesos::internal::master::Master::__removeSlave() > mesos-master[5483]: @ 0x7f2fde267ebe > mesos::internal::master::Master::_markUnreachable() > mesos-master[5483]: @ 0x7f2fde268215 > _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKSsEUlbE_JbclEv > mesos-master[5483]: @ 0x7f2fddf30688 > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEclEOS3_ > mesos-master[5483]: @ 0x7f2fdf5e3b91 process::ProcessBase::consume() > mesos-master[5483]: @ 0x7f2fdf608f77 process::ProcessManager::resume() > mesos-master[5483]: @ 0x7f2fdf60cb36 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > mesos-master[5483]: @ 0x7f2fdf8c34d0 execute_native_thread_routine > mesos-master[5483]: @ 0x7f2fdba02ea5 start_thread > mesos-master[5483]: @ 0x7f2fdb20e8dd __clone > systemd[1]: mesos-master.service: main process exited, code=killed, > status=6/ABRT > systemd[1]: Unit mesos-master.service entered failed state. > systemd[1]: mesos-master.service failed. > systemd[1]: mesos-master.service holdoff time over, scheduling restart. > systemd[1]: Stopped Mesos Master. > systemd[1]: Started Mesos Master. > mesos-master[28757]: I0618 14:05:41.461403 28748 logging.cpp:201] INFO level > logging started! > mesos-master[28757]: I0618 14:05:41.461712 28748 main.cpp:243] Build: > 2020-05-09 10:42:00 by centos > mesos-master[28757]: I0618 14:05:41.461721 28748 main.cpp:244] Version: 1.9.0 > mesos-master[28757]: I0618 14:05:41.461726 28748 main.cpp:247] Git tag: 1.9.0 > mesos-master[28757]: I0618 14:05:41.461730 28748 main.cpp:251] Git SHA: > 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9271) DockerContainerizerHealthCheckTest.ROOT_DOCKER_USERNETWORK_NETNAMESPACE_HealthyTaskViaHTTP is flaky
[ https://issues.apache.org/jira/browse/MESOS-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142521#comment-17142521 ] Greg Mann commented on MESOS-9271: -- Observed again - attached another log from internal CI. CentOS 7, cmake build, no libevent and no SSL. > DockerContainerizerHealthCheckTest.ROOT_DOCKER_USERNETWORK_NETNAMESPACE_HealthyTaskViaHTTP > is flaky > --- > > Key: MESOS-9271 > URL: https://issues.apache.org/jira/browse/MESOS-9271 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > Labels: flaky-test > > Observed in an internal CI run (4498): > {noformat} > ../../src/tests/health_check_tests.cpp:2080 > Failed to wait 15secs for statusHealthy > {noformat} > Full log: > {noformat} > [ RUN ] > NetworkProtocol/DockerContainerizerHealthCheckTest.ROOT_DOCKER_USERNETWORK_NETNAMESPACE_HealthyTaskViaHTTP/1 > I0927 00:57:43.336710 27845 docker.cpp:1659] Running docker -H > unix:///var/run/docker.sock inspect zhq527725/https-server:latest > I0927 00:57:43.340283 27845 docker.cpp:1659] Running docker -H > unix:///var/run/docker.sock inspect alpine:latest > I0927 00:57:43.343433 27845 docker.cpp:1659] Running docker -H > unix:///var/run/docker.sock inspect alpine:latest > I0927 00:57:43.857142 27845 cluster.cpp:173] Creating default 'local' > authorizer > I0927 00:57:43.858705 19628 master.cpp:413] Master > f9e9ac63-826d-4d08-b216-c5f352afc25d (ip-172-16-10-217.ec2.internal) started > on 172.16.10.217:32836 > I0927 00:57:43.858727 19628 master.cpp:416] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="hierarchical" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/QIaitl/credentials" --filter_gpu_resources="true" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" > --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/QIaitl/master" --zk_session_timeout="10secs" > I0927 00:57:43.858912 19628 master.cpp:465] Master only allowing > authenticated frameworks to register > I0927 00:57:43.858942 19628 master.cpp:471] Master only allowing > authenticated agents to register > I0927 00:57:43.858948 19628 master.cpp:477] Master only allowing > authenticated HTTP frameworks to register > I0927 00:57:43.858955 19628 credentials.hpp:37] Loading credentials for > authentication from '/tmp/QIaitl/credentials' > I0927 00:57:43.859072 19628 master.cpp:521] Using default 'crammd5' > authenticator > I0927 00:57:43.859141 19628 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0927 00:57:43.859200 19628 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0927 00:57:43.859246 19628 http.cpp:1037] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0927 00:57:43.859268 19628 master.cpp:602] Authorization enabled > I0927 00:57:43.859541 19629 hierarchical.cpp:182] Initialized hierarchical > allocator process > I0927 00:57:43.859582 19629 whitelist_watcher.cpp:77] No whitelist given > I0927 00:57:43.860060 19628 master.cpp:2083] Elected as the leading master! > I0927 00:57:43.860078 19628 master.cpp:1638] Recovering from registrar > I0927 00:57:43.860117 19628 registrar.cpp:339] Recovering registrar > I0927 00:57:43.860285 19628 registrar.cpp:383] Successfully fetched the > registry (0B) in 144128ns > I0927 00:57:43.860328 19628 registrar.cpp:487] Applied 1 operations in > 8246ns; attempting to update the registry > I0927 00:57:43.860527 19624 registrar.cpp:544] Successfully updated the > registry in 167168ns > I0927 00:57:43.860571 19624
[jira] [Created] (MESOS-10144) MasterQuotaTest.ValidateLimitAgainstConsumed is flaky
Greg Mann created MESOS-10144: - Summary: MasterQuotaTest.ValidateLimitAgainstConsumed is flaky Key: MESOS-10144 URL: https://issues.apache.org/jira/browse/MESOS-10144 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.10.0 Environment: Debian 8 with libevent & SSL enabled. Reporter: Greg Mann Observed in internal CI. Log attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10142) CSI External Volumes MVP Design Doc
[ https://issues.apache.org/jira/browse/MESOS-10142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10142: - Assignee: Qian Zhang > CSI External Volumes MVP Design Doc > --- > > Key: MESOS-10142 > URL: https://issues.apache.org/jira/browse/MESOS-10142 > Project: Mesos > Issue Type: Task >Reporter: Greg Mann >Assignee: Qian Zhang >Priority: Major > Labels: csi, external-volumes, storage > > This ticket tracks the design doc for our initial implementation of external > volume support in Mesos using the CSI standard. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10142) CSI External Volumes MVP Design Doc
Greg Mann created MESOS-10142: - Summary: CSI External Volumes MVP Design Doc Key: MESOS-10142 URL: https://issues.apache.org/jira/browse/MESOS-10142 Project: Mesos Issue Type: Task Reporter: Greg Mann This ticket tracks the design doc for our initial implementation of external volume support in Mesos using the CSI standard. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10141) CSI External Volume Support
Greg Mann created MESOS-10141: - Summary: CSI External Volume Support Key: MESOS-10141 URL: https://issues.apache.org/jira/browse/MESOS-10141 Project: Mesos Issue Type: Epic Reporter: Greg Mann This epic tracks work for our MVP of external volume support in Mesos using the CSI standard. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10136) MasterDrainingTest.DrainAgentUnreachable is flaky
Greg Mann created MESOS-10136: - Summary: MasterDrainingTest.DrainAgentUnreachable is flaky Key: MESOS-10136 URL: https://issues.apache.org/jira/browse/MESOS-10136 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.10.0 Environment: CentOS 7, built with cmake. Reporter: Greg Mann Observed in internal CI. Log attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10118) Agent incorrectly handles draining when empty
[ https://issues.apache.org/jira/browse/MESOS-10118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10118: - Assignee: Greg Mann > Agent incorrectly handles draining when empty > - > > Key: MESOS-10118 > URL: https://issues.apache.org/jira/browse/MESOS-10118 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.9.0 >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > > When the agent receives a {{DrainSlaveMessage}} and does not have any tasks > or operations, it writes the {{DrainConfig}} to disk and is then implicitly > stuck in a "draining" state indefinitely. For example, if an agent > reregistration is triggered at such a time, the master may think the agent is > operating normally and send a task to it, at which point the task will fail > because the agent thinks it's draining (see this test for an example: > https://reviews.apache.org/r/72364/). > If the agent receives a {{DrainSlaveMessage}} when it has no tasks or > operations, it should avoid writing any {{DrainConfig}} to disk so that it > immediately "transitions" into the already-drained state. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10118) Agent incorrectly handles draining when empty
Greg Mann created MESOS-10118: - Summary: Agent incorrectly handles draining when empty Key: MESOS-10118 URL: https://issues.apache.org/jira/browse/MESOS-10118 Project: Mesos Issue Type: Bug Components: agent Affects Versions: 1.9.0 Reporter: Greg Mann When the agent receives a {{DrainSlaveMessage}} and does not have any tasks or operations, it writes the {{DrainConfig}} to disk and is then implicitly stuck in a "draining" state indefinitely. For example, if an agent reregistration is triggered at such a time, the master may think the agent is operating normally and send a task to it, at which point the task will fail because the agent thinks it's draining (see this test for an example: https://reviews.apache.org/r/72364/). If the agent receives a {{DrainSlaveMessage}} when it has no tasks or operations, it should avoid writing any {{DrainConfig}} to disk so that it immediately "transitions" into the already-drained state. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10116) Attempt to reactivate disconnected agent crashes the master
[ https://issues.apache.org/jira/browse/MESOS-10116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10116: - Assignee: Andrei Sekretenko (was: Greg Mann) > Attempt to reactivate disconnected agent crashes the master > --- > > Key: MESOS-10116 > URL: https://issues.apache.org/jira/browse/MESOS-10116 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.9.0, 1.10.0 >Reporter: Andrei Sekretenko >Assignee: Andrei Sekretenko >Priority: Critical > > Observed the following scenario on a production cluster: > - operator performs agent draining > - draining completes, operator disconnects the agent > - operator reactivates agent via REACTIVATE_AGENT call > - *master issues an offer for a reactivated disconnected agent* > - a framework issues ACCEPT call with this offer > - master crashes with the following stack trace: > {noformat} > F0311 09:06:18.852365 11289 validation.cpp:2123] Check failed: > slave->connected Offer 4067082c-ec7a-4efc-ac2d-c6e7cbc77356-O13981526 > outlived disconnected agent 968ea9b2-374d-45cb-b5b3-c4ffb45a4a78-S0 at > slave(1)@10.50.7.59:5051 (10.50.7.59) > *** Check failure stack trace: *** > @ 0x7feac6a1dc6d google::LogMessage::Fail() > @ 0x7feac6a1fec8 google::LogMessage::SendToLog() > @ 0x7feac6a1d803 google::LogMessage::Flush() > @ 0x7feac6a20809 google::LogMessageFatal::~LogMessageFatal() > @ 0x7feac57cdea0 mesos::internal::master::validation::offer::validateSlave() > @ 0x7feac57d09c1 std::_Function_handler<>::_M_invoke() > @ 0x7feac57d0fd1 std::function<>::operator()() > @ 0x7feac57cea3c mesos::internal::master::validation::offer::validate() > @ 0x7feac56d5565 mesos::internal::master::Master::accept() > @ 0x7feac56468f0 mesos::internal::master::Master::Http::scheduler() > @ 0x7feac5689797 > _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionINS2_14authentication9PrincipalEEEZN5mesos8internal6master6Master10initializeEvEUlS7_SD_E1_E9_M_invokeERKSt9_Any_dataS7_SD_ > @ 0x7feac697038c > _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZZNS1_11ProcessBase8_consumeERKNSB_12HttpEndpointERKSsRKNS1_5OwnedINS3_7RequestNKUlRK6OptionINS3_14authentication20AuthenticationResultEEE0_clESR_EUlbE0_IbclEv > @ 0x7feac53f30e7 > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_IST_SI_St12_PlaceholderILi1EEclEOS3_ > @ 0x7feac6966561 process::ProcessBase::consume() > @ 0x7feac697db5b process::ProcessManager::resume() > @ 0x7feac69837f6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > @ 0x7feac262f070 (unknown) > @ 0x7feac1e4de65 start_thread > @ 0x7feac1b7688d __clone > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10116) Attempt to reactivate disconnected agent crashes the master
[ https://issues.apache.org/jira/browse/MESOS-10116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10116: - Assignee: Greg Mann (was: Andrei Sekretenko) > Attempt to reactivate disconnected agent crashes the master > --- > > Key: MESOS-10116 > URL: https://issues.apache.org/jira/browse/MESOS-10116 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.9.0, 1.10.0 >Reporter: Andrei Sekretenko >Assignee: Greg Mann >Priority: Critical > > Observed the following scenario on a production cluster: > - operator performs agent draining > - draining completes, operator disconnects the agent > - operator reactivates agent via REACTIVATE_AGENT call > - *master issues an offer for a reactivated disconnected agent* > - a framework issues ACCEPT call with this offer > - master crashes with the following stack trace: > {noformat} > F0311 09:06:18.852365 11289 validation.cpp:2123] Check failed: > slave->connected Offer 4067082c-ec7a-4efc-ac2d-c6e7cbc77356-O13981526 > outlived disconnected agent 968ea9b2-374d-45cb-b5b3-c4ffb45a4a78-S0 at > slave(1)@10.50.7.59:5051 (10.50.7.59) > *** Check failure stack trace: *** > @ 0x7feac6a1dc6d google::LogMessage::Fail() > @ 0x7feac6a1fec8 google::LogMessage::SendToLog() > @ 0x7feac6a1d803 google::LogMessage::Flush() > @ 0x7feac6a20809 google::LogMessageFatal::~LogMessageFatal() > @ 0x7feac57cdea0 mesos::internal::master::validation::offer::validateSlave() > @ 0x7feac57d09c1 std::_Function_handler<>::_M_invoke() > @ 0x7feac57d0fd1 std::function<>::operator()() > @ 0x7feac57cea3c mesos::internal::master::validation::offer::validate() > @ 0x7feac56d5565 mesos::internal::master::Master::accept() > @ 0x7feac56468f0 mesos::internal::master::Master::Http::scheduler() > @ 0x7feac5689797 > _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionINS2_14authentication9PrincipalEEEZN5mesos8internal6master6Master10initializeEvEUlS7_SD_E1_E9_M_invokeERKSt9_Any_dataS7_SD_ > @ 0x7feac697038c > _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZZNS1_11ProcessBase8_consumeERKNSB_12HttpEndpointERKSsRKNS1_5OwnedINS3_7RequestNKUlRK6OptionINS3_14authentication20AuthenticationResultEEE0_clESR_EUlbE0_IbclEv > @ 0x7feac53f30e7 > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_IST_SI_St12_PlaceholderILi1EEclEOS3_ > @ 0x7feac6966561 process::ProcessBase::consume() > @ 0x7feac697db5b process::ProcessManager::resume() > @ 0x7feac69837f6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > @ 0x7feac262f070 (unknown) > @ 0x7feac1e4de65 start_thread > @ 0x7feac1b7688d __clone > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10111) Failed check in libevent_ssl_socket.cpp: 'self->bev' Must be non NULL
[ https://issues.apache.org/jira/browse/MESOS-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082654#comment-17082654 ] Greg Mann commented on MESOS-10111: --- Review here: https://reviews.apache.org/r/72354/ > Failed check in libevent_ssl_socket.cpp: 'self->bev' Must be non NULL > - > > Key: MESOS-10111 > URL: https://issues.apache.org/jira/browse/MESOS-10111 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.10.0 >Reporter: Andrei Sekretenko >Assignee: Greg Mann >Priority: Critical > > Observing the following master crash on a testing cluster roughly once in a > hour: > {noformat} > F0408 14:17:33.470850 18423 libevent_ssl_socket.cpp:193] Check failed: > 'self->bev' Must be non NULL > @ 0x7fa7db12e2ad google::LogMessage::Fail() > @ 0x7fa7db130508 google::LogMessage::SendToLog() > @ 0x7fa7db12de43 google::LogMessage::Flush() > @ 0x7fa7db130e49 google::LogMessageFatal::~LogMessageFatal() > @ 0x7fa7db1004de google::CheckNotNull<>() > @ 0x7fa7db0fb6ca > _ZNSt17_Function_handlerIFvvEZN7process7network8internal21LibeventSSLSocketImpl8shutdownEiEUlvE_E9_M_invokeERKSt9_Any_data > @ 0x7fa7db107091 process::async_function() > @ 0x7fa7d7178978 event_process_active_single_queue > @ 0x7fa7d7178e5d event_process_active > @ 0x7fa7d71795b9 event_base_loop > @ 0x7fa7db106bed process::EventLoop::run() > @ 0x7fa7d6cfe2b0 (unknown) > @ 0x7fa7d651ce65 start_thread > @ 0x7fa7d624588d __clone > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10111) Failed check in libevent_ssl_socket.cpp: 'self->bev' Must be non NULL
[ https://issues.apache.org/jira/browse/MESOS-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10111: - Assignee: Greg Mann > Failed check in libevent_ssl_socket.cpp: 'self->bev' Must be non NULL > - > > Key: MESOS-10111 > URL: https://issues.apache.org/jira/browse/MESOS-10111 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Andrei Sekretenko >Assignee: Greg Mann >Priority: Critical > > Observing the following master crash on a testing cluster roughly once in a > hour: > {noformat} > F0408 14:17:33.470850 18423 libevent_ssl_socket.cpp:193] Check failed: > 'self->bev' Must be non NULL > @ 0x7fa7db12e2ad google::LogMessage::Fail() > @ 0x7fa7db130508 google::LogMessage::SendToLog() > @ 0x7fa7db12de43 google::LogMessage::Flush() > @ 0x7fa7db130e49 google::LogMessageFatal::~LogMessageFatal() > @ 0x7fa7db1004de google::CheckNotNull<>() > @ 0x7fa7db0fb6ca > _ZNSt17_Function_handlerIFvvEZN7process7network8internal21LibeventSSLSocketImpl8shutdownEiEUlvE_E9_M_invokeERKSt9_Any_data > @ 0x7fa7db107091 process::async_function() > @ 0x7fa7d7178978 event_process_active_single_queue > @ 0x7fa7d7178e5d event_process_active > @ 0x7fa7d71795b9 event_base_loop > @ 0x7fa7db106bed process::EventLoop::run() > @ 0x7fa7d6cfe2b0 (unknown) > @ 0x7fa7d651ce65 start_thread > @ 0x7fa7d624588d __clone > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10108) SSL Improvements
Greg Mann created MESOS-10108: - Summary: SSL Improvements Key: MESOS-10108 URL: https://issues.apache.org/jira/browse/MESOS-10108 Project: Mesos Issue Type: Epic Reporter: Greg Mann -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10048) Update the memory subsystem in the cgroup isolator to set container’s memory resource limits and `oom_score_adj`
[ https://issues.apache.org/jira/browse/MESOS-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066083#comment-17066083 ] Greg Mann commented on MESOS-10048: --- {noformat} commit 12e5e870c38681bfc0455960f89a41127dac3daf (HEAD -> master, origin/master, origin/HEAD) Author: Qian Zhang Date: Tue Mar 24 10:44:39 2020 -0700 Moved containerizer utils in CMakeLists. This is to ensure the function `calculateOOMScoreAdj()` can be resolved on Windows. Review: https://reviews.apache.org/r/72263/ {noformat} > Update the memory subsystem in the cgroup isolator to set container’s memory > resource limits and `oom_score_adj` > > > Key: MESOS-10048 > URL: https://issues.apache.org/jira/browse/MESOS-10048 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Fix For: 1.10.0 > > > Update the memory subsystem in the cgroup isolator to set container’s memory > resource limits and `oom_score_adj` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10055) Update Mesos UI to display the resource limits of tasks
[ https://issues.apache.org/jira/browse/MESOS-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10055: - Assignee: Greg Mann > Update Mesos UI to display the resource limits of tasks > --- > > Key: MESOS-10055 > URL: https://issues.apache.org/jira/browse/MESOS-10055 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10087) Update master & agent's HTTP endpoints for showing resource limits
[ https://issues.apache.org/jira/browse/MESOS-10087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10087: - Assignee: Greg Mann > Update master & agent's HTTP endpoints for showing resource limits > -- > > Key: MESOS-10087 > URL: https://issues.apache.org/jira/browse/MESOS-10087 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > > We need to update Mesos master's `/state`, `/frameworks`, `/tasks` endpoints > and agent's `/state` endpoint to show task's resource limits in their outputs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10093) Libprocess does not properly escape subprocess argument strings on Windows
[ https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10093: - Assignee: Benjamin Mahler (was: Greg Mann) > Libprocess does not properly escape subprocess argument strings on Windows > -- > > Key: MESOS-10093 > URL: https://issues.apache.org/jira/browse/MESOS-10093 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.9.0 >Reporter: Greg Mann >Assignee: Benjamin Mahler >Priority: Major > Labels: containerization, docker, mesosphere, windows > > When running some tests of Mesos on Windows, I discovered that the following > command would not execute successfully when passed to the Docker > containerizer in {{TaskInfo.command}}: > {noformat} > python -c "print('hello world')" > {noformat} > The following error is found in the task sandbox: > {noformat} > File "", line 1 > "print('hello > ^ > SyntaxError: EOL while scanning string literal > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10045) Validate task’s resources limits and the `share_cgroups` field
[ https://issues.apache.org/jira/browse/MESOS-10045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056218#comment-17056218 ] Greg Mann commented on MESOS-10045: --- Patches for agent-side validation of shared cgroups: https://reviews.apache.org/r/72221/ https://reviews.apache.org/r/7/ > Validate task’s resources limits and the `share_cgroups` field > -- > > Key: MESOS-10045 > URL: https://issues.apache.org/jira/browse/MESOS-10045 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > > When launching a task, we need to validate: > # Only CPU and memory are supported as resource limits. > # Resource limit must be larger than resource request. > ** We need to be careful about the command task case, in which case we add > an allowance (0.1 CPUs and 32MB memory, see > [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/slave.cpp#L6663:L6677] > for details) for the executor, so we need to validate task resource limit is > larger than task resource request + this allowance, otherwise the executor > will be launched with limits < requests. > # `TaskInfo` can only include resource limits when the relevant agent > possesses the TASK_RESOURCE_LIMITS capability. > # The value of the field `share_cgroups` should be same for all the tasks > launched by a single default executor. > # It is not allowed to set resource limits for the task which has the field > `share_cgroups` set as true. > We also need to add validation to the agent which will ensure that non-debug > 2nd-or-lower-level nested containers cannot be launched via the > {{LaunchContainer}} call. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10045) Validate task’s resources limits and the `share_cgroups` field
[ https://issues.apache.org/jira/browse/MESOS-10045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055083#comment-17055083 ] Greg Mann commented on MESOS-10045: --- Patches for master-side validation: https://reviews.apache.org/r/72216/ https://reviews.apache.org/r/72217/ > Validate task’s resources limits and the `share_cgroups` field > -- > > Key: MESOS-10045 > URL: https://issues.apache.org/jira/browse/MESOS-10045 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > > When launching a task, we need to validate: > # Only CPU and memory are supported as resource limits. > # Resource limit must be larger than resource request. > ** We need to be careful about the command task case, in which case we add > an allowance (0.1 CPUs and 32MB memory, see > [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/slave.cpp#L6663:L6677] > for details) for the executor, so we need to validate task resource limit is > larger than task resource request + this allowance, otherwise the executor > will be launched with limits < requests. > # `TaskInfo` can only include resource limits when the relevant agent > possesses the TASK_RESOURCE_LIMITS capability. > # The value of the field `share_cgroups` should be same for all the tasks > launched by a single default executor. > # It is not allowed to set resource limits for the task which has the field > `share_cgroups` set as true. > We also need to add validation to the agent which will ensure that non-debug > 2nd-or-lower-level nested containers cannot be launched via the > {{LaunchContainer}} call. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10072) Windows: Curl requires zlib when built with SSL support on Windows
[ https://issues.apache.org/jira/browse/MESOS-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051366#comment-17051366 ] Greg Mann commented on MESOS-10072: --- [~ddary] I'm not sure; if we haven't seen issues in testing on Windows agents, then probably not a blocker. > Windows: Curl requires zlib when built with SSL support on Windows > -- > > Key: MESOS-10072 > URL: https://issues.apache.org/jira/browse/MESOS-10072 > Project: Mesos > Issue Type: Task >Reporter: Joseph Wu >Priority: Major > Labels: curl, foundations, windows > Attachments: Screen Shot 2019-12-17 at 1.38.43 PM.png > > > After building Windows with --enable-ssl, some curl-related tests, like > health check tests, start failing with the odd exit code {{-1073741515}}. > Running curl directly with the Visual Studio debugger yields this error: > !Screen Shot 2019-12-17 at 1.38.43 PM.png|width=343,height=164! > Some documentation online seems to support this additional requirement: > [https://wiki.dlang.org/Curl_on_Windows] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10044) Add a new capability `TASK_RESOURCE_LIMITS` into Mesos agent
[ https://issues.apache.org/jira/browse/MESOS-10044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050260#comment-17050260 ] Greg Mann commented on MESOS-10044: --- {noformat} commit f445e3aea44b4060292fa5e029dbb2c19e219c25 Author: Greg Mann Date: Tue Mar 3 06:03:57 2020 -0800 Added the 'TASK_RESOURCE_LIMITS' agent capability. This capability will be used by the master to detect whether or not an agent can handle task resource limits. Review: https://reviews.apache.org/r/71991/ {noformat} > Add a new capability `TASK_RESOURCE_LIMITS` into Mesos agent > > > Key: MESOS-10044 > URL: https://issues.apache.org/jira/browse/MESOS-10044 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > Fix For: 1.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10093) Docker containerizer does handle whitespace correctly on Windows
[ https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10093: - Assignee: Greg Mann > Docker containerizer does handle whitespace correctly on Windows > > > Key: MESOS-10093 > URL: https://issues.apache.org/jira/browse/MESOS-10093 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.9.0 >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > Labels: containerization, docker, mesosphere, windows > > When running some tests of Mesos on Windows, I discovered that the following > command would not execute successfully when passed to the Docker > containerizer in {{TaskInfo.command}}: > {noformat} > python -c "print('hello world')" > {noformat} > The following error is found in the task sandbox: > {noformat} > File "", line 1 > "print('hello > ^ > SyntaxError: EOL while scanning string literal > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10093) Docker containerizer does handle whitespace correctly on Windows
[ https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031650#comment-17031650 ] Greg Mann commented on MESOS-10093: --- I heard from [~kaysoky] that this may have been a relic from earlier days in Mesos-on-Windows development, when PowerShell was the intended default shell on that platform. This was later changed to {{cmd}} to reduce the resource overhead. > Docker containerizer does handle whitespace correctly on Windows > > > Key: MESOS-10093 > URL: https://issues.apache.org/jira/browse/MESOS-10093 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.9.0 >Reporter: Greg Mann >Priority: Major > Labels: containerization, docker, mesosphere, windows > > When running some tests of Mesos on Windows, I discovered that the following > command would not execute successfully when passed to the Docker > containerizer in {{TaskInfo.command}}: > {noformat} > python -c "print('hello world')" > {noformat} > The following error is found in the task sandbox: > {noformat} > File "", line 1 > "print('hello > ^ > SyntaxError: EOL while scanning string literal > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10093) Docker containerizer does handle whitespace correctly on Windows
[ https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031100#comment-17031100 ] Greg Mann edited comment on MESOS-10093 at 2/5/20 10:53 PM: On windows, we execute shell commands as arguments to {{cmd.exe}}. I ran the following test in the command prompt: {noformat} C:\Users\Administrator>cmd /c "python -c \"print('hello world')\"" File "", line 1 "print('hello ^ SyntaxError: EOL while scanning string literal C:\Users\Administrator>cmd /c "python -c ^"print('hello world')^"" hello world {noformat} In libprocess, it looks like we currently escape double quotes using a backslash: https://github.com/apache/mesos/blob/4990d2cd6e76da340b30e200be0d700124dac2b1/3rdparty/stout/include/stout/os/windows/shell.hpp#L188-L191 Based on the above test, it appears that we should be escaping them with caret instead. NOTE that before merging such a change, we should confirm that changing this escaping behavior doesn't break Mesos containerizer tasks. was (Author: greggomann): On windows, we execute shell commands as arguments to {{cmd.exe}}. I ran the following test in the command prompt: {noformat} C:\Users\Administrator>cmd /c "python -c \"print('hello world')\"" File "", line 1 "print('hello ^ SyntaxError: EOL while scanning string literal C:\Users\Administrator>cmd /c "python -c ^"print('hello world')^"" hello world {noformat} In libprocess, it looks like we currently escape double quotes using a backslash: https://github.com/apache/mesos/blob/4990d2cd6e76da340b30e200be0d700124dac2b1/3rdparty/stout/include/stout/os/windows/shell.hpp#L188-L191 Based on the above test, it appears that escaping them with caret instead. NOTE that before merging such a change, we should confirm that changing this escaping behavior doesn't break Mesos containerizer tasks. > Docker containerizer does handle whitespace correctly on Windows > > > Key: MESOS-10093 > URL: https://issues.apache.org/jira/browse/MESOS-10093 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.9.0 >Reporter: Greg Mann >Priority: Major > Labels: containerization, docker, mesosphere, windows > > When running some tests of Mesos on Windows, I discovered that the following > command would not execute successfully when passed to the Docker > containerizer in {{TaskInfo.command}}: > {noformat} > python -c "print('hello world')" > {noformat} > The following error is found in the task sandbox: > {noformat} > File "", line 1 > "print('hello > ^ > SyntaxError: EOL while scanning string literal > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10093) Docker containerizer does handle whitespace correctly on Windows
[ https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031100#comment-17031100 ] Greg Mann edited comment on MESOS-10093 at 2/5/20 10:53 PM: On windows, we execute shell commands as arguments to {{cmd.exe}}. I ran the following test in the command prompt: {noformat} C:\Users\Administrator>cmd /c "python -c \"print('hello world')\"" File "", line 1 "print('hello ^ SyntaxError: EOL while scanning string literal C:\Users\Administrator>cmd /c "python -c ^"print('hello world')^"" hello world {noformat} In libprocess, it looks like we currently escape double quotes using a backslash: https://github.com/apache/mesos/blob/4990d2cd6e76da340b30e200be0d700124dac2b1/3rdparty/stout/include/stout/os/windows/shell.hpp#L188-L191 Based on the above test, it appears that we should be escaping them with a caret instead. NOTE that before merging such a change, we should confirm that changing this escaping behavior doesn't break Mesos containerizer tasks. was (Author: greggomann): On windows, we execute shell commands as arguments to {{cmd.exe}}. I ran the following test in the command prompt: {noformat} C:\Users\Administrator>cmd /c "python -c \"print('hello world')\"" File "", line 1 "print('hello ^ SyntaxError: EOL while scanning string literal C:\Users\Administrator>cmd /c "python -c ^"print('hello world')^"" hello world {noformat} In libprocess, it looks like we currently escape double quotes using a backslash: https://github.com/apache/mesos/blob/4990d2cd6e76da340b30e200be0d700124dac2b1/3rdparty/stout/include/stout/os/windows/shell.hpp#L188-L191 Based on the above test, it appears that we should be escaping them with caret instead. NOTE that before merging such a change, we should confirm that changing this escaping behavior doesn't break Mesos containerizer tasks. > Docker containerizer does handle whitespace correctly on Windows > > > Key: MESOS-10093 > URL: https://issues.apache.org/jira/browse/MESOS-10093 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.9.0 >Reporter: Greg Mann >Priority: Major > Labels: containerization, docker, mesosphere, windows > > When running some tests of Mesos on Windows, I discovered that the following > command would not execute successfully when passed to the Docker > containerizer in {{TaskInfo.command}}: > {noformat} > python -c "print('hello world')" > {noformat} > The following error is found in the task sandbox: > {noformat} > File "", line 1 > "print('hello > ^ > SyntaxError: EOL while scanning string literal > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10093) Docker containerizer does handle whitespace correctly on Windows
[ https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031100#comment-17031100 ] Greg Mann commented on MESOS-10093: --- On windows, we execute shell commands as arguments to {{cmd.exe}}. I ran the following test in the command prompt: {noformat} C:\Users\Administrator>cmd /c "python -c \"print('hello world')\"" File "", line 1 "print('hello ^ SyntaxError: EOL while scanning string literal C:\Users\Administrator>cmd /c "python -c ^"print('hello world')^"" hello world {noformat} In libprocess, it looks like we currently escape double quotes using a backslash: https://github.com/apache/mesos/blob/4990d2cd6e76da340b30e200be0d700124dac2b1/3rdparty/stout/include/stout/os/windows/shell.hpp#L188-L191 Based on the above test, it appears that escaping them with caret instead. NOTE that before merging such a change, we should confirm that changing this escaping behavior doesn't break Mesos containerizer tasks. > Docker containerizer does handle whitespace correctly on Windows > > > Key: MESOS-10093 > URL: https://issues.apache.org/jira/browse/MESOS-10093 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.9.0 >Reporter: Greg Mann >Priority: Major > Labels: containerization, docker, mesosphere, windows > > When running some tests of Mesos on Windows, I discovered that the following > command would not execute successfully when passed to the Docker > containerizer in {{TaskInfo.command}}: > {noformat} > python -c "print('hello world')" > {noformat} > The following error is found in the task sandbox: > {noformat} > File "", line 1 > "print('hello > ^ > SyntaxError: EOL while scanning string literal > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10093) Docker containerizer does handle whitespace correctly on Windows
Greg Mann created MESOS-10093: - Summary: Docker containerizer does handle whitespace correctly on Windows Key: MESOS-10093 URL: https://issues.apache.org/jira/browse/MESOS-10093 Project: Mesos Issue Type: Bug Affects Versions: 1.9.0 Reporter: Greg Mann When running some tests of Mesos on Windows, I discovered that the following command would not execute successfully when passed to the Docker containerizer in {{TaskInfo.command}}: {noformat} python -c "print('hello world')" {noformat} The following error is found in the task sandbox: {noformat} File "", line 1 "print('hello ^ SyntaxError: EOL while scanning string literal {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state
[ https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025386#comment-17025386 ] Greg Mann commented on MESOS-10068: --- [~daltonmatos] regarding this ticket, yea I think it makes sense to close this one and mention it in MESOS-10089. Time is tight over here, but I'd be happy to mentor you a bit in the codebase :) Would you like to start by addressing MESOS-10089? If so, we could do an intro call to get started. Feel free to find me on Mesos slack if you're on there. > Mesos Master doesn't send AGENT_REMOVED when removing agent from internal > state > --- > > Key: MESOS-10068 > URL: https://issues.apache.org/jira/browse/MESOS-10068 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.3, 1.8.2, 1.9.1 >Reporter: Dalton Matos Coelho Barreto >Priority: Major > Attachments: master-full-logs.log > > > Hello, > > Looking at the documentation of the master {{/api/v1}} endpoint, the > {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is > supported for this endpoint, but when a new agent joins the cluster a > {{AGENT_ADDED}} event is received. > The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not > received by clients subscribed to the master API. > > I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All > using the docker image {{mesos/mesos-centos}}. > The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the > cluster but the master couldn't communicate with this agent, in this specific > test there was a firewall blocking port {{5051}} on the slave, that is, no > body was being able to tal to the slave on port {{5051}}. > > h2. Here are the steps do reproduce the problem > * Start a new mesos master > * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message: > ** > {noformat} > curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: > application/json" http://MASTER_IP:5050/api/v1{noformat} > * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered; > * Stop this slave; > * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the > field {{active=false}}. > * Waits for mesos master stop listing this slave, that is, > {{/slaves?slave_id=AGENT_ID}} returns an empty response; > Even after the empty response, the event never reaches the subscriber. > > The mesos master logs shows this: > {noformat} > I1213 15:03:10.33893513 master.cpp:1297] Agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) disconnected > I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > {noformat} > And then: > {noformat} > W1213 15:04:40.72667015 process.cpp:1917] Failed to send > 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to > connect to 172.18.0.51:5051: No route to host{noformat} > And some time after this: > {noformat} > I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat} > > Even after this removal, the {{AGENT_REMOVED}} event is not delivered. > > I will attach the full master logs also. > > Do you think this could be a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state
[ https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022346#comment-17022346 ] Greg Mann commented on MESOS-10068: --- Yea we should definitely be sending AGENT_REMOVED when agents are marked gone, sounds like a bug to me. I created a ticket to track this: MESOS-10089 Regarding the unreachable agents, we may want to have an AGENT_UNREACHABLE event to indicate this. [~daltonmatos], we have a ticket here to track the design of the full agent state diagram: MESOS-9556 That would be a great place to continue discussion, feel free to ping us there. Unfortunately, I'm not sure when we might find time to work on that, but it's definitely something we've been wanting to do for a while now. > Mesos Master doesn't send AGENT_REMOVED when removing agent from internal > state > --- > > Key: MESOS-10068 > URL: https://issues.apache.org/jira/browse/MESOS-10068 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.3, 1.8.2, 1.9.1 >Reporter: Dalton Matos Coelho Barreto >Priority: Major > Attachments: master-full-logs.log > > > Hello, > > Looking at the documentation of the master {{/api/v1}} endpoint, the > {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is > supported for this endpoint, but when a new agent joins the cluster a > {{AGENT_ADDED}} event is received. > The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not > received by clients subscribed to the master API. > > I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All > using the docker image {{mesos/mesos-centos}}. > The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the > cluster but the master couldn't communicate with this agent, in this specific > test there was a firewall blocking port {{5051}} on the slave, that is, no > body was being able to tal to the slave on port {{5051}}. > > h2. Here are the steps do reproduce the problem > * Start a new mesos master > * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message: > ** > {noformat} > curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: > application/json" http://MASTER_IP:5050/api/v1{noformat} > * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered; > * Stop this slave; > * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the > field {{active=false}}. > * Waits for mesos master stop listing this slave, that is, > {{/slaves?slave_id=AGENT_ID}} returns an empty response; > Even after the empty response, the event never reaches the subscriber. > > The mesos master logs shows this: > {noformat} > I1213 15:03:10.33893513 master.cpp:1297] Agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) disconnected > I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > {noformat} > And then: > {noformat} > W1213 15:04:40.72667015 process.cpp:1917] Failed to send > 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to > connect to 172.18.0.51:5051: No route to host{noformat} > And some time after this: > {noformat} > I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat} > > Even after this removal, the {{AGENT_REMOVED}} event is not delivered. > > I will attach the full master logs also. > > Do you think this could be a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10089) AGENT_REMOVED event not sent when agents marked GONE
Greg Mann created MESOS-10089: - Summary: AGENT_REMOVED event not sent when agents marked GONE Key: MESOS-10089 URL: https://issues.apache.org/jira/browse/MESOS-10089 Project: Mesos Issue Type: Bug Components: master Affects Versions: 1.9.0 Reporter: Greg Mann The master currently does not send subscribers the AGENT_REMOVED event when agents are marked GONE, but it should. Since the {{__removeSlave}} method is used to handle both the UNREACHABLE and GONE cases, we could update it to conditionally send this event. However, it's worth noting that the {{_removeSlave}}/{{__removeSlave}} logic is messy and unintuitive and in need of refactoring - I suspect we can turn these into a single method which handles all cases with the help of an auxiliary function or two. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-9847) Docker executor doesn't wait for status updates to be ack'd before shutting down.
[ https://issues.apache.org/jira/browse/MESOS-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-9847: Assignee: Andrei Budnik > Docker executor doesn't wait for status updates to be ack'd before shutting > down. > - > > Key: MESOS-9847 > URL: https://issues.apache.org/jira/browse/MESOS-9847 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Meng Zhu >Assignee: Andrei Budnik >Priority: Major > Labels: containerization > > The docker executor doesn't wait for pending status updates to be > acknowledged before shutting down, instead it sleeps for one second and then > terminates: > {noformat} > void _stop() > { > // A hack for now ... but we need to wait until the status update > // is sent to the slave before we shut ourselves down. > // TODO(tnachen): Remove this hack and also the same hack in the > // command executor when we have the new HTTP APIs to wait until > // an ack. > os::sleep(Seconds(1)); > driver.get()->stop(); > } > {noformat} > This would result in racing between task status update (e.g. TASK_FINISHED) > and executor exit. The latter would lead agent generating a `TASK_FAILED` > status update by itself, leading to the confusing case where the agent > handles two different terminal status updates. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10045) Validate task’s resources limits and the `shared_cgroups` field in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-10045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10045: - Assignee: Greg Mann > Validate task’s resources limits and the `shared_cgroups` field in Mesos > master > --- > > Key: MESOS-10045 > URL: https://issues.apache.org/jira/browse/MESOS-10045 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > > When launching a task, we need to validate: > # Only CPU and memory are supported as resource limits. > # Resource limit must be larger than resource request. > ** We need to be careful about the command task case, in which case we add > an allowance (0.1 CPUs and 32MB memory, see > [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/slave.cpp#L6663:L6677] > for details) for the executor, so we need to validate task resource limit is > larger than task resource request + this allowance, otherwise the executor > will be launched with limits < requests. > # `TaskInfo` can only include resource limits when the relevant agent > possesses the TASK_RESOURCE_LIMITS capability. > # The value of the field `shared_cgroups` should be same for all the tasks > launched by a single default executor. > # It is not allowed to set resource limits for the task which has the field > `shared_cgroups` set as true. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10044) Add a new capability `TASK_RESOURCE_LIMITS` into Mesos agent
[ https://issues.apache.org/jira/browse/MESOS-10044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10044: - Assignee: Greg Mann > Add a new capability `TASK_RESOURCE_LIMITS` into Mesos agent > > > Key: MESOS-10044 > URL: https://issues.apache.org/jira/browse/MESOS-10044 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10049) Add a new reason in `TaskStatus::Reason` for the case that a task is OOM-killed due to exceeding its memory request
[ https://issues.apache.org/jira/browse/MESOS-10049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001636#comment-17001636 ] Greg Mann commented on MESOS-10049: --- Review here: https://reviews.apache.org/r/71935/ > Add a new reason in `TaskStatus::Reason` for the case that a task is > OOM-killed due to exceeding its memory request > --- > > Key: MESOS-10049 > URL: https://issues.apache.org/jira/browse/MESOS-10049 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10049) Add a new reason in `TaskStatus::Reason` for the case that a task is OOM-killed due to exceeding its memory request
[ https://issues.apache.org/jira/browse/MESOS-10049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10049: - Assignee: Greg Mann > Add a new reason in `TaskStatus::Reason` for the case that a task is > OOM-killed due to exceeding its memory request > --- > > Key: MESOS-10049 > URL: https://issues.apache.org/jira/browse/MESOS-10049 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Greg Mann >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10041) Libprocess SSL verification can leak memory
[ https://issues.apache.org/jira/browse/MESOS-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980584#comment-16980584 ] Greg Mann commented on MESOS-10041: --- {noformat} commit e52d0d1f25a91f9940bea4329eb5359373ee0ed0 Author: Benno Evers Date: Fri Nov 22 12:00:43 2019 -0800 Fixed memory leak in openssl verification function. When the hostname validation scheme was set to 'openssl', the `openssl::verify()` function would return without freeing a previously allocated `X509*` object. To fix the leak, a long-standing TODO to switch to RAII-based memory management for the certificate was resolved. Review: https://reviews.apache.org/r/71805/ {noformat} > Libprocess SSL verification can leak memory > --- > > Key: MESOS-10041 > URL: https://issues.apache.org/jira/browse/MESOS-10041 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.9.0 >Reporter: Greg Mann >Assignee: Benno Evers >Priority: Major > Labels: libprocess, ssl > > In {{process::network::openssl::verify()}}, when the SSL hostname validation > scheme is set to "openssl", the function can return without freeing an > {{X509}} object, leading to a memory leak. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10041) Libprocess SSL verification can leak memory
[ https://issues.apache.org/jira/browse/MESOS-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10041: - Assignee: Benno Evers > Libprocess SSL verification can leak memory > --- > > Key: MESOS-10041 > URL: https://issues.apache.org/jira/browse/MESOS-10041 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Greg Mann >Assignee: Benno Evers >Priority: Major > Labels: libprocess, ssl > > In {{process::network::openssl::verify()}}, when the SSL hostname validation > scheme is set to "openssl", the function can return without freeing an > {{X509}} object, leading to a memory leak. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10041) Libprocess SSL verification can leak memory
Greg Mann created MESOS-10041: - Summary: Libprocess SSL verification can leak memory Key: MESOS-10041 URL: https://issues.apache.org/jira/browse/MESOS-10041 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Greg Mann In {{process::network::openssl::verify()}}, when the SSL hostname validation scheme is set to "openssl", the function can return without freeing an {{X509}} object, leading to a memory leak. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10033) Design per-task cgroup isolation
Greg Mann created MESOS-10033: - Summary: Design per-task cgroup isolation Key: MESOS-10033 URL: https://issues.apache.org/jira/browse/MESOS-10033 Project: Mesos Issue Type: Task Reporter: Greg Mann To provide container resource isolation which more closely matches the isolation implied by the Mesos nested container API, we should limit CPU and memory on a per-task basis. The current Mesos containerizer implementation limits CPU and memory at the level of the executor only, which means that tasks within a task group can burst above their CPU or memory resources. Instead, we should apply these limits using per-task cgroups. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10031) Agent's 'executorTerminated()' can cause double task status update
Greg Mann created MESOS-10031: - Summary: Agent's 'executorTerminated()' can cause double task status update Key: MESOS-10031 URL: https://issues.apache.org/jira/browse/MESOS-10031 Project: Mesos Issue Type: Bug Affects Versions: 1.9.0 Reporter: Greg Mann Assignee: Greg Mann When the agent first receives a task status update from an executor, it executes {{Slave::statusUpdate()}}, which adds the task ID to the {{Executor::pendingStatusUpdates}} map, but leaves the ID in {{Executor::launchedTasks}}. Meanwhile, the code in {{Slave::executorTerminated()}} is not capable of handling the intermediate task state which exists in between the execution of {{Slave::statusUpdate()}} and {{Slave::_statusUpdate()}}. If {{Slave::executorTerminated()}} executes at that point in time, it's possible that the task will be transitioned to a terminal state twice (for example, it could be transitioned to TASK_FINISHED by the executor, then to TASK_FAILED by the agent if the executor suddenly terminates). If the agent has already received a status update from an executor, that state transition should be honored even if the executor terminates immediately after it's sent. We should ensure that {{Slave::executorTerminated()}} cannot cause a valid update received from an executor to be ignored. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10002) Design doc for container bursting
[ https://issues.apache.org/jira/browse/MESOS-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966898#comment-16966898 ] Greg Mann commented on MESOS-10002: --- Design doc is in progress here: https://docs.google.com/document/d/1iEXn2dBg07HehbNZunJWsIY6iaFezXiRsvpNw4dVQII/edit?usp=sharing > Design doc for container bursting > - > > Key: MESOS-10002 > URL: https://issues.apache.org/jira/browse/MESOS-10002 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > Labels: foundations > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9977) Agent does not check for immutable files while removing persistent volumes (and possibly in other GC operations)
[ https://issues.apache.org/jira/browse/MESOS-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965152#comment-16965152 ] Greg Mann commented on MESOS-9977: -- I can think of two options for handling this: 1) Force the persistent volume removal by having the agent unset the immutable attribute 2) Fail the DESTROY operation In the case of persistent volumes, I think that #2 might make more sense - this is the more conservative thing to do, which seems prudent in the case of potentially critical data. Perhaps we could surface the presence of the immutable attribute in the volume via logging somewhere. [~kaysoky] you mentioned sandbox GC in the description as well - in this case, I might be OK with just forcing the directory removal by having the agent unset the immutable attribute. > Agent does not check for immutable files while removing persistent volumes > (and possibly in other GC operations) > > > Key: MESOS-9977 > URL: https://issues.apache.org/jira/browse/MESOS-9977 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.6.2, 1.7.2, 1.8.1, 1.9.0 >Reporter: Joseph Wu >Priority: Major > Labels: foundations > > We observed an exit/crash loop on an agent originating from deleting a > persistent volume: > {code} > slave.cpp:4557] Deleting persistent volume '' at > '/path/to/mesos/slave/volumes/roles/my-role/' > {code} > This persistent volume happened to have one (or more) files within marked as > {{immutable}}. > When the agent went to delete this persistent volume, via {{os::rmdir(...)}}, > it encountered these immutable file(s) and exits like: > {code} > slave.cpp:4423] EXIT with status 1: Failed to sync checkpointed resources: > Failed to remove persistent volume '' at > '/path/to/mesos/slave/volumes/roles/my-role/': Operation not permitted > {code} > The agent would then be unable to start up again, because during recovery, > the agent would attempt to delete the same persistent volume and fail to do > so. > Manually removing the immutable attribute from files within the persistent > volume allows the agent to recover: > {code} > chattr -R -i /path/to/mesos/slave/volumes/roles/my-role/ > {code} > Immutable attributes can be easily introduced by any tasks running on the > agent. As long as the task has sufficient permissions, it could easily call > {{chattr +i ...}}. This attribute could also affect sandbox GC, which also > uses {{os::rmdir}} to clean up. However, sandbox GC tends to warn rather > than exit on failure. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10002) Design doc for container bursting
[ https://issues.apache.org/jira/browse/MESOS-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10002: - Assignee: Greg Mann > Design doc for container bursting > - > > Key: MESOS-10002 > URL: https://issues.apache.org/jira/browse/MESOS-10002 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > Labels: foundations > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-9609) Master check failure when marking agent unreachable
[ https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-9609: Shepherd: (was: Benno Evers) Assignee: (was: Greg Mann) > Master check failure when marking agent unreachable > --- > > Key: MESOS-9609 > URL: https://issues.apache.org/jira/browse/MESOS-9609 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Greg Mann >Priority: Critical > Labels: foundations, mesosphere > > {code} > Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81543313 > http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133 > Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81558813 > master.cpp:5467] Processing DECLINE call for offers: [ > 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework > 5e57f633-a69c-4009-b7 > Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81569313 > master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82014210 > master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at > slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82036710 > registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the > registry > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82057210 > registrar.cpp:552] Successfully updated the registry in 175872ns > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82064211 > master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at > slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957 9 > hierarchical.cpp:609] Removed agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 > Mar 11 10:04:35 research docker[4503]: F0311 10:04:35.85196111 > master.cpp:10018] Check failed: 'framework' Must be non NULL > Mar 11 10:04:35 research docker[4503]: *** Check failure stack trace: *** > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d > google::LogMessage::Fail() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830 > google::LogMessage::SendToLog() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663 > google::LogMessage::Flush() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259 > google::LogMessageFatal::~LogMessageFatal() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14 > google::CheckNotNull<>() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8 > mesos::internal::master::Master::__removeSlave() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2 > mesos::internal::master::Master::_markUnreachable() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11 > process::ProcessBase::consume() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb2a4a > process::ProcessManager::resume() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb65d6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c35d4c80 (unknown) > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2de76ba start_thread > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2b1d41d (unknown) > Mar 11 10:04:36 research docker[4503]: *** Aborted at 1520762676 (unix time) > try "date -d @1520762676" if you are using GNU date *** > Mar 11 10:04:36 research docker[4503]: PC: @ 0x7f96c2a4d196 (unknown) > Mar 11 10:04:36 research docker[4503]: *** SIGSEGV (@0x0) received by PID 1 > (TID 0x7f96b986d700) from PID 0; stack trace: *** > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2df1390 (unknown) > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2a4d196 (unknown) > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c604ce2c > google::DumpStackTraceAndExit() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d > google::LogMessage::Fail() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830 > google::LogMessage::SendToLog() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663 > google::LogMessage::Flush() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259 > google::LogMessageFatal::~LogMessageFatal() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14 > google::CheckNotNull<>() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8 > mesos::internal::master::Master::__removeSlave() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2 > mesos::internal::master::Master::_markUnreachable() > Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11 > process::ProcessBase::consume() > Mar 11
[jira] [Comment Edited] (MESOS-9609) Master check failure when marking agent unreachable
[ https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962491#comment-16962491 ] Greg Mann edited comment on MESOS-9609 at 10/29/19 9:58 PM: [~arostami] thanks so much for the repro and excellent logs! Much appreciated :) I took a close look and I believe the following sequence of events leads to the crash: 1) The last of the framework’s tasks is removed: {noformat} Oct 27 23:21:18 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:21:18.493418 15 master.cpp:12171] Removing task 2 with resources cpus(allocated: *):1; disk(allocated: *):4024; mem(allocated: *):2048 of framework 522424c1-2fac-42ab-9a70-b424266218a9- on agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) {noformat} which means the framework’s entry in {{slave->tasks}} is erased: https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L13527-L13529 2) Later, the agent disconnects and since the framework is not checkpointing, it is removed from the {{Slave}} struct: {noformat} Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248260 14 master.cpp:1321] Removing framework 522424c1-2fac-42ab-9a70-b424266218a9- (toil) from disconnected agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) because the framework is not checkpointing Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248289 14 master.cpp:11436] Removing framework 522424c1-2fac-42ab-9a70-b424266218a9- (toil) from agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248311 14 master.cpp:12211] Removing executor 'toil-440' with resources {} of framework 522424c1-2fac-42ab-9a70-b424266218a9- on agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) {noformat} We see no logging related to task removal since {{slave->tasks[framework->id()]}} was empty this time. however, since we use {{operator[]}} to inspect the task map here, we perform an insertion and it has a side effect: https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11416 This means that {{slave->tasks[framework->id()]}} now exists but has been initialized to an empty map. ruh roh. 3) Very soon after, the framework failover timeout elapses and the framework is removed: {noformat} Oct 27 23:23:22 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:22.890070 11 master.cpp:10224] Framework failover timeout, removing framework 522424c1-2fac-42ab-9a70-b424266218a9- (toil) {noformat} 4) Now when {{__removeSlave()}} iterates over the keys of {{slave->tasks}}, it finds a key which points to a framework that has already been removed: https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11796-L11800 We need to prevent that unintended map insertion to avoid the crash. I'll prioritize this fix in the very near future; will update here soon. was (Author: greggomann): [~arostami] thanks so much for the repro and excellent logs! Much appreciated :) I took a close look and I believe the following sequence of events leads to the crash: 1) The last of the framework’s tasks is removed: {noformat} Oct 27 23:21:18 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:21:18.493418 15 master.cpp:12171] Removing task 2 with resources cpus(allocated: *):1; disk(allocated: *):4024; mem(allocated: *):2048 of framework 522424c1-2fac-42ab-9a70-b424266218a9- on agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) {noformat} which means the framework’s entry in slave->tasks is erased: https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L13527-L13529 2) Later, the agent disconnects and since the framework is not checkpointing, it is removed from the Slave struct: {noformat} Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248260 14 master.cpp:1321] Removing framework 522424c1-2fac-42ab-9a70-b424266218a9- (toil) from disconnected agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) because the framework is not checkpointing Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248289 14 master.cpp:11436] Removing framework 522424c1-2fac-42ab-9a70-b424266218a9- (toil) from agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248311 14 master.cpp:12211] Removing executor 'toil-440' with resources {} of framework
[jira] [Comment Edited] (MESOS-9609) Master check failure when marking agent unreachable
[ https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962491#comment-16962491 ] Greg Mann edited comment on MESOS-9609 at 10/29/19 9:57 PM: [~arostami] thanks so much for the repro and excellent logs! Much appreciated :) I took a close look and I believe the following sequence of events leads to the crash: 1) The last of the framework’s tasks is removed: {noformat} Oct 27 23:21:18 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:21:18.493418 15 master.cpp:12171] Removing task 2 with resources cpus(allocated: *):1; disk(allocated: *):4024; mem(allocated: *):2048 of framework 522424c1-2fac-42ab-9a70-b424266218a9- on agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) {noformat} which means the framework’s entry in slave->tasks is erased: https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L13527-L13529 2) Later, the agent disconnects and since the framework is not checkpointing, it is removed from the Slave struct: {noformat} Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248260 14 master.cpp:1321] Removing framework 522424c1-2fac-42ab-9a70-b424266218a9- (toil) from disconnected agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) because the framework is not checkpointing Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248289 14 master.cpp:11436] Removing framework 522424c1-2fac-42ab-9a70-b424266218a9- (toil) from agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248311 14 master.cpp:12211] Removing executor 'toil-440' with resources {} of framework 522424c1-2fac-42ab-9a70-b424266218a9- on agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) {noformat} We see no logging related to task removal since slave->tasks[framework->id()] was empty this time. however, since we use operator[] to inspect the task map here, we perform an insertion and it has a side effect :face-palm: : https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11416 This means that slave->tasks[framework->id()] now exists but has been initialized to an empty map. ruh roh. 3) Very soon after, the framework failover timeout elapses and the framework is removed: {noformat} Oct 27 23:23:22 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:22.890070 11 master.cpp:10224] Framework failover timeout, removing framework 522424c1-2fac-42ab-9a70-b424266218a9- (toil) {noformat} 4) Now when __removeSlave() iterates over the keys of slave->tasks, it finds a key which points to a framework that has already been removed: https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11796-L11800 We need to prevent that unintended map insertion to avoid the crash. I'll prioritize this fix in the very near future; will update here soon. was (Author: greggomann): [~arostami] thanks so much for the repro and excellent logs! Much appreciated :) I took a close look and I believe the following sequence of events leads to the crash: 1) The last of the framework’s tasks is removed: Oct 27 23:21:18 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:21:18.493418 15 master.cpp:12171] Removing task 2 with resources cpus(allocated: *):1; disk(allocated: *):4024; mem(allocated: *):2048 of framework 522424c1-2fac-42ab-9a70-b424266218a9- on agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) which means the framework’s entry in slave->tasks is erased: https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L13527-L13529 2) Later, the agent disconnects and since the framework is not checkpointing, it is removed from the Slave struct: Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248260 14 master.cpp:1321] Removing framework 522424c1-2fac-42ab-9a70-b424266218a9- (toil) from disconnected agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) because the framework is not checkpointing Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248289 14 master.cpp:11436] Removing framework 522424c1-2fac-42ab-9a70-b424266218a9- (toil) from agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248311 14 master.cpp:12211] Removing executor 'toil-440' with resources {} of framework 522424c1-2fac-42ab-9a70-b424266218a9- on agent
[jira] [Commented] (MESOS-9609) Master check failure when marking agent unreachable
[ https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962491#comment-16962491 ] Greg Mann commented on MESOS-9609: -- [~arostami] thanks so much for the repro and excellent logs! Much appreciated :) I took a close look and I believe the following sequence of events leads to the crash: 1) The last of the framework’s tasks is removed: Oct 27 23:21:18 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:21:18.493418 15 master.cpp:12171] Removing task 2 with resources cpus(allocated: *):1; disk(allocated: *):4024; mem(allocated: *):2048 of framework 522424c1-2fac-42ab-9a70-b424266218a9- on agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) which means the framework’s entry in slave->tasks is erased: https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L13527-L13529 2) Later, the agent disconnects and since the framework is not checkpointing, it is removed from the Slave struct: Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248260 14 master.cpp:1321] Removing framework 522424c1-2fac-42ab-9a70-b424266218a9- (toil) from disconnected agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) because the framework is not checkpointing Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248289 14 master.cpp:11436] Removing framework 522424c1-2fac-42ab-9a70-b424266218a9- (toil) from agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248311 14 master.cpp:12211] Removing executor 'toil-440' with resources {} of framework 522424c1-2fac-42ab-9a70-b424266218a9- on agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) We see no logging related to task removal since slave->tasks[framework->id()] was empty this time. however, since we use operator[] to inspect the task map here, we perform an insertion and it has a side effect :face-palm: : https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11416 This means that slave->tasks[framework->id()] now exists but has been initialized to an empty map. ruh roh. 3) Very soon after, the framework failover timeout elapses and the framework is removed: Oct 27 23:23:22 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:22.890070 11 master.cpp:10224] Framework failover timeout, removing framework 522424c1-2fac-42ab-9a70-b424266218a9- (toil) 4) Now when __removeSlave() iterates over the keys of slave->tasks, it finds a key which points to a framework that has already been removed: https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11796-L11800 We need to prevent that unintended map insertion to avoid the crash. I'll prioritize this fix in the very near future; will update here soon. > Master check failure when marking agent unreachable > --- > > Key: MESOS-9609 > URL: https://issues.apache.org/jira/browse/MESOS-9609 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Critical > Labels: foundations, mesosphere > Fix For: 1.9.0 > > > {code} > Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81543313 > http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133 > Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81558813 > master.cpp:5467] Processing DECLINE call for offers: [ > 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework > 5e57f633-a69c-4009-b7 > Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81569313 > master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82014210 > master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at > slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82036710 > registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the > registry > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82057210 > registrar.cpp:552] Successfully updated the registry in 175872ns > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82064211 > master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at > slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin > Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957 9 > hierarchical.cpp:609] Removed agent
[jira] [Commented] (MESOS-10010) Implement an SSL socket for Windows, using OpenSSL directly
[ https://issues.apache.org/jira/browse/MESOS-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953103#comment-16953103 ] Greg Mann commented on MESOS-10010: --- [~kaysoky] I think this should be more fine-grained, will this really complete in a single sprint? > Implement an SSL socket for Windows, using OpenSSL directly > --- > > Key: MESOS-10010 > URL: https://issues.apache.org/jira/browse/MESOS-10010 > Project: Mesos > Issue Type: Task > Components: libprocess >Reporter: Joseph Wu >Assignee: Joseph Wu >Priority: Major > Labels: foundations > > {code} > class WindowsSSLSocketImpl : public SocketImpl > { > public: > // This will be the entry point for Socket::create(SSL). > static Try> create(int_fd s); > WindowsSSLSocketImpl(int_fd _s); > ~WindowsSSLSocketImpl() override; > // Overrides for the 'SocketImpl' interface below. > // Unreachable. > Future connect(const Address& address) override; > // This will initialize SSL objects then call windows::connect() > // and chain that onto the appropriate call to SSL_do_handshake. > Future connect( > const Address& address, > const openssl::TLSClientConfig& config) override; > // These will call SSL_read or SSL_write as appropriate. > // As long as the SSL context is set up correctly, these will be > // thin wrappers. (More details after the code block.) > Future recv(char* data, size_t size) override; > Future send(const char* data, size_t size) override; > Future sendfile(int_fd fd, off_t offset, size_t size) override; > // Nothing SSL here, just a plain old listener. > Try listen(int backlog) override; > // This will initialize SSL objects then call windows::accept() > // and then perform handshaking. Any downgrading will > // happen here. Since we control the event loop, we can > // easily peek at the first few bytes to check SSL-ness. > Future> accept() override; > SocketImpl::Kind kind() const override { return SocketImpl::Kind::SSL; } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10004) Enable SSL on Windows
Greg Mann created MESOS-10004: - Summary: Enable SSL on Windows Key: MESOS-10004 URL: https://issues.apache.org/jira/browse/MESOS-10004 Project: Mesos Issue Type: Epic Reporter: Greg Mann -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10003) Design doc for SSL on Windows
Greg Mann created MESOS-10003: - Summary: Design doc for SSL on Windows Key: MESOS-10003 URL: https://issues.apache.org/jira/browse/MESOS-10003 Project: Mesos Issue Type: Task Components: libprocess Reporter: Greg Mann -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10002) Design doc for container bursting
Greg Mann created MESOS-10002: - Summary: Design doc for container bursting Key: MESOS-10002 URL: https://issues.apache.org/jira/browse/MESOS-10002 Project: Mesos Issue Type: Task Components: agent, containerization Reporter: Greg Mann -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10001) Container bursting for CPU/mem
Greg Mann created MESOS-10001: - Summary: Container bursting for CPU/mem Key: MESOS-10001 URL: https://issues.apache.org/jira/browse/MESOS-10001 Project: Mesos Issue Type: Epic Components: agent, containerization Reporter: Greg Mann -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9971) Mesos failed to build due to error MSB6006 on Windows with MSVC.
[ https://issues.apache.org/jira/browse/MESOS-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932747#comment-16932747 ] Greg Mann commented on MESOS-9971: -- [~kaysoky] have you seen this before? > Mesos failed to build due to error MSB6006 on Windows with MSVC. > > > Key: MESOS-9971 > URL: https://issues.apache.org/jira/browse/MESOS-9971 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: master > Environment: {color:#172b4d}VS 2017 + Windows Server 2016{color} >Reporter: LinGao >Priority: Major > Attachments: log_x64_build.log > > > Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 1 on > Windows using MSVC. It can be first reproduced on > {color:#24292e}e0f7e2d{color} reversion on master branch. Could you please > take a look at this isssue? Thanks a lot! > Reproduce steps: > 1. git clone -c core.autocrlf=true [https://github.com/apache/mesos] > D:\mesos\src > 2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos > 3. cd src > 4. .\bootstrap.bat > 5. cd .. > 6. mkdir build_x64 && pushd build_x64 > 7. cmake ..\src -G "Visual Studio 15 2017 Win64" > -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64 > 8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 > /t:Rebuild > > ErrorMessage: > 67>PrepareForBuild: > Creating directory "x64\Debug\dist\dist.tlog\". > InitializeBuildStatus: > Creating "x64\Debug\dist\dist.tlog\unsuccessfulbuild" because > "AlwaysCreate" was specified. > 67>C:\Program Files (x86)\Microsoft Visual > Studio\2017\Enterprise\Common7\IDE\VC\VCTargets\Microsoft.CppCommon.targets(209,5): > error MSB6006: "cmd.exe" exited with code 1. > [D:\Mesos\build_x64\dist.vcxproj] > 67>Done Building Project "D:\Mesos\build_x64\dist.vcxproj" (Rebuild > target(s)) -- FAILED. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9965) agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not partition aware.
[ https://issues.apache.org/jira/browse/MESOS-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928984#comment-16928984 ] Greg Mann commented on MESOS-9965: -- 1.9.x: {noformat} commit d8520b0b4bf52fd27be45817934e2af1b871c399 Author: Greg Mann Date: Thu Sep 12 16:33:20 2019 -0700 Fixed a bug for non-partition-aware schedulers. Previously, the agent would send task status updates with the state TASK_GONE_BY_OPERATOR to all schedulers when an agent was drained with the `mark_gone` parameter set to `true`. This patch updates this code to ensure that TASK_GONE_BY_OPERATOR is only sent to partition-aware schedulers. Review: https://reviews.apache.org/r/71480/ {noformat} > agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not > partition aware. > -- > > Key: MESOS-9965 > URL: https://issues.apache.org/jira/browse/MESOS-9965 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Gilbert Song >Assignee: Greg Mann >Priority: Major > Labels: foundations > Fix For: 1.10, 1.9.1 > > > The Mesos agent should not send `TASK_GONE_BY_OPERATOR` if the framework is > not partition-aware. We should distinguish the framework capability and send > different updates to legacy frameworks. > The issue is exposed from here: > https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/slave/slave.cpp#L5803 > An example to follow: > https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/master/master.cpp#L9921 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (MESOS-9965) agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not partition aware.
[ https://issues.apache.org/jira/browse/MESOS-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928979#comment-16928979 ] Greg Mann edited comment on MESOS-9965 at 9/13/19 3:03 AM: --- master: {noformat} commit 8e1a51207304589a6521cff3540e0705fe1533ff Author: Greg Mann Date: Thu Sep 12 16:33:20 2019 -0700 Fixed a bug for non-partition-aware schedulers. Previously, the agent would send task status updates with the state TASK_GONE_BY_OPERATOR to all schedulers when an agent was drained with the `mark_gone` parameter set to `true`. This patch updates this code to ensure that TASK_GONE_BY_OPERATOR is only sent to partition-aware schedulers. Review: https://reviews.apache.org/r/71480/ {noformat} was (Author: greggomann): {noformat} commit 8e1a51207304589a6521cff3540e0705fe1533ff Author: Greg Mann Date: Thu Sep 12 16:33:20 2019 -0700 Fixed a bug for non-partition-aware schedulers. Previously, the agent would send task status updates with the state TASK_GONE_BY_OPERATOR to all schedulers when an agent was drained with the `mark_gone` parameter set to `true`. This patch updates this code to ensure that TASK_GONE_BY_OPERATOR is only sent to partition-aware schedulers. Review: https://reviews.apache.org/r/71480/ {noformat} > agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not > partition aware. > -- > > Key: MESOS-9965 > URL: https://issues.apache.org/jira/browse/MESOS-9965 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Gilbert Song >Assignee: Greg Mann >Priority: Major > Labels: foundations > Fix For: 1.10, 1.9.1 > > > The Mesos agent should not send `TASK_GONE_BY_OPERATOR` if the framework is > not partition-aware. We should distinguish the framework capability and send > different updates to legacy frameworks. > The issue is exposed from here: > https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/slave/slave.cpp#L5803 > An example to follow: > https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/master/master.cpp#L9921 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (MESOS-9965) agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not partition aware.
[ https://issues.apache.org/jira/browse/MESOS-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-9965: Assignee: Greg Mann > agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not > partition aware. > -- > > Key: MESOS-9965 > URL: https://issues.apache.org/jira/browse/MESOS-9965 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Gilbert Song >Assignee: Greg Mann >Priority: Major > Labels: foundations > > The Mesos agent should not send `TASK_GONE_BY_OPERATOR` if the framework is > not partition-aware. We should distinguish the framework capability and send > different updates to legacy frameworks. > The issue is exposed from here: > https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/slave/slave.cpp#L5803 > An example to follow: > https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/master/master.cpp#L9921 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9957) Sequence all operations on the agent
[ https://issues.apache.org/jira/browse/MESOS-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919881#comment-16919881 ] Greg Mann commented on MESOS-9957: -- See the following review, which includes a test illustrating this type of failure: https://reviews.apache.org/r/71417/ > Sequence all operations on the agent > > > Key: MESOS-9957 > URL: https://issues.apache.org/jira/browse/MESOS-9957 > Project: Mesos > Issue Type: Task >Reporter: Greg Mann >Priority: Major > Labels: foundations, mesosphere > > The resolution of MESOS-8582 requires that an asynchronous step be added to > the code path which applies speculative operations like RESERVE and CREATE on > the agent. In order to ensure that the {{FrameworkInfo}} associated with an > incoming operation will be successfully retained, we must first unschedule GC > on the framework meta directory if the framework struct does not exist but > that directory does. By introducing this asynchronous step, we allow the > possibility that an operation may be executed out-of-order with respect to an > incoming dependent LAUNCH or LAUNCH_GROUP. > For example, if a scheduler issues an ACCEPT call containing both a RESERVE > operation as well as a LAUNCH operation containing a task which consumes the > new reserved resources, it's possible that this task will be launched on the > agent before the reserved resources exist. > While we already [sequence task launches on a per-executor > basis|https://github.com/apache/mesos/blob/9297e2d3b0d44b553fc89bcf5f6109c76cc53668/src/slave/slave.cpp#L2337-L2408], > the aforementioned corner case requires that we sequence _all_ offer > operations on a per-framework basis. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9957) Sequence all operations on the agent
[ https://issues.apache.org/jira/browse/MESOS-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919879#comment-16919879 ] Greg Mann commented on MESOS-9957: -- One approach that could be taken here is to eliminate the per-executor {{Sequence}}s in the {{taskLaunchSequences}} map, and instead put a single {{Sequence operationSequence}} member in the {{Framework}} struct. The {{taskLaunch}} futures from the {{run()}} code path could likely be added into that sequence as-is, with the {{applyOperation()}} code path adding new futures to that sequence as well. > Sequence all operations on the agent > > > Key: MESOS-9957 > URL: https://issues.apache.org/jira/browse/MESOS-9957 > Project: Mesos > Issue Type: Task >Reporter: Greg Mann >Priority: Major > Labels: foundations, mesosphere > > The resolution of MESOS-8582 requires that an asynchronous step be added to > the code path which applies speculative operations like RESERVE and CREATE on > the agent. In order to ensure that the {{FrameworkInfo}} associated with an > incoming operation will be successfully retained, we must first unschedule GC > on the framework meta directory if the framework struct does not exist but > that directory does. By introducing this asynchronous step, we allow the > possibility that an operation may be executed out-of-order with respect to an > incoming dependent LAUNCH or LAUNCH_GROUP. > For example, if a scheduler issues an ACCEPT call containing both a RESERVE > operation as well as a LAUNCH operation containing a task which consumes the > new reserved resources, it's possible that this task will be launched on the > agent before the reserved resources exist. > While we already [sequence task launches on a per-executor > basis|https://github.com/apache/mesos/blob/9297e2d3b0d44b553fc89bcf5f6109c76cc53668/src/slave/slave.cpp#L2337-L2408], > the aforementioned corner case requires that we sequence _all_ offer > operations on a per-framework basis. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (MESOS-9957) Sequence all operations on the agent
Greg Mann created MESOS-9957: Summary: Sequence all operations on the agent Key: MESOS-9957 URL: https://issues.apache.org/jira/browse/MESOS-9957 Project: Mesos Issue Type: Task Reporter: Greg Mann The resolution of MESOS-8582 requires that an asynchronous step be added to the code path which applies speculative operations like RESERVE and CREATE on the agent. In order to ensure that the {{FrameworkInfo}} associated with an incoming operation will be successfully retained, we must first unschedule GC on the framework meta directory if the framework struct does not exist but that directory does. By introducing this asynchronous step, we allow the possibility that an operation may be executed out-of-order with respect to an incoming dependent LAUNCH or LAUNCH_GROUP. For example, if a scheduler issues an ACCEPT call containing both a RESERVE operation as well as a LAUNCH operation containing a task which consumes the new reserved resources, it's possible that this task will be launched on the agent before the reserved resources exist. While we already [sequence task launches on a per-executor basis|https://github.com/apache/mesos/blob/9297e2d3b0d44b553fc89bcf5f6109c76cc53668/src/slave/slave.cpp#L2337-L2408], the aforementioned corner case requires that we sequence _all_ offer operations on a per-framework basis. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (MESOS-9954) Flapping tasks with large sandboxes can fill agent disk
Greg Mann created MESOS-9954: Summary: Flapping tasks with large sandboxes can fill agent disk Key: MESOS-9954 URL: https://issues.apache.org/jira/browse/MESOS-9954 Project: Mesos Issue Type: Bug Reporter: Greg Mann If a task on an agent is repeatedly re-launched after failing and pulls a large artifact into its sandbox, it can quickly fill the agent disk. This may happen on a time scale shorter than the disk watch interval, leading to the agent disk filling up. We should evaluate solutions to this issue. A couple options: * Perhaps an aggressive (short) disk watch interval is sufficient? We should investigate the performance impact of this approach. * If the former doesn't work, then maybe polling free disk space whenever a task is launched makes sense? (Rate-limiting this might be necessary) * Perhaps we can come up with some fundamentally different approach for detecting free disk space which would solve this issue? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state
[ https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911836#comment-16911836 ] Greg Mann edited comment on MESOS-9545 at 8/21/19 1:40 AM: --- 1.8.x: {noformat} commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193 Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} 1.7.x: {noformat} commit 61f1155675bd3bc5312e0501ea6182d2ee7434af Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 0c5e78bc26653d26a03b08b82923ea517de46fc0 Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} 1.6.x: {noformat} commit c6da50d10511a1046b8d4bc563dc3ccee875 Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 6a9cee7999be0a3a4f89d21ec58947fe90c01eeb Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} was (Author: greggomann): 1.8.x: {noformat} commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193 Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} 1.7.x: {noformat} commit 61f1155675bd3bc5312e0501ea6182d2ee7434af Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 0c5e78bc26653d26a03b08b82923ea517de46fc0 Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} 1.6.x: {noformat} commit c6da50d10511a1046b8d4bc563dc3ccee875 (HEAD -> 1.6.x, origin/1.6.x, mesos-private/ci/greg/mesos-9545-1.6.x, ci/greg/mesos-9545-1.6.x) Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an
[jira] [Comment Edited] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state
[ https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911836#comment-16911836 ] Greg Mann edited comment on MESOS-9545 at 8/21/19 1:40 AM: --- 1.8.x: {noformat} commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193 Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} 1.7.x: {noformat} commit 61f1155675bd3bc5312e0501ea6182d2ee7434af Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 0c5e78bc26653d26a03b08b82923ea517de46fc0 Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} 1.6.x: {noformat} commit c6da50d10511a1046b8d4bc563dc3ccee875 (HEAD -> 1.6.x, origin/1.6.x, mesos-private/ci/greg/mesos-9545-1.6.x, ci/greg/mesos-9545-1.6.x) Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 6a9cee7999be0a3a4f89d21ec58947fe90c01eeb Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} was (Author: greggomann): 1.8.x: {noformat} commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193 Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} 1.7.x: {noformat} commit 61f1155675bd3bc5312e0501ea6182d2ee7434af Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 0c5e78bc26653d26a03b08b82923ea517de46fc0 Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} > Marking an unreachable agent as gone should transition the tasks to terminal > state >
[jira] [Commented] (MESOS-9937) 53598228fe should be backported to 1.7.x
[ https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911854#comment-16911854 ] Greg Mann commented on MESOS-9937: -- [~carlone], this is done, see the commit below: {noformat} commit 0c5e78bc26653d26a03b08b82923ea517de46fc0 Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} > 53598228fe should be backported to 1.7.x > > > Key: MESOS-9937 > URL: https://issues.apache.org/jira/browse/MESOS-9937 > Project: Mesos > Issue Type: Bug >Reporter: longfei >Assignee: Greg Mann >Priority: Blocker > Labels: foundations > > Commit 53598228fe on the master branch should be backported to 1.7.x. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state
[ https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911836#comment-16911836 ] Greg Mann edited comment on MESOS-9545 at 8/21/19 1:25 AM: --- 1.8.x: {noformat} commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193 Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} 1.7.x: {noformat} commit 61f1155675bd3bc5312e0501ea6182d2ee7434af Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 0c5e78bc26653d26a03b08b82923ea517de46fc0 Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} was (Author: greggomann): 1.8.x: {noformat} commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193 Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} > Marking an unreachable agent as gone should transition the tasks to terminal > state > -- > > Key: MESOS-9545 > URL: https://issues.apache.org/jira/browse/MESOS-9545 > Project: Mesos > Issue Type: Improvement >Reporter: Vinod Kone >Assignee: Greg Mann >Priority: Major > Labels: foundations > Fix For: 1.9.0 > > > If an unreachable agent is marked as gone, currently master just marks that > agent in the registry but doesn't do anything about its tasks. So the tasks > are in UNREACHABLE state in the master forever, until the master fails over. > This is not great UX. We should transition these to terminal state instead. > This fix should also include a test to verify. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state
[ https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911836#comment-16911836 ] Greg Mann commented on MESOS-9545: -- 1.8.x: {noformat} commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193 Author: Greg Mann Date: Tue Apr 23 22:25:29 2019 -0700 Transitioned tasks when an unreachable agent is marked as gone. This patch updates the master code responsible for marking agents as gone to properly transition tasks on agents which were previously marked as unreachable. Review: https://reviews.apache.org/r/70519/ {noformat} {noformat} commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec Author: Greg Mann Date: Tue Apr 23 22:25:21 2019 -0700 Fixed a memory leak in the master's 'removeTask()' helper. Previously, all removed tasks were added to the `slaves.unreachableTasks` map. This patch adds a conditional so that removed tasks are only added to that structure when they are being marked unreachable. Review: https://reviews.apache.org/r/70518/ {noformat} > Marking an unreachable agent as gone should transition the tasks to terminal > state > -- > > Key: MESOS-9545 > URL: https://issues.apache.org/jira/browse/MESOS-9545 > Project: Mesos > Issue Type: Improvement >Reporter: Vinod Kone >Assignee: Greg Mann >Priority: Major > Labels: foundations > Fix For: 1.9.0 > > > If an unreachable agent is marked as gone, currently master just marks that > agent in the registry but doesn't do anything about its tasks. So the tasks > are in UNREACHABLE state in the master forever, until the master fails over. > This is not great UX. We should transition these to terminal state instead. > This fix should also include a test to verify. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (MESOS-9946) DefaultExecutorTest.ROOT_INTERNET_CURL_DockerTaskWithFileURI is flaky
Greg Mann created MESOS-9946: Summary: DefaultExecutorTest.ROOT_INTERNET_CURL_DockerTaskWithFileURI is flaky Key: MESOS-9946 URL: https://issues.apache.org/jira/browse/MESOS-9946 Project: Mesos Issue Type: Bug Components: test Reporter: Greg Mann Observed this on a 1.8.x build. I suspect it's due to a slow image pull based on the logs. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (MESOS-9945) Use streaming response in the checker process
Greg Mann created MESOS-9945: Summary: Use streaming response in the checker process Key: MESOS-9945 URL: https://issues.apache.org/jira/browse/MESOS-9945 Project: Mesos Issue Type: Improvement Reporter: Greg Mann Because we do not currently use a streaming response for nested container command health checks in the checker process, we are not able to display the output of failed checks (MESOS-7903), and we are not able to begin the health check timeout at the appropriate moment (MESOS-9944). We should update the checker process to use a streaming response for the LAUNCH_NESTED_CONTAINER_SESSION call that it uses to initiate command health checks. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (MESOS-9944) Command health check timeout begins to early
Greg Mann created MESOS-9944: Summary: Command health check timeout begins to early Key: MESOS-9944 URL: https://issues.apache.org/jira/browse/MESOS-9944 Project: Mesos Issue Type: Bug Components: agent Affects Versions: 1.9.0 Reporter: Greg Mann The checker process begins the timer for the command health check timeout when the LAUNCH_NESTED_CONTAINER_SESSION request is first sent, which means any delay in the execution of the health check command is included in the health check timeout. This can be an issue when the agent is under heavy load, and it may take a few seconds for the health check command to be run. Once we have a streaming response for the ATTACH_CONTAINER_OUTPUT call which follows the nested container launch, we can initiate the health check timeout once the first byte of the response is received; this is a more accurate signal that the health check command has begun running. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state
[ https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906587#comment-16906587 ] Greg Mann commented on MESOS-9545: -- [~vinodkone] thanks for the ping - I have these backports in progress but got distracted, will make this happen this week. > Marking an unreachable agent as gone should transition the tasks to terminal > state > -- > > Key: MESOS-9545 > URL: https://issues.apache.org/jira/browse/MESOS-9545 > Project: Mesos > Issue Type: Improvement >Reporter: Vinod Kone >Assignee: Greg Mann >Priority: Major > Labels: foundations > Fix For: 1.9.0 > > > If an unreachable agent is marked as gone, currently master just marks that > agent in the registry but doesn't do anything about its tasks. So the tasks > are in UNREACHABLE state in the master forever, until the master fails over. > This is not great UX. We should transition these to terminal state instead. > This fix should also include a test to verify. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9938) Standalone container documentation
[ https://issues.apache.org/jira/browse/MESOS-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906579#comment-16906579 ] Greg Mann commented on MESOS-9938: -- Review here: https://reviews.apache.org/r/65112/ > Standalone container documentation > -- > > Key: MESOS-9938 > URL: https://issues.apache.org/jira/browse/MESOS-9938 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Greg Mann >Assignee: Joseph Wu >Priority: Major > Labels: foundations, mesosphere > > We should add documentation for standalone containers. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9938) Standalone container documentation
[ https://issues.apache.org/jira/browse/MESOS-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-9938: Assignee: Joseph Wu > Standalone container documentation > -- > > Key: MESOS-9938 > URL: https://issues.apache.org/jira/browse/MESOS-9938 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Greg Mann >Assignee: Joseph Wu >Priority: Major > Labels: foundations, mesosphere > > We should add documentation for standalone containers. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9938) Standalone container documentation
Greg Mann created MESOS-9938: Summary: Standalone container documentation Key: MESOS-9938 URL: https://issues.apache.org/jira/browse/MESOS-9938 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Greg Mann We should add documentation for standalone containers. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9937) 53598228fe should be backported to 1.7.x
[ https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906559#comment-16906559 ] Greg Mann commented on MESOS-9937: -- [~carlone] good timing! I was already planning to backport that commit as part of backporting MESOS-9545, which I previously overlooked backporting. Should happen in the next couple days. > 53598228fe should be backported to 1.7.x > > > Key: MESOS-9937 > URL: https://issues.apache.org/jira/browse/MESOS-9937 > Project: Mesos > Issue Type: Bug >Reporter: longfei >Assignee: Greg Mann >Priority: Blocker > Labels: foundations > > Commit 53598228fe on the master branch should be backported to 1.7.x. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9931) StorageLocalResourceProviderTest.ROOT_NewVolumeRecovery is flaky
Greg Mann created MESOS-9931: Summary: StorageLocalResourceProviderTest.ROOT_NewVolumeRecovery is flaky Key: MESOS-9931 URL: https://issues.apache.org/jira/browse/MESOS-9931 Project: Mesos Issue Type: Bug Components: resource provider, test Environment: Ubuntu 14.04, SSL-enabled Reporter: Greg Mann {noformat} 20:03:04 [ RUN ] StorageLocalResourceProviderTest.ROOT_NewVolumeRecovery 20:03:04 I0808 20:03:04.748040 10423 cluster.cpp:172] Creating default 'local' authorizer 20:03:04 I0808 20:03:04.749110 23206 master.cpp:467] Master 6180f181-f8df-4125-a7bd-a1b5ff9f16f1 (ip-172-16-10-204.ec2.internal) started on 172.16.10.204:57833 20:03:04 I0808 20:03:04.749131 23206 master.cpp:469] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/nLxkqj/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/nLxkqj/master" --zk_session_timeout="10secs" 20:03:04 I0808 20:03:04.749269 23206 master.cpp:518] Master only allowing authenticated frameworks to register 20:03:04 I0808 20:03:04.749276 23206 master.cpp:524] Master only allowing authenticated agents to register 20:03:04 I0808 20:03:04.749281 23206 master.cpp:530] Master only allowing authenticated HTTP frameworks to register 20:03:04 I0808 20:03:04.749287 23206 credentials.hpp:37] Loading credentials for authentication from '/tmp/nLxkqj/credentials' 20:03:04 I0808 20:03:04.749378 23206 master.cpp:574] Using default 'crammd5' authenticator 20:03:04 I0808 20:03:04.749428 23206 http.cpp:1049] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' 20:03:04 I0808 20:03:04.749473 23206 http.cpp:1049] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' 20:03:04 I0808 20:03:04.749497 23206 http.cpp:1049] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' 20:03:04 I0808 20:03:04.749522 23206 master.cpp:653] Authorization enabled 20:03:04 I0808 20:03:04.749619 23208 hierarchical.cpp:219] Initialized hierarchical allocator process 20:03:04 I0808 20:03:04.749645 23208 whitelist_watcher.cpp:77] No whitelist given 20:03:04 I0808 20:03:04.750432 23206 master.cpp:2265] Elected as the leading master! 20:03:04 I0808 20:03:04.750452 23206 master.cpp:1730] Recovering from registrar 20:03:04 I0808 20:03:04.750491 23206 registrar.cpp:347] Recovering registrar 20:03:04 I0808 20:03:04.750617 23206 registrar.cpp:391] Successfully fetched the registry (0B) in 112896ns 20:03:04 I0808 20:03:04.750648 23206 registrar.cpp:495] Applied 1 operations in 7349ns; attempting to update the registry 20:03:04 I0808 20:03:04.750763 23204 registrar.cpp:552] Successfully updated the registry in 97024ns 20:03:04 I0808 20:03:04.750794 23204 registrar.cpp:424] Successfully recovered registrar 20:03:04 I0808 20:03:04.750921 23204 master.cpp:1843] Recovered 0 agents from the registry (176B); allowing 10mins for agents to re-register 20:03:04 I0808 20:03:04.750958 23204 hierarchical.cpp:257] Skipping recovery of hierarchical allocator: nothing to recover 20:03:04 W0808 20:03:04.752594 10423 process.cpp:2745] Attempted to spawn already running process files@172.16.10.204:57833 20:03:04 I0808 20:03:04.753041 10423 containerizer.cpp:309] Using isolation { environment_secret, volume/sandbox_path, filesystem/linux, network/cni, volume/image, volume/host_path } 20:03:04 I0808 20:03:04.756779 10423 linux_launcher.cpp:145] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher 20:03:04 I0808 20:03:04.757201 10423 provisioner.cpp:299] Using default backend 'aufs' 20:03:04 I0808 20:03:04.757702 10423 linux.cpp:152] Bind
[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail
[ https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901216#comment-16901216 ] Greg Mann commented on MESOS-9875: -- I'm trying to figure out how to address this with the current information that we checkpoint on the agent. The old-style checkpointing on the agent went like this: 1) Checkpoint resources to a "target file" 2) Sync checkpointed resources to disk, which creates persistent volumes 3) If #2 succeeds, move the "target file" to the actual checkpoint location When implementing operation feedback, we thought we could get away without this two-phase checkpointing, since we now have the operation feedback streams which we can use as another source of information. When recovering in the agent, we have some logic which inspects both the checkpointed resources/operations as well as the operation streams checkpointed by the operation status update manager in order to recover properly. It's possible that we could use the old-style checkpointed resource files in order to accomplish recovery now (we still write those to disk to enable agent downgrades), but I'm worried that this will be confusing. But perhaps it's already confusing :) I'll try to have a patch up by EOD with a solution for you to look at. > Mesos did not respond correctly when operations should fail > --- > > Key: MESOS-9875 > URL: https://issues.apache.org/jira/browse/MESOS-9875 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Yifan Xing >Assignee: Greg Mann >Priority: Major > Labels: foundations, mesosphere > Attachments: Screen Shot 2019-06-27 at 15.07.20.png > > > For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we > sshed into the mesos-agent and made it unable to create subdirectories in > {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation > failed response. Instead, we received {{OPERATION_FINISHED}} feedback. > Steps to recreate the issue: > 1. Ssh into a magent. > 2. Make it impossible to create a persistent volume (we expect the agent to > crash and reregister, and the master to release that the operation is > {{OPERATION_DROPPED}}): > * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes) > * chattr -RV +i volumes (then no subdirectories can be created) > 3. Launch a service with persistent volumes with the constraint of only using > the magent modified above. > > > Logs for the scheduler for receiving `OPERATION_FINISHED`: > (Also see screenshot) > > 2019-06-27 21:57:11.879 [12768651|rdar://12768651] > [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored > operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and > feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on > serviceID=yifan-badagents-1 > > * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: > REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch > container: Failed to change the ownership of the persistent volume at > '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' > with uid 264 and gid 264: No such file or directory -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (MESOS-9875) Mesos did not respond correctly when operations should fail
[ https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900866#comment-16900866 ] Greg Mann edited comment on MESOS-9875 at 8/6/19 12:31 PM: --- It looks like the {{OPERATION_FINISHED}} update should only be sent after the agent fails over and recovers its checkpointed operations. We need to make sure that if the agent's call to `syncCheckpointedResources()` fails, which is the function that actually creates the persistent volume, then the operation in state OPERATION_FINISHED should not be recovered by the agent. Currently, it looks like the agent will fail to create the persistent volume, crash, and then recover the operation in state OPERATION_FINISHED and send the update. was (Author: greggomann): [~jamespeach] could you tell me what Mesos SHA this was observed on? Looking at the code, I'm having trouble identifying how this would happen, since we don't send the operation feedback until after the operation has been committed to disk; feedback is sent to the master via the final call to {{operationStatusUpdateManager.update(update);}} in {{Slave::applyOperation()}}. > Mesos did not respond correctly when operations should fail > --- > > Key: MESOS-9875 > URL: https://issues.apache.org/jira/browse/MESOS-9875 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Yifan Xing >Assignee: Greg Mann >Priority: Major > Labels: foundations, mesosphere > Attachments: Screen Shot 2019-06-27 at 15.07.20.png > > > For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we > sshed into the mesos-agent and made it unable to create subdirectories in > {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation > failed response. Instead, we received {{OPERATION_FINISHED}} feedback. > Steps to recreate the issue: > 1. Ssh into a magent. > 2. Make it impossible to create a persistent volume (we expect the agent to > crash and reregister, and the master to release that the operation is > {{OPERATION_DROPPED}}): > * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes) > * chattr -RV +i volumes (then no subdirectories can be created) > 3. Launch a service with persistent volumes with the constraint of only using > the magent modified above. > > > Logs for the scheduler for receiving `OPERATION_FINISHED`: > (Also see screenshot) > > 2019-06-27 21:57:11.879 [12768651|rdar://12768651] > [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored > operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and > feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on > serviceID=yifan-badagents-1 > > * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: > REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch > container: Failed to change the ownership of the persistent volume at > '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' > with uid 264 and gid 264: No such file or directory -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (MESOS-9875) Mesos did not respond correctly when operations should fail
[ https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900866#comment-16900866 ] Greg Mann edited comment on MESOS-9875 at 8/6/19 10:30 AM: --- [~jamespeach] could you tell me what Mesos SHA this was observed on? Looking at the code, I'm having trouble identifying how this would happen, since we don't send the operation feedback until after the operation has been committed to disk; feedback is sent to the master via the final call to {{operationStatusUpdateManager.update(update);}} in {{Slave::applyOperation()}}. was (Author: greggomann): [~jamespeach] could you tell me what Mesos SHA this was observed on? Looking at the code, I'm having trouble identifying how this would happen, since we don't send the operation feedback until after the operation has been committed to disk. > Mesos did not respond correctly when operations should fail > --- > > Key: MESOS-9875 > URL: https://issues.apache.org/jira/browse/MESOS-9875 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Yifan Xing >Assignee: Greg Mann >Priority: Major > Labels: foundations, mesosphere > Attachments: Screen Shot 2019-06-27 at 15.07.20.png > > > For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we > sshed into the mesos-agent and made it unable to create subdirectories in > {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation > failed response. Instead, we received {{OPERATION_FINISHED}} feedback. > Steps to recreate the issue: > 1. Ssh into a magent. > 2. Make it impossible to create a persistent volume (we expect the agent to > crash and reregister, and the master to release that the operation is > {{OPERATION_DROPPED}}): > * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes) > * chattr -RV +i volumes (then no subdirectories can be created) > 3. Launch a service with persistent volumes with the constraint of only using > the magent modified above. > > > Logs for the scheduler for receiving `OPERATION_FINISHED`: > (Also see screenshot) > > 2019-06-27 21:57:11.879 [12768651|rdar://12768651] > [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored > operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and > feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on > serviceID=yifan-badagents-1 > > * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: > REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch > container: Failed to change the ownership of the persistent volume at > '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' > with uid 264 and gid 264: No such file or directory -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail
[ https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900866#comment-16900866 ] Greg Mann commented on MESOS-9875: -- [~jamespeach] could you tell me what Mesos SHA this was observed on? Looking at the code, I'm having trouble identifying how this would happen, since we don't send the operation feedback until after the operation has been committed to disk. > Mesos did not respond correctly when operations should fail > --- > > Key: MESOS-9875 > URL: https://issues.apache.org/jira/browse/MESOS-9875 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Yifan Xing >Assignee: Greg Mann >Priority: Major > Labels: foundations, mesosphere > Attachments: Screen Shot 2019-06-27 at 15.07.20.png > > > For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we > sshed into the mesos-agent and made it unable to create subdirectories in > {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation > failed response. Instead, we received {{OPERATION_FINISHED}} feedback. > Steps to recreate the issue: > 1. Ssh into a magent. > 2. Make it impossible to create a persistent volume (we expect the agent to > crash and reregister, and the master to release that the operation is > {{OPERATION_DROPPED}}): > * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes) > * chattr -RV +i volumes (then no subdirectories can be created) > 3. Launch a service with persistent volumes with the constraint of only using > the magent modified above. > > > Logs for the scheduler for receiving `OPERATION_FINISHED`: > (Also see screenshot) > > 2019-06-27 21:57:11.879 [12768651|rdar://12768651] > [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored > operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and > feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on > serviceID=yifan-badagents-1 > > * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: > REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch > container: Failed to change the ownership of the persistent volume at > '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' > with uid 264 and gid 264: No such file or directory -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9875) Mesos did not respond correctly when operations should fail
[ https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-9875: Assignee: Greg Mann > Mesos did not respond correctly when operations should fail > --- > > Key: MESOS-9875 > URL: https://issues.apache.org/jira/browse/MESOS-9875 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Yifan Xing >Assignee: Greg Mann >Priority: Major > Labels: foundations, mesosphere > Attachments: Screen Shot 2019-06-27 at 15.07.20.png > > > For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we > sshed into the mesos-agent and made it unable to create subdirectories in > {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation > failed response. Instead, we received {{OPERATION_FINISHED}} feedback. > Steps to recreate the issue: > 1. Ssh into a magent. > 2. Make it impossible to create a persistent volume (we expect the agent to > crash and reregister, and the master to release that the operation is > {{OPERATION_DROPPED}}): > * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes) > * chattr -RV +i volumes (then no subdirectories can be created) > 3. Launch a service with persistent volumes with the constraint of only using > the magent modified above. > > > Logs for the scheduler for receiving `OPERATION_FINISHED`: > (Also see screenshot) > > 2019-06-27 21:57:11.879 [12768651|rdar://12768651] > [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored > operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and > feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on > serviceID=yifan-badagents-1 > > * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: > REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch > container: Failed to change the ownership of the persistent volume at > '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' > with uid 264 and gid 264: No such file or directory -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9919) Health check performance decreases on large machines
[ https://issues.apache.org/jira/browse/MESOS-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-9919: Assignee: Greg Mann > Health check performance decreases on large machines > > > Key: MESOS-9919 > URL: https://issues.apache.org/jira/browse/MESOS-9919 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > Labels: foundations, mesosphere > > In recent testing, it appears that the performance of Mesos command health > checks decreases dramatically on nodes with large numbers of cores and lots > of memory. This may be due to the changes in the cost of forking the agent > process on such nodes. We need to investigate this issue to understand the > root cause. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9919) Health check performance decreases on large machines
Greg Mann created MESOS-9919: Summary: Health check performance decreases on large machines Key: MESOS-9919 URL: https://issues.apache.org/jira/browse/MESOS-9919 Project: Mesos Issue Type: Task Components: agent, containerization Reporter: Greg Mann In recent testing, it appears that the performance of Mesos command health checks decreases dramatically on nodes with large numbers of cores and lots of memory. This may be due to the changes in the cost of forking the agent process on such nodes. We need to investigate this issue to understand the root cause. -- This message was sent by Atlassian JIRA (v7.6.14#76016)