[jira] [Assigned] (MESOS-9753) Agent Draining

2020-12-10 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-9753:


Assignee: Greg Mann

> Agent Draining
> --
>
> Key: MESOS-9753
> URL: https://issues.apache.org/jira/browse/MESOS-9753
> Project: Mesos
>  Issue Type: Epic
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
>
> This epic holds tickets related to maintenance primitive improvements which 
> facilitate draining of agent nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10192) Recent Nvidia CUDA changes break Mesos GPU support

2020-10-03 Thread Greg Mann (Jira)
Greg Mann created MESOS-10192:
-

 Summary: Recent Nvidia CUDA changes break Mesos GPU support
 Key: MESOS-10192
 URL: https://issues.apache.org/jira/browse/MESOS-10192
 Project: Mesos
  Issue Type: Bug
  Components: agent, containerization, gpu
Reporter: Greg Mann


Recently it seems that the layout of the Nvidia device files has changed:  
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/

This prevents GPU tasks from launching:
{noformat}
W0929 17:27:21.002178 65691 http.cpp:3436] Failed to launch container 
c08e1fc7-53c4-427e-a1a1-85b770e77d69.738440a3-f4cc-42ce-8978-418ba0011160: 
Failed to copy device '/dev/nvidia-caps': Failed to get source dev: Not a 
special file: /dev/nvidia-caps
{noformat}

due to this code, which detects the nvidia device files: 
https://github.com/apache/mesos/blob/8700dd8d5ece658804d7b7a40863800dcc5c72bc/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L438-L454



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10167) Mesos-websitebot fails due to wrong permissions of voulmes mounted into Docker container

2020-09-30 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10167:
-

Assignee: Vinod Kone

> Mesos-websitebot fails due to wrong permissions of voulmes mounted into 
> Docker container
> 
>
> Key: MESOS-10167
> URL: https://issues.apache.org/jira/browse/MESOS-10167
> Project: Mesos
>  Issue Type: Bug
>  Components: project website
>Reporter: Andrei Sekretenko
>Assignee: Vinod Kone
>Priority: Minor
>
> Last successful run was on Apr 7:
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Websitebot/2464/
> First failure:
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Websitebot/2465/console
> Build with added permissions dump 
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Websitebot/2525/console
> shows that while the build scripts in the container are, as expected, running 
> under "tempuser" (with the same uid as the user outside container which pulls 
> the git repositories),
> the directories with git repositories mounted into the container are owned by 
> root:
> {noformat}
> 19:06:21 uid=910(tempuser) gid=1001(tempuser) groups=1001(tempuser)
> 19:06:21 total 836
> 19:06:21 drwxr-xr-x 12 root root   4096 Jul  3 17:02 .
> 19:06:21 drwxr-xr-x  1 root root   4096 Jul  3 17:04 ..
> 19:06:21 drwxr-xr-x  6 root root   4096 Jun 29 14:12 3rdparty
> 19:06:21 drwxr-xr-x  2 root root   4096 Apr 15 14:33 bin
> 19:06:21 -rwxr-xr-x  1 root root   1294 Jul  3 17:02 bootstrap
> 19:06:21 -rw-r--r--  1 root root 536015 May 29 09:21 CHANGELOG
> 19:06:21 drwxr-xr-x  2 root root   4096 May 29 11:30 cmake
> 19:06:21 -rw-r--r--  1 root root   3990 May  7 13:40 CMakeLists.txt
> 19:06:21 -rw-r--r--  1 root root 105737 May  7 13:40 configure.ac
> 19:06:21 lrwxrwxrwx  1 root root 31 Apr 15 14:33 CONTRIBUTING.md -> 
> ./docs/beginner-contribution.md
> 19:06:21 drwxr-xr-x  6 root root   4096 May 28 19:18 docs
> 19:06:21 -rw-r--r--  1 root root  63778 Apr 15 14:33 Doxyfile
> 19:06:21 drwxr-xr-x  8 root root   4096 Jul  3 17:02 .git
> 19:06:21 -rw-r--r--  1 root root 99 Apr 15 14:33 .gitattributes
> 19:06:21 drwxr-xr-x  3 root root   4096 Aug 27  2019 include
> 19:06:21 -rw-r--r--  1 root root  66156 Apr 15 14:33 LICENSE
> 19:06:21 drwxr-xr-x  2 root root   4096 Apr 15 14:33 m4
> 19:06:21 -rw-r--r--  1 root root   3842 Apr 15 14:33 Makefile.am
> 19:06:21 -rw-r--r--  1 root root426 Apr 15 14:33 mesos.pc.in
> 19:06:21 -rw-r--r--  1 root root162 Apr 15 14:33 NOTICE
> 19:06:21 -rw-r--r--  1 root root   1103 Apr 15 14:33 README.md
> 19:06:21 drwxr-xr-x  5 root root   4096 Jul  3 17:04 site
> 19:06:21 drwxr-xr-x 48 root root   4096 Jun 30 19:30 src
> 19:06:21 drwxr-xr-x  9 root root   4096 Jul  3 17:02 support
> 19:06:21 autoreconf: Entering directory `.'
> 19:06:21 autoreconf: configure.ac: not using Gettext
> 19:06:22 autoreconf: running: aclocal --warnings=all -I m4
> 19:06:23 autom4te: cannot create autom4te.cache: No such file or directory
> {noformat}
> Note that the Dockerfile specifies "USER root" 
> https://github.com/apache/mesos/blob/master/support/mesos-website/Dockerfile 
> and the permissions are dropped to the "testuser" only inside the 
> entrypoint.sh script.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10156) Enable the `volume/csi` isolator in UCR

2020-09-04 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190876#comment-17190876
 ] 

Greg Mann commented on MESOS-10156:
---

{noformat}
commit a8059a78473774e3d95e8e908f360ee5e9aadd0d
Author: Greg Mann 
Date:   Fri Sep 4 10:39:10 2020 -0700

Added tests for 'volume/csi' isolator recovery.

Review: https://reviews.apache.org/r/72806/
{noformat}

> Enable the `volume/csi` isolator in UCR
> ---
>
> Key: MESOS-10156
> URL: https://issues.apache.org/jira/browse/MESOS-10156
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10157) Add documentation for the `volume/csi` isolator

2020-09-04 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10157:
-

Assignee: Greg Mann

> Add documentation for the `volume/csi` isolator
> ---
>
> Key: MESOS-10157
> URL: https://issues.apache.org/jira/browse/MESOS-10157
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>  Labels: docs, documentation
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10156) Enable the `volume/csi` isolator in UCR

2020-09-03 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190390#comment-17190390
 ] 

Greg Mann commented on MESOS-10156:
---

Adding test patches for the CSI isolator here:
{noformat}
commit a3fe939616fe13f34bd3555d613a0e1323730424
Author: Greg Mann 
Date:   Thu Sep 3 12:06:31 2020 -0700

Updated the test CSI plugin for CSI server testing.

This patch adds additional configuration flags to the
test CSI plugin which are necessary in order to test
the agent's CSI server.

Review: https://reviews.apache.org/r/72727/
{noformat}
{noformat}
commit f0ce0f1d8601228f16efbb98420693af42b19d43
Author: Greg Mann 
Date:   Thu Sep 3 12:06:34 2020 -0700

Added a test helper for CSI volumes.

Review: https://reviews.apache.org/r/72805/
{noformat}
commit fc22984de558302029a8cad0655e375653208448
Author: Greg Mann 
Date:   Thu Sep 3 12:06:38 2020 -0700

Added tests for the 'volume/csi' isolator.

Review: https://reviews.apache.org/r/72728/
{noformat}

> Enable the `volume/csi` isolator in UCR
> ---
>
> Key: MESOS-10156
> URL: https://issues.apache.org/jira/browse/MESOS-10156
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10156) Enable the `volume/csi` isolator in UCR

2020-09-03 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190390#comment-17190390
 ] 

Greg Mann edited comment on MESOS-10156 at 9/3/20, 9:02 PM:


Adding test patches for the CSI isolator here:
{noformat}
commit a3fe939616fe13f34bd3555d613a0e1323730424
Author: Greg Mann 
Date:   Thu Sep 3 12:06:31 2020 -0700

Updated the test CSI plugin for CSI server testing.

This patch adds additional configuration flags to the
test CSI plugin which are necessary in order to test
the agent's CSI server.

Review: https://reviews.apache.org/r/72727/
{noformat}
{noformat}
commit f0ce0f1d8601228f16efbb98420693af42b19d43
Author: Greg Mann 
Date:   Thu Sep 3 12:06:34 2020 -0700

Added a test helper for CSI volumes.

Review: https://reviews.apache.org/r/72805/
{noformat}
{noformat}
commit fc22984de558302029a8cad0655e375653208448
Author: Greg Mann 
Date:   Thu Sep 3 12:06:38 2020 -0700

Added tests for the 'volume/csi' isolator.

Review: https://reviews.apache.org/r/72728/
{noformat}


was (Author: greggomann):
Adding test patches for the CSI isolator here:
{noformat}
commit a3fe939616fe13f34bd3555d613a0e1323730424
Author: Greg Mann 
Date:   Thu Sep 3 12:06:31 2020 -0700

Updated the test CSI plugin for CSI server testing.

This patch adds additional configuration flags to the
test CSI plugin which are necessary in order to test
the agent's CSI server.

Review: https://reviews.apache.org/r/72727/
{noformat}
{noformat}
commit f0ce0f1d8601228f16efbb98420693af42b19d43
Author: Greg Mann 
Date:   Thu Sep 3 12:06:34 2020 -0700

Added a test helper for CSI volumes.

Review: https://reviews.apache.org/r/72805/
{noformat}
commit fc22984de558302029a8cad0655e375653208448
Author: Greg Mann 
Date:   Thu Sep 3 12:06:38 2020 -0700

Added tests for the 'volume/csi' isolator.

Review: https://reviews.apache.org/r/72728/
{noformat}

> Enable the `volume/csi` isolator in UCR
> ---
>
> Key: MESOS-10156
> URL: https://issues.apache.org/jira/browse/MESOS-10156
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10163) Implement a new component to launch CSI plugins as standalone containers and make CSI gRPC calls

2020-08-21 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182170#comment-17182170
 ] 

Greg Mann commented on MESOS-10163:
---

{noformat}
commit 68b481085fb82b475e108b9aa39935a8d7729983
Author: Greg Mann 
Date:   Thu Aug 20 19:26:48 2020 -0700

Fixed a bug in CSI volume manager initialization.

Previously, the volume managers would assume that they could
make CONTROLLER_SERVICE calls during plugin initialization,
regardless of whether or not the plugin provides that service.

Review: https://reviews.apache.org/r/72726/
{noformat}
{noformat}
commit 5ed30db48785007e35805886a024ebb8a61a7037
Author: Greg Mann 
Date:   Thu Aug 20 19:27:02 2020 -0700

Added the CSI server to the Mesos agent.

This patch adds a CSI server to the Mesos agent in both
the agent binary and in tests.

Review: https://reviews.apache.org/r/72761/
{noformat}
{noformat}
commit 4ff51041df860dbcc2247ef47a0596e5132da190
Author: Greg Mann g...@mesosphere.io
Date:   Thu Aug 20 19:27:23 2020 -0700


Initialized plugins lazily in the CSI server.

Review: https://reviews.apache.org/r/72779/
{noformat}

> Implement a new component to launch CSI plugins as standalone containers and 
> make CSI gRPC calls
> 
>
> Key: MESOS-10163
> URL: https://issues.apache.org/jira/browse/MESOS-10163
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>
> *Background:*
> Originally we want `volume/csi` isolator to leverage the existing [service 
> manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51]
>  to launch CSI plugins as standalone containers and currently service manager 
> needs to call the following agent HTTP APIs:
>  # `GET_CONTAINERS` to get all standalone containers in its `recover` method.
>  # `KILL_CONTAINER` and `WAIT_CONTAINER` to kill the outdated standalone 
> containers in its `recover` method.
>  # `LAUNCH_CONTAINER` via the existing 
> [ContainerDaemon|https://github.com/apache/mesos/blob/1.10.0/src/slave/container_daemon.hpp#L41:L46]
>  to launch CSI plugin as standalone container when its `getEndpoint` method 
> is called.
> The problem with the above design is, `volume/csi` isolator may need to clean 
> up orphan container during agent recovery which is triggered by containerizer 
> (see 
> [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/containerizer.cpp#L1272:L1275]
>  for details), to clean up an orphan container which is using a CSI volume, 
> `volume/csi` isolator needs to instantiate and recover the service manager 
> and get CSI plugin’s endpoint from it (i.e., service manager’s `getEndpoint` 
> method will be called by `volume/csi` isolator during agent recovery. And as 
> I mentioned above service manager’s `getEndpoint` may need to call 
> `LAUNCH_CONTAINER` to launch CSI plugin as standalone container, since agent 
> is still in recovering state, such agent HTTP call will be just rejected by 
> agent. So we have to instantiate and recover service manager *after agent 
> recovery is done*, but in `volume/csi` isolator we do not have such 
> information (i.e. the signal that agent recovery is done).
> *Solution*
> We need to implement a new component (like `CSIVolumeManager` or a better 
> name?) in Mesos agent which is responsible for launching CSI plugins as 
> standalone containers (via the existing [service 
> manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51])
>  and making CSI gRPC calls (via the existing [volume 
> manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L56]).
>  * We can instantiate this new component in the `main` method of agent and 
> pass it to both containerizer and agent (i.e. it will be a member of the 
> `Slave` object), and containerizer will in turn pass it to the `volume/csi` 
> isolator.
>  * Since this new component relies on service manager which will call agent 
> HTTP APIs, we need to pass agent URL to it, like `process::http::URL(scheme, 
> agentIP, agentPort, agentLibprocessId + "/api/v1")`, see 
> [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L459:L471]
>  for an example.
>  * When agent registers/reregisters with master (`Slave::registered` and 
> `Slave::reregistered`), we should call this new component’s `start` method 
> (see 
> [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1740:L1742]
>  and 
> [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1825:L1827]
>  as examples) which will scan the directory `--csi_plugin_config_dir` and 
> create the `service manager - volume manager` pair for each CSI plugin 

[jira] [Commented] (MESOS-9609) Master check failure when marking agent unreachable

2020-08-19 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180677#comment-17180677
 ] 

Greg Mann commented on MESOS-9609:
--

Hi [~arostami], my apologies for the delay. This came up again recently, I 
understand it's a serious bug so we'll start working on a fix soon, will update 
here.

> Master check failure when marking agent unreachable
> ---
>
> Key: MESOS-9609
> URL: https://issues.apache.org/jira/browse/MESOS-9609
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Priority: Critical
>  Labels: foundations, mesosphere
>
> {code}
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81543313 
> http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81558813 
> master.cpp:5467] Processing DECLINE call for offers: [ 
> 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework 
> 5e57f633-a69c-4009-b7
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81569313 
> master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82014210 
> master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at 
> slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82036710 
> registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the 
> registry
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82057210 
> registrar.cpp:552] Successfully updated the registry in 175872ns
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82064211 
> master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at 
> slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957 9 
> hierarchical.cpp:609] Removed agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49
> Mar 11 10:04:35 research docker[4503]: F0311 10:04:35.85196111 
> master.cpp:10018] Check failed: 'framework' Must be non NULL
> Mar 11 10:04:35 research docker[4503]: *** Check failure stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d  
> google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830  
> google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663  
> google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259  
> google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14  
> google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8  
> mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2  
> mesos::internal::master::Master::_markUnreachable()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11  
> process::ProcessBase::consume()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb2a4a  
> process::ProcessManager::resume()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb65d6  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c35d4c80  (unknown)
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2de76ba  start_thread
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2b1d41d  (unknown)
> Mar 11 10:04:36 research docker[4503]: *** Aborted at 1520762676 (unix time) 
> try "date -d @1520762676" if you are using GNU date ***
> Mar 11 10:04:36 research docker[4503]: PC: @ 0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: *** SIGSEGV (@0x0) received by PID 1 
> (TID 0x7f96b986d700) from PID 0; stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2df1390 (unknown)
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c604ce2c 
> google::DumpStackTraceAndExit()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d 
> google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830 
> google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663 
> google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259 
> google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14 
> google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8 
> mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2 
> 

[jira] [Commented] (MESOS-10163) Implement a new component to launch CSI plugins as standalone containers and make CSI gRPC calls

2020-08-10 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175226#comment-17175226
 ] 

Greg Mann commented on MESOS-10163:
---

{noformat}
commit fe0cd02a0697a4c4fcf5087fcafd6729beec0b41 (HEAD -> master, origin/master, 
origin/HEAD, merge)
Author: Greg Mann 
Date:   Mon Aug 10 20:11:50 2020 -0700

Added implementation of the CSI server.

Review: https://reviews.apache.org/r/72716/
{noformat}

> Implement a new component to launch CSI plugins as standalone containers and 
> make CSI gRPC calls
> 
>
> Key: MESOS-10163
> URL: https://issues.apache.org/jira/browse/MESOS-10163
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>
> *Background:*
> Originally we want `volume/csi` isolator to leverage the existing [service 
> manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51]
>  to launch CSI plugins as standalone containers and currently service manager 
> needs to call the following agent HTTP APIs:
>  # `GET_CONTAINERS` to get all standalone containers in its `recover` method.
>  # `KILL_CONTAINER` and `WAIT_CONTAINER` to kill the outdated standalone 
> containers in its `recover` method.
>  # `LAUNCH_CONTAINER` via the existing 
> [ContainerDaemon|https://github.com/apache/mesos/blob/1.10.0/src/slave/container_daemon.hpp#L41:L46]
>  to launch CSI plugin as standalone container when its `getEndpoint` method 
> is called.
> The problem with the above design is, `volume/csi` isolator may need to clean 
> up orphan container during agent recovery which is triggered by containerizer 
> (see 
> [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/containerizer.cpp#L1272:L1275]
>  for details), to clean up an orphan container which is using a CSI volume, 
> `volume/csi` isolator needs to instantiate and recover the service manager 
> and get CSI plugin’s endpoint from it (i.e., service manager’s `getEndpoint` 
> method will be called by `volume/csi` isolator during agent recovery. And as 
> I mentioned above service manager’s `getEndpoint` may need to call 
> `LAUNCH_CONTAINER` to launch CSI plugin as standalone container, since agent 
> is still in recovering state, such agent HTTP call will be just rejected by 
> agent. So we have to instantiate and recover service manager *after agent 
> recovery is done*, but in `volume/csi` isolator we do not have such 
> information (i.e. the signal that agent recovery is done).
> *Solution*
> We need to implement a new component (like `CSIVolumeManager` or a better 
> name?) in Mesos agent which is responsible for launching CSI plugins as 
> standalone containers (via the existing [service 
> manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51])
>  and making CSI gRPC calls (via the existing [volume 
> manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L56]).
>  * We can instantiate this new component in the `main` method of agent and 
> pass it to both containerizer and agent (i.e. it will be a member of the 
> `Slave` object), and containerizer will in turn pass it to the `volume/csi` 
> isolator.
>  * Since this new component relies on service manager which will call agent 
> HTTP APIs, we need to pass agent URL to it, like `process::http::URL(scheme, 
> agentIP, agentPort, agentLibprocessId + "/api/v1")`, see 
> [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L459:L471]
>  for an example.
>  * When agent registers/reregisters with master (`Slave::registered` and 
> `Slave::reregistered`), we should call this new component’s `start` method 
> (see 
> [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1740:L1742]
>  and 
> [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1825:L1827]
>  as examples) which will scan the directory `--csi_plugin_config_dir` and 
> create the `service manager - volume manager` pair for each CSI plugin loaded 
> from that directory.
>  * For the `volume/csi` isolator, it needs to call this new component’s 
> `publishVolume` and `unpublishVolume` methods in its `prepare` and `cleanup` 
> method.
> In the case of clean up orphan containers during agent recovery, `volume/csi` 
> isolator will just call this new component’s `unpublishVolume` method as 
> usual, and it is this new component’s responsibility to only make the actual 
> CSI gRPC call after agent recovery is done and agent has registered with 
> master (e.g., when this new component’s start method is called).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10168) Add secrets support to the CSI service and volume managers

2020-08-03 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10168:
-

Assignee: Greg Mann

> Add secrets support to the CSI service and volume managers
> --
>
> Key: MESOS-10168
> URL: https://issues.apache.org/jira/browse/MESOS-10168
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: csi
>
> We must update our CSI code to pass secrets to CSI drivers when 
> staging/unstaging and publishing/unpublishing volumes. We must ensure that we 
> avoid writing any secrets to disk by holding a secret resolver in the 
> appropriate component to resolve secrets associated with already-attached 
> volumes during/after recovery.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10156) Enable the `volume/csi` isolator in UCR

2020-08-03 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10156:
-

Assignee: (was: Greg Mann)

> Enable the `volume/csi` isolator in UCR
> ---
>
> Key: MESOS-10156
> URL: https://issues.apache.org/jira/browse/MESOS-10156
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10156) Enable the `volume/csi` isolator in UCR

2020-08-03 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10156:
-

Assignee: Greg Mann

> Enable the `volume/csi` isolator in UCR
> ---
>
> Key: MESOS-10156
> URL: https://issues.apache.org/jira/browse/MESOS-10156
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10163) Implement a new component to launch CSI plugins as standalone containers and make CSI gRPC calls

2020-08-03 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170261#comment-17170261
 ] 

Greg Mann commented on MESOS-10163:
---

{noformat}
commit c78dc333fc893a43d40dc33299a61987198a6ea9 (HEAD -> master, origin/master, 
origin/HEAD)
Author: Greg Mann 
Date:   Mon Aug 3 10:11:57 2020 -0700

Added interface for the CSI server.

This component will hold objects associated with CSI plugins
running on the agent.

Review: https://reviews.apache.org/r/72707/
{noformat}

> Implement a new component to launch CSI plugins as standalone containers and 
> make CSI gRPC calls
> 
>
> Key: MESOS-10163
> URL: https://issues.apache.org/jira/browse/MESOS-10163
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>
> *Background:*
> Originally we want `volume/csi` isolator to leverage the existing [service 
> manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51]
>  to launch CSI plugins as standalone containers and currently service manager 
> needs to call the following agent HTTP APIs:
>  # `GET_CONTAINERS` to get all standalone containers in its `recover` method.
>  # `KILL_CONTAINER` and `WAIT_CONTAINER` to kill the outdated standalone 
> containers in its `recover` method.
>  # `LAUNCH_CONTAINER` via the existing 
> [ContainerDaemon|https://github.com/apache/mesos/blob/1.10.0/src/slave/container_daemon.hpp#L41:L46]
>  to launch CSI plugin as standalone container when its `getEndpoint` method 
> is called.
> The problem with the above design is, `volume/csi` isolator may need to clean 
> up orphan container during agent recovery which is triggered by containerizer 
> (see 
> [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/containerizer.cpp#L1272:L1275]
>  for details), to clean up an orphan container which is using a CSI volume, 
> `volume/csi` isolator needs to instantiate and recover the service manager 
> and get CSI plugin’s endpoint from it (i.e., service manager’s `getEndpoint` 
> method will be called by `volume/csi` isolator during agent recovery. And as 
> I mentioned above service manager’s `getEndpoint` may need to call 
> `LAUNCH_CONTAINER` to launch CSI plugin as standalone container, since agent 
> is still in recovering state, such agent HTTP call will be just rejected by 
> agent. So we have to instantiate and recover service manager *after agent 
> recovery is done*, but in `volume/csi` isolator we do not have such 
> information (i.e. the signal that agent recovery is done).
> *Solution*
> We need to implement a new component (like `CSIVolumeManager` or a better 
> name?) in Mesos agent which is responsible for launching CSI plugins as 
> standalone containers (via the existing [service 
> manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51])
>  and making CSI gRPC calls (via the existing [volume 
> manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L56]).
>  * We can instantiate this new component in the `main` method of agent and 
> pass it to both containerizer and agent (i.e. it will be a member of the 
> `Slave` object), and containerizer will in turn pass it to the `volume/csi` 
> isolator.
>  * Since this new component relies on service manager which will call agent 
> HTTP APIs, we need to pass agent URL to it, like `process::http::URL(scheme, 
> agentIP, agentPort, agentLibprocessId + "/api/v1")`, see 
> [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L459:L471]
>  for an example.
>  * When agent registers/reregisters with master (`Slave::registered` and 
> `Slave::reregistered`), we should call this new component’s `start` method 
> (see 
> [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1740:L1742]
>  and 
> [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1825:L1827]
>  as examples) which will scan the directory `--csi_plugin_config_dir` and 
> create the `service manager - volume manager` pair for each CSI plugin loaded 
> from that directory.
>  * For the `volume/csi` isolator, it needs to call this new component’s 
> `publishVolume` and `unpublishVolume` methods in its `prepare` and `cleanup` 
> method.
> In the case of clean up orphan containers during agent recovery, `volume/csi` 
> isolator will just call this new component’s `unpublishVolume` method as 
> usual, and it is this new component’s responsibility to only make the actual 
> CSI gRPC call after agent recovery is done and agent has registered with 
> master (e.g., when this new component’s start method is called).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10168) Add secrets support to the CSI service and volume managers

2020-07-28 Thread Greg Mann (Jira)
Greg Mann created MESOS-10168:
-

 Summary: Add secrets support to the CSI service and volume managers
 Key: MESOS-10168
 URL: https://issues.apache.org/jira/browse/MESOS-10168
 Project: Mesos
  Issue Type: Task
Reporter: Greg Mann


We must update our CSI code to pass secrets to CSI drivers when 
staging/unstaging and publishing/unpublishing volumes. We must ensure that we 
avoid writing any secrets to disk by holding a secret resolver in the 
appropriate component to resolve secrets associated with already-attached 
volumes during/after recovery.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10150) Refactor CSI volume manager to support pre-provisioned CSI volumes

2020-07-09 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10150:
-

Assignee: Greg Mann

> Refactor CSI volume manager to support pre-provisioned CSI volumes
> --
>
> Key: MESOS-10150
> URL: https://issues.apache.org/jira/browse/MESOS-10150
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>
> The existing 
> [VolumeManager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L138]
>  is like a wrapper for various CSI gRPC calls, we could consider leveraging 
> it to call CSI plugins rather than making raw CSI gRPC calls in `volume/csi` 
> isolator. But there is a problem, the lifecycle of the volumes managed by 
> VolumeManager starts from the 
> `[createVolume|https://github.com/apache/mesos/blob/1.10.0/src/csi/v1_volume_manager.cpp#L281:L329]`
>  CSI call, but what we plan to support in MVP is pre-provisioned volumes, so 
> we need to refactor VolumeManager by making it support pre-provisioned 
> volumes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10140) CMake Error: Problem with archive_read_open_file(): Unrecognized archive format

2020-07-07 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152815#comment-17152815
 ] 

Greg Mann commented on MESOS-10140:
---

[~QuellaZhang] could you try building again on latest master branch of Mesos? 
We believe the issue should be fixed now. If so, please close out this ticket, 
otherwise let us know. Thanks!

> CMake Error: Problem with archive_read_open_file(): Unrecognized archive 
> format
> ---
>
> Key: MESOS-10140
> URL: https://issues.apache.org/jira/browse/MESOS-10140
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: QuellaZhang
>Priority: Major
>  Labels: windows
> Attachments: mesos_build.log
>
>
> Hi All,
> We tried to build Mesos on Windows with VS2019. It failed to build due to 
> "CUSTOMBUILD : CMake error : Problem with archive_read_open_file(): 
> Unrecognized archive format 
> [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj]" on Windows 
> using MSVC. It can be reproduced on latest reversion d4634f4 on master 
> branch. Could you help confirm? We use cmake version 3.17.2.
>  
> Reproduce steps:
> 1.  git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> F:\gitP\apache\mesos
>  2.  Open a VS 2019 x64 command prompt as admin and browse to 
> F:\gitP\apache\mesos
>  3.  mkdir build_amd64 && pushd build_amd64
> 4.  cmake -G "Visual Studio 16 2019" -A x64 
> -DCMAKE_SYSTEM_VERSION=10.0.18362.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="F:\tools\gnuwin32\bin" -T host=x64 ..
> 5.  set _CL_=/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING
> 6.  msbuild /maxcpucount:4 /p:Platform=x64 /p:Configuration=Debug Mesos.sln 
> /t:Rebuild
>  
> ErrorMessage:
> *manual run:*
> F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP\src>cmake --version
>  cmake version 3.17.2
> CMake suite maintained and supported by Kitware (kitware.com/cmake).
> F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP\src>cmake -E tar xjf 
> archive.tar
>  CMake Error: Problem with archive_read_open_file(): Unrecognized archive 
> format
>  CMake Error: Problem extracting tar: archive.tar
> *build log: (see attachment)*
> 59>CUSTOMBUILD : CMake error : Problem with archive_read_open_file(): 
> Unrecognized archive format 
> [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj]
>  59>CUSTOMBUILD : CMake error : Problem extracting tar: 
> F:/gitP/apache/mesos/build_amd64/3rdparty/wclayer-WIP/src/archive.tar 
> [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj]
>  – extracting... [error clean up]
>  CMake Error at wclayer-WIP-stamp/extract-wclayer-WIP.cmake:33 (message):
>  59>CUSTOMBUILD : error : extract of 
> [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj]
>  'F:/gitP/apache/mesos/build_amd64/3rdparty/wclayer-WIP/src/archive.tar'
>  failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10143) Outstanding Offers accumulating

2020-07-07 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152811#comment-17152811
 ] 

Greg Mann commented on MESOS-10143:
---

[~puneetku287] it's unclear to me from the description if this is an issue in 
Mesos or in your scheduler. A more precise description of the framework's 
behavior during the incidents would help - what does the scheduler do with the 
offers during this time? Feel free to find us on Mesos Slack, that might be an 
easier place to have a synchronous discussion about your issue.

> Outstanding Offers accumulating
> ---
>
> Key: MESOS-10143
> URL: https://issues.apache.org/jira/browse/MESOS-10143
> Project: Mesos
>  Issue Type: Bug
>  Components: master, scheduler driver
>Affects Versions: 1.7.0
> Environment: Mesos Version 1.7.0
> JDK 8.0
>Reporter: Puneet Kumar
>Priority: Minor
>
> We manage an Apache Mesos cluster version 1.7.0. We have written a framework 
> in Java that schedules tasks to Mesos master at a rate of 300 TPS. Everything 
> works fine for almost 24 hours but then outstanding offers accumulate & 
> saturate within 15 minutes. Outstanding offers aren't reclaimed by Mesos 
> master. We observe "RescindOffer" messages in verbose (GLOG v=3) framework 
> logs but outstanding offers don't reduce. New resources aren't offered to 
> framework when outstanding offers saturate. We have to restart the scheduler 
> to reset outstanding offers to zero.
> Any suggestions to debug this issue are welcome.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10146) Removing task from slave when framework is disconnected causes master to crash

2020-07-07 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152809#comment-17152809
 ] 

Greg Mann commented on MESOS-10146:
---

[~sunshine123] thank you for the bug report! Would it be possible to get a full 
verbose master log from an incident? The logs surrounding the check failure may 
help us pinpoint the issue more precisely.

> Removing task from slave when framework is disconnected causes master to crash
> --
>
> Key: MESOS-10146
> URL: https://issues.apache.org/jira/browse/MESOS-10146
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api, framework
>Affects Versions: 1.9.0
> Environment: Mesos master with three master nodes
>Reporter: Naveen
>Priority: Major
>
> Hello, 
>     we want to report an issue we observed when remove tasks from slave. 
> There is condition to check for valid framework before tasks can be removed. 
> There can be several reasons framework can be disconnected. This check fails 
> and crashes mesos master node. 
> [https://github.com/apache/mesos/blob/1.9.0/src/master/master.cpp#L11842]
> There is also unguarded access to the internal framework state on line 11853.
> Error logs - 
> {noformat}
> mesos-master[5483]: I0618 14:05:20.859189 5491 master.cpp:9512] Marked agent 
> 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 (10.160.73.79) unreachable: health 
> check timed out
> mesos-master[5483]: F0618 14:05:20.859347 5491 master.cpp:11842] Check 
> failed: framework != nullptr Framework 
> 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067 not found while removing agent 
> 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 at slave(1)@10.160.73.79:5051 
> (10.160.73.79); agent tasks: { 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067: { } 
> }
> mesos-master[5483]: *** Check failure stack trace: ***
> mesos-master[5483]: I0618 14:05:20.859781 5490 hierarchical.cpp:1013] Removed 
> all filters for agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
> mesos-master[5483]: I0618 14:05:20.872217 5490 hierarchical.cpp:890] Removed 
> agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
> mesos-master[5483]: I0618 14:05:20.859922 5487 replica.cpp:695] Replica 
> received learned notice for position 42070 from 
> log-network(1)@10.160.73.212:5050
> mesos-master[5483]: @ 0x7f2fdf6a5b1d google::LogMessage::Fail()
> mesos-master[5483]: @ 0x7f2fdf6a7dfd google::LogMessage::SendToLog()
> mesos-master[5483]: @ 0x7f2fdf6a56ab google::LogMessage::Flush()
> mesos-master[5483]: @ 0x7f2fdf6a8859 
> google::LogMessageFatal::~LogMessageFatal()
> mesos-master[5483]: @ 0x7f2fde2677f2 
> mesos::internal::master::Master::__removeSlave()
> mesos-master[5483]: @ 0x7f2fde267ebe 
> mesos::internal::master::Master::_markUnreachable()
> mesos-master[5483]: @ 0x7f2fde268215 
> _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKSsEUlbE_JbclEv
> mesos-master[5483]: @ 0x7f2fddf30688 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEclEOS3_
> mesos-master[5483]: @ 0x7f2fdf5e3b91 process::ProcessBase::consume()
> mesos-master[5483]: @ 0x7f2fdf608f77 process::ProcessManager::resume()
> mesos-master[5483]: @ 0x7f2fdf60cb36 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> mesos-master[5483]: @ 0x7f2fdf8c34d0 execute_native_thread_routine
> mesos-master[5483]: @ 0x7f2fdba02ea5 start_thread
> mesos-master[5483]: @ 0x7f2fdb20e8dd __clone
> systemd[1]: mesos-master.service: main process exited, code=killed, 
> status=6/ABRT
> systemd[1]: Unit mesos-master.service entered failed state.
> systemd[1]: mesos-master.service failed.
> systemd[1]: mesos-master.service holdoff time over, scheduling restart.
> systemd[1]: Stopped Mesos Master.
> systemd[1]: Started Mesos Master.
> mesos-master[28757]: I0618 14:05:41.461403 28748 logging.cpp:201] INFO level 
> logging started!
> mesos-master[28757]: I0618 14:05:41.461712 28748 main.cpp:243] Build: 
> 2020-05-09 10:42:00 by centos
> mesos-master[28757]: I0618 14:05:41.461721 28748 main.cpp:244] Version: 1.9.0
> mesos-master[28757]: I0618 14:05:41.461726 28748 main.cpp:247] Git tag: 1.9.0
> mesos-master[28757]: I0618 14:05:41.461730 28748 main.cpp:251] Git SHA: 
> 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9271) DockerContainerizerHealthCheckTest.ROOT_DOCKER_USERNETWORK_NETNAMESPACE_HealthyTaskViaHTTP is flaky

2020-06-22 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142521#comment-17142521
 ] 

Greg Mann commented on MESOS-9271:
--

Observed again - attached another log from internal CI. CentOS 7, cmake build, 
no libevent and no SSL.

> DockerContainerizerHealthCheckTest.ROOT_DOCKER_USERNETWORK_NETNAMESPACE_HealthyTaskViaHTTP
>  is flaky
> ---
>
> Key: MESOS-9271
> URL: https://issues.apache.org/jira/browse/MESOS-9271
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed in an internal CI run (4498):
> {noformat}
> ../../src/tests/health_check_tests.cpp:2080
> Failed to wait 15secs for statusHealthy
> {noformat}
> Full log:
> {noformat}
> [ RUN  ] 
> NetworkProtocol/DockerContainerizerHealthCheckTest.ROOT_DOCKER_USERNETWORK_NETNAMESPACE_HealthyTaskViaHTTP/1
> I0927 00:57:43.336710 27845 docker.cpp:1659] Running docker -H 
> unix:///var/run/docker.sock inspect zhq527725/https-server:latest
> I0927 00:57:43.340283 27845 docker.cpp:1659] Running docker -H 
> unix:///var/run/docker.sock inspect alpine:latest
> I0927 00:57:43.343433 27845 docker.cpp:1659] Running docker -H 
> unix:///var/run/docker.sock inspect alpine:latest
> I0927 00:57:43.857142 27845 cluster.cpp:173] Creating default 'local' 
> authorizer
> I0927 00:57:43.858705 19628 master.cpp:413] Master 
> f9e9ac63-826d-4d08-b216-c5f352afc25d (ip-172-16-10-217.ec2.internal) started 
> on 172.16.10.217:32836
> I0927 00:57:43.858727 19628 master.cpp:416] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="hierarchical" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/QIaitl/credentials" --filter_gpu_resources="true" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
> --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/QIaitl/master" --zk_session_timeout="10secs"
> I0927 00:57:43.858912 19628 master.cpp:465] Master only allowing 
> authenticated frameworks to register
> I0927 00:57:43.858942 19628 master.cpp:471] Master only allowing 
> authenticated agents to register
> I0927 00:57:43.858948 19628 master.cpp:477] Master only allowing 
> authenticated HTTP frameworks to register
> I0927 00:57:43.858955 19628 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/QIaitl/credentials'
> I0927 00:57:43.859072 19628 master.cpp:521] Using default 'crammd5' 
> authenticator
> I0927 00:57:43.859141 19628 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0927 00:57:43.859200 19628 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0927 00:57:43.859246 19628 http.cpp:1037] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0927 00:57:43.859268 19628 master.cpp:602] Authorization enabled
> I0927 00:57:43.859541 19629 hierarchical.cpp:182] Initialized hierarchical 
> allocator process
> I0927 00:57:43.859582 19629 whitelist_watcher.cpp:77] No whitelist given
> I0927 00:57:43.860060 19628 master.cpp:2083] Elected as the leading master!
> I0927 00:57:43.860078 19628 master.cpp:1638] Recovering from registrar
> I0927 00:57:43.860117 19628 registrar.cpp:339] Recovering registrar
> I0927 00:57:43.860285 19628 registrar.cpp:383] Successfully fetched the 
> registry (0B) in 144128ns
> I0927 00:57:43.860328 19628 registrar.cpp:487] Applied 1 operations in 
> 8246ns; attempting to update the registry
> I0927 00:57:43.860527 19624 registrar.cpp:544] Successfully updated the 
> registry in 167168ns
> I0927 00:57:43.860571 19624 

[jira] [Created] (MESOS-10144) MasterQuotaTest.ValidateLimitAgainstConsumed is flaky

2020-06-22 Thread Greg Mann (Jira)
Greg Mann created MESOS-10144:
-

 Summary: MasterQuotaTest.ValidateLimitAgainstConsumed is flaky
 Key: MESOS-10144
 URL: https://issues.apache.org/jira/browse/MESOS-10144
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 1.10.0
 Environment: Debian 8 with libevent & SSL enabled.
Reporter: Greg Mann


Observed in internal CI. Log attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10142) CSI External Volumes MVP Design Doc

2020-06-17 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10142:
-

Assignee: Qian Zhang

> CSI External Volumes MVP Design Doc
> ---
>
> Key: MESOS-10142
> URL: https://issues.apache.org/jira/browse/MESOS-10142
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Assignee: Qian Zhang
>Priority: Major
>  Labels: csi, external-volumes, storage
>
> This ticket tracks the design doc for our initial implementation of external 
> volume support in Mesos using the CSI standard.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10142) CSI External Volumes MVP Design Doc

2020-06-17 Thread Greg Mann (Jira)
Greg Mann created MESOS-10142:
-

 Summary: CSI External Volumes MVP Design Doc
 Key: MESOS-10142
 URL: https://issues.apache.org/jira/browse/MESOS-10142
 Project: Mesos
  Issue Type: Task
Reporter: Greg Mann


This ticket tracks the design doc for our initial implementation of external 
volume support in Mesos using the CSI standard.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10141) CSI External Volume Support

2020-06-17 Thread Greg Mann (Jira)
Greg Mann created MESOS-10141:
-

 Summary: CSI External Volume Support
 Key: MESOS-10141
 URL: https://issues.apache.org/jira/browse/MESOS-10141
 Project: Mesos
  Issue Type: Epic
Reporter: Greg Mann


This epic tracks work for our MVP of external volume support in Mesos using the 
CSI standard.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10136) MasterDrainingTest.DrainAgentUnreachable is flaky

2020-05-27 Thread Greg Mann (Jira)
Greg Mann created MESOS-10136:
-

 Summary: MasterDrainingTest.DrainAgentUnreachable is flaky
 Key: MESOS-10136
 URL: https://issues.apache.org/jira/browse/MESOS-10136
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 1.10.0
 Environment: CentOS 7, built with cmake.
Reporter: Greg Mann


Observed in internal CI. Log attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10118) Agent incorrectly handles draining when empty

2020-04-15 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10118:
-

Assignee: Greg Mann

> Agent incorrectly handles draining when empty
> -
>
> Key: MESOS-10118
> URL: https://issues.apache.org/jira/browse/MESOS-10118
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.9.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>
> When the agent receives a {{DrainSlaveMessage}} and does not have any tasks 
> or operations, it writes the {{DrainConfig}} to disk and is then implicitly 
> stuck in a "draining" state indefinitely. For example, if an agent 
> reregistration is triggered at such a time, the master may think the agent is 
> operating normally and send a task to it, at which point the task will fail 
> because the agent thinks it's draining (see this test for an example: 
> https://reviews.apache.org/r/72364/).
> If the agent receives a {{DrainSlaveMessage}} when it has no tasks or 
> operations, it should avoid writing any {{DrainConfig}} to disk so that it 
> immediately "transitions" into the already-drained state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10118) Agent incorrectly handles draining when empty

2020-04-15 Thread Greg Mann (Jira)
Greg Mann created MESOS-10118:
-

 Summary: Agent incorrectly handles draining when empty
 Key: MESOS-10118
 URL: https://issues.apache.org/jira/browse/MESOS-10118
 Project: Mesos
  Issue Type: Bug
  Components: agent
Affects Versions: 1.9.0
Reporter: Greg Mann


When the agent receives a {{DrainSlaveMessage}} and does not have any tasks or 
operations, it writes the {{DrainConfig}} to disk and is then implicitly stuck 
in a "draining" state indefinitely. For example, if an agent reregistration is 
triggered at such a time, the master may think the agent is operating normally 
and send a task to it, at which point the task will fail because the agent 
thinks it's draining (see this test for an example: 
https://reviews.apache.org/r/72364/).

If the agent receives a {{DrainSlaveMessage}} when it has no tasks or 
operations, it should avoid writing any {{DrainConfig}} to disk so that it 
immediately "transitions" into the already-drained state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10116) Attempt to reactivate disconnected agent crashes the master

2020-04-14 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10116:
-

Assignee: Andrei Sekretenko  (was: Greg Mann)

> Attempt to reactivate disconnected agent crashes the master
> ---
>
> Key: MESOS-10116
> URL: https://issues.apache.org/jira/browse/MESOS-10116
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Critical
>
> Observed the following scenario on a production cluster:
>  - operator performs agent draining
>  - draining completes, operator disconnects the agent
>  - operator reactivates agent via REACTIVATE_AGENT call
>  - *master issues an offer for a reactivated disconnected agent*
>  - a framework issues ACCEPT call with this offer
>  - master crashes with the following stack trace:
> {noformat}
> F0311 09:06:18.852365 11289 validation.cpp:2123] Check failed: 
> slave->connected Offer 4067082c-ec7a-4efc-ac2d-c6e7cbc77356-O13981526 
> outlived disconnected agent 968ea9b2-374d-45cb-b5b3-c4ffb45a4a78-S0 at 
> slave(1)@10.50.7.59:5051 (10.50.7.59)
> *** Check failure stack trace: ***
> @ 0x7feac6a1dc6d google::LogMessage::Fail()
> @ 0x7feac6a1fec8 google::LogMessage::SendToLog()
> @ 0x7feac6a1d803 google::LogMessage::Flush()
> @ 0x7feac6a20809 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7feac57cdea0 mesos::internal::master::validation::offer::validateSlave()
> @ 0x7feac57d09c1 std::_Function_handler<>::_M_invoke()
> @ 0x7feac57d0fd1 std::function<>::operator()()
> @ 0x7feac57cea3c mesos::internal::master::validation::offer::validate()
> @ 0x7feac56d5565 mesos::internal::master::Master::accept()
> @ 0x7feac56468f0 mesos::internal::master::Master::Http::scheduler()
> @ 0x7feac5689797 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionINS2_14authentication9PrincipalEEEZN5mesos8internal6master6Master10initializeEvEUlS7_SD_E1_E9_M_invokeERKSt9_Any_dataS7_SD_
> @ 0x7feac697038c 
> _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZZNS1_11ProcessBase8_consumeERKNSB_12HttpEndpointERKSsRKNS1_5OwnedINS3_7RequestNKUlRK6OptionINS3_14authentication20AuthenticationResultEEE0_clESR_EUlbE0_IbclEv
> @ 0x7feac53f30e7 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_IST_SI_St12_PlaceholderILi1EEclEOS3_
> @ 0x7feac6966561 process::ProcessBase::consume()
> @ 0x7feac697db5b process::ProcessManager::resume()
> @ 0x7feac69837f6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7feac262f070 (unknown)
> @ 0x7feac1e4de65 start_thread
> @ 0x7feac1b7688d __clone
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10116) Attempt to reactivate disconnected agent crashes the master

2020-04-14 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10116:
-

Assignee: Greg Mann  (was: Andrei Sekretenko)

> Attempt to reactivate disconnected agent crashes the master
> ---
>
> Key: MESOS-10116
> URL: https://issues.apache.org/jira/browse/MESOS-10116
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Andrei Sekretenko
>Assignee: Greg Mann
>Priority: Critical
>
> Observed the following scenario on a production cluster:
>  - operator performs agent draining
>  - draining completes, operator disconnects the agent
>  - operator reactivates agent via REACTIVATE_AGENT call
>  - *master issues an offer for a reactivated disconnected agent*
>  - a framework issues ACCEPT call with this offer
>  - master crashes with the following stack trace:
> {noformat}
> F0311 09:06:18.852365 11289 validation.cpp:2123] Check failed: 
> slave->connected Offer 4067082c-ec7a-4efc-ac2d-c6e7cbc77356-O13981526 
> outlived disconnected agent 968ea9b2-374d-45cb-b5b3-c4ffb45a4a78-S0 at 
> slave(1)@10.50.7.59:5051 (10.50.7.59)
> *** Check failure stack trace: ***
> @ 0x7feac6a1dc6d google::LogMessage::Fail()
> @ 0x7feac6a1fec8 google::LogMessage::SendToLog()
> @ 0x7feac6a1d803 google::LogMessage::Flush()
> @ 0x7feac6a20809 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7feac57cdea0 mesos::internal::master::validation::offer::validateSlave()
> @ 0x7feac57d09c1 std::_Function_handler<>::_M_invoke()
> @ 0x7feac57d0fd1 std::function<>::operator()()
> @ 0x7feac57cea3c mesos::internal::master::validation::offer::validate()
> @ 0x7feac56d5565 mesos::internal::master::Master::accept()
> @ 0x7feac56468f0 mesos::internal::master::Master::Http::scheduler()
> @ 0x7feac5689797 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionINS2_14authentication9PrincipalEEEZN5mesos8internal6master6Master10initializeEvEUlS7_SD_E1_E9_M_invokeERKSt9_Any_dataS7_SD_
> @ 0x7feac697038c 
> _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZZNS1_11ProcessBase8_consumeERKNSB_12HttpEndpointERKSsRKNS1_5OwnedINS3_7RequestNKUlRK6OptionINS3_14authentication20AuthenticationResultEEE0_clESR_EUlbE0_IbclEv
> @ 0x7feac53f30e7 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_IST_SI_St12_PlaceholderILi1EEclEOS3_
> @ 0x7feac6966561 process::ProcessBase::consume()
> @ 0x7feac697db5b process::ProcessManager::resume()
> @ 0x7feac69837f6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7feac262f070 (unknown)
> @ 0x7feac1e4de65 start_thread
> @ 0x7feac1b7688d __clone
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10111) Failed check in libevent_ssl_socket.cpp: 'self->bev' Must be non NULL

2020-04-13 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082654#comment-17082654
 ] 

Greg Mann commented on MESOS-10111:
---

Review here: https://reviews.apache.org/r/72354/

> Failed check in libevent_ssl_socket.cpp: 'self->bev' Must be non NULL
> -
>
> Key: MESOS-10111
> URL: https://issues.apache.org/jira/browse/MESOS-10111
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.10.0
>Reporter: Andrei Sekretenko
>Assignee: Greg Mann
>Priority: Critical
>
> Observing the following master crash on a testing cluster roughly once in a 
> hour:
> {noformat}
> F0408 14:17:33.470850 18423 libevent_ssl_socket.cpp:193] Check failed: 
> 'self->bev' Must be non NULL
> @ 0x7fa7db12e2ad  google::LogMessage::Fail()
> @ 0x7fa7db130508  google::LogMessage::SendToLog()
> @ 0x7fa7db12de43  google::LogMessage::Flush()
> @ 0x7fa7db130e49  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fa7db1004de  google::CheckNotNull<>()
> @ 0x7fa7db0fb6ca  
> _ZNSt17_Function_handlerIFvvEZN7process7network8internal21LibeventSSLSocketImpl8shutdownEiEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7fa7db107091  process::async_function()
> @ 0x7fa7d7178978  event_process_active_single_queue
> @ 0x7fa7d7178e5d  event_process_active
> @ 0x7fa7d71795b9  event_base_loop
> @ 0x7fa7db106bed  process::EventLoop::run()
> @ 0x7fa7d6cfe2b0  (unknown)
> @ 0x7fa7d651ce65  start_thread
> @ 0x7fa7d624588d  __clone
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10111) Failed check in libevent_ssl_socket.cpp: 'self->bev' Must be non NULL

2020-04-08 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10111:
-

Assignee: Greg Mann

> Failed check in libevent_ssl_socket.cpp: 'self->bev' Must be non NULL
> -
>
> Key: MESOS-10111
> URL: https://issues.apache.org/jira/browse/MESOS-10111
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Andrei Sekretenko
>Assignee: Greg Mann
>Priority: Critical
>
> Observing the following master crash on a testing cluster roughly once in a 
> hour:
> {noformat}
> F0408 14:17:33.470850 18423 libevent_ssl_socket.cpp:193] Check failed: 
> 'self->bev' Must be non NULL
> @ 0x7fa7db12e2ad  google::LogMessage::Fail()
> @ 0x7fa7db130508  google::LogMessage::SendToLog()
> @ 0x7fa7db12de43  google::LogMessage::Flush()
> @ 0x7fa7db130e49  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fa7db1004de  google::CheckNotNull<>()
> @ 0x7fa7db0fb6ca  
> _ZNSt17_Function_handlerIFvvEZN7process7network8internal21LibeventSSLSocketImpl8shutdownEiEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7fa7db107091  process::async_function()
> @ 0x7fa7d7178978  event_process_active_single_queue
> @ 0x7fa7d7178e5d  event_process_active
> @ 0x7fa7d71795b9  event_base_loop
> @ 0x7fa7db106bed  process::EventLoop::run()
> @ 0x7fa7d6cfe2b0  (unknown)
> @ 0x7fa7d651ce65  start_thread
> @ 0x7fa7d624588d  __clone
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10108) SSL Improvements

2020-04-02 Thread Greg Mann (Jira)
Greg Mann created MESOS-10108:
-

 Summary: SSL Improvements
 Key: MESOS-10108
 URL: https://issues.apache.org/jira/browse/MESOS-10108
 Project: Mesos
  Issue Type: Epic
Reporter: Greg Mann






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10048) Update the memory subsystem in the cgroup isolator to set container’s memory resource limits and `oom_score_adj`

2020-03-24 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066083#comment-17066083
 ] 

Greg Mann commented on MESOS-10048:
---

{noformat}
commit 12e5e870c38681bfc0455960f89a41127dac3daf (HEAD -> master, origin/master, 
origin/HEAD)
Author: Qian Zhang 
Date:   Tue Mar 24 10:44:39 2020 -0700

Moved containerizer utils in CMakeLists.

This is to ensure the function `calculateOOMScoreAdj()` can be resolved
on Windows.

Review: https://reviews.apache.org/r/72263/
{noformat}

> Update the memory subsystem in the cgroup isolator to set container’s memory 
> resource limits and `oom_score_adj`
> 
>
> Key: MESOS-10048
> URL: https://issues.apache.org/jira/browse/MESOS-10048
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.10.0
>
>
> Update the memory subsystem in the cgroup isolator to set container’s memory 
> resource limits and `oom_score_adj`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10055) Update Mesos UI to display the resource limits of tasks

2020-03-23 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10055:
-

Assignee: Greg Mann

> Update Mesos UI to display the resource limits of tasks
> ---
>
> Key: MESOS-10055
> URL: https://issues.apache.org/jira/browse/MESOS-10055
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10087) Update master & agent's HTTP endpoints for showing resource limits

2020-03-23 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10087:
-

Assignee: Greg Mann

> Update master & agent's HTTP endpoints for showing resource limits
> --
>
> Key: MESOS-10087
> URL: https://issues.apache.org/jira/browse/MESOS-10087
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>
> We need to update Mesos master's `/state`, `/frameworks`, `/tasks` endpoints 
> and agent's `/state` endpoint to show task's resource limits in their outputs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10093) Libprocess does not properly escape subprocess argument strings on Windows

2020-03-11 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10093:
-

Assignee: Benjamin Mahler  (was: Greg Mann)

> Libprocess does not properly escape subprocess argument strings on Windows
> --
>
> Key: MESOS-10093
> URL: https://issues.apache.org/jira/browse/MESOS-10093
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Greg Mann
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: containerization, docker, mesosphere, windows
>
> When running some tests of Mesos on Windows, I discovered that the following 
> command would not execute successfully when passed to the Docker 
> containerizer in {{TaskInfo.command}}:
> {noformat}
> python -c "print('hello world')"
> {noformat}
> The following error is found in the task sandbox:
> {noformat}
>   File "", line 1
> "print('hello
> ^
> SyntaxError: EOL while scanning string literal
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10045) Validate task’s resources limits and the `share_cgroups` field

2020-03-10 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056218#comment-17056218
 ] 

Greg Mann commented on MESOS-10045:
---

Patches for agent-side validation of shared cgroups:
https://reviews.apache.org/r/72221/
https://reviews.apache.org/r/7/

> Validate task’s resources limits and the `share_cgroups` field
> --
>
> Key: MESOS-10045
> URL: https://issues.apache.org/jira/browse/MESOS-10045
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>
> When launching a task, we need to validate:
>  # Only CPU and memory are supported as resource limits.
>  # Resource limit must be larger than resource request.
>  ** We need to be careful about the command task case, in which case we add 
> an allowance (0.1 CPUs and 32MB memory, see 
> [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/slave.cpp#L6663:L6677]
>  for details) for the executor, so we need to validate task resource limit is 
> larger than task resource request + this allowance, otherwise the executor 
> will be launched with limits < requests.
>  # `TaskInfo` can only include resource limits when the relevant agent 
> possesses the TASK_RESOURCE_LIMITS capability.
>  # The value of the field `share_cgroups` should be same for all the tasks 
> launched by a single default executor.
>  # It is not allowed to set resource limits for the task which has the field 
> `share_cgroups` set as true.
> We also need to add validation to the agent which will ensure that non-debug 
> 2nd-or-lower-level nested containers cannot be launched via the 
> {{LaunchContainer}} call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10045) Validate task’s resources limits and the `share_cgroups` field

2020-03-09 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055083#comment-17055083
 ] 

Greg Mann commented on MESOS-10045:
---

Patches for master-side validation:
https://reviews.apache.org/r/72216/
https://reviews.apache.org/r/72217/

> Validate task’s resources limits and the `share_cgroups` field
> --
>
> Key: MESOS-10045
> URL: https://issues.apache.org/jira/browse/MESOS-10045
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>
> When launching a task, we need to validate:
>  # Only CPU and memory are supported as resource limits.
>  # Resource limit must be larger than resource request.
>  ** We need to be careful about the command task case, in which case we add 
> an allowance (0.1 CPUs and 32MB memory, see 
> [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/slave.cpp#L6663:L6677]
>  for details) for the executor, so we need to validate task resource limit is 
> larger than task resource request + this allowance, otherwise the executor 
> will be launched with limits < requests.
>  # `TaskInfo` can only include resource limits when the relevant agent 
> possesses the TASK_RESOURCE_LIMITS capability.
>  # The value of the field `share_cgroups` should be same for all the tasks 
> launched by a single default executor.
>  # It is not allowed to set resource limits for the task which has the field 
> `share_cgroups` set as true.
> We also need to add validation to the agent which will ensure that non-debug 
> 2nd-or-lower-level nested containers cannot be launched via the 
> {{LaunchContainer}} call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10072) Windows: Curl requires zlib when built with SSL support on Windows

2020-03-04 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051366#comment-17051366
 ] 

Greg Mann commented on MESOS-10072:
---

[~ddary] I'm not sure; if we haven't seen issues in testing on Windows agents, 
then probably not a blocker.

> Windows: Curl requires zlib when built with SSL support on Windows
> --
>
> Key: MESOS-10072
> URL: https://issues.apache.org/jira/browse/MESOS-10072
> Project: Mesos
>  Issue Type: Task
>Reporter: Joseph Wu
>Priority: Major
>  Labels: curl, foundations, windows
> Attachments: Screen Shot 2019-12-17 at 1.38.43 PM.png
>
>
> After building Windows with --enable-ssl, some curl-related tests, like 
> health check tests, start failing with the odd exit code {{-1073741515}}.
> Running curl directly with the Visual Studio debugger yields this error:
>  !Screen Shot 2019-12-17 at 1.38.43 PM.png|width=343,height=164!
> Some documentation online seems to support this additional requirement:
>  [https://wiki.dlang.org/Curl_on_Windows]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10044) Add a new capability `TASK_RESOURCE_LIMITS` into Mesos agent

2020-03-03 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050260#comment-17050260
 ] 

Greg Mann commented on MESOS-10044:
---

{noformat}
commit f445e3aea44b4060292fa5e029dbb2c19e219c25
Author: Greg Mann 
Date:   Tue Mar 3 06:03:57 2020 -0800

Added the 'TASK_RESOURCE_LIMITS' agent capability.

This capability will be used by the master to detect whether
or not an agent can handle task resource limits.

Review: https://reviews.apache.org/r/71991/
{noformat}

> Add a new capability `TASK_RESOURCE_LIMITS` into Mesos agent
> 
>
> Key: MESOS-10044
> URL: https://issues.apache.org/jira/browse/MESOS-10044
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
> Fix For: 1.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10093) Docker containerizer does handle whitespace correctly on Windows

2020-02-06 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10093:
-

Assignee: Greg Mann

> Docker containerizer does handle whitespace correctly on Windows
> 
>
> Key: MESOS-10093
> URL: https://issues.apache.org/jira/browse/MESOS-10093
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: containerization, docker, mesosphere, windows
>
> When running some tests of Mesos on Windows, I discovered that the following 
> command would not execute successfully when passed to the Docker 
> containerizer in {{TaskInfo.command}}:
> {noformat}
> python -c "print('hello world')"
> {noformat}
> The following error is found in the task sandbox:
> {noformat}
>   File "", line 1
> "print('hello
> ^
> SyntaxError: EOL while scanning string literal
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10093) Docker containerizer does handle whitespace correctly on Windows

2020-02-06 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031650#comment-17031650
 ] 

Greg Mann commented on MESOS-10093:
---

I heard from [~kaysoky] that this may have been a relic from earlier days in 
Mesos-on-Windows development, when PowerShell was the intended default shell on 
that platform. This was later changed to {{cmd}} to reduce the resource 
overhead.

> Docker containerizer does handle whitespace correctly on Windows
> 
>
> Key: MESOS-10093
> URL: https://issues.apache.org/jira/browse/MESOS-10093
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: containerization, docker, mesosphere, windows
>
> When running some tests of Mesos on Windows, I discovered that the following 
> command would not execute successfully when passed to the Docker 
> containerizer in {{TaskInfo.command}}:
> {noformat}
> python -c "print('hello world')"
> {noformat}
> The following error is found in the task sandbox:
> {noformat}
>   File "", line 1
> "print('hello
> ^
> SyntaxError: EOL while scanning string literal
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10093) Docker containerizer does handle whitespace correctly on Windows

2020-02-05 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031100#comment-17031100
 ] 

Greg Mann edited comment on MESOS-10093 at 2/5/20 10:53 PM:


On windows, we execute shell commands as arguments to {{cmd.exe}}. I ran the 
following test in the command prompt:
{noformat}
C:\Users\Administrator>cmd /c "python -c \"print('hello world')\""
  File "", line 1
"print('hello
^
SyntaxError: EOL while scanning string literal

C:\Users\Administrator>cmd /c "python -c ^"print('hello world')^""
hello world
{noformat}

In libprocess, it looks like we currently escape double quotes using a 
backslash: 
https://github.com/apache/mesos/blob/4990d2cd6e76da340b30e200be0d700124dac2b1/3rdparty/stout/include/stout/os/windows/shell.hpp#L188-L191

Based on the above test, it appears that we should be escaping them with caret 
instead.

NOTE that before merging such a change, we should confirm that changing this 
escaping behavior doesn't break Mesos containerizer tasks.


was (Author: greggomann):
On windows, we execute shell commands as arguments to {{cmd.exe}}. I ran the 
following test in the command prompt:
{noformat}
C:\Users\Administrator>cmd /c "python -c \"print('hello world')\""
  File "", line 1
"print('hello
^
SyntaxError: EOL while scanning string literal

C:\Users\Administrator>cmd /c "python -c ^"print('hello world')^""
hello world
{noformat}

In libprocess, it looks like we currently escape double quotes using a 
backslash: 
https://github.com/apache/mesos/blob/4990d2cd6e76da340b30e200be0d700124dac2b1/3rdparty/stout/include/stout/os/windows/shell.hpp#L188-L191

Based on the above test, it appears that escaping them with caret instead.

NOTE that before merging such a change, we should confirm that changing this 
escaping behavior doesn't break Mesos containerizer tasks.

> Docker containerizer does handle whitespace correctly on Windows
> 
>
> Key: MESOS-10093
> URL: https://issues.apache.org/jira/browse/MESOS-10093
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: containerization, docker, mesosphere, windows
>
> When running some tests of Mesos on Windows, I discovered that the following 
> command would not execute successfully when passed to the Docker 
> containerizer in {{TaskInfo.command}}:
> {noformat}
> python -c "print('hello world')"
> {noformat}
> The following error is found in the task sandbox:
> {noformat}
>   File "", line 1
> "print('hello
> ^
> SyntaxError: EOL while scanning string literal
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10093) Docker containerizer does handle whitespace correctly on Windows

2020-02-05 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031100#comment-17031100
 ] 

Greg Mann edited comment on MESOS-10093 at 2/5/20 10:53 PM:


On windows, we execute shell commands as arguments to {{cmd.exe}}. I ran the 
following test in the command prompt:
{noformat}
C:\Users\Administrator>cmd /c "python -c \"print('hello world')\""
  File "", line 1
"print('hello
^
SyntaxError: EOL while scanning string literal

C:\Users\Administrator>cmd /c "python -c ^"print('hello world')^""
hello world
{noformat}

In libprocess, it looks like we currently escape double quotes using a 
backslash: 
https://github.com/apache/mesos/blob/4990d2cd6e76da340b30e200be0d700124dac2b1/3rdparty/stout/include/stout/os/windows/shell.hpp#L188-L191

Based on the above test, it appears that we should be escaping them with a 
caret instead.

NOTE that before merging such a change, we should confirm that changing this 
escaping behavior doesn't break Mesos containerizer tasks.


was (Author: greggomann):
On windows, we execute shell commands as arguments to {{cmd.exe}}. I ran the 
following test in the command prompt:
{noformat}
C:\Users\Administrator>cmd /c "python -c \"print('hello world')\""
  File "", line 1
"print('hello
^
SyntaxError: EOL while scanning string literal

C:\Users\Administrator>cmd /c "python -c ^"print('hello world')^""
hello world
{noformat}

In libprocess, it looks like we currently escape double quotes using a 
backslash: 
https://github.com/apache/mesos/blob/4990d2cd6e76da340b30e200be0d700124dac2b1/3rdparty/stout/include/stout/os/windows/shell.hpp#L188-L191

Based on the above test, it appears that we should be escaping them with caret 
instead.

NOTE that before merging such a change, we should confirm that changing this 
escaping behavior doesn't break Mesos containerizer tasks.

> Docker containerizer does handle whitespace correctly on Windows
> 
>
> Key: MESOS-10093
> URL: https://issues.apache.org/jira/browse/MESOS-10093
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: containerization, docker, mesosphere, windows
>
> When running some tests of Mesos on Windows, I discovered that the following 
> command would not execute successfully when passed to the Docker 
> containerizer in {{TaskInfo.command}}:
> {noformat}
> python -c "print('hello world')"
> {noformat}
> The following error is found in the task sandbox:
> {noformat}
>   File "", line 1
> "print('hello
> ^
> SyntaxError: EOL while scanning string literal
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10093) Docker containerizer does handle whitespace correctly on Windows

2020-02-05 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031100#comment-17031100
 ] 

Greg Mann commented on MESOS-10093:
---

On windows, we execute shell commands as arguments to {{cmd.exe}}. I ran the 
following test in the command prompt:
{noformat}
C:\Users\Administrator>cmd /c "python -c \"print('hello world')\""
  File "", line 1
"print('hello
^
SyntaxError: EOL while scanning string literal

C:\Users\Administrator>cmd /c "python -c ^"print('hello world')^""
hello world
{noformat}

In libprocess, it looks like we currently escape double quotes using a 
backslash: 
https://github.com/apache/mesos/blob/4990d2cd6e76da340b30e200be0d700124dac2b1/3rdparty/stout/include/stout/os/windows/shell.hpp#L188-L191

Based on the above test, it appears that escaping them with caret instead.

NOTE that before merging such a change, we should confirm that changing this 
escaping behavior doesn't break Mesos containerizer tasks.

> Docker containerizer does handle whitespace correctly on Windows
> 
>
> Key: MESOS-10093
> URL: https://issues.apache.org/jira/browse/MESOS-10093
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: containerization, docker, mesosphere, windows
>
> When running some tests of Mesos on Windows, I discovered that the following 
> command would not execute successfully when passed to the Docker 
> containerizer in {{TaskInfo.command}}:
> {noformat}
> python -c "print('hello world')"
> {noformat}
> The following error is found in the task sandbox:
> {noformat}
>   File "", line 1
> "print('hello
> ^
> SyntaxError: EOL while scanning string literal
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10093) Docker containerizer does handle whitespace correctly on Windows

2020-02-05 Thread Greg Mann (Jira)
Greg Mann created MESOS-10093:
-

 Summary: Docker containerizer does handle whitespace correctly on 
Windows
 Key: MESOS-10093
 URL: https://issues.apache.org/jira/browse/MESOS-10093
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.9.0
Reporter: Greg Mann


When running some tests of Mesos on Windows, I discovered that the following 
command would not execute successfully when passed to the Docker containerizer 
in {{TaskInfo.command}}:
{noformat}
python -c "print('hello world')"
{noformat}

The following error is found in the task sandbox:
{noformat}
  File "", line 1
"print('hello
^
SyntaxError: EOL while scanning string literal
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state

2020-01-28 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025386#comment-17025386
 ] 

Greg Mann commented on MESOS-10068:
---

[~daltonmatos] regarding this ticket, yea I think it makes sense to close this 
one and mention it in MESOS-10089.

Time is tight over here, but I'd be happy to mentor you a bit in the codebase 
:) Would you like to start by addressing MESOS-10089? If so, we could do an 
intro call to get started. Feel free to find me on Mesos slack if you're on 
there.

> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal 
> state
> ---
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.3, 1.8.2, 1.9.1
>Reporter: Dalton Matos Coelho Barreto
>Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>  
> Looking at the documentation of the master {{/api/v1}} endpoint, the 
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is 
> supported for this endpoint, but when a new agent joins the cluster a 
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not 
> received by clients subscribed to the master API.
>  
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All 
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the 
> cluster but the master couldn't communicate with this agent, in this specific 
> test there was a firewall blocking port {{5051}} on the slave, that is, no 
> body was being able to tal to the slave on port {{5051}}.
>  
> h2. Here are the steps do reproduce the problem
>  * Start a new mesos master
>  * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
>  ** 
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: 
> application/json" http://MASTER_IP:5050/api/v1{noformat}
>  * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
>  * Stop this slave;
>  * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the 
> field {{active=false}}.
>  * Waits for mesos master stop listing this slave, that is, 
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>  
> The mesos master logs shows this:
> {noformat}
>  I1213 15:03:10.33893513 master.cpp:1297] Agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964) disconnected
> I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.72667015 process.cpp:1917] Failed to send 
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to 
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1   {noformat}
>  
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>  
> I will attach the full master logs also.
>  
> Do you think this could be a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state

2020-01-23 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022346#comment-17022346
 ] 

Greg Mann commented on MESOS-10068:
---

Yea we should definitely be sending AGENT_REMOVED when agents are marked gone, 
sounds like a bug to me. I created a ticket to track this: MESOS-10089

Regarding the unreachable agents, we may want to have an AGENT_UNREACHABLE 
event to indicate this.

[~daltonmatos], we have a ticket here to track the design of the full agent 
state diagram: MESOS-9556
That would be a great place to continue discussion, feel free to ping us there. 
Unfortunately, I'm not sure when we might find time to work on that, but it's 
definitely something we've been wanting to do for a while now.

> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal 
> state
> ---
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.3, 1.8.2, 1.9.1
>Reporter: Dalton Matos Coelho Barreto
>Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>  
> Looking at the documentation of the master {{/api/v1}} endpoint, the 
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is 
> supported for this endpoint, but when a new agent joins the cluster a 
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not 
> received by clients subscribed to the master API.
>  
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All 
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the 
> cluster but the master couldn't communicate with this agent, in this specific 
> test there was a firewall blocking port {{5051}} on the slave, that is, no 
> body was being able to tal to the slave on port {{5051}}.
>  
> h2. Here are the steps do reproduce the problem
>  * Start a new mesos master
>  * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
>  ** 
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: 
> application/json" http://MASTER_IP:5050/api/v1{noformat}
>  * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
>  * Stop this slave;
>  * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the 
> field {{active=false}}.
>  * Waits for mesos master stop listing this slave, that is, 
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>  
> The mesos master logs shows this:
> {noformat}
>  I1213 15:03:10.33893513 master.cpp:1297] Agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964) disconnected
> I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.72667015 process.cpp:1917] Failed to send 
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to 
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1   {noformat}
>  
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>  
> I will attach the full master logs also.
>  
> Do you think this could be a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10089) AGENT_REMOVED event not sent when agents marked GONE

2020-01-23 Thread Greg Mann (Jira)
Greg Mann created MESOS-10089:
-

 Summary: AGENT_REMOVED event not sent when agents marked GONE
 Key: MESOS-10089
 URL: https://issues.apache.org/jira/browse/MESOS-10089
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.9.0
Reporter: Greg Mann


The master currently does not send subscribers the AGENT_REMOVED event when 
agents are marked GONE, but it should.

Since the {{__removeSlave}} method is used to handle both the UNREACHABLE and 
GONE cases, we could update it to conditionally send this event. However, it's 
worth noting that the {{_removeSlave}}/{{__removeSlave}} logic is messy and 
unintuitive and in need of refactoring - I suspect we can turn these into a 
single method which handles all cases with the help of an auxiliary function or 
two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9847) Docker executor doesn't wait for status updates to be ack'd before shutting down.

2020-01-15 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-9847:


Assignee: Andrei Budnik

> Docker executor doesn't wait for status updates to be ack'd before shutting 
> down.
> -
>
> Key: MESOS-9847
> URL: https://issues.apache.org/jira/browse/MESOS-9847
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Meng Zhu
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerization
>
> The docker executor doesn't wait for pending status updates to be 
> acknowledged before shutting down, instead it sleeps for one second and then 
> terminates:
> {noformat}
>   void _stop()
>   {
> // A hack for now ... but we need to wait until the status update
> // is sent to the slave before we shut ourselves down.
> // TODO(tnachen): Remove this hack and also the same hack in the
> // command executor when we have the new HTTP APIs to wait until
> // an ack.
> os::sleep(Seconds(1));
> driver.get()->stop();
>   }
> {noformat}
> This would result in racing between task status update (e.g. TASK_FINISHED) 
> and executor exit. The latter would lead agent generating a `TASK_FAILED` 
> status update by itself, leading to the confusing case where the agent 
> handles two different terminal status updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10045) Validate task’s resources limits and the `shared_cgroups` field in Mesos master

2020-01-13 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10045:
-

Assignee: Greg Mann

> Validate task’s resources limits and the `shared_cgroups` field in Mesos 
> master
> ---
>
> Key: MESOS-10045
> URL: https://issues.apache.org/jira/browse/MESOS-10045
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>
> When launching a task, we need to validate:
>  # Only CPU and memory are supported as resource limits.
>  # Resource limit must be larger than resource request.
>  ** We need to be careful about the command task case, in which case we add 
> an allowance (0.1 CPUs and 32MB memory, see 
> [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/slave.cpp#L6663:L6677]
>  for details) for the executor, so we need to validate task resource limit is 
> larger than task resource request + this allowance, otherwise the executor 
> will be launched with limits < requests.
>  # `TaskInfo` can only include resource limits when the relevant agent 
> possesses the TASK_RESOURCE_LIMITS capability.
>  # The value of the field `shared_cgroups` should be same for all the tasks 
> launched by a single default executor.
>  # It is not allowed to set resource limits for the task which has the field 
> `shared_cgroups` set as true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10044) Add a new capability `TASK_RESOURCE_LIMITS` into Mesos agent

2020-01-13 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10044:
-

Assignee: Greg Mann

> Add a new capability `TASK_RESOURCE_LIMITS` into Mesos agent
> 
>
> Key: MESOS-10044
> URL: https://issues.apache.org/jira/browse/MESOS-10044
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10049) Add a new reason in `TaskStatus::Reason` for the case that a task is OOM-killed due to exceeding its memory request

2019-12-21 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001636#comment-17001636
 ] 

Greg Mann commented on MESOS-10049:
---

Review here: https://reviews.apache.org/r/71935/

> Add a new reason in `TaskStatus::Reason` for the case that a task is 
> OOM-killed due to exceeding its memory request
> ---
>
> Key: MESOS-10049
> URL: https://issues.apache.org/jira/browse/MESOS-10049
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10049) Add a new reason in `TaskStatus::Reason` for the case that a task is OOM-killed due to exceeding its memory request

2019-12-21 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10049:
-

Assignee: Greg Mann

> Add a new reason in `TaskStatus::Reason` for the case that a task is 
> OOM-killed due to exceeding its memory request
> ---
>
> Key: MESOS-10049
> URL: https://issues.apache.org/jira/browse/MESOS-10049
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Greg Mann
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10041) Libprocess SSL verification can leak memory

2019-11-22 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980584#comment-16980584
 ] 

Greg Mann commented on MESOS-10041:
---

{noformat}
commit e52d0d1f25a91f9940bea4329eb5359373ee0ed0
Author: Benno Evers 
Date:   Fri Nov 22 12:00:43 2019 -0800

Fixed memory leak in openssl verification function.

When the hostname validation scheme was set to 'openssl',
the `openssl::verify()` function would return without
freeing a previously allocated `X509*` object.

To fix the leak, a long-standing TODO to switch to
RAII-based memory management for the certificate was
resolved.

Review: https://reviews.apache.org/r/71805/
{noformat}

> Libprocess SSL verification can leak memory
> ---
>
> Key: MESOS-10041
> URL: https://issues.apache.org/jira/browse/MESOS-10041
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.9.0
>Reporter: Greg Mann
>Assignee: Benno Evers
>Priority: Major
>  Labels: libprocess, ssl
>
> In {{process::network::openssl::verify()}}, when the SSL hostname validation 
> scheme is set to "openssl", the function can return without freeing an 
> {{X509}} object, leading to a memory leak.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10041) Libprocess SSL verification can leak memory

2019-11-22 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10041:
-

Assignee: Benno Evers

> Libprocess SSL verification can leak memory
> ---
>
> Key: MESOS-10041
> URL: https://issues.apache.org/jira/browse/MESOS-10041
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Benno Evers
>Priority: Major
>  Labels: libprocess, ssl
>
> In {{process::network::openssl::verify()}}, when the SSL hostname validation 
> scheme is set to "openssl", the function can return without freeing an 
> {{X509}} object, leading to a memory leak.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10041) Libprocess SSL verification can leak memory

2019-11-22 Thread Greg Mann (Jira)
Greg Mann created MESOS-10041:
-

 Summary: Libprocess SSL verification can leak memory
 Key: MESOS-10041
 URL: https://issues.apache.org/jira/browse/MESOS-10041
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Greg Mann


In {{process::network::openssl::verify()}}, when the SSL hostname validation 
scheme is set to "openssl", the function can return without freeing an {{X509}} 
object, leading to a memory leak.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10033) Design per-task cgroup isolation

2019-11-20 Thread Greg Mann (Jira)
Greg Mann created MESOS-10033:
-

 Summary: Design per-task cgroup isolation
 Key: MESOS-10033
 URL: https://issues.apache.org/jira/browse/MESOS-10033
 Project: Mesos
  Issue Type: Task
Reporter: Greg Mann


To provide container resource isolation which more closely matches the 
isolation implied by the Mesos nested container API, we should limit CPU and 
memory on a per-task basis. The current Mesos containerizer implementation 
limits CPU and memory at the level of the  executor only, which means that 
tasks within a task group can burst above their CPU or memory resources. 
Instead, we should apply these limits using per-task cgroups.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10031) Agent's 'executorTerminated()' can cause double task status update

2019-11-06 Thread Greg Mann (Jira)
Greg Mann created MESOS-10031:
-

 Summary: Agent's 'executorTerminated()' can cause double task 
status update
 Key: MESOS-10031
 URL: https://issues.apache.org/jira/browse/MESOS-10031
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.9.0
Reporter: Greg Mann
Assignee: Greg Mann


When the agent first receives a task status update from an executor, it 
executes {{Slave::statusUpdate()}}, which adds the task ID to the 
{{Executor::pendingStatusUpdates}} map, but leaves the ID in 
{{Executor::launchedTasks}}.

Meanwhile, the code in {{Slave::executorTerminated()}} is not capable of 
handling the intermediate task state which exists in between the execution of 
{{Slave::statusUpdate()}} and {{Slave::_statusUpdate()}}. If 
{{Slave::executorTerminated()}} executes at that point in time, it's possible 
that the task will be transitioned to a terminal state twice (for example, it 
could be transitioned to TASK_FINISHED by the executor, then to TASK_FAILED by 
the agent if the executor suddenly terminates).

If the agent has already received a status update from an executor, that state 
transition should be honored even if the executor terminates immediately after 
it's sent. We should ensure that {{Slave::executorTerminated()}} cannot cause a 
valid update received from an executor to be ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10002) Design doc for container bursting

2019-11-04 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966898#comment-16966898
 ] 

Greg Mann commented on MESOS-10002:
---

Design doc is in progress here: 
https://docs.google.com/document/d/1iEXn2dBg07HehbNZunJWsIY6iaFezXiRsvpNw4dVQII/edit?usp=sharing

> Design doc for container bursting
> -
>
> Key: MESOS-10002
> URL: https://issues.apache.org/jira/browse/MESOS-10002
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9977) Agent does not check for immutable files while removing persistent volumes (and possibly in other GC operations)

2019-11-01 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965152#comment-16965152
 ] 

Greg Mann commented on MESOS-9977:
--

I can think of two options for handling this:
1) Force the persistent volume removal by having the agent unset the immutable 
attribute
2) Fail the DESTROY operation

In the case of persistent volumes, I think that #2 might make more sense - this 
is the more conservative thing to do, which seems prudent in the case of 
potentially critical data. Perhaps we could surface the presence of the 
immutable attribute in the volume via logging somewhere.

[~kaysoky] you mentioned sandbox GC in the description as well - in this case, 
I might be OK with just forcing the directory removal by having the agent unset 
the immutable attribute.

> Agent does not check for immutable files while removing persistent volumes 
> (and possibly in other GC operations)
> 
>
> Key: MESOS-9977
> URL: https://issues.apache.org/jira/browse/MESOS-9977
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.6.2, 1.7.2, 1.8.1, 1.9.0
>Reporter: Joseph Wu
>Priority: Major
>  Labels: foundations
>
> We observed an exit/crash loop on an agent originating from deleting a 
> persistent volume:
> {code}
> slave.cpp:4557] Deleting persistent volume '' at 
> '/path/to/mesos/slave/volumes/roles/my-role/'
> {code}
> This persistent volume happened to have one (or more) files within marked as 
> {{immutable}}.
> When the agent went to delete this persistent volume, via {{os::rmdir(...)}}, 
> it encountered these immutable file(s) and exits like:
> {code}
> slave.cpp:4423] EXIT with status 1: Failed to sync checkpointed resources: 
> Failed to remove persistent volume '' at 
> '/path/to/mesos/slave/volumes/roles/my-role/': Operation not permitted
> {code}
> The agent would then be unable to start up again, because during recovery, 
> the agent would attempt to delete the same persistent volume and fail to do 
> so.
> Manually removing the immutable attribute from files within the persistent 
> volume allows the agent to recover:
> {code}
> chattr -R -i /path/to/mesos/slave/volumes/roles/my-role/
> {code}
> Immutable attributes can be easily introduced by any tasks running on the 
> agent.  As long as the task has sufficient permissions, it could easily call 
> {{chattr +i ...}}.  This attribute could also affect sandbox GC, which also 
> uses {{os::rmdir}} to clean up.  However, sandbox GC tends to warn rather 
> than exit on failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10002) Design doc for container bursting

2019-10-30 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10002:
-

Assignee: Greg Mann

> Design doc for container bursting
> -
>
> Key: MESOS-10002
> URL: https://issues.apache.org/jira/browse/MESOS-10002
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-9609) Master check failure when marking agent unreachable

2019-10-29 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-9609:


Shepherd:   (was: Benno Evers)
Assignee: (was: Greg Mann)

> Master check failure when marking agent unreachable
> ---
>
> Key: MESOS-9609
> URL: https://issues.apache.org/jira/browse/MESOS-9609
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Priority: Critical
>  Labels: foundations, mesosphere
>
> {code}
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81543313 
> http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81558813 
> master.cpp:5467] Processing DECLINE call for offers: [ 
> 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework 
> 5e57f633-a69c-4009-b7
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81569313 
> master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82014210 
> master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at 
> slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82036710 
> registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the 
> registry
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82057210 
> registrar.cpp:552] Successfully updated the registry in 175872ns
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82064211 
> master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at 
> slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957 9 
> hierarchical.cpp:609] Removed agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49
> Mar 11 10:04:35 research docker[4503]: F0311 10:04:35.85196111 
> master.cpp:10018] Check failed: 'framework' Must be non NULL
> Mar 11 10:04:35 research docker[4503]: *** Check failure stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d  
> google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830  
> google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663  
> google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259  
> google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14  
> google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8  
> mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2  
> mesos::internal::master::Master::_markUnreachable()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11  
> process::ProcessBase::consume()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb2a4a  
> process::ProcessManager::resume()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb65d6  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c35d4c80  (unknown)
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2de76ba  start_thread
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2b1d41d  (unknown)
> Mar 11 10:04:36 research docker[4503]: *** Aborted at 1520762676 (unix time) 
> try "date -d @1520762676" if you are using GNU date ***
> Mar 11 10:04:36 research docker[4503]: PC: @ 0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: *** SIGSEGV (@0x0) received by PID 1 
> (TID 0x7f96b986d700) from PID 0; stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2df1390 (unknown)
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c604ce2c 
> google::DumpStackTraceAndExit()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d 
> google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830 
> google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663 
> google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259 
> google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14 
> google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8 
> mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2 
> mesos::internal::master::Master::_markUnreachable()
> Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11 
> process::ProcessBase::consume()
> Mar 11 

[jira] [Comment Edited] (MESOS-9609) Master check failure when marking agent unreachable

2019-10-29 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962491#comment-16962491
 ] 

Greg Mann edited comment on MESOS-9609 at 10/29/19 9:58 PM:


[~arostami] thanks so much for the repro and excellent logs! Much appreciated :)

I took a close look and I believe the following sequence of events leads to the 
crash:

1) The last of the framework’s tasks is removed:
{noformat}
Oct 27 23:21:18 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:21:18.493418 
   15 master.cpp:12171] Removing task 2 with resources cpus(allocated: *):1; 
disk(allocated: *):4024; mem(allocated: *):2048 of framework 
522424c1-2fac-42ab-9a70-b424266218a9- on agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144)
{noformat}

which means the framework’s entry in {{slave->tasks}} is erased: 
https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L13527-L13529

2) Later, the agent disconnects and since the framework is not checkpointing, 
it is removed from the {{Slave}} struct:
{noformat}
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248260 
   14 master.cpp:1321] Removing framework 
522424c1-2fac-42ab-9a70-b424266218a9- (toil) from disconnected agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144) because the framework is not checkpointing
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248289 
   14 master.cpp:11436] Removing framework 
522424c1-2fac-42ab-9a70-b424266218a9- (toil) from agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144)
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248311 
   14 master.cpp:12211] Removing executor 'toil-440' with resources {} of 
framework 522424c1-2fac-42ab-9a70-b424266218a9- on agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144)
{noformat}

We see no logging related to task removal since 
{{slave->tasks[framework->id()]}} was empty this time. however, since we use 
{{operator[]}} to inspect the task map here, we perform an insertion and it has 
a side effect: 
https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11416
This means that {{slave->tasks[framework->id()]}} now exists but has been 
initialized to an empty map. ruh roh.

3) Very soon after, the framework failover timeout elapses and the framework is 
removed:
{noformat}
Oct 27 23:23:22 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:22.890070 
   11 master.cpp:10224] Framework failover timeout, removing framework 
522424c1-2fac-42ab-9a70-b424266218a9- (toil)
{noformat}

4) Now when {{__removeSlave()}} iterates over the keys of {{slave->tasks}}, it 
finds a key which points to a framework that has already been removed: 
https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11796-L11800

We need to prevent that unintended map insertion to avoid the crash.

I'll prioritize this fix in the very near future; will update here soon.


was (Author: greggomann):
[~arostami] thanks so much for the repro and excellent logs! Much appreciated :)

I took a close look and I believe the following sequence of events leads to the 
crash:

1) The last of the framework’s tasks is removed:
{noformat}
Oct 27 23:21:18 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:21:18.493418 
   15 master.cpp:12171] Removing task 2 with resources cpus(allocated: *):1; 
disk(allocated: *):4024; mem(allocated: *):2048 of framework 
522424c1-2fac-42ab-9a70-b424266218a9- on agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144)
{noformat}

which means the framework’s entry in slave->tasks is erased: 
https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L13527-L13529

2) Later, the agent disconnects and since the framework is not checkpointing, 
it is removed from the Slave struct:
{noformat}
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248260 
   14 master.cpp:1321] Removing framework 
522424c1-2fac-42ab-9a70-b424266218a9- (toil) from disconnected agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144) because the framework is not checkpointing
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248289 
   14 master.cpp:11436] Removing framework 
522424c1-2fac-42ab-9a70-b424266218a9- (toil) from agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144)
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248311 
   14 master.cpp:12211] Removing executor 'toil-440' with resources {} of 
framework 

[jira] [Comment Edited] (MESOS-9609) Master check failure when marking agent unreachable

2019-10-29 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962491#comment-16962491
 ] 

Greg Mann edited comment on MESOS-9609 at 10/29/19 9:57 PM:


[~arostami] thanks so much for the repro and excellent logs! Much appreciated :)

I took a close look and I believe the following sequence of events leads to the 
crash:

1) The last of the framework’s tasks is removed:
{noformat}
Oct 27 23:21:18 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:21:18.493418 
   15 master.cpp:12171] Removing task 2 with resources cpus(allocated: *):1; 
disk(allocated: *):4024; mem(allocated: *):2048 of framework 
522424c1-2fac-42ab-9a70-b424266218a9- on agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144)
{noformat}

which means the framework’s entry in slave->tasks is erased: 
https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L13527-L13529

2) Later, the agent disconnects and since the framework is not checkpointing, 
it is removed from the Slave struct:
{noformat}
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248260 
   14 master.cpp:1321] Removing framework 
522424c1-2fac-42ab-9a70-b424266218a9- (toil) from disconnected agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144) because the framework is not checkpointing
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248289 
   14 master.cpp:11436] Removing framework 
522424c1-2fac-42ab-9a70-b424266218a9- (toil) from agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144)
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248311 
   14 master.cpp:12211] Removing executor 'toil-440' with resources {} of 
framework 522424c1-2fac-42ab-9a70-b424266218a9- on agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144)
{noformat}

We see no logging related to task removal since slave->tasks[framework->id()] 
was empty this time. however, since we use operator[] to inspect the task map 
here, we perform an insertion and it has a side effect :face-palm: : 
https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11416
This means that slave->tasks[framework->id()] now exists but has been 
initialized to an empty map. ruh roh.

3) Very soon after, the framework failover timeout elapses and the framework is 
removed:
{noformat}
Oct 27 23:23:22 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:22.890070 
   11 master.cpp:10224] Framework failover timeout, removing framework 
522424c1-2fac-42ab-9a70-b424266218a9- (toil)
{noformat}

4) Now when __removeSlave() iterates over the keys of slave->tasks, it finds a 
key which points to a framework that has already been removed: 
https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11796-L11800

We need to prevent that unintended map insertion to avoid the crash.

I'll prioritize this fix in the very near future; will update here soon.


was (Author: greggomann):
[~arostami] thanks so much for the repro and excellent logs! Much appreciated :)

I took a close look and I believe the following sequence of events leads to the 
crash:

1) The last of the framework’s tasks is removed:
Oct 27 23:21:18 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:21:18.493418 
   15 master.cpp:12171] Removing task 2 with resources cpus(allocated: *):1; 
disk(allocated: *):4024; mem(allocated: *):2048 of framework 
522424c1-2fac-42ab-9a70-b424266218a9- on agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144)
which means the framework’s entry in slave->tasks is erased: 
https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L13527-L13529

2) Later, the agent disconnects and since the framework is not checkpointing, 
it is removed from the Slave struct:
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248260 
   14 master.cpp:1321] Removing framework 
522424c1-2fac-42ab-9a70-b424266218a9- (toil) from disconnected agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144) because the framework is not checkpointing
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248289 
   14 master.cpp:11436] Removing framework 
522424c1-2fac-42ab-9a70-b424266218a9- (toil) from agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144)
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248311 
   14 master.cpp:12211] Removing executor 'toil-440' with resources {} of 
framework 522424c1-2fac-42ab-9a70-b424266218a9- on agent 

[jira] [Commented] (MESOS-9609) Master check failure when marking agent unreachable

2019-10-29 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962491#comment-16962491
 ] 

Greg Mann commented on MESOS-9609:
--

[~arostami] thanks so much for the repro and excellent logs! Much appreciated :)

I took a close look and I believe the following sequence of events leads to the 
crash:

1) The last of the framework’s tasks is removed:
Oct 27 23:21:18 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:21:18.493418 
   15 master.cpp:12171] Removing task 2 with resources cpus(allocated: *):1; 
disk(allocated: *):4024; mem(allocated: *):2048 of framework 
522424c1-2fac-42ab-9a70-b424266218a9- on agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144)
which means the framework’s entry in slave->tasks is erased: 
https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L13527-L13529

2) Later, the agent disconnects and since the framework is not checkpointing, 
it is removed from the Slave struct:
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248260 
   14 master.cpp:1321] Removing framework 
522424c1-2fac-42ab-9a70-b424266218a9- (toil) from disconnected agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144) because the framework is not checkpointing
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248289 
   14 master.cpp:11436] Removing framework 
522424c1-2fac-42ab-9a70-b424266218a9- (toil) from agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144)
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248311 
   14 master.cpp:12211] Removing executor 'toil-440' with resources {} of 
framework 522424c1-2fac-42ab-9a70-b424266218a9- on agent 
522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 
(10.0.143.144)
We see no logging related to task removal since slave->tasks[framework->id()] 
was empty this time. however, since we use operator[] to inspect the task map 
here, we perform an insertion and it has a side effect :face-palm: : 
https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11416
This means that slave->tasks[framework->id()] now exists but has been 
initialized to an empty map. ruh roh.

3) Very soon after, the framework failover timeout elapses and the framework is 
removed:
Oct 27 23:23:22 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:22.890070 
   11 master.cpp:10224] Framework failover timeout, removing framework 
522424c1-2fac-42ab-9a70-b424266218a9- (toil)

4) Now when __removeSlave() iterates over the keys of slave->tasks, it finds a 
key which points to a framework that has already been removed: 
https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11796-L11800

We need to prevent that unintended map insertion to avoid the crash.

I'll prioritize this fix in the very near future; will update here soon.

> Master check failure when marking agent unreachable
> ---
>
> Key: MESOS-9609
> URL: https://issues.apache.org/jira/browse/MESOS-9609
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Critical
>  Labels: foundations, mesosphere
> Fix For: 1.9.0
>
>
> {code}
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81543313 
> http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81558813 
> master.cpp:5467] Processing DECLINE call for offers: [ 
> 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework 
> 5e57f633-a69c-4009-b7
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.81569313 
> master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82014210 
> master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at 
> slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82036710 
> registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the 
> registry
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82057210 
> registrar.cpp:552] Successfully updated the registry in 175872ns
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.82064211 
> master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at 
> slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957 9 
> hierarchical.cpp:609] Removed agent 

[jira] [Commented] (MESOS-10010) Implement an SSL socket for Windows, using OpenSSL directly

2019-10-16 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953103#comment-16953103
 ] 

Greg Mann commented on MESOS-10010:
---

[~kaysoky] I think this should be more fine-grained, will this really complete 
in a single sprint?

> Implement an SSL socket for Windows, using OpenSSL directly
> ---
>
> Key: MESOS-10010
> URL: https://issues.apache.org/jira/browse/MESOS-10010
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>
> {code}
> class WindowsSSLSocketImpl : public SocketImpl
> {
> public:
>   // This will be the entry point for Socket::create(SSL).
>   static Try> create(int_fd s);
>   WindowsSSLSocketImpl(int_fd _s);
>   ~WindowsSSLSocketImpl() override;
>   // Overrides for the 'SocketImpl' interface below.
>   // Unreachable.
>   Future connect(const Address& address) override;
>   // This will initialize SSL objects then call windows::connect()
>   // and chain that onto the appropriate call to SSL_do_handshake.
>   Future connect(
>   const Address& address,
>   const openssl::TLSClientConfig& config) override;
>   // These will call SSL_read or SSL_write as appropriate.
>   // As long as the SSL context is set up correctly, these will be
>   // thin wrappers.  (More details after the code block.)
>   Future recv(char* data, size_t size) override;
>   Future send(const char* data, size_t size) override;
>   Future sendfile(int_fd fd, off_t offset, size_t size) override;
>   // Nothing SSL here, just a plain old listener.
>   Try listen(int backlog) override;
>   // This will initialize SSL objects then call windows::accept()
>   // and then perform handshaking.  Any downgrading will
>   // happen here.  Since we control the event loop, we can
>   // easily peek at the first few bytes to check SSL-ness.
>   Future> accept() override;
>   SocketImpl::Kind kind() const override { return SocketImpl::Kind::SSL; }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10004) Enable SSL on Windows

2019-10-02 Thread Greg Mann (Jira)
Greg Mann created MESOS-10004:
-

 Summary: Enable SSL on Windows
 Key: MESOS-10004
 URL: https://issues.apache.org/jira/browse/MESOS-10004
 Project: Mesos
  Issue Type: Epic
Reporter: Greg Mann






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10003) Design doc for SSL on Windows

2019-10-02 Thread Greg Mann (Jira)
Greg Mann created MESOS-10003:
-

 Summary: Design doc for SSL on Windows
 Key: MESOS-10003
 URL: https://issues.apache.org/jira/browse/MESOS-10003
 Project: Mesos
  Issue Type: Task
  Components: libprocess
Reporter: Greg Mann






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10002) Design doc for container bursting

2019-10-02 Thread Greg Mann (Jira)
Greg Mann created MESOS-10002:
-

 Summary: Design doc for container bursting
 Key: MESOS-10002
 URL: https://issues.apache.org/jira/browse/MESOS-10002
 Project: Mesos
  Issue Type: Task
  Components: agent, containerization
Reporter: Greg Mann






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10001) Container bursting for CPU/mem

2019-10-02 Thread Greg Mann (Jira)
Greg Mann created MESOS-10001:
-

 Summary: Container bursting for CPU/mem
 Key: MESOS-10001
 URL: https://issues.apache.org/jira/browse/MESOS-10001
 Project: Mesos
  Issue Type: Epic
  Components: agent, containerization
Reporter: Greg Mann






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9971) Mesos failed to build due to error MSB6006 on Windows with MSVC.

2019-09-18 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932747#comment-16932747
 ] 

Greg Mann commented on MESOS-9971:
--

[~kaysoky] have you seen this before?

> Mesos failed to build due to error MSB6006 on Windows with MSVC.
> 
>
> Key: MESOS-9971
> URL: https://issues.apache.org/jira/browse/MESOS-9971
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: master
> Environment: {color:#172b4d}VS 2017 + Windows Server 2016{color}
>Reporter: LinGao
>Priority: Major
> Attachments: log_x64_build.log
>
>
> Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 1 on 
> Windows using MSVC. It can be first reproduced on 
> {color:#24292e}e0f7e2d{color} reversion on master branch. Could you please 
> take a look at this isssue? Thanks a lot!
> Reproduce steps:
> 1. git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> D:\mesos\src
>  2. Open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  3. cd src
>  4. .\bootstrap.bat
>  5. cd ..
>  6. mkdir build_x64 && pushd build_x64
>  7. cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
>  8. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /maxcpucount:4 
> /t:Rebuild
>  
> ErrorMessage:
> 67>PrepareForBuild:
>  Creating directory "x64\Debug\dist\dist.tlog\".
>    InitializeBuildStatus:
>  Creating "x64\Debug\dist\dist.tlog\unsuccessfulbuild" because 
> "AlwaysCreate" was specified.
> 67>C:\Program Files (x86)\Microsoft Visual 
> Studio\2017\Enterprise\Common7\IDE\VC\VCTargets\Microsoft.CppCommon.targets(209,5):
>  error MSB6006: "cmd.exe" exited with code 1. 
> [D:\Mesos\build_x64\dist.vcxproj]
> 67>Done Building Project "D:\Mesos\build_x64\dist.vcxproj" (Rebuild 
> target(s)) -- FAILED.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9965) agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not partition aware.

2019-09-12 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928984#comment-16928984
 ] 

Greg Mann commented on MESOS-9965:
--

1.9.x:
{noformat}
commit d8520b0b4bf52fd27be45817934e2af1b871c399
Author: Greg Mann 
Date:   Thu Sep 12 16:33:20 2019 -0700

Fixed a bug for non-partition-aware schedulers.

Previously, the agent would send task status updates with the state
TASK_GONE_BY_OPERATOR to all schedulers when an agent was drained
with the `mark_gone` parameter set to `true`.

This patch updates this code to ensure that TASK_GONE_BY_OPERATOR
is only sent to partition-aware schedulers.

Review: https://reviews.apache.org/r/71480/
{noformat}

> agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not 
> partition aware.
> --
>
> Key: MESOS-9965
> URL: https://issues.apache.org/jira/browse/MESOS-9965
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Gilbert Song
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
> Fix For: 1.10, 1.9.1
>
>
> The Mesos agent should not send `TASK_GONE_BY_OPERATOR` if the framework is 
> not partition-aware. We should distinguish the framework capability and send 
> different updates to legacy frameworks.
> The issue is exposed from here:
> https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/slave/slave.cpp#L5803
> An example to follow:
> https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/master/master.cpp#L9921



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (MESOS-9965) agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not partition aware.

2019-09-12 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928979#comment-16928979
 ] 

Greg Mann edited comment on MESOS-9965 at 9/13/19 3:03 AM:
---

master:
{noformat}
commit 8e1a51207304589a6521cff3540e0705fe1533ff
Author: Greg Mann 
Date:   Thu Sep 12 16:33:20 2019 -0700

Fixed a bug for non-partition-aware schedulers.

Previously, the agent would send task status updates with the state
TASK_GONE_BY_OPERATOR to all schedulers when an agent was drained
with the `mark_gone` parameter set to `true`.

This patch updates this code to ensure that TASK_GONE_BY_OPERATOR
is only sent to partition-aware schedulers.

Review: https://reviews.apache.org/r/71480/
{noformat}


was (Author: greggomann):
{noformat}
commit 8e1a51207304589a6521cff3540e0705fe1533ff
Author: Greg Mann 
Date:   Thu Sep 12 16:33:20 2019 -0700

Fixed a bug for non-partition-aware schedulers.

Previously, the agent would send task status updates with the state
TASK_GONE_BY_OPERATOR to all schedulers when an agent was drained
with the `mark_gone` parameter set to `true`.

This patch updates this code to ensure that TASK_GONE_BY_OPERATOR
is only sent to partition-aware schedulers.

Review: https://reviews.apache.org/r/71480/
{noformat}

> agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not 
> partition aware.
> --
>
> Key: MESOS-9965
> URL: https://issues.apache.org/jira/browse/MESOS-9965
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Gilbert Song
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
> Fix For: 1.10, 1.9.1
>
>
> The Mesos agent should not send `TASK_GONE_BY_OPERATOR` if the framework is 
> not partition-aware. We should distinguish the framework capability and send 
> different updates to legacy frameworks.
> The issue is exposed from here:
> https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/slave/slave.cpp#L5803
> An example to follow:
> https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/master/master.cpp#L9921



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (MESOS-9965) agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not partition aware.

2019-09-12 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-9965:


Assignee: Greg Mann

> agent should not send `TASK_GONE_BY_OPERATOR` if the framework is not 
> partition aware.
> --
>
> Key: MESOS-9965
> URL: https://issues.apache.org/jira/browse/MESOS-9965
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Gilbert Song
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
>
> The Mesos agent should not send `TASK_GONE_BY_OPERATOR` if the framework is 
> not partition-aware. We should distinguish the framework capability and send 
> different updates to legacy frameworks.
> The issue is exposed from here:
> https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/slave/slave.cpp#L5803
> An example to follow:
> https://github.com/apache/mesos/blob/f0be23765531b05661ed7f1b124faf96744aa80b/src/master/master.cpp#L9921



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9957) Sequence all operations on the agent

2019-08-30 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919881#comment-16919881
 ] 

Greg Mann commented on MESOS-9957:
--

See the following review, which includes a test illustrating this type of 
failure: https://reviews.apache.org/r/71417/

> Sequence all operations on the agent
> 
>
> Key: MESOS-9957
> URL: https://issues.apache.org/jira/browse/MESOS-9957
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
>
> The resolution of MESOS-8582 requires that an asynchronous step be added to 
> the code path which applies speculative operations like RESERVE and CREATE on 
> the agent. In order to ensure that the {{FrameworkInfo}} associated with an 
> incoming operation will be successfully retained, we must first unschedule GC 
> on the framework meta directory if the framework struct does not exist but 
> that directory does. By introducing this asynchronous step, we allow the 
> possibility that an operation may be executed out-of-order with respect to an 
> incoming dependent LAUNCH or LAUNCH_GROUP.
> For example, if a scheduler issues an ACCEPT call containing both a RESERVE 
> operation  as well as a LAUNCH operation containing a task which consumes the 
> new reserved resources, it's possible that this task will be launched on the 
> agent before the reserved resources exist.
> While we already [sequence task launches on a per-executor 
> basis|https://github.com/apache/mesos/blob/9297e2d3b0d44b553fc89bcf5f6109c76cc53668/src/slave/slave.cpp#L2337-L2408],
>  the aforementioned corner case requires that we sequence _all_ offer 
> operations on a per-framework basis.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9957) Sequence all operations on the agent

2019-08-30 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919879#comment-16919879
 ] 

Greg Mann commented on MESOS-9957:
--

One approach that could be taken here is to eliminate the per-executor 
{{Sequence}}s in the {{taskLaunchSequences}} map, and instead put a single 
{{Sequence operationSequence}} member in the {{Framework}} struct. The 
{{taskLaunch}} futures from the {{run()}} code path could likely be added into 
that sequence as-is, with the {{applyOperation()}} code path adding new futures 
to that sequence as well.

> Sequence all operations on the agent
> 
>
> Key: MESOS-9957
> URL: https://issues.apache.org/jira/browse/MESOS-9957
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
>
> The resolution of MESOS-8582 requires that an asynchronous step be added to 
> the code path which applies speculative operations like RESERVE and CREATE on 
> the agent. In order to ensure that the {{FrameworkInfo}} associated with an 
> incoming operation will be successfully retained, we must first unschedule GC 
> on the framework meta directory if the framework struct does not exist but 
> that directory does. By introducing this asynchronous step, we allow the 
> possibility that an operation may be executed out-of-order with respect to an 
> incoming dependent LAUNCH or LAUNCH_GROUP.
> For example, if a scheduler issues an ACCEPT call containing both a RESERVE 
> operation  as well as a LAUNCH operation containing a task which consumes the 
> new reserved resources, it's possible that this task will be launched on the 
> agent before the reserved resources exist.
> While we already [sequence task launches on a per-executor 
> basis|https://github.com/apache/mesos/blob/9297e2d3b0d44b553fc89bcf5f6109c76cc53668/src/slave/slave.cpp#L2337-L2408],
>  the aforementioned corner case requires that we sequence _all_ offer 
> operations on a per-framework basis.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9957) Sequence all operations on the agent

2019-08-30 Thread Greg Mann (Jira)
Greg Mann created MESOS-9957:


 Summary: Sequence all operations on the agent
 Key: MESOS-9957
 URL: https://issues.apache.org/jira/browse/MESOS-9957
 Project: Mesos
  Issue Type: Task
Reporter: Greg Mann


The resolution of MESOS-8582 requires that an asynchronous step be added to the 
code path which applies speculative operations like RESERVE and CREATE on the 
agent. In order to ensure that the {{FrameworkInfo}} associated with an 
incoming operation will be successfully retained, we must first unschedule GC 
on the framework meta directory if the framework struct does not exist but that 
directory does. By introducing this asynchronous step, we allow the possibility 
that an operation may be executed out-of-order with respect to an incoming 
dependent LAUNCH or LAUNCH_GROUP.

For example, if a scheduler issues an ACCEPT call containing both a RESERVE 
operation  as well as a LAUNCH operation containing a task which consumes the 
new reserved resources, it's possible that this task will be launched on the 
agent before the reserved resources exist.

While we already [sequence task launches on a per-executor 
basis|https://github.com/apache/mesos/blob/9297e2d3b0d44b553fc89bcf5f6109c76cc53668/src/slave/slave.cpp#L2337-L2408],
 the aforementioned corner case requires that we sequence _all_ offer 
operations on a per-framework basis.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9954) Flapping tasks with large sandboxes can fill agent disk

2019-08-26 Thread Greg Mann (Jira)
Greg Mann created MESOS-9954:


 Summary: Flapping tasks with large sandboxes can fill agent disk
 Key: MESOS-9954
 URL: https://issues.apache.org/jira/browse/MESOS-9954
 Project: Mesos
  Issue Type: Bug
Reporter: Greg Mann


If a task on an agent is repeatedly re-launched after failing and pulls a large 
artifact into its sandbox, it can quickly fill the agent disk. This may happen 
on a time scale shorter than the disk watch interval, leading to the agent disk 
filling up.

We should evaluate solutions to this issue. A couple options:
* Perhaps an aggressive (short) disk watch interval is sufficient? We should 
investigate the performance impact of this approach.
* If the former doesn't work, then maybe polling free disk space whenever a 
task is launched makes sense? (Rate-limiting this might be necessary)
* Perhaps we can come up with some fundamentally different approach for 
detecting free disk space which would solve this issue?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state

2019-08-20 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911836#comment-16911836
 ] 

Greg Mann edited comment on MESOS-9545 at 8/21/19 1:40 AM:
---

1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}


1.7.x:
{noformat}
commit 61f1155675bd3bc5312e0501ea6182d2ee7434af
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 0c5e78bc26653d26a03b08b82923ea517de46fc0
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}


1.6.x:
{noformat}
commit c6da50d10511a1046b8d4bc563dc3ccee875
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6a9cee7999be0a3a4f89d21ec58947fe90c01eeb
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}


was (Author: greggomann):
1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

1.7.x:
{noformat}
commit 61f1155675bd3bc5312e0501ea6182d2ee7434af
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 0c5e78bc26653d26a03b08b82923ea517de46fc0
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

1.6.x:
{noformat}
commit c6da50d10511a1046b8d4bc563dc3ccee875 (HEAD -> 1.6.x, origin/1.6.x, 
mesos-private/ci/greg/mesos-9545-1.6.x, ci/greg/mesos-9545-1.6.x)
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an 

[jira] [Comment Edited] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state

2019-08-20 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911836#comment-16911836
 ] 

Greg Mann edited comment on MESOS-9545 at 8/21/19 1:40 AM:
---

1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

1.7.x:
{noformat}
commit 61f1155675bd3bc5312e0501ea6182d2ee7434af
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 0c5e78bc26653d26a03b08b82923ea517de46fc0
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

1.6.x:
{noformat}
commit c6da50d10511a1046b8d4bc563dc3ccee875 (HEAD -> 1.6.x, origin/1.6.x, 
mesos-private/ci/greg/mesos-9545-1.6.x, ci/greg/mesos-9545-1.6.x)
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6a9cee7999be0a3a4f89d21ec58947fe90c01eeb
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}


was (Author: greggomann):
1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

1.7.x:
{noformat}
commit 61f1155675bd3bc5312e0501ea6182d2ee7434af
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 0c5e78bc26653d26a03b08b82923ea517de46fc0
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

> Marking an unreachable agent as gone should transition the tasks to terminal 
> state
> 

[jira] [Commented] (MESOS-9937) 53598228fe should be backported to 1.7.x

2019-08-20 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911854#comment-16911854
 ] 

Greg Mann commented on MESOS-9937:
--

[~carlone], this is done, see the commit below:
{noformat}
commit 0c5e78bc26653d26a03b08b82923ea517de46fc0
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

> 53598228fe should be backported to 1.7.x
> 
>
> Key: MESOS-9937
> URL: https://issues.apache.org/jira/browse/MESOS-9937
> Project: Mesos
>  Issue Type: Bug
>Reporter: longfei
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: foundations
>
> Commit 53598228fe on the master branch should be backported to 1.7.x. 
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state

2019-08-20 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911836#comment-16911836
 ] 

Greg Mann edited comment on MESOS-9545 at 8/21/19 1:25 AM:
---

1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

1.7.x:
{noformat}
commit 61f1155675bd3bc5312e0501ea6182d2ee7434af
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 0c5e78bc26653d26a03b08b82923ea517de46fc0
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}


was (Author: greggomann):
1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

> Marking an unreachable agent as gone should transition the tasks to terminal 
> state
> --
>
> Key: MESOS-9545
> URL: https://issues.apache.org/jira/browse/MESOS-9545
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
> Fix For: 1.9.0
>
>
> If an unreachable agent is marked as gone, currently master just marks that 
> agent in the registry but doesn't do anything about its tasks. So the tasks 
> are in UNREACHABLE state in the master forever, until the master fails over. 
> This is not great UX. We should transition these to terminal state instead.
> This fix should also include a test to verify.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state

2019-08-20 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911836#comment-16911836
 ] 

Greg Mann commented on MESOS-9545:
--

1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

> Marking an unreachable agent as gone should transition the tasks to terminal 
> state
> --
>
> Key: MESOS-9545
> URL: https://issues.apache.org/jira/browse/MESOS-9545
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
> Fix For: 1.9.0
>
>
> If an unreachable agent is marked as gone, currently master just marks that 
> agent in the registry but doesn't do anything about its tasks. So the tasks 
> are in UNREACHABLE state in the master forever, until the master fails over. 
> This is not great UX. We should transition these to terminal state instead.
> This fix should also include a test to verify.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9946) DefaultExecutorTest.ROOT_INTERNET_CURL_DockerTaskWithFileURI is flaky

2019-08-20 Thread Greg Mann (Jira)
Greg Mann created MESOS-9946:


 Summary: 
DefaultExecutorTest.ROOT_INTERNET_CURL_DockerTaskWithFileURI is flaky
 Key: MESOS-9946
 URL: https://issues.apache.org/jira/browse/MESOS-9946
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Greg Mann


Observed this on a 1.8.x build. I suspect it's due to a slow image pull based 
on the logs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9945) Use streaming response in the checker process

2019-08-20 Thread Greg Mann (Jira)
Greg Mann created MESOS-9945:


 Summary: Use streaming response in the checker process
 Key: MESOS-9945
 URL: https://issues.apache.org/jira/browse/MESOS-9945
 Project: Mesos
  Issue Type: Improvement
Reporter: Greg Mann


Because we do not currently use a streaming response for nested container 
command health checks in the checker process, we are not able to display the 
output of failed checks (MESOS-7903), and we are not able to begin the health 
check timeout at the appropriate moment (MESOS-9944).

We should update the checker process to use a streaming response for the 
LAUNCH_NESTED_CONTAINER_SESSION call that it uses to initiate command health 
checks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9944) Command health check timeout begins to early

2019-08-20 Thread Greg Mann (Jira)
Greg Mann created MESOS-9944:


 Summary: Command health check timeout begins to early
 Key: MESOS-9944
 URL: https://issues.apache.org/jira/browse/MESOS-9944
 Project: Mesos
  Issue Type: Bug
  Components: agent
Affects Versions: 1.9.0
Reporter: Greg Mann


The checker process begins the timer for the command health check timeout when 
the LAUNCH_NESTED_CONTAINER_SESSION request is first sent, which means any 
delay in the execution of the health check command is included in the health 
check timeout. This can be an issue when the agent is under heavy load, and it 
may take a few seconds for the health check command to be run.

Once we have a streaming response for the ATTACH_CONTAINER_OUTPUT call which 
follows the nested container launch, we can initiate the health check timeout 
once the first byte of the response is received; this is a more accurate signal 
that the health check command has begun running.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state

2019-08-13 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906587#comment-16906587
 ] 

Greg Mann commented on MESOS-9545:
--

[~vinodkone] thanks for the ping - I have these backports in progress but got 
distracted, will make this happen this week.

> Marking an unreachable agent as gone should transition the tasks to terminal 
> state
> --
>
> Key: MESOS-9545
> URL: https://issues.apache.org/jira/browse/MESOS-9545
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
> Fix For: 1.9.0
>
>
> If an unreachable agent is marked as gone, currently master just marks that 
> agent in the registry but doesn't do anything about its tasks. So the tasks 
> are in UNREACHABLE state in the master forever, until the master fails over. 
> This is not great UX. We should transition these to terminal state instead.
> This fix should also include a test to verify.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9938) Standalone container documentation

2019-08-13 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906579#comment-16906579
 ] 

Greg Mann commented on MESOS-9938:
--

Review here: https://reviews.apache.org/r/65112/

> Standalone container documentation
> --
>
> Key: MESOS-9938
> URL: https://issues.apache.org/jira/browse/MESOS-9938
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations, mesosphere
>
> We should add documentation for standalone containers.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9938) Standalone container documentation

2019-08-13 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-9938:


Assignee: Joseph Wu

> Standalone container documentation
> --
>
> Key: MESOS-9938
> URL: https://issues.apache.org/jira/browse/MESOS-9938
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations, mesosphere
>
> We should add documentation for standalone containers.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9938) Standalone container documentation

2019-08-13 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9938:


 Summary: Standalone container documentation
 Key: MESOS-9938
 URL: https://issues.apache.org/jira/browse/MESOS-9938
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Greg Mann


We should add documentation for standalone containers.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9937) 53598228fe should be backported to 1.7.x

2019-08-13 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906559#comment-16906559
 ] 

Greg Mann commented on MESOS-9937:
--

[~carlone] good timing! I was already planning to backport that commit as part 
of backporting MESOS-9545, which I previously overlooked backporting. Should 
happen in the next couple days.

> 53598228fe should be backported to 1.7.x
> 
>
> Key: MESOS-9937
> URL: https://issues.apache.org/jira/browse/MESOS-9937
> Project: Mesos
>  Issue Type: Bug
>Reporter: longfei
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: foundations
>
> Commit 53598228fe on the master branch should be backported to 1.7.x. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9931) StorageLocalResourceProviderTest.ROOT_NewVolumeRecovery is flaky

2019-08-09 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9931:


 Summary: StorageLocalResourceProviderTest.ROOT_NewVolumeRecovery 
is flaky
 Key: MESOS-9931
 URL: https://issues.apache.org/jira/browse/MESOS-9931
 Project: Mesos
  Issue Type: Bug
  Components: resource provider, test
 Environment: Ubuntu 14.04, SSL-enabled
Reporter: Greg Mann


{noformat}
20:03:04 [ RUN  ] StorageLocalResourceProviderTest.ROOT_NewVolumeRecovery
20:03:04 I0808 20:03:04.748040 10423 cluster.cpp:172] Creating default 'local' 
authorizer
20:03:04 I0808 20:03:04.749110 23206 master.cpp:467] Master 
6180f181-f8df-4125-a7bd-a1b5ff9f16f1 (ip-172-16-10-204.ec2.internal) started on 
172.16.10.204:57833
20:03:04 I0808 20:03:04.749131 23206 master.cpp:469] Flags at startup: 
--acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/nLxkqj/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/nLxkqj/master" --zk_session_timeout="10secs"
20:03:04 I0808 20:03:04.749269 23206 master.cpp:518] Master only allowing 
authenticated frameworks to register
20:03:04 I0808 20:03:04.749276 23206 master.cpp:524] Master only allowing 
authenticated agents to register
20:03:04 I0808 20:03:04.749281 23206 master.cpp:530] Master only allowing 
authenticated HTTP frameworks to register
20:03:04 I0808 20:03:04.749287 23206 credentials.hpp:37] Loading credentials 
for authentication from '/tmp/nLxkqj/credentials'
20:03:04 I0808 20:03:04.749378 23206 master.cpp:574] Using default 'crammd5' 
authenticator
20:03:04 I0808 20:03:04.749428 23206 http.cpp:1049] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-readonly'
20:03:04 I0808 20:03:04.749473 23206 http.cpp:1049] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-readwrite'
20:03:04 I0808 20:03:04.749497 23206 http.cpp:1049] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-scheduler'
20:03:04 I0808 20:03:04.749522 23206 master.cpp:653] Authorization enabled
20:03:04 I0808 20:03:04.749619 23208 hierarchical.cpp:219] Initialized 
hierarchical allocator process
20:03:04 I0808 20:03:04.749645 23208 whitelist_watcher.cpp:77] No whitelist 
given
20:03:04 I0808 20:03:04.750432 23206 master.cpp:2265] Elected as the leading 
master!
20:03:04 I0808 20:03:04.750452 23206 master.cpp:1730] Recovering from registrar
20:03:04 I0808 20:03:04.750491 23206 registrar.cpp:347] Recovering registrar
20:03:04 I0808 20:03:04.750617 23206 registrar.cpp:391] Successfully fetched 
the registry (0B) in 112896ns
20:03:04 I0808 20:03:04.750648 23206 registrar.cpp:495] Applied 1 operations in 
7349ns; attempting to update the registry
20:03:04 I0808 20:03:04.750763 23204 registrar.cpp:552] Successfully updated 
the registry in 97024ns
20:03:04 I0808 20:03:04.750794 23204 registrar.cpp:424] Successfully recovered 
registrar
20:03:04 I0808 20:03:04.750921 23204 master.cpp:1843] Recovered 0 agents from 
the registry (176B); allowing 10mins for agents to re-register
20:03:04 I0808 20:03:04.750958 23204 hierarchical.cpp:257] Skipping recovery of 
hierarchical allocator: nothing to recover
20:03:04 W0808 20:03:04.752594 10423 process.cpp:2745] Attempted to spawn 
already running process files@172.16.10.204:57833
20:03:04 I0808 20:03:04.753041 10423 containerizer.cpp:309] Using isolation { 
environment_secret, volume/sandbox_path, filesystem/linux, network/cni, 
volume/image, volume/host_path }
20:03:04 I0808 20:03:04.756779 10423 linux_launcher.cpp:145] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
20:03:04 I0808 20:03:04.757201 10423 provisioner.cpp:299] Using default backend 
'aufs'
20:03:04 I0808 20:03:04.757702 10423 linux.cpp:152] Bind 

[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail

2019-08-06 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901216#comment-16901216
 ] 

Greg Mann commented on MESOS-9875:
--

I'm trying to figure out how to address this with the current information that 
we checkpoint on the agent. The old-style checkpointing on the agent went like 
this:
1) Checkpoint resources to a "target file"
2) Sync checkpointed resources to disk, which creates persistent volumes
3) If #2 succeeds, move the "target file" to the actual checkpoint location

When implementing operation feedback, we thought we could get away without this 
two-phase checkpointing, since we now have the operation feedback streams which 
we can use as another source of information. When recovering in the agent, we 
have some logic which inspects both the checkpointed resources/operations as 
well as the operation streams checkpointed by the operation status update 
manager in order to recover properly.

It's possible that we could use the old-style checkpointed resource files in 
order to accomplish recovery now (we still write those to disk to enable agent 
downgrades), but I'm worried that this will be confusing. But perhaps it's 
already confusing :)

I'll try to have a patch up by EOD with a solution for you to look at.

> Mesos did not respond correctly when operations should fail
> ---
>
> Key: MESOS-9875
> URL: https://issues.apache.org/jira/browse/MESOS-9875
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Yifan Xing
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
> Attachments: Screen Shot 2019-06-27 at 15.07.20.png
>
>
> For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we 
> sshed into the mesos-agent and made it unable to create subdirectories in 
> {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation 
> failed response. Instead, we received {{OPERATION_FINISHED}} feedback.
> Steps to recreate the issue:
> 1. Ssh into a magent.
>  2. Make it impossible to create a persistent volume (we expect the agent to 
> crash and reregister, and the master to release that the operation is 
> {{OPERATION_DROPPED}}):
>  * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes)
>  * chattr -RV +i volumes (then no subdirectories can be created)
> 3. Launch a service with persistent volumes with the constraint of only using 
> the magent modified above.
>  
>  
> Logs for the scheduler for receiving `OPERATION_FINISHED`:
> (Also see screenshot)
>  
> 2019-06-27 21:57:11.879 [12768651|rdar://12768651] 
> [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored 
> operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and 
> feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on 
> serviceID=yifan-badagents-1
>  
> * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: 
> REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch 
> container: Failed to change the ownership of the persistent volume at 
> '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' 
> with uid 264 and gid 264: No such file or directory



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (MESOS-9875) Mesos did not respond correctly when operations should fail

2019-08-06 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900866#comment-16900866
 ] 

Greg Mann edited comment on MESOS-9875 at 8/6/19 12:31 PM:
---

It looks like the {{OPERATION_FINISHED}} update should only be sent after the 
agent fails over and recovers its checkpointed operations. We need to make sure 
that if the agent's call to `syncCheckpointedResources()` fails, which is the 
function that actually creates the persistent volume, then the operation in 
state OPERATION_FINISHED should not be recovered by the agent. Currently, it 
looks like the agent will fail to create the persistent volume, crash, and then 
recover the operation in state OPERATION_FINISHED and send the update.


was (Author: greggomann):
[~jamespeach] could you tell me what Mesos SHA this was observed on? Looking at 
the code, I'm having trouble identifying how this would happen, since we don't 
send the operation feedback until after the operation has been committed to 
disk; feedback is sent to the master via the final call to 
{{operationStatusUpdateManager.update(update);}} in {{Slave::applyOperation()}}.

> Mesos did not respond correctly when operations should fail
> ---
>
> Key: MESOS-9875
> URL: https://issues.apache.org/jira/browse/MESOS-9875
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Yifan Xing
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
> Attachments: Screen Shot 2019-06-27 at 15.07.20.png
>
>
> For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we 
> sshed into the mesos-agent and made it unable to create subdirectories in 
> {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation 
> failed response. Instead, we received {{OPERATION_FINISHED}} feedback.
> Steps to recreate the issue:
> 1. Ssh into a magent.
>  2. Make it impossible to create a persistent volume (we expect the agent to 
> crash and reregister, and the master to release that the operation is 
> {{OPERATION_DROPPED}}):
>  * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes)
>  * chattr -RV +i volumes (then no subdirectories can be created)
> 3. Launch a service with persistent volumes with the constraint of only using 
> the magent modified above.
>  
>  
> Logs for the scheduler for receiving `OPERATION_FINISHED`:
> (Also see screenshot)
>  
> 2019-06-27 21:57:11.879 [12768651|rdar://12768651] 
> [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored 
> operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and 
> feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on 
> serviceID=yifan-badagents-1
>  
> * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: 
> REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch 
> container: Failed to change the ownership of the persistent volume at 
> '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' 
> with uid 264 and gid 264: No such file or directory



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (MESOS-9875) Mesos did not respond correctly when operations should fail

2019-08-06 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900866#comment-16900866
 ] 

Greg Mann edited comment on MESOS-9875 at 8/6/19 10:30 AM:
---

[~jamespeach] could you tell me what Mesos SHA this was observed on? Looking at 
the code, I'm having trouble identifying how this would happen, since we don't 
send the operation feedback until after the operation has been committed to 
disk; feedback is sent to the master via the final call to 
{{operationStatusUpdateManager.update(update);}} in {{Slave::applyOperation()}}.


was (Author: greggomann):
[~jamespeach] could you tell me what Mesos SHA this was observed on? Looking at 
the code, I'm having trouble identifying how this would happen, since we don't 
send the operation feedback until after the operation has been committed to 
disk.

> Mesos did not respond correctly when operations should fail
> ---
>
> Key: MESOS-9875
> URL: https://issues.apache.org/jira/browse/MESOS-9875
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Yifan Xing
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
> Attachments: Screen Shot 2019-06-27 at 15.07.20.png
>
>
> For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we 
> sshed into the mesos-agent and made it unable to create subdirectories in 
> {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation 
> failed response. Instead, we received {{OPERATION_FINISHED}} feedback.
> Steps to recreate the issue:
> 1. Ssh into a magent.
>  2. Make it impossible to create a persistent volume (we expect the agent to 
> crash and reregister, and the master to release that the operation is 
> {{OPERATION_DROPPED}}):
>  * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes)
>  * chattr -RV +i volumes (then no subdirectories can be created)
> 3. Launch a service with persistent volumes with the constraint of only using 
> the magent modified above.
>  
>  
> Logs for the scheduler for receiving `OPERATION_FINISHED`:
> (Also see screenshot)
>  
> 2019-06-27 21:57:11.879 [12768651|rdar://12768651] 
> [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored 
> operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and 
> feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on 
> serviceID=yifan-badagents-1
>  
> * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: 
> REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch 
> container: Failed to change the ownership of the persistent volume at 
> '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' 
> with uid 264 and gid 264: No such file or directory



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail

2019-08-06 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900866#comment-16900866
 ] 

Greg Mann commented on MESOS-9875:
--

[~jamespeach] could you tell me what Mesos SHA this was observed on? Looking at 
the code, I'm having trouble identifying how this would happen, since we don't 
send the operation feedback until after the operation has been committed to 
disk.

> Mesos did not respond correctly when operations should fail
> ---
>
> Key: MESOS-9875
> URL: https://issues.apache.org/jira/browse/MESOS-9875
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Yifan Xing
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
> Attachments: Screen Shot 2019-06-27 at 15.07.20.png
>
>
> For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we 
> sshed into the mesos-agent and made it unable to create subdirectories in 
> {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation 
> failed response. Instead, we received {{OPERATION_FINISHED}} feedback.
> Steps to recreate the issue:
> 1. Ssh into a magent.
>  2. Make it impossible to create a persistent volume (we expect the agent to 
> crash and reregister, and the master to release that the operation is 
> {{OPERATION_DROPPED}}):
>  * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes)
>  * chattr -RV +i volumes (then no subdirectories can be created)
> 3. Launch a service with persistent volumes with the constraint of only using 
> the magent modified above.
>  
>  
> Logs for the scheduler for receiving `OPERATION_FINISHED`:
> (Also see screenshot)
>  
> 2019-06-27 21:57:11.879 [12768651|rdar://12768651] 
> [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored 
> operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and 
> feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on 
> serviceID=yifan-badagents-1
>  
> * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: 
> REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch 
> container: Failed to change the ownership of the persistent volume at 
> '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' 
> with uid 264 and gid 264: No such file or directory



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9875) Mesos did not respond correctly when operations should fail

2019-08-05 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-9875:


Assignee: Greg Mann

> Mesos did not respond correctly when operations should fail
> ---
>
> Key: MESOS-9875
> URL: https://issues.apache.org/jira/browse/MESOS-9875
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Yifan Xing
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
> Attachments: Screen Shot 2019-06-27 at 15.07.20.png
>
>
> For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we 
> sshed into the mesos-agent and made it unable to create subdirectories in 
> {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation 
> failed response. Instead, we received {{OPERATION_FINISHED}} feedback.
> Steps to recreate the issue:
> 1. Ssh into a magent.
>  2. Make it impossible to create a persistent volume (we expect the agent to 
> crash and reregister, and the master to release that the operation is 
> {{OPERATION_DROPPED}}):
>  * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes)
>  * chattr -RV +i volumes (then no subdirectories can be created)
> 3. Launch a service with persistent volumes with the constraint of only using 
> the magent modified above.
>  
>  
> Logs for the scheduler for receiving `OPERATION_FINISHED`:
> (Also see screenshot)
>  
> 2019-06-27 21:57:11.879 [12768651|rdar://12768651] 
> [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored 
> operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and 
> feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on 
> serviceID=yifan-badagents-1
>  
> * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: 
> REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch 
> container: Failed to change the ownership of the persistent volume at 
> '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' 
> with uid 264 and gid 264: No such file or directory



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9919) Health check performance decreases on large machines

2019-08-02 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-9919:


Assignee: Greg Mann

> Health check performance decreases on large machines
> 
>
> Key: MESOS-9919
> URL: https://issues.apache.org/jira/browse/MESOS-9919
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
>
> In recent testing, it appears that the performance of Mesos command health 
> checks decreases dramatically on nodes with large numbers of cores and lots 
> of memory. This may be due to the changes in the cost of forking the agent 
> process on such nodes. We need to investigate this issue to understand the 
> root cause.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9919) Health check performance decreases on large machines

2019-07-31 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9919:


 Summary: Health check performance decreases on large machines
 Key: MESOS-9919
 URL: https://issues.apache.org/jira/browse/MESOS-9919
 Project: Mesos
  Issue Type: Task
  Components: agent, containerization
Reporter: Greg Mann


In recent testing, it appears that the performance of Mesos command health 
checks decreases dramatically on nodes with large numbers of cores and lots of 
memory. This may be due to the changes in the cost of forking the agent process 
on such nodes. We need to investigate this issue to understand the root cause.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


  1   2   3   4   5   6   7   8   9   10   >