[jira] [Assigned] (MESOS-8383) Add metrics for operations in Storage Local Resource Provider (SLRP).

2018-02-14 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-8383:
--

Assignee: Chun-Hung Hsiao

> Add metrics for operations in Storage Local Resource Provider (SLRP).
> -
>
> Key: MESOS-8383
> URL: https://issues.apache.org/jira/browse/MESOS-8383
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Chun-Hung Hsiao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8584) Move volume file attach/detach from the agent to the containerizer.

2018-02-14 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-8584:
---

 Summary: Move volume file attach/detach from the agent to the 
containerizer.
 Key: MESOS-8584
 URL: https://issues.apache.org/jira/browse/MESOS-8584
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song


Volume is the concept for containers, and is supported via the isolator in the 
containerizer. We should consider to move the file endpoint attach/detach to 
the containerizer. A refactoring is needed.

/cc [~jieyu] [~qianzhang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8573) Container stuck in PULLING when Docker daemon hangs

2018-02-14 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-8573:
---

Assignee: Gilbert Song

> Container stuck in PULLING when Docker daemon hangs
> ---
>
> Key: MESOS-8573
> URL: https://issues.apache.org/jira/browse/MESOS-8573
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Gilbert Song
>Priority: Major
>  Labels: mesosphere
>
> When the {{force}} argument is not set to {{true}}, {{Docker::pull}} will 
> always perform a {{docker inspect}} call before it does a {{docker pull}}. If 
> either of these two Docker CLI calls hangs indefinitely, the Docker container 
> will be stuck in the PULLING state. This means that we make no further 
> progress in the {{launch()}} call path, so the executor binary is never 
> executed, the {{Future}} associated with the {{launch()}} call is never 
> failed or satisfied, and {{wait()}} is never called on the container. Thus, 
> when the executor registration timeout elapses, the agent's call to 
> {{containerizer->destroy()}} gets stuck waiting on the container status, and 
> its continuation is never invoked.
> This leaves the task destined for that Docker executor stuck in TASK_STAGING 
> from the framework's perspective, and attempts to kill the task will fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8565) Persistent volumes are not visible in Mesos UI when launching a pod using default executor.

2018-02-14 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363643#comment-16363643
 ] 

Gilbert Song edited comment on MESOS-8565 at 2/14/18 6:41 PM:
--

commit 9d4c6d9576741cc480c75f8e59cc8d1adc9849fc

Author: Qian Zhang 

Date:   Wed Feb 14 00:17:37 2018 -0800

 

    Attached/detached volume directory for task which has volume specified.

    

    Review: [https://reviews.apache.org/r/65570/]


was (Author: gilbert):
commit a7714536fad1140fd0c07c47e32b40e9ed00a3c3

Author: Qian Zhang 

Date:   Mon Feb 5 20:42:07 2018 +0800

 

    Reaped the container process directly in Docker executor.

    

    Due to a Docker issue (https://github.com/moby/moby/issues/33820),

    Docker daemon can fail to catch a container exit, i.e., the container

    process has already exited but the command `docker ps` shows the

    container still running, this will lead to the "docker run" command

    that we execute in Docker executor never returning, and it will also

    cause the `docker stop` command takes no effect, i.e., it will return

    without error but `docker ps` shows the container still running, so

    the task will stuck in `TASK_KILLING` state.

    

    To workaround this Docker issue, in this patch we made Docker executor

    reaps the container process directly so Docker executor will be notified

    once the container process exits.

    

    Review: https://reviews.apache.org/r/65518

> Persistent volumes are not visible in Mesos UI when launching a pod using 
> default executor.
> ---
>
> Key: MESOS-8565
> URL: https://issues.apache.org/jira/browse/MESOS-8565
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.2, 1.3.1, 1.4.1
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.6.0, 1.5.1
>
>
> When user launches a pod to use a persistent volume in DC/OS, the nested 
> containers in the pod can access the PV successfully and the PV directory of 
> the executor shown in Mesos UI has all the contents written by the tasks, but 
> the PV directory of the tasks shown in DC/OS UI and Mesos UI is empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8583) Autotools and Cmake do not give the same permissions on files in build/bin/.

2018-02-14 Thread Armand Grillet (JIRA)
Armand Grillet created MESOS-8583:
-

 Summary: Autotools and Cmake do not give the same permissions on 
files in build/bin/.
 Key: MESOS-8583
 URL: https://issues.apache.org/jira/browse/MESOS-8583
 Project: Mesos
  Issue Type: Bug
  Components: build
Affects Versions: 1.6.0
Reporter: Armand Grillet


Using CMake, the files in {{build/bin}} will have for access rights:
{code:java}
ls -l
total 84
-rw--x. 1 agrillet agrillet 1583 Feb 14 10:13 gdb-mesos-agent.sh
-rw--x. 1 agrillet agrillet 1577 Feb 14 10:13 gdb-mesos-local.sh
-rw--x. 1 agrillet agrillet 1586 Feb 14 10:13 gdb-mesos-master.sh
-rw--x. 1 agrillet agrillet 1554 Feb 14 10:13 gdb-mesos-tests.sh
-rw--x. 1 agrillet agrillet 1543 Feb 14 10:13 lldb-mesos-agent.sh
-rw--x. 1 agrillet agrillet 1545 Feb 14 10:13 lldb-mesos-local.sh
-rw--x. 1 agrillet agrillet 1548 Feb 14 10:13 lldb-mesos-master.sh
-rw--x. 1 agrillet agrillet 1522 Feb 14 10:13 lldb-mesos-tests.sh
-rw--x. 1 agrillet agrillet 1840 Feb 14 10:13 mesos-agent-flags.sh
-rw--x. 1 agrillet agrillet 1047 Feb 14 10:13 mesos-agent.sh
-rw--x. 1 agrillet agrillet 929 Feb 14 10:13 mesos-local-flags.sh
-rw--x. 1 agrillet agrillet 1053 Feb 14 10:13 mesos-local.sh
-rw--x. 1 agrillet agrillet 892 Feb 14 10:13 mesos-master-flags.sh
-rw--x. 1 agrillet agrillet 1056 Feb 14 10:13 mesos-master.sh
-rw--x. 1 agrillet agrillet 1200 Feb 14 10:13 mesos.sh
-rw--x. 1 agrillet agrillet 901 Feb 14 10:13 mesos-tests-flags.sh
-rw--x. 1 agrillet agrillet 1056 Feb 14 10:13 mesos-tests.sh
-rw--x. 1 agrillet agrillet 1825 Feb 14 10:13 valgrind-mesos-agent.sh
-rw--x. 1 agrillet agrillet 1825 Feb 14 10:13 valgrind-mesos-local.sh
-rw--x. 1 agrillet agrillet 1828 Feb 14 10:13 valgrind-mesos-master.sh
-rw--x. 1 agrillet agrillet 1825 Feb 14 10:13 valgrind-mesos-tests.sh{code}

Using Autotools, the permissions are not the same:
{code}
ls -l
total 104
-rwxrwxr-x. 1 agrillet agrillet 1592 Feb 14 10:32 gdb-mesos-agent.sh
-rwxrwxr-x. 1 agrillet agrillet 1586 Feb 14 10:32 gdb-mesos-local.sh
-rwxrwxr-x. 1 agrillet agrillet 1595 Feb 14 10:32 gdb-mesos-master.sh
-rwxrwxr-x. 1 agrillet agrillet 1592 Feb 14 10:32 gdb-mesos-slave.sh
-rwxrwxr-x. 1 agrillet agrillet 1563 Feb 14 10:32 gdb-mesos-tests.sh
-rwxrwxr-x. 1 agrillet agrillet 1552 Feb 14 10:32 lldb-mesos-agent.sh
-rwxrwxr-x. 1 agrillet agrillet 1554 Feb 14 10:32 lldb-mesos-local.sh
-rwxrwxr-x. 1 agrillet agrillet 1557 Feb 14 10:32 lldb-mesos-master.sh
-rwxrwxr-x. 1 agrillet agrillet 1552 Feb 14 10:32 lldb-mesos-slave.sh
-rwxrwxr-x. 1 agrillet agrillet 1531 Feb 14 10:32 lldb-mesos-tests.sh
-rw-rw-r--. 1 agrillet agrillet 1840 Feb 14 10:32 mesos-agent-flags.sh
-rwxrwxr-x. 1 agrillet agrillet 1047 Feb 14 10:32 mesos-agent.sh
-rw-rw-r--. 1 agrillet agrillet  929 Feb 14 10:32 mesos-local-flags.sh
-rwxrwxr-x. 1 agrillet agrillet 1053 Feb 14 10:32 mesos-local.sh
-rw-rw-r--. 1 agrillet agrillet  901 Feb 14 10:32 mesos-master-flags.sh
-rwxrwxr-x. 1 agrillet agrillet 1056 Feb 14 10:32 mesos-master.sh
-rwxrwxr-x. 1 agrillet agrillet 1209 Feb 14 10:32 mesos.sh
-rw-rw-r--. 1 agrillet agrillet 1840 Feb 14 10:32 mesos-slave-flags.sh
-rwxrwxr-x. 1 agrillet agrillet 1047 Feb 14 10:32 mesos-slave.sh
-rw-rw-r--. 1 agrillet agrillet  901 Feb 14 10:32 mesos-tests-flags.sh
-rwxrwxr-x. 1 agrillet agrillet 1056 Feb 14 10:32 mesos-tests.sh
-rwxrwxr-x. 1 agrillet agrillet 1834 Feb 14 10:32 valgrind-mesos-agent.sh
-rwxrwxr-x. 1 agrillet agrillet 1834 Feb 14 10:32 valgrind-mesos-local.sh
-rwxrwxr-x. 1 agrillet agrillet 1837 Feb 14 10:32 valgrind-mesos-master.sh
-rwxrwxr-x. 1 agrillet agrillet 1834 Feb 14 10:32 valgrind-mesos-slave.sh
-rwxrwxr-x. 1 agrillet agrillet 1834 Feb 14 10:32 valgrind-mesos-tests.sh
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8582) Add a way to make sure an agent always knows the full framework information of all frameworks executing operations on its resources

2018-02-14 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-8582:
---

 Summary: Add a way to make sure an agent always knows the full 
framework information of all frameworks executing operations on its resources
 Key: MESOS-8582
 URL: https://issues.apache.org/jira/browse/MESOS-8582
 Project: Mesos
  Issue Type: Bug
  Components: agent, master, storage
Affects Versions: 1.5.0
Reporter: Benjamin Bannier


Currently an {{Operation}} only contains a {{FrameworkID}} of originating 
frameworks, but e.g., not the full {{FrameworkInfo}}. This is problematic in 
master failover scenarios where a master might learn about an operation 
triggered by a framework unknown to it. The way the master implementation is 
structured, we would like to create tracking structures for that framework 
(e.g., to sync with the allocator down the line), but cannot do so since we can 
only learn this information when either the framework reregisters, or an agent 
running tasks of that framework reconciles with the master. We also cannot use 
conjured uo dummy information until we  learn the true {{FrameworkInfo}} since 
some required fields in {{FrameworkInfo}} (namely {{FrameworkInfo.user}}) 
cannot be updated, see MESOS-703.

We should introduce a channel for agents to learn the full {{FrameworkInfo}} 
for all frameworks executing operations on its resources. For simplicity and 
symmetry with {{RunTaskMessage}} it seems that adding an explicit 
{{FrameworkInfo}} field to {{Operation}} would do the job (e.g., allow atomic 
information transfer when operations are sent to the agent or on reconciliation 
with newly elected masters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8468) `LAUNCH_GROUP` failure tears down the default executor.

2018-02-14 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363885#comment-16363885
 ] 

Qian Zhang commented on MESOS-8468:
---

commit 632ff7f7f8e32d3f9507e9199c8a253ff755224e
Author: Gaston Kleiman 
Date: Wed Feb 14 14:35:34 2018 +0800

Removed outdated executor-wide launched flag from the default executor.
 
 Review: https://reviews.apache.org/r/65616/

src/launcher/default_executor.cpp | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

commit 54b6c5b9c7cb059ebd87ee0f9927cfa6ff73129d
Author: Gaston Kleiman 
Date: Wed Feb 14 14:35:22 2018 +0800

Made the default executor treat agent disconnections more gracefully.
 
 This patch makes the default executor not shutdown if there are active
 child containers, and it fails to connect or is not subscribed to the
 agent when starting to launch a task group.
 
 Review: https://reviews.apache.org/r/65556/

src/launcher/default_executor.cpp | 43 
+++
 1 file changed, 35 insertions(+), 8 deletions(-)

commit 656196eeca4ab6449c4b9f329b5b9cac2f69a885
Author: Gaston Kleiman 
Date: Wed Feb 14 14:35:17 2018 +0800

Added a regression test for MESOS-8468.
 
 Review: https://reviews.apache.org/r/65552/

src/tests/default_executor_tests.cpp | 252 
+
 1 file changed, 252 insertions(+)

commit c3f3542e7ecce82cad8b75fdc2db14fe8c43a5da
Author: Gaston Kleiman 
Date: Wed Feb 14 14:35:11 2018 +0800

Stopped shutting down the whole default executor on task launch failure.

The default executor would be completely shutdown on a
 `LAUNCH_NESTED_CONTAINER` failure.
 
 This patch makes it kill the affected task group instead of shutting
 down and killing all task groups.
 
 Review: https://reviews.apache.org/r/65551/

src/launcher/default_executor.cpp | 165 
++--
 1 file changed, 103 insertions(+), 62 deletions(-)

commit 5c8852b244b09b4ae57e00abcd940482927d57e6
Author: Gaston Kleiman 
Date: Wed Feb 14 14:35:01 2018 +0800

Made default executor not shutdown if unsubscribed during task launch.
 
 The default executor would unnecessarily shutdown if, while launching a
 task group, it gets unsubscribed after having successfully launched the
 task group's containers.
 
 Review: https://reviews.apache.org/r/65550/

src/launcher/default_executor.cpp | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

commit 2e570b709dc7d15c73c8d728ef0b32e2416b0a08
Author: Gaston Kleiman 
Date: Wed Feb 14 14:34:56 2018 +0800

Improved some default executor log messages.
 
 Review: https://reviews.apache.org/r/65549/

src/launcher/default_executor.cpp | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

commit 29d1e4e1a1b894da78c2033f1932b282ee794f4b
Author: Gaston Kleiman 
Date: Wed Feb 14 14:34:50 2018 +0800

Added `Event::Update` and `v1::scheduler::TaskStatus` ostream operators.
 
 This operators make gtest print a human-readable representation of the
 protos on test failures.
 
 Review: https://reviews.apache.org/r/65548/

include/mesos/v1/mesos.hpp | 3 +++
 include/mesos/v1/scheduler/scheduler.hpp | 10 ++
 src/v1/mesos.cpp | 37 +
 3 files changed, 50 insertions(+)

> `LAUNCH_GROUP` failure tears down the default executor.
> ---
>
> Key: MESOS-8468
> URL: https://issues.apache.org/jira/browse/MESOS-8468
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
>Reporter: Chun-Hung Hsiao
>Assignee: Gastón Kleiman
>Priority: Critical
>  Labels: default-executor, mesosphere
> Fix For: 1.6.0, 1.5.1
>
>
> The following code in the default executor 
> (https://github.com/apache/mesos/blob/12be4ba002f2f5ff314fbc16af51d095b0d90e56/src/launcher/default_executor.cpp#L525-L535)
>  shows that if a `LAUNCH_NESTED_CONTAINER` call is failed (say, due to a 
> fetcher failure), the whole executor will be shut down:
> {code:cpp}
> // Check if we received a 200 OK response for all the
> // `LAUNCH_NESTED_CONTAINER` calls. Shutdown the executor
> // if this is not the case.
> foreach (const Response& response, responses.get()) {
>   if (response.code != process::http::Status::OK) {
> LOG(ERROR) << "Received '" << response.status << "' ("
><< response.body << ") while launching child container";
> _shutdown();
> return;
>   }
> }
> {code}
> This is not expected by a 

[jira] [Commented] (MESOS-8468) `LAUNCH_GROUP` failure tears down the default executor.

2018-02-14 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363883#comment-16363883
 ] 

Qian Zhang commented on MESOS-8468:
---

https://reviews.apache.org/r/65616/

> `LAUNCH_GROUP` failure tears down the default executor.
> ---
>
> Key: MESOS-8468
> URL: https://issues.apache.org/jira/browse/MESOS-8468
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
>Reporter: Chun-Hung Hsiao
>Assignee: Gastón Kleiman
>Priority: Critical
>  Labels: default-executor, mesosphere
>
> The following code in the default executor 
> (https://github.com/apache/mesos/blob/12be4ba002f2f5ff314fbc16af51d095b0d90e56/src/launcher/default_executor.cpp#L525-L535)
>  shows that if a `LAUNCH_NESTED_CONTAINER` call is failed (say, due to a 
> fetcher failure), the whole executor will be shut down:
> {code:cpp}
> // Check if we received a 200 OK response for all the
> // `LAUNCH_NESTED_CONTAINER` calls. Shutdown the executor
> // if this is not the case.
> foreach (const Response& response, responses.get()) {
>   if (response.code != process::http::Status::OK) {
> LOG(ERROR) << "Received '" << response.status << "' ("
><< response.body << ") while launching child container";
> _shutdown();
> return;
>   }
> }
> {code}
> This is not expected by a user. Instead, one would expect that a failed 
> `LAUNCH_GROUP` won't affect other task groups launched by the same executor, 
> similar to the case that a task failure only takes down its own task group. 
> We should adjust the semantics to make a failed `LAUNCH_GROUP` not take down 
> the executor and affect other task groups.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-1720) Slave should send exited executor message when the executor is never launched.

2018-02-14 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363865#comment-16363865
 ] 

Greg Mann commented on MESOS-1720:
--

Patches on 1.5.x:
{code}
commit 2bdf4935b7929d0dce614d76461cddb991df89da
Author: Meng Zhu 
Date:   Tue Feb 13 22:45:07 2018 -0800

Fixed a bug where executor info lingers on master if failed to launch.

Master relies on `ExitedExecutorMessage` from the agent to remove
executor entries. However, this message won't be sent if an executor
never actually launched (due to transient error), leaving executor
info on the master and the executor's resources claimed.
See MESOS-1720.

This patch fixes this issue by sending the `ExitedExecutorMessage`
from the agent if the executor is never launched.

Review: https://reviews.apache.org/r/65449/
{code}
{code}
commit fb0e2f1f81b2256a76cae83893e2a69fdd91fcd7
Author: Meng Zhu 
Date:   Tue Feb 13 22:45:03 2018 -0800

Added helper function for the agent to send `ExitedExecutorMessage`.

Review: https://reviews.apache.org/r/65446/
{code}
{code}
commit 10aa875df8947f8cbfb318820101984d99259070
Author: Meng Zhu 
Date:   Tue Feb 13 22:44:58 2018 -0800

Made master set `launch_executor` in the RunTask(Group)Message.

By setting a new field `launch_executor` in the RunTask(Group)Message,
the master is able to control executor creation on the agent.

Also refactored the `addTask()` logic. Added two new functions:
`isTaskLaunchExecutor()` checks if a task needs to launch an executor;
`addExecutor()` adds an executor to the framework and slave.

Review: https://reviews.apache.org/r/65504/
{code}
{code}
commit 08e0ceb84e4bf353e1f938482bb6766bf73310c7
Author: Meng Zhu 
Date:   Tue Feb 13 22:44:48 2018 -0800

Added new protobuf field `launch_executor` in RunTask(Group)Message.

This boolean flag is used for the master to specify whether a
new executor should be launched for the task or task group (with
the exception of the command executor). This allows the master
to control executor creation on the agent.

Also updated the relevant message handlers and mock functions.

Review: https://reviews.apache.org/r/65445/
{code}

> Slave should send exited executor message when the executor is never launched.
> --
>
> Key: MESOS-1720
> URL: https://issues.apache.org/jira/browse/MESOS-1720
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.6.0, 1.5.1
>
>
> When the slave sends TASK_LOST before launching an executor for a task, the 
> slave does not send an exited executor message to the master.
> Since the master receives no exited executor message, it still thinks the 
> executor's resources are consumed on the slave.
> One possible fix for this would be to send the exited executor message to the 
> master in these cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-1720) Slave should send exited executor message when the executor is never launched.

2018-02-14 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363851#comment-16363851
 ] 

Greg Mann commented on MESOS-1720:
--

Patches on master:
{code}
commit 3e3c582f10e8154e4a76c2b481cc33c8d4d0310c
Author: Meng Zhu 
Date:   Tue Feb 13 22:45:23 2018 -0800

Added tests to check that executors which fail to launch are removed.

Theses tests ensure that the agent sends `ExitedExecutorMessage` when
a task group fails to launch due to unschedule GC failure, or when a
task fails to launch due to task authorization failure.

Review: https://reviews.apache.org/r/65593/
{code}
{code}
commit a8e723b6ca5a268cc97e39919f7a6b4aedfc3222
Author: Meng Zhu 
Date:   Tue Feb 13 22:45:21 2018 -0800

Added a mock method for `__run()` to the mock slave.

Review: https://reviews.apache.org/r/65626/
{code}
{code}
commit a6c065060d94dc04dcdc81021035d846ad7040a0
Author: Meng Zhu 
Date:   Tue Feb 13 22:45:16 2018 -0800

Added a test to ensure master removes executors that never launched.

This test ensures that the agent sends `ExitedExecutorMessage` when
the executor is never launched so that the master's executor
bookkeeping entry is removed. See MESOS-1720.

Review: https://reviews.apache.org/r/65448/
{code}
{code}
commit b5350fecc8604bdddb45303d9363aff4ca60cfcc
Author: Meng Zhu 
Date:   Tue Feb 13 22:45:07 2018 -0800

Fixed a bug where executor info lingers on master if failed to launch.

Master relies on `ExitedExecutorMessage` from the agent to remove
executor entries. However, this message won't be sent if an executor
never actually launched (due to transient error), leaving executor
info on the master and the executor's resources claimed.
See MESOS-1720.

This patch fixes this issue by sending the `ExitedExecutorMessage`
from the agent if the executor is never launched.

Review: https://reviews.apache.org/r/65449/
{code}
{code}
commit 0321b85ce66f21e9cb6990a3032cb7f8f709c6e6
Author: Meng Zhu 
Date:   Tue Feb 13 22:45:03 2018 -0800

Added helper function for the agent to send `ExitedExecutorMessage`.

Review: https://reviews.apache.org/r/65446/
{code}
{code}
commit ce7f1f6a0807b96b92cb4c755c52f36e1a8e2853
Author: Meng Zhu 
Date:   Tue Feb 13 22:44:58 2018 -0800

Made master set `launch_executor` in the RunTask(Group)Message.

By setting a new field `launch_executor` in the RunTask(Group)Message,
the master is able to control executor creation on the agent.

Also refactored the `addTask()` logic. Added two new functions:
`isTaskLaunchExecutor()` checks if a task needs to launch an executor;
`addExecutor()` adds an executor to the framework and slave.

Review: https://reviews.apache.org/r/65504/
{code}
{code}
commit 7c29031bf35232a9e8b0c88c826d0185673a
Author: Meng Zhu 
Date:   Tue Feb 13 22:44:48 2018 -0800

Added new protobuf field `launch_executor` in RunTask(Group)Message.

This boolean flag is used for the master to specify whether a
new executor should be launched for the task or task group (with
the exception of the command executor). This allows the master
to control executor creation on the agent.

Also updated the relevant message handlers and mock functions.

Review: https://reviews.apache.org/r/65445/
{code}

> Slave should send exited executor message when the executor is never launched.
> --
>
> Key: MESOS-1720
> URL: https://issues.apache.org/jira/browse/MESOS-1720
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>  Labels: mesosphere
>
> When the slave sends TASK_LOST before launching an executor for a task, the 
> slave does not send an exited executor message to the master.
> Since the master receives no exited executor message, it still thinks the 
> executor's resources are consumed on the slave.
> One possible fix for this would be to send the exited executor message to the 
> master in these cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8565) Persistent volumes are not visible in Mesos UI when launching a pod using default executor.

2018-02-14 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363651#comment-16363651
 ] 

Gilbert Song commented on MESOS-8565:
-

Please note that this fix addresses the issue that persistent volume is shared 
to nested containers by defining a `SANDBOX_PATH` volume to each nested 
container from the framework. So that the persistent volume would show up on 
the Mesos UI.

However, there is a related *limitation*: for the case of only `SANDBOX_PATH` 
volume being specified for a nested container (no persistent volume or any 
other volume is specified at the executor container level), the pure 
`SANDBOX_PATH` volume is not reflected on the UI yet. We should create a 
separate Jira for this case. /cc [~qianzhang]

> Persistent volumes are not visible in Mesos UI when launching a pod using 
> default executor.
> ---
>
> Key: MESOS-8565
> URL: https://issues.apache.org/jira/browse/MESOS-8565
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.2, 1.3.1, 1.4.1
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.6.0, 1.5.1
>
>
> When user launches a pod to use a persistent volume in DC/OS, the nested 
> containers in the pod can access the PV successfully and the PV directory of 
> the executor shown in Mesos UI has all the contents written by the tasks, but 
> the PV directory of the tasks shown in DC/OS UI and Mesos UI is empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)