[jira] [Assigned] (MESOS-4965) Support resizing of an existing persistent volume

2018-03-02 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-4965:


Assignee: Zhitao Li

> Support resizing of an existing persistent volume
> -
>
> Key: MESOS-4965
> URL: https://issues.apache.org/jira/browse/MESOS-4965
> Project: Mesos
>  Issue Type: Improvement
>  Components: storage
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>  Labels: mesosphere, persistent-volumes, storage
>
> We need a mechanism to update the size of a persistent volume.
> The increase case is generally more interesting to us (as long as there still 
> available disk resource on the same disk).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8488) Docker bug can cause unkillable tasks.

2018-03-02 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384432#comment-16384432
 ] 

Andrew Schwartzmeyer edited comment on MESOS-8488 at 3/3/18 2:17 AM:
-

Commit 1daf6cb03
Author: Akash Gupta akash-gu...@hotmail.com
Date:   Sun Feb 25 13:37:42 2018 -0800
Windows: Fixed flaky Docker command health check test.

The `DockerContainerizerHealthCheckTest.ROOT_DOCKER_
DockerHealthStatusChange` test was flaky on Windows, because
the Docker executor manually reaps the container exit code in
case that `docker run` fails to get the exit code. This logic
doesn't work on Windows, since the process might not be visible to
the container host machine, causing `TASK_FAILED` to get sent. By
removing the reaping logic on Windows, the test is much more reliable.

Review: https://reviews.apache.org/r/65733/


was (Author: andschwa):
{noformat}
commit 1daf6cb03
Author: Akash Gupta akash-gu...@hotmail.com
Date:   Sun Feb 25 13:37:42 2018 -0800
Windows: Fixed flaky Docker command health check test.

The `DockerContainerizerHealthCheckTest.ROOT_DOCKER_
DockerHealthStatusChange` test was flaky on Windows, because
the Docker executor manually reaps the container exit code in
case that `docker run` fails to get the exit code. This logic
doesn't work on Windows, since the process might not be visible to
the container host machine, causing `TASK_FAILED` to get sent. By
removing the reaping logic on Windows, the test is much more reliable.

Review: https://reviews.apache.org/r/65733/
{noformat}

> Docker bug can cause unkillable tasks.
> --
>
> Key: MESOS-8488
> URL: https://issues.apache.org/jira/browse/MESOS-8488
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Qian Zhang
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.6.0
>
>
> Due to an [issue on the Moby 
> project|https://github.com/moby/moby/issues/33820], it's possible for Docker 
> versions 1.13 and later to fail to catch a container exit, so that the 
> {{docker run}} command which was used to launch the container will never 
> return. This can lead to the Docker executor becoming stuck in a state where 
> it believes the container is still running and cannot be killed.
> We should update the Docker executor to ensure that containers stuck in such 
> a state cannot cause unkillable Docker executors/tasks.
> One way to do this would be a timeout, after which the Docker executor will 
> commit suicide if a kill task attempt has not succeeded. However, if we do 
> this we should also ensure that in the case that the container was actually 
> still running, either the Docker daemon or the DockerContainerizer would 
> clean up the container when it does exit.
> Another option might be for the Docker executor to directly {{wait()}} on the 
> container's Linux PID, in order to notice when the container exits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8488) Docker bug can cause unkillable tasks.

2018-03-02 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384432#comment-16384432
 ] 

Andrew Schwartzmeyer commented on MESOS-8488:
-

{noformat}
commit 1daf6cb03
Author: Akash Gupta akash-gu...@hotmail.com
Date:   Sun Feb 25 13:37:42 2018 -0800
Windows: Fixed flaky Docker command health check test.

The `DockerContainerizerHealthCheckTest.ROOT_DOCKER_
DockerHealthStatusChange` test was flaky on Windows, because
the Docker executor manually reaps the container exit code in
case that `docker run` fails to get the exit code. This logic
doesn't work on Windows, since the process might not be visible to
the container host machine, causing `TASK_FAILED` to get sent. By
removing the reaping logic on Windows, the test is much more reliable.

Review: https://reviews.apache.org/r/65733/
{noformat}

> Docker bug can cause unkillable tasks.
> --
>
> Key: MESOS-8488
> URL: https://issues.apache.org/jira/browse/MESOS-8488
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Qian Zhang
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.6.0
>
>
> Due to an [issue on the Moby 
> project|https://github.com/moby/moby/issues/33820], it's possible for Docker 
> versions 1.13 and later to fail to catch a container exit, so that the 
> {{docker run}} command which was used to launch the container will never 
> return. This can lead to the Docker executor becoming stuck in a state where 
> it believes the container is still running and cannot be killed.
> We should update the Docker executor to ensure that containers stuck in such 
> a state cannot cause unkillable Docker executors/tasks.
> One way to do this would be a timeout, after which the Docker executor will 
> commit suicide if a kill task attempt has not succeeded. However, if we do 
> this we should also ensure that in the case that the container was actually 
> still running, either the Docker daemon or the DockerContainerizer would 
> clean up the container when it does exit.
> Another option might be for the Docker executor to directly {{wait()}} on the 
> container's Linux PID, in order to notice when the container exits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8633) Investigate using HTTP server library (e.g. h2o) in libprocess.

2018-03-02 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-8633:
--

 Summary: Investigate using HTTP server library (e.g. h2o) in 
libprocess.
 Key: MESOS-8633
 URL: https://issues.apache.org/jira/browse/MESOS-8633
 Project: Mesos
  Issue Type: Task
  Components: libprocess
Reporter: Benjamin Mahler


Currently libprocess provides its own HTTP server implementation that leverages 
libraries (e.g. http-parser) where possible to simplify this. However, even 
simpler from a maintainability perspective would be to leverage an HTTP server 
library.

https://github.com/h2o/h2o/ is a high performance HTTP server that can be used 
as a library. This would also provide a significant performance benefit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5460) Add HDFS support in Windows builds.

2018-03-02 Thread Jeff Coffler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Coffler reassigned MESOS-5460:
---

Assignee: (was: Jeff Coffler)

> Add HDFS support in Windows builds.
> ---
>
> Key: MESOS-5460
> URL: https://issues.apache.org/jira/browse/MESOS-5460
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrew Schwartzmeyer
>Priority: Minor
>  Labels: agent, fetcher, mesosphere, windows
>
> Right now we have a bunch of #ifdefs throughout the codebase around the HDFS 
> code, because Windows doesn't support it. We should explore adding support 
> for (e.g.) fetching from HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5460) Add HDFS support in Windows builds.

2018-03-02 Thread Jeff Coffler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384399#comment-16384399
 ] 

Jeff Coffler commented on MESOS-5460:
-

More information from Joe:

The Hadoop (soft) dependency is a catch-all in case the URI is not supported by 
other fetch methods (like HTTP).  We used to see this used for `hdfs://` and 
`s3://` URIs.  There are definitely ways to fetch these URIs without the Hadoop 
client

We will defer this for now.

> Add HDFS support in Windows builds.
> ---
>
> Key: MESOS-5460
> URL: https://issues.apache.org/jira/browse/MESOS-5460
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrew Schwartzmeyer
>Assignee: Jeff Coffler
>Priority: Minor
>  Labels: agent, fetcher, mesosphere, windows
>
> Right now we have a bunch of #ifdefs throughout the codebase around the HDFS 
> code, because Windows doesn't support it. We should explore adding support 
> for (e.g.) fetching from HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8576) Improve discard handling of 'Docker::inspect()'

2018-03-02 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384340#comment-16384340
 ] 

Gilbert Song commented on MESOS-8576:
-

{noformat}
commit 8346ab0c812559ef73e1bbd30718f6c74a023079
Author: Greg Mann 
Date:   Fri Mar 2 15:39:58 2018 -0800


    Avoided orphan subprocess in the Docker library.

    
    This patch ensures that `Docker::inspect` will not leave orphan
    subprocesses behind.
    

    Review: https://reviews.apache.org/r/65887/

{noformat}

> Improve discard handling of 'Docker::inspect()'
> ---
>
> Key: MESOS-8576
> URL: https://issues.apache.org/jira/browse/MESOS-8576
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.6.0
>
>
> In the call path of {{Docker::inspect()}}, each continuation currently checks 
> if {{promise->future().hasDiscard()}}, where the {{promise}} is associated 
> with the output of the {{docker inspect}} call. However, if the call to 
> {{docker inspect}} becomes hung indefinitely, then continuations are never 
> invoked, and a subsequent discard of the returned {{Future}} will have no 
> effect. We should add proper {{onDiscard}} handling to that {{Future}} so 
> that appropriate cleanup is performed in such cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8632) Add unit tests which exercise resources dynamically

2018-03-02 Thread Andrew Schwartzmeyer (JIRA)
Andrew Schwartzmeyer created MESOS-8632:
---

 Summary: Add unit tests which exercise resources dynamically
 Key: MESOS-8632
 URL: https://issues.apache.org/jira/browse/MESOS-8632
 Project: Mesos
  Issue Type: Task
Reporter: Andrew Schwartzmeyer
Assignee: Andrew Schwartzmeyer


As discovered in MESOS-8631, we lack tests which exercise the actual reported 
resources of a machine. We want a test that:

(1) Launches an agent that auto-detects resources
(2) That test should be parameterized to use different sets of isolators
(3) A task that accepts the resources should use all of them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6422) cgroups_tests not correctly tearing down testing hierarchies

2018-03-02 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384307#comment-16384307
 ] 

Yan Xu commented on MESOS-6422:
---

Sorry this is low priority for me right now so I am unassigning.

> cgroups_tests not correctly tearing down testing hierarchies
> 
>
> Key: MESOS-6422
> URL: https://issues.apache.org/jira/browse/MESOS-6422
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, containerization
>Reporter: Yan Xu
>Assignee: Yan Xu
>Priority: Minor
>  Labels: cgroups
>
> We currently do the following in 
> [CgroupsTest::TearDownTestCase()|https://github.com/apache/mesos/blob/5e850a362edbf494921fedff4037cf4b53088c10/src/tests/containerizer/cgroups_tests.cpp#L83]
> {code:title=}
> static void TearDownTestCase()
> {
>   AWAIT_READY(cgroups::cleanup(TEST_CGROUPS_HIERARCHY));
> }
> {code}
> One of its derived test {{CgroupsNoHierarchyTest}} treats 
> {{TEST_CGROUPS_HIERARCHY}} as a hierarchy so it's able to clean it up as a 
> hierarchy.
> However another derived test {{CgroupsAnyHierarchyTest}} would create new 
> hierarchies (if none is available) using {{TEST_CGROUPS_HIERARCHY}} as a 
> parent directory (i.e., base hierarchy) and not as a hierarchy, so when it's 
> time to clean up, it fails:
> {noformat:title=}
> [   OK ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Subsystems (1 ms)
> ../../src/tests/containerizer/cgroups_tests.cpp:88: Failure
> (cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6422) cgroups_tests not correctly tearing down testing hierarchies

2018-03-02 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-6422:
-

Assignee: (was: Yan Xu)

> cgroups_tests not correctly tearing down testing hierarchies
> 
>
> Key: MESOS-6422
> URL: https://issues.apache.org/jira/browse/MESOS-6422
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, containerization
>Reporter: Yan Xu
>Priority: Minor
>  Labels: cgroups
>
> We currently do the following in 
> [CgroupsTest::TearDownTestCase()|https://github.com/apache/mesos/blob/5e850a362edbf494921fedff4037cf4b53088c10/src/tests/containerizer/cgroups_tests.cpp#L83]
> {code:title=}
> static void TearDownTestCase()
> {
>   AWAIT_READY(cgroups::cleanup(TEST_CGROUPS_HIERARCHY));
> }
> {code}
> One of its derived test {{CgroupsNoHierarchyTest}} treats 
> {{TEST_CGROUPS_HIERARCHY}} as a hierarchy so it's able to clean it up as a 
> hierarchy.
> However another derived test {{CgroupsAnyHierarchyTest}} would create new 
> hierarchies (if none is available) using {{TEST_CGROUPS_HIERARCHY}} as a 
> parent directory (i.e., base hierarchy) and not as a hierarchy, so when it's 
> time to clean up, it fails:
> {noformat:title=}
> [   OK ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Subsystems (1 ms)
> ../../src/tests/containerizer/cgroups_tests.cpp:88: Failure
> (cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8631) Can't start a task with every CPU on a Windows machine

2018-03-02 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384301#comment-16384301
 ] 

Andrew Schwartzmeyer commented on MESOS-8631:
-

I believe this should be back-ported to Mesos 1.5.1.

> Can't start a task with every CPU on a Windows machine
> --
>
> Key: MESOS-8631
> URL: https://issues.apache.org/jira/browse/MESOS-8631
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: Windows 10 using the CPU isolator
>Reporter: Andrew Schwartzmeyer
>Assignee: Andrew Schwartzmeyer
>Priority: Critical
>  Labels: cpu, isolator, windows
> Fix For: 1.5.1
>
>
> We have an edge case that existing unit tests don't cover: starting a single 
> task which consumes every reported CPU. The problem is that the executor 
> overcommits by 0.1 CPUs, so when we go to set the CPU limit, on say a 12 core 
> machine, we set it to 12.1/12 * 1 CPU cycles, which results in an invalid 
> number of cycles (10,083, but the max is 10,000).
> We were bounds-checking the minimum (1 cycle) but not the maximum, as the 
> function did not expect the user to request more CPUs than available.
> We'll fix this by checking the max bound. So if, for example, 12.1 CPUs are 
> requested, re set the limit to 1 CPU cycles, no more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8631) Can't start a task with every CPU on a Windows machine

2018-03-02 Thread Andrew Schwartzmeyer (JIRA)
Andrew Schwartzmeyer created MESOS-8631:
---

 Summary: Can't start a task with every CPU on a Windows machine
 Key: MESOS-8631
 URL: https://issues.apache.org/jira/browse/MESOS-8631
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.5.0
 Environment: Windows 10 using the CPU isolator
Reporter: Andrew Schwartzmeyer
Assignee: Andrew Schwartzmeyer
 Fix For: 1.5.1


We have an edge case that existing unit tests don't cover: starting a single 
task which consumes every reported CPU. The problem is that the executor 
overcommits by 0.1 CPUs, so when we go to set the CPU limit, on say a 12 core 
machine, we set it to 12.1/12 * 1 CPU cycles, which results in an invalid 
number of cycles (10,083, but the max is 10,000).

We were bounds-checking the minimum (1 cycle) but not the maximum, as the 
function did not expect the user to request more CPUs than available.

We'll fix this by checking the max bound. So if, for example, 12.1 CPUs are 
requested, re set the limit to 1 CPU cycles, no more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8619) Docker on Windows uses USERPROFILE instead of HOME for credentials

2018-03-02 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383927#comment-16383927
 ] 

Andrew Schwartzmeyer edited comment on MESOS-8619 at 3/2/18 6:21 PM:
-

Review: https://reviews.apache.org/r/65872


was (Author: andschwa):
{noformat}
commit 9bd5d8f9b (HEAD -> master, apache/master)
Author: Andrew Schwartzmeyer 
Date:   Wed Feb 28 16:45:49 2018 -0800

Windows: Fixed location of Docker's `config.json` file.

Per MESOS-8619, Docker checks `$USERPROFILE/.docker/config.json`
instead of `$HOME`. Mesos overrides this environment variable in order
to point Docker to a `config.json` file in another location, so we
have to fix the assumption we made about Docker.

We do not add this constant to stout, because it is not consistent
across Windows applications. This particular logic is specific to the
implementation of Docker. Other applications might check `$HOME` or
`$HOMEPATH` on Windows.

Review: https://reviews.apache.org/r/65872
{noformat}

> Docker on Windows uses USERPROFILE instead of HOME for credentials
> --
>
> Key: MESOS-8619
> URL: https://issues.apache.org/jira/browse/MESOS-8619
> Project: Mesos
>  Issue Type: Bug
> Environment: Windows 10 with Docker version 17.12.0-ce, build c97c6d6.
>Reporter: Andrew Schwartzmeyer
>Assignee: Andrew Schwartzmeyer
>Priority: Major
>  Labels: docker, windows
> Fix For: 1.6.0
>
>
> The logic for doing a {{docker pull}} of an image for a private registry 
> assumes that the {{.docker/config.json}} is to be found in {{$HOME}} 
> (according to the [Mesosphere 
> instructions|https://mesosphere.github.io/marathon/docs/native-docker-private-registry.html#docker-containerizer]
>  and the 
> [code|https://github.com/apache/mesos/blob/b7933c176d719766bdb6459048ede6e94f6a7763/src/docker/docker.cpp#L1710]).
> However, this assumption was only true for Linux per the [Docker 
> code|https://github.com/moby/moby/blob/3a633a712c8bbb863fe7e57ec132dd87a9c4eff7/pkg/homedir/homedir_unix.go#L14],
>  but on Windows, Docker explicitly looks at the {{USERPROFILE}} environment 
> variable, again [per the Docker 
> code|https://github.com/moby/moby/blob/3a633a712c8bbb863fe7e57ec132dd87a9c4eff7/pkg/homedir/homedir_windows.go#L10].
> So in order for Docker to pick up the config file correctly, we need to 
> change the variable used on Windows in the Docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8630) All subsequent registry operations fail after the registrar is aborted after a failed update

2018-03-02 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8630:
-

 Summary: All subsequent registry operations fail after the 
registrar is aborted after a failed update
 Key: MESOS-8630
 URL: https://issues.apache.org/jira/browse/MESOS-8630
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Yan Xu


Failure to update registry always aborts the registrar but don't always abort 
the master process.

When the registrar fails to update the registry it would abort the actor and 
fail all future operations. The rationale as explained here: 
[https://github.com/apache/mesos/commit/5eaf1eb346fc2f46c852c1246bdff12a89216b60]
{quote}In this event, the Master won't commit suicide until the initial
 failure is processed. However, in the interim, subsequent operations
 are potentially being performed against the Registrar. This could lead
 to fighting between masters if a "demoted" master re-attempts to
 acquire log-leadership!
{quote}
However when the registrar updates is requested by an operator API 
(maintenance, quota update, etc) the master process doesn't shut down (a 500 
error is returned to the client instead) and all subsequent operations will 
fail!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8240) Add an option to build the new CLI and run unit tests.

2018-03-02 Thread Armand Grillet (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383600#comment-16383600
 ] 

Armand Grillet commented on MESOS-8240:
---

https://reviews.apache.org/r/65705/

> Add an option to build the new CLI and run unit tests.
> --
>
> Key: MESOS-8240
> URL: https://issues.apache.org/jira/browse/MESOS-8240
> Project: Mesos
>  Issue Type: Improvement
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> An update of the discarded [https://reviews.apache.org/r/52543/]
> Also needs to be available for CMake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5158) Provide XFS quota support for persistent volumes.

2018-03-02 Thread Harold Dost III (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383595#comment-16383595
 ] 

Harold Dost III commented on MESOS-5158:


Hey [~jamespeach],

On a related note, one of the things mentioned in the source is about the fact 
that the isolator does not necessarily have knowledge of whether the persistent 
volume is available. Should we consider making some general utilities inspired 
by the ones in the xfs isolator that could be refactored so that they may be 
used by frameworks and isolators alike.  

Sound like right now there is no reference to the number of containers access a 
persistent volume. Maybe we would want some sort of reference counting and have 
a way that volumes in essence manage themselves. I am still learning the 
lifecycles of some of the different components, but I think it may be better 
when persistent volumes are are created {{src/common/sources.cpp}} to allocate 
a project id for the directory, and then as task use the same one increase 
their usage. As containers are cleaned up reduce the count.

> Provide XFS quota support for persistent volumes.
> -
>
> Key: MESOS-5158
> URL: https://issues.apache.org/jira/browse/MESOS-5158
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Yan Xu
>Priority: Major
>
> Given that the lifecycle of persistent volumes is managed outside of the 
> isolator, we may need to further abstract out the quota management 
> functionality to do it outside the XFS isolator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8158) Mesos Agent in docker neglects to retry discovering Task docker containers

2018-03-02 Thread Stefan Eder (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383511#comment-16383511
 ] 

Stefan Eder commented on MESOS-8158:


We experience the same issue with Mesos 1.4.1 (did not work with previous 
versions e.g. 1.0, 1.3 either).

Mesos agent started like:
{noformat}
ExecStart=/usr/bin/docker run \
  --name=mesos_agent \
  --net=host \
  --pid=host \ # does not work without this as well
  --privileged \
  -v /cgroup:/cgroup \
  -v /sys:/sys \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v /mnt/data/mesos/slave/work:/mnt/data/mesos/slave/work \
  -v /mnt/data/mesos/slave/logs:/mnt/data/mesos/slave/logs \
    infonova/mesos-agent:1.4.1-docker-17.09.0-ce \
  --ip=10.0.0.100 \
  --logging_level=INFO \
  --advertise_ip=10.0.0.100 \
  --port=5051 \
  --advertise_port=5051 \
  --master=zk://10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181/MesosDevelopment \
  --containerizers=docker \
  --recover=reconnect \
  --resources=cpus:6;mem:45000;ports:[8000-9000] \
  --log_dir=/mnt/data/mesos/slave/logs \
  --work_dir=/mnt/data/mesos/slave/work \
  --docker_remove_delay=10mins \
  --executor_registration_timeout=20mins \
  --executor_shutdown_grace_period=10mins \
  --hostname=mesos-dev-agent-01 \
  
--attributes=host:mesos-dev-agent-01;type:openstacknode;dockerfs:overlayfs \
  --credential=file:///etc/mesos-slave/password \
  --recovery_timeout=24hrs \
  --docker_config=file:///home/jenkins/.docker/config.json \
  --no-systemd_enable_support \
  --docker_mesos_image=infonova/mesos:1.4.1-docker-17.09.0-ce{noformat}
Logs of mesos-agent indicate, that it does not find the actual task container:
{noformat}
I0302 09:33:34.609871 6 docker.cpp:1136] Starting container 
'f00102b3-f1bd-4575-86fd-769f198db674' for task 'jenkins-mesos-agent-01' (and 
executor 'jenkins-mesos-agent-01') of framework 
806cbd1f-6dcf-43b8-b8ed-682995a66dbe-0012
I0302 09:33:35.321282 7 slave.cpp:3928] Got registration for executor 
'jenkins-mesos-agent-01' of framework 806cbd1f-6dcf-43b8-b8ed-682995a66dbe-0012 
from executor(1)@10.0.0.100:45619
I0302 09:33:35.322042    11 docker.cpp:1616] Ignoring updating container 
f00102b3-f1bd-4575-86fd-769f198db674 because resources passed to update are 
identical to existing resources
I0302 09:33:35.322141    11 slave.cpp:2598] Sending queued task 
'jenkins-mesos-agent-01' to executor 'jenkins-mesos-agent-01' of framework 
806cbd1f-6dcf-43b8-b8ed-682995a66dbe-0012 at executor(1)@10.0.0.100:45619
E0302 09:33:56.000396    10 slave.cpp:5285] Container 
'f00102b3-f1bd-4575-86fd-769f198db674' for executor 'jenkins-mesos-agent-01' of 
framework 806cbd1f-6dcf-43b8-b8ed-682995a66dbe-0012 failed to start: Failed to 
run 'docker -H unix:///var/run/docker.sock inspect 
mesos-f00102b3-f1bd-4575-86fd-769f198db674': exited with status 1; 
stderr='Error: No such object: mesos-f00102b3-f1bd-4575-86fd-769f198db674'
I0302 09:33:56.000726 7 slave.cpp:5398] Executor 'jenkins-mesos-agent-01' 
of framework 806cbd1f-6dcf-43b8-b8ed-682995a66dbe-0012 has terminated with 
unknown status
I0302 09:33:56.000792 7 slave.cpp:4392] Handling status update TASK_FAILED 
(UUID: d5c38f0b-80b8-4b4b-8bf0-34d08196f053) for task jenkins-mesos-agent-01 of 
framework 806cbd1f-6dcf-43b8-b8ed-682995a66dbe-0012 from @0.0.0.0:0
W0302 09:33:56.001066    12 docker.cpp:1603] Ignoring updating unknown 
container f00102b3-f1bd-4575-86fd-769f198db674
I0302 09:33:56.001164    10 status_update_manager.cpp:323] Received status 
update TASK_FAILED (UUID: d5c38f0b-80b8-4b4b-8bf0-34d08196f053) for task 
jenkins-mesos-agent-01 of framework 806cbd1f-6dcf-43b8-b8ed-682995a66dbe-0012
I0302 09:33:56.001391    10 status_update_manager.cpp:834] Checkpointing UPDATE 
for status update TASK_FAILED (UUID: d5c38f0b-80b8-4b4b-8bf0-34d08196f053) for 
task jenkins-mesos-agent-01 of framework 
806cbd1f-6dcf-43b8-b8ed-682995a66dbe-0012
I0302 09:33:56.001545    10 slave.cpp:4873] Forwarding the update TASK_FAILED 
(UUID: d5c38f0b-80b8-4b4b-8bf0-34d08196f053) for task jenkins-mesos-agent-01 of 
framework 806cbd1f-6dcf-43b8-b8ed-682995a66dbe-0012 to master@10.0.0.81:5050
I0302 09:33:56.023033 7 status_update_manager.cpp:395] Received status 
update acknowledgement (UUID: d5c38f0b-80b8-4b4b-8bf0-34d08196f053) for task 
jenkins-mesos-agent-01 of framework 806cbd1f-6dcf-43b8-b8ed-682995a66dbe-0012
I0302 09:33:56.023077 7 status_update_manager.cpp:834] Checkpointing ACK 
for status update TASK_FAILED (UUID: d5c38f0b-80b8-4b4b-8bf0-34d08196f053) for 
task jenkins-mesos-agent-01 of framework 
806cbd1f-6dcf-43b8-b8ed-682995a66dbe-0012
I0302 09:33:56.023187 8 slave.cpp:5509] Cleaning up executor 
'jenkins-mesos-agent-01' of framework 806cbd1f-6dcf-43b8-b8ed-682995a66dbe-0012 
at executor(1)@10.0.0.100:45619{noformat}
Funny thing is though, the 

[jira] [Comment Edited] (MESOS-8190) Update the master to accept OfferOperationIDs from frameworks.

2018-03-02 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270038#comment-16270038
 ] 

Greg Mann edited comment on MESOS-8190 at 3/2/18 8:09 AM:
--

Reviews here:
https://reviews.apache.org/r/63991/
https://reviews.apache.org/r/63992/
https://reviews.apache.org/r/63994/


was (Author: greggomann):
Reviews here:
https://reviews.apache.org/r/63990/
https://reviews.apache.org/r/63991/
https://reviews.apache.org/r/63992/
https://reviews.apache.org/r/63994/

> Update the master to accept OfferOperationIDs from frameworks.
> --
>
> Key: MESOS-8190
> URL: https://issues.apache.org/jira/browse/MESOS-8190
> Project: Mesos
>  Issue Type: Task
>Reporter: Gastón Kleiman
>Assignee: Greg Mann
>Priority: Major
>  Labels: mesosphere
>
> Master’s {{ACCEPT}} handler should send failed operation updates when a 
> framework sets the {{OfferOperationID}} on an operation destined for an agent 
> without the {{RESOURCE_PROVIDER}} capability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)