[jira] [Commented] (MESOS-6410) Fail to mount persistent volume when run mesos slave in docker

2016-10-19 Thread Lei Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590575#comment-15590575
 ] 

Lei Xu commented on MESOS-6410:
---

Hi [~haosd...@gmail.com], It's OK now with `--privileged=true`, thanks very 
much.

> Fail to mount persistent volume when run mesos slave in docker
> --
>
> Key: MESOS-6410
> URL: https://issues.apache.org/jira/browse/MESOS-6410
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, volumes
>Affects Versions: 0.28.2
> Environment: Mesos 0.28.2
> Docker 1.12.1
>Reporter: Lei Xu
>Priority: Critical
>
> Here are some error logs from the slave:
> {code}
> E1018 07:52:06.18692630 slave.cpp:3758] Container 
> 'fbfd5e46-4460-45af-bd64-e03e8664f575' for executor 
> 'storm_nimbus_mpubpushsmart.d
> 60e9066-94ec-11e6-99ff-0242d43b0395' of framework 
> 06ccc047-7137-41ef-a4ac-4090b9cd9e42-0023 failed to start: Failed to mount 
> persistent
>  volume from 
> '/var/lib/mesos/volumes/roles/storm/storm_nimbus_mpubpushsmart#tmp#d60e4245-94ec-11e6-99ff-0242d43b0395'
>  to '/var/lib/meso
> s/slaves/06ccc047-7137-41ef-a4ac-4090b9cd9e42-S45/frameworks/06ccc047-7137-41ef-a4ac-4090b9cd9e42-0023/executors/storm_nimbus_mpubpushs
> mart.d60e9066-94ec-11e6-99ff-0242d43b0395/runs/fbfd5e46-4460-45af-bd64-e03e8664f575/tmp':
>  Operation not permitted
> E1018 07:52:09.91687725 slave.cpp:3758] Container 
> 'bb8ca08b-1cbf-450d-93e2-18a6322cb5be' for executor 
> 'storm_nimbus_mpubpushsmart.d
> 60e9066-94ec-11e6-99ff-0242d43b0395' of framework 
> 06ccc047-7137-41ef-a4ac-4090b9cd9e42-0023 failed to start: Failed to mount 
> persistent
>  volume from 
> '/var/lib/mesos/volumes/roles/storm/storm_nimbus_mpubpushsmart#tmp#d60e4245-94ec-11e6-99ff-0242d43b0395'
>  to '/var/lib/meso
> s/slaves/06ccc047-7137-41ef-a4ac-4090b9cd9e42-S45/frameworks/06ccc047-7137-41ef-a4ac-4090b9cd9e42-0023/executors/storm_nimbus_mpubpushs
> mart.d60e9066-94ec-11e6-99ff-0242d43b0395/runs/bb8ca08b-1cbf-450d-93e2-18a6322cb5be/tmp':
>  Operation not permitted
> {code}
> But out of the docker, the mesos slave works OK with the persistent volumes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5218) Fetcher should not chown the entire sandbox.

2016-10-19 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590385#comment-15590385
 ] 

Yan Xu edited comment on MESOS-5218 at 10/20/16 12:54 AM:
--

{noformat:title=}
commit e65b40d48b25ecf45805c8a740a412074da00d1f
Author: Megha Sharma 
Date:   Wed Oct 19 17:34:19 2016 -0700

Fixed a bug that causes the fetcher to not chown the sandbox.

Moved the `uri.size() == 0` check in fetcher so that the chown to
task user of stdout/stderr in sandbox directory happens even when
there is no uri to be fetched.

Review: https://reviews.apache.org/r/52828/

commit 09a1cd10278992360c63a77d2712b9d047ce0e67
Author: Megha Sharma 
Date:   Wed Oct 19 17:37:16 2016 -0700

Fixed fetcher to not recursively chown the entire sandbox.

Fetcher currently changes the ownership of entire sandbox directory
recursively to the task user and as a result also changes the
ownership of files laid down by other entities in the sandbox, which
leads to unintended side-effects.

Review: https://reviews.apache.org/r/52058/
{noformat}


was (Author: xujyan):
{noformat:title=}
commit 09a1cd10278992360c63a77d2712b9d047ce0e67
Author: Megha Sharma 
Date:   Wed Oct 19 17:37:16 2016 -0700

Fixed fetcher to not recursively chown the entire sandbox.

Fetcher currently changes the ownership of entire sandbox directory
recursively to the task user and as a result also changes the
ownership of files laid down by other entities in the sandbox, which
leads to unintended side-effects.

Review: https://reviews.apache.org/r/52058/
{noformat}

> Fetcher should not chown the entire sandbox.
> 
>
> Key: MESOS-5218
> URL: https://issues.apache.org/jira/browse/MESOS-5218
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Reporter: Yan Xu
>Assignee: Megha
> Fix For: 1.2.0
>
>
> The real intention for this action is to make sure all decompressed files are 
> chowned to the task user but this has side effects if there are other things 
> are laid down in the sandbox.
> The fetcher should only chown the individual files if they are not to be 
> extracted or it should run the extraction commands as the task user so the 
> files get extracted with the task user as the owner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6420) Mesos Agent leaking sockets when port mapping network isolator is ON

2016-10-19 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590388#comment-15590388
 ] 

Ian Downes commented on MESOS-6420:
---

What about these sockets?
{code}
[1750][idownes:mesos]$ git grep -n "socket(.F_INET" src/linux/routing/
src/linux/routing/link/internal.hpp:144:  int fd = ::socket(AF_INET, 
SOCK_STREAM, 0);
src/linux/routing/link/link.cpp:258:  int fd = ::socket(AF_INET, SOCK_STREAM, 
0);
src/linux/routing/link/link.cpp:332:  int fd = ::socket(AF_INET, SOCK_STREAM, 
0);
{code}

> Mesos Agent leaking sockets when port mapping network isolator is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> $ sudo lsof -p $(pgrep -u root -o -f /usr/local/sbin/mesos-slave) -nP | grep 
> "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468012 close(27)   = 0
> [pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27

[jira] [Created] (MESOS-6423) Establish error message guidelines in the style guide.

2016-10-19 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-6423:
--

 Summary: Establish error message guidelines in the style guide.
 Key: MESOS-6423
 URL: https://issues.apache.org/jira/browse/MESOS-6423
 Project: Mesos
  Issue Type: Improvement
  Components: documentation, technical debt
Reporter: Benjamin Mahler


We currently have a "pattern" for writing error messages that enables 
composition. The rule for synchronous error composition is as follows:

{code}
Try open(string path)
{
  return Error("File not found"));
}

Try read(int fd)
{
  return Error("Invalid file descriptor");
}


{
  Try open = ::open("path");
  
  if (open.isError()) {
return Error("Failed to open 'path': " + open.error());
  }

  FileDescriptor fd = open.get();

  Try read = ::read(fd);

  if (read.isError()) {
return Error("Failed to read from file at 'path' (fd " + stringify(fd) + 
"): " + read.error());
  }

  return read.get();
}
{code}

This leads to the following error messages:

{code}
Failed to open 'path': File not found
Failed to read from opened file at 'path' (fd 4): Invalid file descriptor
{code}

The pattern in use for error messages is:

* Callees do not include caller-available context (i.e. the arguments provided) 
as this can easily lead to double logging. I.e. callers only include the reason 
that they are surfacing an error.
* Callers add their context to the reason provided by the callee.

This ticket is to document the pattern in use for synchronous code paths.

Note that this pattern can't be used with the design of {{Future::then}}. We 
need to determine how to provide error composition for Futures as well. See [my 
comment|https://issues.apache.org/jira/browse/MESOS-785?focusedCommentId=15590263=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15590263]
 in MESOS-785 for additional context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-785) Extend stout/try to support functional composition

2016-10-19 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590263#comment-15590263
 ] 

Benjamin Mahler commented on MESOS-785:
---

(Some additional context from a discussion awhile back with [~benjaminhindman])

In order to preserve our error composition technique [~benjaminhindman] had 
shown me an approach like this:

{code}
Try t;

Try = t.bind(
  [=](const T& t) { return t.toX(); }
  [=](const Error& error) { return Error("Additional context: " + 
error.message); }
);
{code}

We were trying to avoid the issues around error message composition that 
currently exist with Future chaining (where the design of .then prevents the 
caller from adding additional error message context).

> Extend stout/try to support functional composition
> --
>
> Key: MESOS-785
> URL: https://issues.apache.org/jira/browse/MESOS-785
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Reporter: Ian Downes
>Priority: Minor
>  Labels: c++11
>
> Motivating example was fetching a list of URIs where each 'fetch' is actually 
> a sequence of operations: fetch the uri to a local file, pass the local file 
> on to chmod or extract, and pass on to chown it or the directory. Individual 
> operations needn't be asynchronous.
> This can be written using Futures but is potentially confusing when the code 
> is actually synchronous.
> {code}
> Future fetch(
> const CommandInfo& commandInfo,
> const string& directory,
> const HDFS& hdfs,
> const Option& frameworks_home,
> const Option& user)
> {
>   foreach (const CommandInfo::URI& uri, commandInfo.uris()) {
> bool executable = uri.has_executable() && uri.executable();
> // This code is synchronous!
> Future result =
> _fetch(uri, directory, hdfs, frameworks_home)
>   .then(lambda::bind(_chmod, lambda::_1, directory, executable))
>   .then(lambda::bind(_extract, lambda::_1, directory, !executable))
>   .then(lambda::bind(_chown, lambda::_1, directory, user));
> if (result.isFailed()) {
>   LOG(ERROR) << "Fetch of uri '" << uri.value() << "' failed: " << 
> result.failure();
>   return Future::failed(result.failure());
> }
>   }
>   return Nothing();
> }
> {code}
> [~bmahler] and I had these thoughts:
> * .ifSome() and .ifError() to express control flow.
> * .and() for chaining to make it clear the code is synchronous and also that 
> it will short-circuit error.
> e.g.
> {code}
> Try result = 
>   _fetch(...)
> .and(chmod...)
> .and(extract...)
> .and(chown...);
> {code}
> Thoughts? How do other languages express this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6420) Mesos Agent leaking sockets when port mapping network isolator is ON

2016-10-19 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6420:
--
Summary: Mesos Agent leaking sockets when port mapping network isolator is 
ON  (was: Mesos Agent leaking sockets when network isolation is ON)

> Mesos Agent leaking sockets when port mapping network isolator is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> sshanmugham[5]smf1-aeu-07-sr3(mesos.test.slave) ~ $ sudo lsof -p $(pgrep -u 
> root -o -f /usr/local/sbin/mesos-slave) -nP | grep "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468012 close(27)   = 0
> [pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468799 close(27)   = 0
> [pid 57691] 19:18:03.468950 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.469505 close(27)   = 0
> [pid 57691] 19:18:03.469578 socket(PF_NETLINK, 

[jira] [Commented] (MESOS-6420) Mesos Agent leaking sockets when network isolation is ON

2016-10-19 Thread Santhosh Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590126#comment-15590126
 ] 

Santhosh Shanmugham commented on MESOS-6420:


Something about file 
'/sys/fs/cgroup/freezer/mesos/0f5238be-4dd6-4ffe-9250-ceddc904174f' already 
exists. See the full log for the life-time of the job below.

W1019 22:50:47.115586 1117 subprocess.hpp:422] Failed to execute 
Subprocess::Hook in parent for child '17427': Failed to assign process to its 
freezer cgroup: Failed to create freezer cgroup: Failed to create directory 
'/sys/fs/cgroup/freezer/mesos/0f5238be-4dd6-4ffe-9250-ceddc904174f': File exists
E1019 22:50:47.121644 1117 slave.cpp:3976] Container 
'0f5238be-4dd6-4ffe-9250-ceddc904174f' for executor 
'thermos-sshanmugham-devel-hello-0-f70e5cd5-d93c-45c5-85ec-434cb9c527ab' of 
framework 201103282247-19- failed to start: Failed to fork 
executor: Failed to clone child process: Failed to execute Subprocess::Hook in 
parent for child '17427': Failed to assign process to its freezer cgroup: 
Failed to create freezer cgroup: Failed to create directory 
'/sys/fs/cgroup/freezer/mesos/0f5238be-4dd6-4ffe-9250-ceddc904174f': File exists
I1019 22:50:47.122350 1117 containerizer.cpp:1622] Destroying container 
'0f5238be-4dd6-4ffe-9250-ceddc904174f'

> Mesos Agent leaking sockets when network isolation is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> sshanmugham[5]smf1-aeu-07-sr3(mesos.test.slave) ~ $ sudo lsof -p $(pgrep -u 
> root -o -f /usr/local/sbin/mesos-slave) -nP | grep "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 

[jira] [Commented] (MESOS-6420) Mesos Agent leaking sockets when network isolation is ON

2016-10-19 Thread Santhosh Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590120#comment-15590120
 ] 

Santhosh Shanmugham commented on MESOS-6420:


I1019 22:50:45.495558  1117 slave.cpp:1495] Got assigned task 
sshanmugham-devel-hello-0-f70e5cd5-d93c-45c5-85ec-434cb9c527ab for framework 
201103282247-19-
I1019 22:50:45.504261  1117 gc.cpp:83] Unscheduling 
'/var/lib/mesos/slaves/7f9be60e-9bcf-47e0-8c6b-8b8ab278ecbe-S0/frameworks/201103282247-19-'
 from gc
I1019 22:50:45.505200  1117 gc.cpp:83] Unscheduling 
'/var/lib/mesos/meta/slaves/7f9be60e-9bcf-47e0-8c6b-8b8ab278ecbe-S0/frameworks/201103282247-19-'
 from gc
I1019 22:50:45.506726  1116 slave.cpp:1614] Launching task 
sshanmugham-devel-hello-0-f70e5cd5-d93c-45c5-85ec-434cb9c527ab for framework 
201103282247-19-
I1019 22:50:45.510138  1116 paths.cpp:528] Trying to chown 
'/var/lib/mesos/slaves/7f9be60e-9bcf-47e0-8c6b-8b8ab278ecbe-S0/frameworks/201103282247-19-/executors/thermos-sshanmugham-devel-hello-0-f70e5cd5-d93c-45c5-85ec-434cb9c527ab/runs/0f5238be-4dd6-4ffe-9250-ceddc904174f'
 to user 'root'
==17403== Memcheck, a memory error detector
==17403== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==17403== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==17403== Command: /bin/sh -c chown\ -R\ 0:0\ 
'/var/lib/mesos/slaves/7f9be60e-9bcf-47e0-8c6b-8b8ab278ecbe-S0/frameworks/201103282247-19-/executors/thermos-sshanmugham-devel-hello-0-f70e5cd5-d93c-45c5-85ec-434cb9c527ab/runs/0f5238be-4dd6-4ffe-9250-ceddc904174f'
==17403==
==17403== Memcheck, a memory error detector
==17403== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==17403== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==17403== Command: /bin/chown -R 0:0 
/var/lib/mesos/slaves/7f9be60e-9bcf-47e0-8c6b-8b8ab278ecbe-S0/frameworks/201103282247-19-/executors/thermos-sshanmugham-devel-hello-0-f70e5cd5-d93c-45c5-85ec-434cb9c527ab/runs/0f5238be-4dd6-4ffe-9250-ceddc904174f
==17403==
==17403==
==17403== FILE DESCRIPTORS: 23 open at exit.
==17403== Open file descriptor 33:
==17403==
==17403==
==17403== Open file descriptor 32:
==17403==
==17403==
==17403== Open file descriptor 31:
==17403==
==17403==
==17403== Open file descriptor 30:
==17403==
==17403==
==17403== Open file descriptor 29:
==17403==
==17403==
==17403== Open file descriptor 28:
==17403==
==17403==
==17403== Open file descriptor 27:
==17403==
==17403==
==17403== Open file descriptor 25:
==17403==
==17403==
==17403== Open file descriptor 23:
==17403==
==17403==
==17403== Open file descriptor 22:
==17403==
==17403==
==17403== Open file descriptor 21:
==17403==
==17403==
==17403== Open file descriptor 20:
==17403==
==17403==
==17403== Open file descriptor 19:
==17403==
==17403==
==17403== Open file descriptor 17:
==17403==
==17403==
==17403== Open file descriptor 16:
==17403==
==17403==
==17403== Open file descriptor 15:
==17403==
==17403==
==17403== Open file descriptor 14:
==17403==
==17403==
==17403== Open file descriptor 12:
==17403==
==17403==
==17403== Open AF_INET socket 10: 10.34.124.106:60994 <-> 10.35.95.111:2181
==17403==
==17403==
==17403== Open file descriptor 8:
==17403==
==17403==
==17403== Open file descriptor 7:
==17403==
==17403==
==17403== Open file descriptor 2:
==17403==
==17403==
==17403== Open file descriptor 0: /dev/null
==17403==
==17403==
==17403==
==17403== HEAP SUMMARY:
==17403== in use at exit: 4 bytes in 2 blocks
==17403==   total heap usage: 106 allocs, 104 frees, 575,581 bytes allocated
==17403==
==17403== LEAK SUMMARY:
==17403==definitely lost: 0 bytes in 0 blocks
==17403==indirectly lost: 0 bytes in 0 blocks
==17403==  possibly lost: 0 bytes in 0 blocks
==17403==still reachable: 4 bytes in 2 blocks
==17403== suppressed: 0 bytes in 0 blocks
==17403== Reachable blocks (those to which a pointer was found) are not shown.
==17403== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==17403==
==17403== For counts of detected and suppressed errors, rerun with: -v
==17403== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4)
I1019 22:50:46.946646  1116 slave.cpp:5674] Launching executor 
thermos-sshanmugham-devel-hello-0-f70e5cd5-d93c-45c5-85ec-434cb9c527ab of 
framework 201103282247-19- with resources mem(*):128 in work 
directory 
'/var/lib/mesos/slaves/7f9be60e-9bcf-47e0-8c6b-8b8ab278ecbe-S0/frameworks/201103282247-19-/executors/thermos-sshanmugham-devel-hello-0-f70e5cd5-d93c-45c5-85ec-434cb9c527ab/runs/0f5238be-4dd6-4ffe-9250-ceddc904174f'
I1019 22:50:46.952791  1117 containerizer.cpp:781] Starting container 
'0f5238be-4dd6-4ffe-9250-ceddc904174f' for executor 

[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2016-10-19 Thread Megha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590108#comment-15590108
 ] 

Megha commented on MESOS-6223:
--

Recovery of agent post a reboot is required to be able to support restart of 
Restartable tasks when the executor dies as a result of agent host reboot. 
Here's the detailed design doc for Restartable Tasks:

https://docs.google.com/document/d/1YS_EBUNLkzpSru0dwn_hPUIeTATiWckSaosXSIaHUCo/edit?usp=sharing


> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>Assignee: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6422) cgroups_tests not correctly tearing down testing hierarchies

2016-10-19 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-6422:
--
Description: 
We currently do the following in 
[CgroupsTest::TearDownTestCase()|https://github.com/apache/mesos/blob/5e850a362edbf494921fedff4037cf4b53088c10/src/tests/containerizer/cgroups_tests.cpp#L83]

{code:title=}
static void TearDownTestCase()
{
  AWAIT_READY(cgroups::cleanup(TEST_CGROUPS_HIERARCHY));
}
{code}

One of its derived test {{CgroupsNoHierarchyTest}} treats 
{{TEST_CGROUPS_HIERARCHY}} as a hierarchy so it's able to clean it up as a 
hierarchy.

However another derived test {{CgroupsAnyHierarchyTest}} would create new 
hierarchies (if none is available) using {{TEST_CGROUPS_HIERARCHY}} as a parent 
directory (i.e., base hierarchy) and not as a hierarchy, so when it's time to 
clean up, it fails:

{noformat:title=}
[   OK ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Subsystems (1 ms)
../../src/tests/containerizer/cgroups_tests.cpp:88: Failure
(cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted
{noformat}

  was:
We currently do the following in 
[CgroupsTest::TearDownTestCase()|https://github.com/apache/mesos/blob/5e850a362edbf494921fedff4037cf4b53088c10/src/tests/containerizer/cgroups_tests.cpp#L83]

{code:title=}
static void TearDownTestCase()
{
  AWAIT_READY(cgroups::cleanup(TEST_CGROUPS_HIERARCHY));
}
{code}

One of its derived test {{CgroupsNoHierarchyTest}} treats 
{TEST_CGROUPS_HIERARCHY} as a hierarchy so it's able to clean it up as a 
hierarchy.

However another derived test {{CgroupsAnyHierarchyTest}} would create new 
hierarchies (if none is available) using {{TEST_CGROUPS_HIERARCHY}} as a parent 
directory (i.e., base hierarchy) and not as a hierarchy, so when it's time to 
clean up, it fails:

{noformat:title=}
[   OK ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Subsystems (1 ms)
../../src/tests/containerizer/cgroups_tests.cpp:88: Failure
(cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted
{noformat}


> cgroups_tests not correctly tearing down testing hierarchies
> 
>
> Key: MESOS-6422
> URL: https://issues.apache.org/jira/browse/MESOS-6422
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Yan Xu
>
> We currently do the following in 
> [CgroupsTest::TearDownTestCase()|https://github.com/apache/mesos/blob/5e850a362edbf494921fedff4037cf4b53088c10/src/tests/containerizer/cgroups_tests.cpp#L83]
> {code:title=}
> static void TearDownTestCase()
> {
>   AWAIT_READY(cgroups::cleanup(TEST_CGROUPS_HIERARCHY));
> }
> {code}
> One of its derived test {{CgroupsNoHierarchyTest}} treats 
> {{TEST_CGROUPS_HIERARCHY}} as a hierarchy so it's able to clean it up as a 
> hierarchy.
> However another derived test {{CgroupsAnyHierarchyTest}} would create new 
> hierarchies (if none is available) using {{TEST_CGROUPS_HIERARCHY}} as a 
> parent directory (i.e., base hierarchy) and not as a hierarchy, so when it's 
> time to clean up, it fails:
> {noformat:title=}
> [   OK ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Subsystems (1 ms)
> ../../src/tests/containerizer/cgroups_tests.cpp:88: Failure
> (cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6422) cgroups_tests not correctly tearing down testing hierarchies

2016-10-19 Thread Yan Xu (JIRA)
Yan Xu created MESOS-6422:
-

 Summary: cgroups_tests not correctly tearing down testing 
hierarchies
 Key: MESOS-6422
 URL: https://issues.apache.org/jira/browse/MESOS-6422
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Yan Xu


We currently do the following in 
[CgroupsTest::TearDownTestCase()|https://github.com/apache/mesos/blob/5e850a362edbf494921fedff4037cf4b53088c10/src/tests/containerizer/cgroups_tests.cpp#L83]

{code:title=}
static void TearDownTestCase()
{
  AWAIT_READY(cgroups::cleanup(TEST_CGROUPS_HIERARCHY));
}
{code}

One of its derived test {{CgroupsNoHierarchyTest}} treats 
{TEST_CGROUPS_HIERARCHY} as a hierarchy so it's able to clean it up as a 
hierarchy.

However another derived test {{CgroupsAnyHierarchyTest}} would create new 
hierarchies (if none is available) using {{TEST_CGROUPS_HIERARCHY}} as a parent 
directory (i.e., base hierarchy) and not as a hierarchy, so when it's time to 
clean up, it fails:

{noformat:title=}
[   OK ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Subsystems (1 ms)
../../src/tests/containerizer/cgroups_tests.cpp:88: Failure
(cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6420) Mesos Agent leaking sockets when network isolation is ON

2016-10-19 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590083#comment-15590083
 ] 

Jie Yu commented on MESOS-6420:
---

Can you find out why it failed in agent log?

> Mesos Agent leaking sockets when network isolation is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> sshanmugham[5]smf1-aeu-07-sr3(mesos.test.slave) ~ $ sudo lsof -p $(pgrep -u 
> root -o -f /usr/local/sbin/mesos-slave) -nP | grep "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468012 close(27)   = 0
> [pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468799 close(27)   = 0
> [pid 57691] 19:18:03.468950 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.469505 close(27)   = 0
> [pid 57691] 19:18:03.469578 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.470301 close(27)   = 0
> [pid 57691] 19:18:03.470353 

[jira] [Commented] (MESOS-6420) Mesos Agent leaking sockets when network isolation is ON

2016-10-19 Thread Santhosh Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590075#comment-15590075
 ] 

Santhosh Shanmugham commented on MESOS-6420:


When running with Valgrind, the slave fails to prepare isolators for the 
containers.

`/usr/local/bin/valgrind --trace-children=yes --tool=memcheck --leak-check=full 
--track-fds=yes /usr/local/bin/mesos-slave.sh 
/usr/local/mesos/conf/mesos-slave-config.sh`

Failed to launch container: Failed to fork executor: Failed to clone child 
process: Failed to synchronize child process; Container destroyed while 
preparing isolators

> Mesos Agent leaking sockets when network isolation is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> sshanmugham[5]smf1-aeu-07-sr3(mesos.test.slave) ~ $ sudo lsof -p $(pgrep -u 
> root -o -f /usr/local/sbin/mesos-slave) -nP | grep "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> 

[jira] [Assigned] (MESOS-3505) Support specifying Docker image by Image ID.

2016-10-19 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-3505:
--

Assignee: Ilya Pronin

> Support specifying Docker image by Image ID.
> 
>
> Key: MESOS-3505
> URL: https://issues.apache.org/jira/browse/MESOS-3505
> Project: Mesos
>  Issue Type: Story
>Reporter: Yan Xu
>Assignee: Ilya Pronin
>  Labels: mesosphere
>
> A common way to specify a Docker image with the docker engine is through 
> {{repo:tag}}, which is convenient and sufficient for most people in most 
> scenarios. However this combination is neither precise nor immutable.
> For this reason, it's possible when an image with a {{repo:tag}} already 
> cached locally on an agent host and a task requiring this {{repo:tag}} 
> arrives, it's using an image that's different than the one the user intended.
> Docker CLI already supports referring to an image by {{repo@id}}, where the 
> ID can have two forms:
> * v1 Image ID
> * digest
> Native Mesos provisioner should support the same for Docker images. IMO it's 
> fine if image discovery by ID is not supported (and thus still requiring 
> {{repo:tag}} to be specified) (looks like [v2 
> registry|http://docs.docker.com/registry/spec/api/] does support it) but the 
> user can optionally specify an image ID and match it against the cached / 
> newly pulled image. If the ID doesn't match the cached image, the store can 
> re-pull it; if the ID doesn't match the newly pulled image (manifest), the 
> provisioner can fail the request without having the user unknowingly running 
> its task on the wrong image.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6420) Mesos Agent leaking sockets when network isolation is ON

2016-10-19 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589882#comment-15589882
 ] 

Jie Yu edited comment on MESOS-6420 at 10/19/16 9:41 PM:
-

Looks like a TCP socket. Port mapping isolator does not open tcp sockets. So 
either libnl is doing that, or something else.


was (Author: jieyu):
Hum, the last socket is a raw IP socket. Mesos definitely does not do that. 
This might be a leak in libnl that is used by port mapping isolator.

> Mesos Agent leaking sockets when network isolation is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> sshanmugham[5]smf1-aeu-07-sr3(mesos.test.slave) ~ $ sudo lsof -p $(pgrep -u 
> root -o -f /usr/local/sbin/mesos-slave) -nP | grep "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468012 close(27)   = 0
> [pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> 

[jira] [Commented] (MESOS-6400) Not able to remove Orphan Tasks

2016-10-19 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589950#comment-15589950
 ] 

Gilbert Song commented on MESOS-6400:
-

[~mithril], seems like there are two separate issues from your description:

1. After a network partition or reboot, Marathon should not register with the 
Mesos Master using a new FrameworkID, since it would result in the old 
frameworkId being regarded as an unregistered_framework and ophan tasks still 
occupy the resources which make the new tasks cannot receive enough resources 
to launch. (We should contact Marathon team to figure why a new FrameworkID is 
used).

2. 'master/teardown' endpoint should support tearing down an unregistered 
framework. I created MESOS-6419 to track on this issue.

Side Note:
Currently, the Mesos master does not persist any state of the registered 
frameworks, this leads to the master not able to figure out when a new 
framework id try to register whether or not it exists before. Ideally, the 
Mesos master should persist all framework information on disk (as it currently 
does with the agent information). There should be an early JIRA describing this 
issue. Will link it once I find it.

> Not able to remove Orphan Tasks
> ---
>
> Key: MESOS-6400
> URL: https://issues.apache.org/jira/browse/MESOS-6400
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: centos 7 x64
>Reporter: kasim
>
> The problem maybe cause by Mesos and Marathon out of sync
> https://github.com/mesosphere/marathon/issues/616
> When I found Orphan Tasks happen, I
> 1. restart marathon
> 2. marathon do not sync Orphan Tasks, but start new tasks.
> 3. Orphan Tasks still taked the resource, I have to delete them.
> 4. I find all Orphan Tasks are under framework 
> `ef169d8a-24fc-41d1-8b0d-c67718937a48-`,
> curl -XGET `http://c196:5050/master/frameworks` shows that framework is 
> `unregistered_frameworks`
> {code}
> {
> "frameworks": [
> .
> ],
> "completed_frameworks": [ ],
> "unregistered_frameworks": [
> "ef169d8a-24fc-41d1-8b0d-c67718937a48-",
> "ef169d8a-24fc-41d1-8b0d-c67718937a48-",
> "ef169d8a-24fc-41d1-8b0d-c67718937a48-"
> ]
> }
> {code}
> 5.Try {code}curl -XPOST http://c196:5050/master/teardown -d 
> 'frameworkId=ef169d8a-24fc-41d1-8b0d-c67718937a48-' {code}
> , but get `No framework found with specified ID`
> So I have no idea to delete Orphan Tasks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6420) Mesos Agent leaking sockets when network isolation is ON

2016-10-19 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589882#comment-15589882
 ] 

Jie Yu commented on MESOS-6420:
---

Hum, the last socket is a raw IP socket. Mesos definitely does not do that. 
This might be a leak in libnl that is used by port mapping isolator.

> Mesos Agent leaking sockets when network isolation is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> sshanmugham[5]smf1-aeu-07-sr3(mesos.test.slave) ~ $ sudo lsof -p $(pgrep -u 
> root -o -f /usr/local/sbin/mesos-slave) -nP | grep "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468012 close(27)   = 0
> [pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468799 close(27)   = 0
> [pid 57691] 19:18:03.468950 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.469505 close(27)   = 0
> [pid 57691] 19:18:03.469578 

[jira] [Created] (MESOS-6421) Agent enters re-registration loop after "--recovery=cleanup"

2016-10-19 Thread Neil Conway (JIRA)
Neil Conway created MESOS-6421:
--

 Summary: Agent enters re-registration loop after 
"--recovery=cleanup"
 Key: MESOS-6421
 URL: https://issues.apache.org/jira/browse/MESOS-6421
 Project: Mesos
  Issue Type: Bug
  Components: general
Reporter: Neil Conway
Assignee: Neil Conway


Repro:

1. Start a master
2. Start an agent. The agent should register with the master.
3. Ctrl-C agent.
4. Start agent with {{--recover=cleanup}}. Agent will cleanup {{work_dir}} and 
terminate.
5. Start agent without any recovery flag. Agent will then proceed to register 
and obtain a new slave ID. However, it will repeatedly disconnect and 
re-register with the master:

{noformat}
I1019 16:59:21.859954 3747840 slave.cpp:915] New master detected at 
master@10.0.9.176:5050
I1019 16:59:21.859997 3747840 slave.cpp:936] No credentials provided. 
Attempting to register without authentication
I1019 16:59:21.860038 3747840 slave.cpp:947] Detecting new master
I1019 16:59:22.712779 528384 slave.cpp:1115] Registered with master 
master@10.0.9.176:5050; given agent ID 3615d417-b3fd-4e00-b794-9396456c3f6a-S1
I1019 16:59:22.712908 2138112 status_update_manager.cpp:184] Resuming sending 
status updates
I1019 16:59:22.713953 528384 slave.cpp:1175] Forwarding total oversubscribed 
resources {}
I1019 16:59:25.082317 3747840 slave.cpp:4147] Master marked the agent as 
disconnected but the agent considers itself registered! Forcing re-registration.
I1019 16:59:25.082756 3747840 slave.cpp:904] Re-detecting master
I1019 16:59:25.082767 2138112 status_update_manager.cpp:177] Pausing sending 
status updates
I1019 16:59:25.082801 3747840 slave.cpp:947] Detecting new master
I1019 16:59:25.083293 4284416 status_update_manager.cpp:177] Pausing sending 
status updates
I1019 16:59:25.083300 528384 slave.cpp:915] New master detected at 
master@10.0.9.176:5050
I1019 16:59:25.083349 528384 slave.cpp:936] No credentials provided. Attempting 
to register without authentication
I1019 16:59:25.083395 528384 slave.cpp:947] Detecting new master
I1019 16:59:25.869060 1064960 slave.cpp:1217] Re-registered with master 
master@10.0.9.176:5050
I1019 16:59:25.869246 4284416 status_update_manager.cpp:184] Resuming sending 
status updates
I1019 16:59:25.869246 1064960 slave.cpp:1253] Forwarding total oversubscribed 
resources {}
I1019 16:59:40.087697 4284416 slave.cpp:4147] Master marked the agent as 
disconnected but the agent considers itself registered! Forcing re-registration.
I1019 16:59:40.088105 4284416 slave.cpp:904] Re-detecting master
I1019 16:59:40.088120 3747840 status_update_manager.cpp:177] Pausing sending 
status updates
I1019 16:59:40.088145 4284416 slave.cpp:947] Detecting new master
I1019 16:59:40.088599 2138112 status_update_manager.cpp:177] Pausing sending 
status updates
I1019 16:59:40.088623 1601536 slave.cpp:915] New master detected at 
master@10.0.9.176:5050
I1019 16:59:40.088656 1601536 slave.cpp:936] No credentials provided. 
Attempting to register without authentication
I1019 16:59:40.088716 1601536 slave.cpp:947] Detecting new master
I1019 16:59:41.006837 2138112 slave.cpp:1217] Re-registered with master 
master@10.0.9.176:5050
I1019 16:59:41.007017 2674688 status_update_manager.cpp:184] Resuming sending 
status updates
I1019 16:59:41.007035 2138112 slave.cpp:1253] Forwarding total oversubscribed 
resources {}
I1019 16:59:55.089197 2674688 slave.cpp:4147] Master marked the agent as 
disconnected but the agent considers itself registered! Forcing re-registration.
{noformat}

This continues on, seemingly indefinitely. Master logs during this period:

{noformat}
I1019 16:59:04.307692 528384 master.cpp:5579] Received update of agent 
3615d417-b3fd-4e00-b794-9396456c3f6a-S0 at slave(1)@10.0.9.176:5051 
(10.0.9.176) with total oversubscribed resources {}
I1019 16:59:04.307929 528384 hierarchical.cpp:555] Agent 
3615d417-b3fd-4e00-b794-9396456c3f6a-S0 (10.0.9.176) updated with 
oversubscribed resources {} (total: cpus(*):8; mem(*):15360; disk(*):470832; 
ports(*):[31000-32000], allocated: {})
E1019 16:59:06.411664 4820992 process.cpp:2154] Failed to shutdown socket with 
fd 12: Socket is not connected
I1019 16:59:06.411717 2138112 master.cpp:1259] Agent 
3615d417-b3fd-4e00-b794-9396456c3f6a-S0 at slave(1)@10.0.9.176:5051 
(10.0.9.176) disconnected
I1019 16:59:06.411767 2138112 master.cpp:2948] Disconnecting agent 
3615d417-b3fd-4e00-b794-9396456c3f6a-S0 at slave(1)@10.0.9.176:5051 (10.0.9.176)
E1019 16:59:06.411854 4820992 process.cpp:2154] Failed to shutdown socket with 
fd 9: Socket is not connected
I1019 16:59:06.411887 2138112 master.cpp:2967] Deactivating agent 
3615d417-b3fd-4e00-b794-9396456c3f6a-S0 at slave(1)@10.0.9.176:5051 (10.0.9.176)
I1019 16:59:06.412001 2674688 hierarchical.cpp:584] Agent 
3615d417-b3fd-4e00-b794-9396456c3f6a-S0 deactivated
E1019 16:59:10.078735 4820992 process.cpp:2154] Failed to shutdown 

[jira] [Created] (MESOS-6420) Mesos Agent leaking sockets when network isolation is ON

2016-10-19 Thread Santhosh Shanmugham (JIRA)
Santhosh Shanmugham created MESOS-6420:
--

 Summary: Mesos Agent leaking sockets when network isolation is ON
 Key: MESOS-6420
 URL: https://issues.apache.org/jira/browse/MESOS-6420
 Project: Mesos
  Issue Type: Bug
  Components: isolation, network, slave
Affects Versions: 1.0.2
Reporter: Santhosh Shanmugham


Mesos Agent leaks one socket per task launched and eventually runs out of 
sockets. We were able to track it down to the network isolator 
(port_mapping.cpp). When we turned off the port mapping isolator no file 
descriptors where leaked. The leaked fd is a SOCK_STREAM socket.

Leaked Sockets:
sshanmugham[5]smf1-aeu-07-sr3(mesos.test.slave) ~ $ sudo lsof -p $(pgrep -u 
root -o -f /usr/local/sbin/mesos-slave) -nP | grep "can't"
[sudo] password for sshanmugham:
mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't identify 
protocol
mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't identify 
protocol

Extract from strace:

...
[pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 57701] 19:14:02.494395 close(19)   = 0
[pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 57701] 19:14:02.494844 close(19)   = 0
[pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 57701] 19:14:02.495565 close(19)   = 0
[pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 57701] 19:14:02.496072 close(19)   = 0
[pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 57701] 19:14:02.496758 close(19)   = 0
[pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 57701] 19:14:02.497270 close(19)   = 0
[pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 57701] 19:14:02.497698 close(19)   = 0
[pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 57701] 19:14:02.498407 close(19)   = 0
[pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 57701] 19:14:02.498899 close(19)   = 0
[pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 63682] 19:14:02.499091 close(18 
[pid 57701] 19:14:02.499634 close(19)   = 0
[pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 57701] 19:14:02.500044 close(19)   = 0
[pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 57701] 19:14:02.500734 close(19)   = 0
[pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 57701] 19:14:02.501271 close(19)   = 0
[pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
[pid 57701] 19:14:02.502030 close(19)   = 0
[pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
...

...
[pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
[pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY 
[pid 57691] 19:18:03.461460 close(27)   = 0
[pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
[pid  6138] 19:18:03.461632 close(3 
[pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
[pid  6138] 19:18:03.462190 close(3 
[pid 57691] 19:18:03.462374 close(27)   = 0
[pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
[pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
[pid  6138] 19:18:03.462678 close(3 
[pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY 
[pid 57691] 19:18:03.463046 close(27)   = 0
[pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
[pid  6138] 19:18:03.463225 close(3 
[pid 57691] 19:18:03.463845 close(27)   = 0
[pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
[pid 57691] 19:18:03.464604 close(27)   = 0
[pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
[pid 57691] 19:18:03.465074 close(27)   = 0
[pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
[pid 57691] 19:18:03.465862 close(27)   = 0
[pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
[pid 57691] 19:18:03.466713 close(27)   = 0
[pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
[pid 57691] 19:18:03.467472 close(27)   = 0
[pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
[pid 57691] 19:18:03.468012 close(27)   = 0
[pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27
[pid 57691] 19:18:03.468799 close(27)   = 0
[pid 57691] 19:18:03.468950 socket(PF_NETLINK, SOCK_RAW, 0) = 27
[pid 57691] 19:18:03.469505 close(27)   = 0
[pid 57691] 19:18:03.469578 socket(PF_NETLINK, SOCK_RAW, 0) = 27
[pid 57691] 19:18:03.470301 close(27)   = 0
[pid 57691] 19:18:03.470353 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 27
...

The last socket the was created never has a corresponding close().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-6414) cgroups isolator cleanup failed when the hierarchy is cleanup by docker daemon

2016-10-19 Thread Anindya Sinha (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anindya Sinha updated MESOS-6414:
-
Comment: was deleted

(was: RR published for review:
https://reviews.apache.org/r/53031/)

> cgroups isolator cleanup failed when the hierarchy is cleanup by docker 
> daemon 
> ---
>
> Key: MESOS-6414
> URL: https://issues.apache.org/jira/browse/MESOS-6414
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>  Labels: containerizer
>
> Now if we launch a docker container in Mesos containerizer, the racing may 
> happen
> between docker daemon and Mesos containerizer during cgroups operations.
> For example, when the docker container which run in Mesos containerizer OOM 
> exit,
> Mesos containerizer would destroy following hierarchies
> {code}
> /sys/fs/cgroup/freezer/mesos//
> /sys/fs/cgroup/freezer/mesos/
> {code}
> But the docker daemon may destroy 
> {code}
> /sys/fs/cgroup/freezer/mesos//
> {code}
> at the same time.
> If the docker daemon destroy the hierarchy first, then the Mesos 
> containerizer would
> failed during {{CgroupsIsolatorProcess::cleanup}} because it could not find 
> that hierarchy
> when destroying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6419) The 'master/teardown' endpoint should support tearing down 'unregistered_frameworks'.

2016-10-19 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-6419:
---

 Summary: The 'master/teardown' endpoint should support tearing 
down 'unregistered_frameworks'.
 Key: MESOS-6419
 URL: https://issues.apache.org/jira/browse/MESOS-6419
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.0.1, 0.28.2, 0.27.3, 0.26.2
Reporter: Gilbert Song
Priority: Critical


This issue is exposed from 
[MESOS-6400](https://issues.apache.org/jira/browse/MESOS-6400). When a user is 
trying to tear down an 'unregistered_framework' from the 'master/teardown' 
endpoint, a bad request will be returned: `No framework found with specified 
ID`.

Ideally, we should support tearing down an unregistered framework, since those 
frameworks may occur due to network partition, then all the orphan tasks still 
occupy the resources. It would be a nightmare if a user has to wait until the 
unregistered framework to get those resources back.

This may be the initial implementation: 
https://github.com/apache/mesos/commit/bb8375975e92ee722befb478ddc3b2541d1ccaa9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6414) Task cleanup fails when the containers includes cgroups not owned by Mesos

2016-10-19 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-6414:

Description: 
Now if we launch a docker container in Mesos containerizer, the racing may 
happen
between docker daemon and Mesos containerizer during cgroups operations.
For example, when the docker container which run in Mesos containerizer OOM 
exit,
Mesos containerizer would destroy following hierarchies

{code}
/sys/fs/cgroup/freezer/mesos//
/sys/fs/cgroup/freezer/mesos/
{code}

But the docker daemon may destroy 

{code}
/sys/fs/cgroup/freezer/mesos//
{code}

at the same time.

If the docker daemon destroy the hierarchy first, then the Mesos containerizer 
would
failed during {{CgroupsIsolatorProcess::cleanup}} because it could not find 
that hierarchy
when destroying.

  was:
If a mesos task is launched in a cgroup outside of the context of Mesos,  Mesos 
is unaware of that cgroup created in the task context.

Now when the Mesos task terminates: Mesos tries to cleanup all cgroups within 
the top level cgroup it knows about. If the cgroup created in the task context 
exists when LinuxLauncherProcess::destroy() is called but is eventually cleaned 
up by the container before we do a freeze() or thaw() or remove(), it fails at 
those stages leading to an incomplete cleanup of the container.


> Task cleanup fails when the containers includes cgroups not owned by Mesos
> --
>
> Key: MESOS-6414
> URL: https://issues.apache.org/jira/browse/MESOS-6414
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>
> Now if we launch a docker container in Mesos containerizer, the racing may 
> happen
> between docker daemon and Mesos containerizer during cgroups operations.
> For example, when the docker container which run in Mesos containerizer OOM 
> exit,
> Mesos containerizer would destroy following hierarchies
> {code}
> /sys/fs/cgroup/freezer/mesos//
> /sys/fs/cgroup/freezer/mesos/
> {code}
> But the docker daemon may destroy 
> {code}
> /sys/fs/cgroup/freezer/mesos//
> {code}
> at the same time.
> If the docker daemon destroy the hierarchy first, then the Mesos 
> containerizer would
> failed during {{CgroupsIsolatorProcess::cleanup}} because it could not find 
> that hierarchy
> when destroying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6414) cgroups isolator cleanup failed when the hierarchy is cleanup by docker daemon

2016-10-19 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-6414:

Summary: cgroups isolator cleanup failed when the hierarchy is cleanup by 
docker daemon   (was: Task cleanup fails when the containers includes cgroups 
not owned by Mesos)

> cgroups isolator cleanup failed when the hierarchy is cleanup by docker 
> daemon 
> ---
>
> Key: MESOS-6414
> URL: https://issues.apache.org/jira/browse/MESOS-6414
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>
> Now if we launch a docker container in Mesos containerizer, the racing may 
> happen
> between docker daemon and Mesos containerizer during cgroups operations.
> For example, when the docker container which run in Mesos containerizer OOM 
> exit,
> Mesos containerizer would destroy following hierarchies
> {code}
> /sys/fs/cgroup/freezer/mesos//
> /sys/fs/cgroup/freezer/mesos/
> {code}
> But the docker daemon may destroy 
> {code}
> /sys/fs/cgroup/freezer/mesos//
> {code}
> at the same time.
> If the docker daemon destroy the hierarchy first, then the Mesos 
> containerizer would
> failed during {{CgroupsIsolatorProcess::cleanup}} because it could not find 
> that hierarchy
> when destroying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6414) Task cleanup fails when the containers includes cgroups not owned by Mesos

2016-10-19 Thread Anindya Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589394#comment-15589394
 ] 

Anindya Sinha commented on MESOS-6414:
--

Let us assume a task is launched which creates a sub-cgroup through an external 
service. So, the cgroup hierarchy is something like:
/sys/fs/cgroup/freezer/mesos//

Say the task fails, so the container exits, and when launcher->destroy() is 
called, we do a recursive cgroups::get() to get all cgroups and we get absolute 
paths for both  as well as . And then the 
TasksKiller() is initiated for  as well as  resulting 
in freeze(), thaw(), etc. for each of them in parallel, followed by a killed().

However, since the  is created by an external service, that service 
may do a cleanup of  without Mesos' knowledge.  If that happens, 
any of the cleanup operations (freeze(), thaw(), etc) for the  may 
fail in the flow of TasksKiller() for the  (since the external 
service removed /sys/fs/cgroup/freezer/mesos// before 
Mesos could do a cleanup in TasksKiller). As a result, we exit out of cleanup 
of  at that point which seems incorrect since all cleanup has 
actually happened.

To avoid this issue (ie. race of cleanup of  between the external 
service and Mesos), I am suggesting to treat failure in any of these steps as a 
failure for all cases except when the failure is due to non-existence of 
 (ie. it has already been cleaned up by an external service, so we 
treat this as a success).



> Task cleanup fails when the containers includes cgroups not owned by Mesos
> --
>
> Key: MESOS-6414
> URL: https://issues.apache.org/jira/browse/MESOS-6414
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>
> If a mesos task is launched in a cgroup outside of the context of Mesos,  
> Mesos is unaware of that cgroup created in the task context.
> Now when the Mesos task terminates: Mesos tries to cleanup all cgroups within 
> the top level cgroup it knows about. If the cgroup created in the task 
> context exists when LinuxLauncherProcess::destroy() is called but is 
> eventually cleaned up by the container before we do a freeze() or thaw() or 
> remove(), it fails at those stages leading to an incomplete cleanup of the 
> container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6357) `NestedMesosContainerizerTest.ROOT_CGROUPS_ParentExit` is flaky in Debian 8.

2016-10-19 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589383#comment-15589383
 ] 

Gilbert Song commented on MESOS-6357:
-

This patch should fix the fd syntax error.
https://reviews.apache.org/r/53024/

> `NestedMesosContainerizerTest.ROOT_CGROUPS_ParentExit` is flaky in Debian 8.
> 
>
> Key: MESOS-6357
> URL: https://issues.apache.org/jira/browse/MESOS-6357
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 1.1.0
> Environment: Debian 8 with SSL enabled
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: flaky-test
>
> {noformat}
> [00:21:51] :   [Step 10/10] [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_ParentExit
> [00:21:51]W:   [Step 10/10] I1008 00:21:51.357839 23530 
> containerizer.cpp:202] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> [00:21:51]W:   [Step 10/10] I1008 00:21:51.361143 23530 
> linux_launcher.cpp:150] Using /sys/fs/cgroup/freezer as the freezer hierarchy 
> for the Linux launcher
> [00:21:51]W:   [Step 10/10] I1008 00:21:51.366930 23547 
> containerizer.cpp:557] Recovering containerizer
> [00:21:51]W:   [Step 10/10] I1008 00:21:51.367962 23551 provisioner.cpp:253] 
> Provisioner recovery complete
> [00:21:51]W:   [Step 10/10] I1008 00:21:51.368253 23549 
> containerizer.cpp:954] Starting container 
> 42589936-56b2-4e41-86d8-447bfaba4666 for executor 'executor' of framework 
> [00:21:51]W:   [Step 10/10] I1008 00:21:51.368577 23548 cgroups.cpp:404] 
> Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_458f8018-67e7-4cc6-8126-a535974db35d/42589936-56b2-4e41-86d8-447bfaba4666'
>  for container 42589936-56b2-4e41-86d8-447bfaba4666
> [00:21:51]W:   [Step 10/10] I1008 00:21:51.369863 23544 cpu.cpp:103] Updated 
> 'cpu.shares' to 1024 (cpus 1) for container 
> 42589936-56b2-4e41-86d8-447bfaba4666
> [00:21:51]W:   [Step 10/10] I1008 00:21:51.370384 23545 
> containerizer.cpp:1443] Launching 'mesos-containerizer' with flags 
> '--command="{"shell":true,"value":"read key <&30"}" --help="false" 
> --pipe_read="30" --pipe_write="34" 
> --pre_exec_commands="[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/mnt\/teamcity\/work\/4240ba9ddd0997c3\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o nosuid,noexec,nodev"}]" 
> --runtime_directory="/mnt/teamcity/temp/buildTmp/NestedMesosContainerizerTest_ROOT_CGROUPS_ParentExit_sEbtvQ/containers/42589936-56b2-4e41-86d8-447bfaba4666"
>  --unshare_namespace_mnt="false" 
> --working_directory="/mnt/teamcity/temp/buildTmp/NestedMesosContainerizerTest_ROOT_CGROUPS_ParentExit_MqjHi0"'
> [00:21:51]W:   [Step 10/10] I1008 00:21:51.370483 23544 
> linux_launcher.cpp:421] Launching container 
> 42589936-56b2-4e41-86d8-447bfaba4666 and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> [00:21:51]W:   [Step 10/10] I1008 00:21:51.374867 23545 
> containerizer.cpp:1480] Checkpointing container's forked pid 14139 to 
> '/mnt/teamcity/temp/buildTmp/NestedMesosContainerizerTest_ROOT_CGROUPS_ParentExit_gzjeKG/meta/slaves/frameworks/executors/executor/runs/42589936-56b2-4e41-86d8-447bfaba4666/pids/forked.pid'
> [00:21:51]W:   [Step 10/10] I1008 00:21:51.376519 23551 
> containerizer.cpp:1648] Starting nested container 
> 42589936-56b2-4e41-86d8-447bfaba4666.a5bc9913-c32c-40c6-ab78-2b08910847f8
> [00:21:51]W:   [Step 10/10] I1008 00:21:51.377296 23549 
> containerizer.cpp:1443] Launching 'mesos-containerizer' with flags 
> '--command="{"shell":true,"value":"sleep 1000"}" --help="false" 
> --pipe_read="30" --pipe_write="34" 
> --pre_exec_commands="[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/mnt\/teamcity\/work\/4240ba9ddd0997c3\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o nosuid,noexec,nodev"}]" 
> --runtime_directory="/mnt/teamcity/temp/buildTmp/NestedMesosContainerizerTest_ROOT_CGROUPS_ParentExit_sEbtvQ/containers/42589936-56b2-4e41-86d8-447bfaba4666/containers/a5bc9913-c32c-40c6-ab78-2b08910847f8"
>  --unshare_namespace_mnt="false" 
> --working_directory="/mnt/teamcity/temp/buildTmp/NestedMesosContainerizerTest_ROOT_CGROUPS_ParentExit_MqjHi0/containers/a5bc9913-c32c-40c6-ab78-2b08910847f8"'
> [00:21:51]W:   [Step 10/10] I1008 00:21:51.377424 23548 
> linux_launcher.cpp:421] Launching nested container 
> 42589936-56b2-4e41-86d8-447bfaba4666.a5bc9913-c32c-40c6-ab78-2b08910847f8 and 
> cloning with namespaces CLONE_NEWNS | CLONE_NEWPID
> [00:21:51] :   [Step 10/10] Executing pre-exec command 
> 

[jira] [Commented] (MESOS-6414) Task cleanup fails when the containers includes cgroups not owned by Mesos

2016-10-19 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589364#comment-15589364
 ] 

haosdent commented on MESOS-6414:
-

hi, [~gilbert] I chat with [~anindya.sinha] before. He means the cgroups 
destroy racing between docker daemon and mesos agent if launch docker in the 
mesos container. 
Let me update the ticket. 

> Task cleanup fails when the containers includes cgroups not owned by Mesos
> --
>
> Key: MESOS-6414
> URL: https://issues.apache.org/jira/browse/MESOS-6414
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>
> If a mesos task is launched in a cgroup outside of the context of Mesos,  
> Mesos is unaware of that cgroup created in the task context.
> Now when the Mesos task terminates: Mesos tries to cleanup all cgroups within 
> the top level cgroup it knows about. If the cgroup created in the task 
> context exists when LinuxLauncherProcess::destroy() is called but is 
> eventually cleaned up by the container before we do a freeze() or thaw() or 
> remove(), it fails at those stages leading to an incomplete cleanup of the 
> container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6400) Not able to remove Orphan Tasks

2016-10-19 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589327#comment-15589327
 ] 

Gilbert Song commented on MESOS-6400:
-

Did you try tearing down the old framework (whose tasks occupied your 
resources)?

> Not able to remove Orphan Tasks
> ---
>
> Key: MESOS-6400
> URL: https://issues.apache.org/jira/browse/MESOS-6400
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: centos 7 x64
>Reporter: kasim
>
> The problem maybe cause by Mesos and Marathon out of sync
> https://github.com/mesosphere/marathon/issues/616
> When I found Orphan Tasks happen, I
> 1. restart marathon
> 2. marathon do not sync Orphan Tasks, but start new tasks.
> 3. Orphan Tasks still taked the resource, I have to delete them.
> 4. I find all Orphan Tasks are under framework 
> `ef169d8a-24fc-41d1-8b0d-c67718937a48-`,
> curl -XGET `http://c196:5050/master/frameworks` shows that framework is 
> `unregistered_frameworks`
> {code}
> {
> "frameworks": [
> .
> ],
> "completed_frameworks": [ ],
> "unregistered_frameworks": [
> "ef169d8a-24fc-41d1-8b0d-c67718937a48-",
> "ef169d8a-24fc-41d1-8b0d-c67718937a48-",
> "ef169d8a-24fc-41d1-8b0d-c67718937a48-"
> ]
> }
> {code}
> 5.Try {code}curl -XPOST http://c196:5050/master/teardown -d 
> 'frameworkId=ef169d8a-24fc-41d1-8b0d-c67718937a48-' {code}
> , but get `No framework found with specified ID`
> So I have no idea to delete Orphan Tasks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6223) Allow agents to re-register post a host reboot

2016-10-19 Thread Megha (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Megha updated MESOS-6223:
-
Shepherd: Yan Xu

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>Assignee: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6223) Allow agents to re-register post a host reboot

2016-10-19 Thread Megha (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Megha reassigned MESOS-6223:


Assignee: Megha

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>Assignee: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6414) Task cleanup fails when the containers includes cgroups not owned by Mesos

2016-10-19 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589302#comment-15589302
 ] 

Gilbert Song commented on MESOS-6414:
-

[~anindya.sinha], would you mind providing more context about why you want a 
mesos task launched in a cgroup which is not created by mesos? The 
LinuxLauncher::destroy() would clean up all cgroups which are created by 
fork(). It assumes all cgroups under the freezerhierachy are previously created 
by Mesos.

Or as [~haosd...@gmail.com] mentioned, are  you asking for cgroup namespace 
support?

> Task cleanup fails when the containers includes cgroups not owned by Mesos
> --
>
> Key: MESOS-6414
> URL: https://issues.apache.org/jira/browse/MESOS-6414
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>Priority: Minor
>
> If a mesos task is launched in a cgroup outside of the context of Mesos,  
> Mesos is unaware of that cgroup created in the task context.
> Now when the Mesos task terminates: Mesos tries to cleanup all cgroups within 
> the top level cgroup it knows about. If the cgroup created in the task 
> context exists when LinuxLauncherProcess::destroy() is called but is 
> eventually cleaned up by the container before we do a freeze() or thaw() or 
> remove(), it fails at those stages leading to an incomplete cleanup of the 
> container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6418) Avoid popup a new window when open stdout/stderr of the executor

2016-10-19 Thread haosdent (JIRA)
haosdent created MESOS-6418:
---

 Summary: Avoid popup a new window when open stdout/stderr of the 
executor
 Key: MESOS-6418
 URL: https://issues.apache.org/jira/browse/MESOS-6418
 Project: Mesos
  Issue Type: Improvement
Reporter: haosdent






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4945) Garbage collect unused docker layers in the store.

2016-10-19 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586813#comment-15586813
 ] 

Zhitao Li edited comment on MESOS-4945 at 10/19/16 4:39 PM:


Revised plan in rough steps:
* For each image, checkpoint a) container ids, b) time of last container using 
it being destroyed, and c) size of each layer;
** TODO: how do deal with migration? idea is passing in more info in 
recover() chain of containerizer -> provisioner -> store;
* Change store interface:
** "get(Image)" to "get(Image, ContainerID)": The containerID field added 
can be used to implement ref counting and further book keeping (i.e. get local 
images information);
** add "remove(Image, ContainerID)" virtual function: this is optional in 
that store which does not do ref counting can have an empty implementation.
*  Make sure provisioner::destroy() call store::remove(Image, ContainerID);
* Add command line flag for docker store capacity limit (in bytes);
* In (docker) store::get(Image, ContainerID), after a pull is done, calculate 
total layer sizes, if above store capacity, remove unused images (determined by 
empty container ids), sorted by last time not used. Any layer not shared by 
leftover images is also removed, until total size is dropped below capacity.

Open question: 

1) In this design, we have one explicit reference counting between 
{{Container}} and {{Image}} in store. However, this information could be 
constructed on-the-fly with all containers in {{Containerizer}} class. Do we 
consider this "double accounting" problematic, or error-prone?
2) Is calling new {{remove(Image, ContainerID)}} from 
{{Provisioner::destroy()}} sufficient to make sure all book keepings are 
properly done?


was (Author: zhitao):
Revised plan in rough steps:
* For each image, checkpoint a) container ids, b) time of last container using 
it being destroyed, and c) size of each layer;
** TODO: how do deal with migration? idea is passing in more info in 
recover() chain of containerizer -> provisioner -> store;
* Change store interface:
** "get(Image)" to "get(Image, ContainerID)": The containerID field added 
can be used to implement ref counting and further book keeping (i.e. get local 
images information);
** add "remove(Image, ContainerID)" virtual function: this is optional in 
that store which does not do ref counting can skip implementing.
*  Make sure provisioner::destroy() call store::remove(Image, ContainerID);
* Add command line flag for docker store capacity limit (in bytes);
* In (docker) store::get(Image, ContainerID), after a pull is done, calculate 
total layer sizes, if above store capacity, remove unused images (determined by 
empty container ids), sorted by last time not used. Any layer not shared by 
leftover images is also removed, until total size is dropped below capacity.

Open question: 

1) In this design, we have one explicit reference counting between 
{{Container}} and {{Image}} in store. However, this information could be 
constructed on-the-fly with all containers in {{Containerizer}} class. Do we 
consider this "double accounting" problematic, or error-prone?
2) Is calling new {{remove(Image, ContainerID)}} from 
{{Provisioner::destroy()}} sufficient to make sure all book keepings are 
properly done?

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Assignee: Zhitao Li
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6404) My program cannot access a .so file while being run with mesos containerization on a docker image.

2016-10-19 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589198#comment-15589198
 ] 

Jie Yu commented on MESOS-6404:
---

Resolve this ticket. Please follow MESOS-6360 for the fix.

> My program cannot access a .so file while being run with mesos 
> containerization on a docker image.
> --
>
> Key: MESOS-6404
> URL: https://issues.apache.org/jira/browse/MESOS-6404
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: CentOS Linux release 7.2.1511 (Core) 
>Reporter: Mark Hammons
>Priority: Minor
> Attachments: Dockerfile, IUWT_140926aR_t000_ch00.log
>
>
> I have an application compiled within a docker environment called 
> ubuntu-mesos:0.11-17102016-IUWT. I've defined the executor for said 
> application with the following code: 
> val iuwtURI = CommandInfo.URI.newBuilder().setValue("http://***/
> IUWT.tar.gz").setExtract(true).setCache(false).build()
> val iuwtjURI = CommandInfo.URI.newBuilder().setValue("http://***/
> iuwtExecutor-assembly-0.1-
> SNAPSHOT.jar").setExecutable(false).setCache(false).build()
> val iuwtExec = "java -jar iuwtExecutor-assembly-0.1-SNAPSHOT.jar -
> Xmx1024M -Xmx128M"
> val iuwtCommand = 
> CommandInfo.newBuilder.setValue(iuwtExec).addAllUris(List(iuwtjURI, 
> iuwtURI).asJava).setShell(true).build()
> val iuwtImageInfo = 
> Image.newBuilder().setType(Image.Type.DOCKER).setDocker(Image.Docker.newBuilder.setName("ubuntu-
> mesos:0.11-17102016-IUWT").build()).build()
> val iuwtContInfo = 
> ContainerInfo.MesosInfo.newBuilder().setImage(iuwtImageInfo).build()
> val iuwtContainer = ContainerInfo.newBuilder()
> .setMesos(iuwtContInfo)
>   .setType(ContainerInfo.Type.MESOS)
>   .build()
> val iuwtExecutor = ExecutorInfo.newBuilder()
> .setCommand(iuwtCommand)
> .setContainer(iuwtContainer)
> 
> .setExecutorId(ExecutorID.newBuilder().setValue("iuwt-executor"))
> .setName("iuwt-executor").build()
> My executor then downloads some additional data and then tries to launch the 
> application with the input data. Unfortunately, the application fails to 
> launch because "exec: error while loading shared libraries: libtiff.so.5: 
> cannot open shared object file: No such file or directory". I've attached 
> logs showing libtiff.so.5 is both in /usr/lib/x86_64-linux-gnu, but also in 
> /usr/lib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6212) Validate the name format of mesos-managed docker containers

2016-10-19 Thread Manuwela Kanade (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manuwela Kanade reassigned MESOS-6212:
--

Assignee: Manuwela Kanade

> Validate the name format of mesos-managed docker containers
> ---
>
> Key: MESOS-6212
> URL: https://issues.apache.org/jira/browse/MESOS-6212
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 1.0.1
>Reporter: Marc Villacorta
>Assignee: Manuwela Kanade
>Priority: Minor
>
> Validate the name format of mesos-managed docker containers in order to avoid 
> false positives when looking for orphaned mesos tasks.
> Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ 
> are wrongly terminated when {{--docker_kill_orphans}} is set to true 
> (default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4292) Tests for quota with implicit roles.

2016-10-19 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4292:
---
Shepherd: Alexander Rukletsov
Assignee: Zhitao Li

> Tests for quota with implicit roles.
> 
>
> Key: MESOS-4292
> URL: https://issues.apache.org/jira/browse/MESOS-4292
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Alexander Rukletsov
>Assignee: Zhitao Li
>  Labels: mesosphere
>
> With the introduction of implicit roles (MESOS-3988), we should make sure 
> quota can be set for an inactive role (unknown to the master) and maybe 
> transition it to the active state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6417) Introduce an extra 'unknown' health check state.

2016-10-19 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-6417:
--

 Summary: Introduce an extra 'unknown' health check state.
 Key: MESOS-6417
 URL: https://issues.apache.org/jira/browse/MESOS-6417
 Project: Mesos
  Issue Type: Improvement
Reporter: Alexander Rukletsov


There are three logical states regarding health checks:
1) no health checks;
2) a health check is defined, but no result is available yet;
3) a health check is defined, it is either healthy or not.

Currently, we do not distinguish between 1) and 2), which can be problematic 
for framework authors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6352) Expose information about unreachable agents via operator API

2016-10-19 Thread Abhishek Dasgupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dasgupta reassigned MESOS-6352:


Assignee: Abhishek Dasgupta

> Expose information about unreachable agents via operator API
> 
>
> Key: MESOS-6352
> URL: https://issues.apache.org/jira/browse/MESOS-6352
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Neil Conway
>Assignee: Abhishek Dasgupta
>  Labels: mesosphere
>
> Operators would probably find information about the set of unreachable agents 
> useful. Two main use cases I can see: (a) identifying which agents are 
> currently unreachable and when they were marked unreachable, (b) 
> understanding the size/content of the registry as a way to debug registry 
> perf issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4945) Garbage collect unused docker layers in the store.

2016-10-19 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586813#comment-15586813
 ] 

Zhitao Li edited comment on MESOS-4945 at 10/19/16 9:03 AM:


Revised plan in rough steps:
* For each image, checkpoint a) container ids, b) time of last container using 
it being destroyed, and c) size of each layer;
** TODO: how do deal with migration? idea is passing in more info in 
recover() chain of containerizer -> provisioner -> store;
* Change store interface:
** "get(Image)" to "get(Image, ContainerID)": The containerID field added 
can be used to implement ref counting and further book keeping (i.e. get local 
images information);
** add "remove(Image, ContainerID)" virtual function: this is optional in 
that store which does not do ref counting can skip implementing.
*  Make sure provisioner::destroy() call store::remove(Image, ContainerID);
* Add command line flag for docker store capacity limit (in bytes);
* In (docker) store::get(Image, ContainerID), after a pull is done, calculate 
total layer sizes, if above store capacity, remove unused images (determined by 
empty container ids), sorted by last time not used. Any layer not shared by 
leftover images is also removed, until total size is dropped below capacity.

Open question: 

1) In this design, we have one explicit reference counting between 
{{Container}} and {{Image}} in store. However, this information could be 
constructed on-the-fly with all containers in {{Containerizer}} class. Do we 
consider this "double accounting" problematic, or error-prone?
2) Is calling new {{remove(Image, ContainerID)}} from 
{{Provisioner::destroy()}} sufficient to make sure all book keepings are 
properly done?


was (Author: zhitao):
Revised plan in rough steps:
* For each image, checkpoint a) container ids, b) time of last container using 
it being destroyed, and c) size of each layer;
** TODO: how do deal with migration? idea is passing in more info in 
recover() chain of containerizer -> provisioner -> store;
* Change store interface:
** "get(Image)" to "get(Image, ContainerID)",
***The containerID field added can be used to implement ref counting 
and further book keeping (i.e. get local images information);
**add "remove(Image, ContainerID)" virtual function;
  *** this is optional: store which does not do ref counting can skip 
implementing.
*  Make sure provisioner::destroy() call store::remove(Image, ContainerID);
* Add command line flag for docker store capacity limit;
* In (docker) store::get(Image, ContainerID), after a pull is done, calculate 
total layer sizes, if above store capacity, remove images with empty container 
ids (aka not used), sorted by last time not used. Any layer not used is also 
removed, until total size is dropped below capacity.

Open question: 

1) In this design, we have one explicit reference counting between 
{{Container}} and {{Image}} in store. However, this information could be 
constructed on-the-fly with all containers in {{Containerizer}} class. Do we 
consider this "double accounting" problematic, or error-prone?
2) Is calling new {{remove(Image, ContainerID)}} from 
{{Provisioner::destroy()}} sufficient to make sure all book keepings are 
properly done?

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Assignee: Zhitao Li
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4945) Garbage collect unused docker layers in the store.

2016-10-19 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586813#comment-15586813
 ] 

Zhitao Li edited comment on MESOS-4945 at 10/19/16 8:58 AM:


Revised plan in rough steps:
* For each image, checkpoint a) container ids, b) time of last container using 
it being destroyed, and c) size of each layer;
** TODO: how do deal with migration? idea is passing in more info in 
recover() chain of containerizer -> provisioner -> store;
* Change store interface:
** "get(Image)" to "get(Image, ContainerID)",
***The containerID field added can be used to implement ref counting 
and further book keeping (i.e. get local images information);
**add "remove(Image, ContainerID)" virtual function;
  *** this is optional: store which does not do ref counting can skip 
implementing.
*  Make sure provisioner::destroy() call store::remove(Image, ContainerID);
* Add command line flag for docker store capacity limit;
* In (docker) store::get(Image, ContainerID), after a pull is done, calculate 
total layer sizes, if above store capacity, remove images with empty container 
ids (aka not used), sorted by last time not used. Any layer not used is also 
removed, until total size is dropped below capacity.

Open question: 

1) In this design, we have one explicit reference counting between 
{{Container}} and {{Image}} in store. However, this information could be 
constructed on-the-fly with all containers in {{Containerizer}} class. Do we 
consider this "double accounting" problematic, or error-prone?
2) Is calling new {{remove(Image, ContainerID)}} from 
{{Provisioner::destroy()}} sufficient to make sure all book keepings are 
properly done?


was (Author: zhitao):
Current plan:

- Add a "cleanup" method to store interface, which takes a {{vector}} 
for "images in use";
- store can choose its own implementation of what it wants to cleanup. Deleted 
images will be returned in a {{Future}};
- it's the job of Containerizer/Provisioner to actively prepare the list of 
"images in use"
- initially this can simply be done by traversing all active containers, if 
provisioner already has all information in its memory;
- Initial implementation will add a new flag indicating upper limit of size for 
docker store directory, and docker::store will delete images until it drops 
below there;
- The invocation to store::cleanup can happen either in a background timer, 
upon provisioner::destroy, or before the pull? (I have no real preference, but 
calling it before pull seems safest if we use space based policy?);
- Initial implementation on store will traverse all images in the store;
- Further optimization including implementing a reference counting and size 
counting of all images in store, and checkpointing them. We might also need 
some kind of LRU implementation here.

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Assignee: Zhitao Li
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6294) TaskInfo should allow CommandInfo and ExecutorInfo

2016-10-19 Thread Abhishek Dasgupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dasgupta reassigned MESOS-6294:


Assignee: Abhishek Dasgupta

> TaskInfo should allow CommandInfo and ExecutorInfo
> --
>
> Key: MESOS-6294
> URL: https://issues.apache.org/jira/browse/MESOS-6294
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Gabriel Hartmann
>Assignee: Abhishek Dasgupta
>
> It is awkward and difficult to support development of a generic custom 
> executor when TaskInfos may not contain both a CommandInfo and an 
> ExecutorInfo.
> A generic CustomExecutor would like to be able to use the CommandInfo of a 
> TaskInfo in the launchTask call to determine what action to take.  The mutual 
> exclusion of those two elements of a TaskInfo is not a good way to indicate a 
> desire to use the CmdExecutor or a CustomExecutor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5368) Consider introducing persistent agent ID

2016-10-19 Thread Abhishek Dasgupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dasgupta updated MESOS-5368:
-
Assignee: (was: Abhishek Dasgupta)

> Consider introducing persistent agent ID
> 
>
> Key: MESOS-5368
> URL: https://issues.apache.org/jira/browse/MESOS-5368
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>  Labels: mesosphere
>
> Currently, agent IDs identify a single "session" by an agent: that is, an 
> agent receives an agent ID when it registers with the master; it reuses that 
> agent ID if it disconnects and successfully reregisters; if the agent shuts 
> down and restarts, it registers anew and receives a new agent ID.
> It would be convenient to have a "persistent agent ID" that remains the same 
> for the duration of a given agent {{work_dir}}. This would mean that a given 
> persistent volume would not migrate between different persistent agent IDs 
> over time, for example (see MESOS-4894). If we supported permanently removing 
> an agent from the cluster (i.e., the {{work_dir}} and any volumes used by the 
> agent will never be reused), we could use the persistent agent ID to report 
> which agent has been removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)