[jira] [Created] (MESOS-7653) Support launching slave using unprivileged user.

2017-06-09 Thread Jie Yu (JIRA)
Jie Yu created MESOS-7653:
-

 Summary: Support launching slave using unprivileged user.
 Key: MESOS-7653
 URL: https://issues.apache.org/jira/browse/MESOS-7653
 Project: Mesos
  Issue Type: Improvement
Reporter: Jie Yu
Priority: Minor


This ticket captures the work needed to support launching agent using 
unprivileged user.

1) The agent binary needs to have file capabilities set. Given agent needs to 
manipulate cgroups (if using linux launcher or cgroups isolator) and clone 
namespaces (if using linux launcher), CAP_SYS_ADMIN capability must be in agent 
process's effective set. Either the "Effective" bit should be set on the agent 
binary so that the permitted capabilities gained on exec'ing the binary will be 
put into the effective set of the agent process automatically, or the agent 
will raise the capability itself as long as the capabilities are in the 
permitted set.

2) Since the launch of the user task will be done by `mesos-containerizer` 
binary. Either the agent will raise ambient capabilities (using prctl 
PR_CAP_AMBIENT_RAISE), or we need to make sure `mesos-containerizer` binary 
have file capabilities set so that it is able to do thing like `setuid` after 
agent exec'ed the helper. That means the agent process should have those 
required capabilities in its inheritable set (at least) and permitted set if 
ambient capabilities route is chosen.

3) If linux capabilities isolator is enabled, in order for the framework to 
gain any capabilities they like, the process launching the agent process should 
have all capabilities in its inheritable set and its bounding set so that those 
capabilities can be regain later.




http://man7.org/linux/man-pages/man7/capabilities.7.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7652) docker image not working with universal containerizer

2017-06-09 Thread michael beisiegel (JIRA)
michael beisiegel created MESOS-7652:


 Summary: docker image not working with universal containerizer
 Key: MESOS-7652
 URL: https://issues.apache.org/jira/browse/MESOS-7652
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Affects Versions: 1.2.1
Reporter: michael beisiegel
Priority: Minor


hello,
used the following docker image recently

quay.io/spinnaker/front50:master
https://quay.io/repository/spinnaker/front50

Here the link to the Dockerfile
https://github.com/spinnaker/front50/blob/master/Dockerfile

The image works fine with the docker containerizer, but the universal 
containerizer shows the following in stderr.

"Failed to chdir into current working directory '/workdir': No such file or 
directory"

The problem comes from the fact that the Dockerfile creates a WORKDIR but then 
later removes the created dir as part of a RUN. The docker containerizer has no 
problem with it if you do

docker run -ti --rm quay.io/spinnaker/front50:master bash

you get into the working dir, but the universal containerizer fails with the 
error.

thanks for your help,
Michael



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7651) Consider a more explicit way to bind reservations / volumes to a framework.

2017-06-09 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045268#comment-16045268
 ] 

Benjamin Mahler commented on MESOS-7651:


[~xujyan] Updated the description to mention lifecycle.

> Consider a more explicit way to bind reservations / volumes to a framework.
> ---
>
> Key: MESOS-7651
> URL: https://issues.apache.org/jira/browse/MESOS-7651
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>
> Currently, when a framework creates a reservation or a persistent volume, and 
> it wants exclusive access to this volume or reservation, it must take a few 
> steps:
> * Ensure that no other frameworks are running within the reservation role (or 
> the other frameworks are co-operative).
> * With hierarchical roles, frameworks must also ensure that the role is a 
> leaf so that no descendant roles will have access to the reservation/volume. 
> This could be done by generating a role (e.g. eng/kafka/).
> It's not easy for the framework to ensure these things, since role ACLs are 
> controlled by the operator.
> We should consider a more direct way for a framework to ensure that their 
> reservation/volume cannot be shared. E.g. by binding it to their framework id 
> (perhaps re-using roles for this rather than introducing something new?)
> We should also consider binding the reservation / volumes, much like other 
> objects (tasks, executors), to the framework's lifecycle. So that if the 
> framework is removed, the reservations / volumes it left behind are cleaned 
> up.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7651) Consider a more explicit way to bind reservations / volumes to a framework.

2017-06-09 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7651:
---
Description: 
Currently, when a framework creates a reservation or a persistent volume, and 
it wants exclusive access to this volume or reservation, it must take a few 
steps:

* Ensure that no other frameworks are running within the reservation role (or 
the other frameworks are co-operative).
* With hierarchical roles, frameworks must also ensure that the role is a leaf 
so that no descendant roles will have access to the reservation/volume. This 
could be done by generating a role (e.g. eng/kafka/).

It's not easy for the framework to ensure these things, since role ACLs are 
controlled by the operator.

We should consider a more direct way for a framework to ensure that their 
reservation/volume cannot be shared. E.g. by binding it to their framework id 
(perhaps re-using roles for this rather than introducing something new?)

We should also consider binding the reservation / volumes, much like other 
objects (tasks, executors), to the framework's lifecycle. So that if the 
framework is removed, the reservations / volumes it left behind are cleaned up.

  was:
Currently, when a framework creates a reservation or a persistent volume, and 
it wants exclusive access to this volume or reservation, it must take a few 
steps:

* Ensure that no other frameworks are running within the reservation role (or 
the other frameworks are co-operative).
* With hierarchical roles, frameworks must also ensure that the role is a leaf 
so that no descendant roles will have access to the reservation/volume. This 
could be done by generating a role (e.g. eng/kafka/).

It's not easy for the framework to ensure these things, since role ACLs are 
controlled by the operator.

We should consider a more direct way for a framework to ensure that their 
reservation/volume cannot be shared. E.g. by binding it to their framework id 
(perhaps re-using roles for this rather than introducing something new?)


> Consider a more explicit way to bind reservations / volumes to a framework.
> ---
>
> Key: MESOS-7651
> URL: https://issues.apache.org/jira/browse/MESOS-7651
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>
> Currently, when a framework creates a reservation or a persistent volume, and 
> it wants exclusive access to this volume or reservation, it must take a few 
> steps:
> * Ensure that no other frameworks are running within the reservation role (or 
> the other frameworks are co-operative).
> * With hierarchical roles, frameworks must also ensure that the role is a 
> leaf so that no descendant roles will have access to the reservation/volume. 
> This could be done by generating a role (e.g. eng/kafka/).
> It's not easy for the framework to ensure these things, since role ACLs are 
> controlled by the operator.
> We should consider a more direct way for a framework to ensure that their 
> reservation/volume cannot be shared. E.g. by binding it to their framework id 
> (perhaps re-using roles for this rather than introducing something new?)
> We should also consider binding the reservation / volumes, much like other 
> objects (tasks, executors), to the framework's lifecycle. So that if the 
> framework is removed, the reservations / volumes it left behind are cleaned 
> up.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6162) Add support for cgroups blkio subsystem

2017-06-09 Thread Jason Lai (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045210#comment-16045210
 ] 

Jason Lai commented on MESOS-6162:
--

I had a long time diff that I didn't get to submit yet. Now rebased to the 
master and squashed into one commit at: https://reviews.apache.org/r/59960/ 
[~gilbert] [~jieyu]

> Add support for cgroups blkio subsystem
> ---
>
> Key: MESOS-6162
> URL: https://issues.apache.org/jira/browse/MESOS-6162
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: Jason Lai
>
> Noted that cgroups blkio subsystem may have performance issue, refer to 
> https://github.com/opencontainers/runc/issues/861



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7524) Basic fetcher success metrics

2017-06-09 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039578#comment-16039578
 ] 

James Peach edited comment on MESOS-7524 at 6/9/17 6:58 PM:


| [r/59952|https://reviews.apache.org/r/59952] | Split FetcherProcess into its 
own source files. |
| [r/59467|https://reviews.apache.org/r/59467] | Document new Fetcher metrics. |
| [r/59466|https://reviews.apache.org/r/59466] | Add metrics check to Fetcher 
tests. |
| [r/59464|https://reviews.apache.org/r/59464] | Add Fetcher task total and 
failed fetch metrics. |
| [r/59855|https://reviews.apache.org/r/59855] | Set the fetcher cache size at 
construction time. |
| [r/59854|https://reviews.apache.org/r/59854] | Make additional Fetcher and 
FetcherProcess methods const. |


was (Author: jamespeach):
| [r/59467|https://reviews.apache.org/r/59467] | Document new Fetcher metrics. |
| [r/59464|https://reviews.apache.org/r/59464] | Add Fetcher task total and 
failed fetch metrics. |
| [r/59466|https://reviews.apache.org/r/59466] | Add metrics check to Fetcher 
tests. |
| [r/59855|https://reviews.apache.org/r/59855] | Set the fetcher cache size at 
construction time. |
| [r/59854|https://reviews.apache.org/r/59854] | Make additional Fetcher and 
FetcherProcess methods const. |

> Basic fetcher success metrics
> -
>
> Key: MESOS-7524
> URL: https://issues.apache.org/jira/browse/MESOS-7524
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Reporter: James Peach
>Assignee: James Peach
>
> There are no metrics for the fetcher. As minimum we should have counters for:
> * successful fetcher invocations
> * failed fetcher invocations
> It would also be useful to know the fetch time, though that could be highly 
> variable depending on the cluster usage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7651) Consider a more explicit way to bind reservations / volumes to a framework.

2017-06-09 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044848#comment-16044848
 ] 

Yan Xu commented on MESOS-7651:
---

+1. Related to this is the headaches around the lifecycle of reservations and 
volumes. Not sure what you meant by "perhaps re-using roles for this" above but 
I think as part of this we should bind the lifecycle of reservations to the 
lifecycle of the framework the same way tasks are bound to the lifecycle of the 
framework.

> Consider a more explicit way to bind reservations / volumes to a framework.
> ---
>
> Key: MESOS-7651
> URL: https://issues.apache.org/jira/browse/MESOS-7651
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>
> Currently, when a framework creates a reservation or a persistent volume, and 
> it wants exclusive access to this volume or reservation, it must take a few 
> steps:
> * Ensure that no other frameworks are running within the reservation role (or 
> the other frameworks are co-operative).
> * With hierarchical roles, frameworks must also ensure that the role is a 
> leaf so that no descendant roles will have access to the reservation/volume. 
> This could be done by generating a role (e.g. eng/kafka/).
> It's not easy for the framework to ensure these things, since role ACLs are 
> controlled by the operator.
> We should consider a more direct way for a framework to ensure that their 
> reservation/volume cannot be shared. E.g. by binding it to their framework id 
> (perhaps re-using roles for this rather than introducing something new?)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-3826) Add an optional unique identifier for resource reservations

2017-06-09 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044833#comment-16044833
 ] 

Benjamin Mahler commented on MESOS-3826:


Filed a related issue: https://issues.apache.org/jira/browse/MESOS-7651

> Add an optional unique identifier for resource reservations
> ---
>
> Key: MESOS-3826
> URL: https://issues.apache.org/jira/browse/MESOS-3826
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Sargun Dhillon
>  Labels: mesosphere, reservations
>
> Thanks to the resource reservation primitives, frameworks can reserve 
> resources. These reservations are per role, which means multiple frameworks 
> can share reservations. This can get very hairy, as multiple reservations can 
> occur on each agent. 
> It would be nice to be able to optionally, uniquely identify reservations by 
> ID, much like persistent volumes are today. This could be done by adding a 
> new protobuf field, such as Resource.ReservationInfo.id, that if set upon 
> reservation time, would come back when the reservation is advertised.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7651) Consider a more explicit way to bind reservations / volumes to a framework.

2017-06-09 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7651:
--

 Summary: Consider a more explicit way to bind reservations / 
volumes to a framework.
 Key: MESOS-7651
 URL: https://issues.apache.org/jira/browse/MESOS-7651
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler


Currently, when a framework creates a reservation or a persistent volume, and 
it wants exclusive access to this volume or reservation, it must take a few 
steps:

* Ensure that no other frameworks are running within the reservation role (or 
the other frameworks are co-operative).
* With hierarchical roles, frameworks must also ensure that the role is a leaf 
so that no descendant roles will have access to the reservation/volume. This 
could be done by generating a role (e.g. eng/kafka/).

It's not easy for the framework to ensure these things, since role ACLs are 
controlled by the operator.

We should consider a more direct way for a framework to ensure that their 
reservation/volume cannot be shared. E.g. by binding it to their framework id 
(perhaps re-using roles for this rather than introducing something new?)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7630) Add simple filtering to unversioned operator API

2017-06-09 Thread Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quinn updated MESOS-7630:
-
 Epic Name: Operator API filtering
Issue Type: Improvement  (was: Epic)

> Add simple filtering to unversioned operator API
> 
>
> Key: MESOS-7630
> URL: https://issues.apache.org/jira/browse/MESOS-7630
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Quinn
>Assignee: Quinn
>  Labels: agent, api, http, master, mesosphere
>
> Add filtering for the following endpoints:
> - {{/frameworks}}
> - {{/slaves}}
> - {{/tasks}}
> - {{/containers}}
> We should investigate whether we should use RESTful style or query string to 
> filter the specific resource. We should also figure out whether it's 
> necessary to filter a list of resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7630) Add simple filtering to unversioned operator API

2017-06-09 Thread Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quinn updated MESOS-7630:
-
Issue Type: Epic  (was: Improvement)

> Add simple filtering to unversioned operator API
> 
>
> Key: MESOS-7630
> URL: https://issues.apache.org/jira/browse/MESOS-7630
> Project: Mesos
>  Issue Type: Epic
>  Components: agent, master
>Reporter: Quinn
>Assignee: Quinn
>  Labels: agent, api, http, master, mesosphere
>
> Add filtering for the following endpoints:
> - {{/frameworks}}
> - {{/slaves}}
> - {{/tasks}}
> - {{/containers}}
> We should investigate whether we should use RESTful style or query string to 
> filter the specific resource. We should also figure out whether it's 
> necessary to filter a list of resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-7630) Add simple filtering to unversioned operator API

2017-06-09 Thread Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quinn reassigned MESOS-7630:


Assignee: Quinn

> Add simple filtering to unversioned operator API
> 
>
> Key: MESOS-7630
> URL: https://issues.apache.org/jira/browse/MESOS-7630
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Quinn
>Assignee: Quinn
>  Labels: agent, api, http, master, mesosphere
>
> Add filtering for the following endpoints:
> - {{/frameworks}}
> - {{/slaves}}
> - {{/tasks}}
> - {{/containers}}
> We should investigate whether we should use RESTful style or query string to 
> filter the specific resource. We should also figure out whether it's 
> necessary to filter a list of resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7033) Update documentation for hierarchical roles.

2017-06-09 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7033:
---
Description: 
A few things to be sure cover:

* How to ensure that a volume is not shared with other frameworks. Previously, 
this meant running only 1 framework in the role and using ACLs to prevent other 
frameworks from running in the role. With hierarchical roles, this now also 
includes using ACLs to prevent any child roles from being created beneath the 
role (as these children would be able to obtain the reserved resources). We've 
been advising frameworks to generate a role (e.g. eng/kafka/) to 
ensure that they own their reservations (but the dynamic nature of this makes 
setting up ACLs difficult). Longer term, we may need a more explicit way to 
bind reservations or volumes to frameworks.

> Update documentation for hierarchical roles.
> 
>
> Key: MESOS-7033
> URL: https://issues.apache.org/jira/browse/MESOS-7033
> Project: Mesos
>  Issue Type: Task
>  Components: documentation
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> A few things to be sure cover:
> * How to ensure that a volume is not shared with other frameworks. 
> Previously, this meant running only 1 framework in the role and using ACLs to 
> prevent other frameworks from running in the role. With hierarchical roles, 
> this now also includes using ACLs to prevent any child roles from being 
> created beneath the role (as these children would be able to obtain the 
> reserved resources). We've been advising frameworks to generate a role (e.g. 
> eng/kafka/) to ensure that they own their reservations (but the 
> dynamic nature of this makes setting up ACLs difficult). Longer term, we may 
> need a more explicit way to bind reservations or volumes to frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7650) Timer::cancel doesn't completely prevent spurious agent reregister loops

2017-06-09 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-7650:
--
Affects Version/s: 1.3.0
   1.2.0

> Timer::cancel doesn't completely prevent spurious agent reregister loops
> 
>
> Key: MESOS-7650
> URL: https://issues.apache.org/jira/browse/MESOS-7650
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Yan Xu
>
> See MESOS-6803 for the previous attempt to address this issue but Timer 
> cancellation does prevent the already dispatched {{doReliableRegistration}} 
> event from being executed and thus creating spurious agent reregister loops.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7650) Timer::cancel doesn't completely prevent spurious agent reregister loops

2017-06-09 Thread Yan Xu (JIRA)
Yan Xu created MESOS-7650:
-

 Summary: Timer::cancel doesn't completely prevent spurious agent 
reregister loops
 Key: MESOS-7650
 URL: https://issues.apache.org/jira/browse/MESOS-7650
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: Yan Xu


See MESOS-6803 for the previous attempt to address this issue but Timer 
cancellation does prevent the already dispatched {{doReliableRegistration}} 
event from being executed and thus creating spurious agent reregister loops.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7649) GPF in mesos-executor

2017-06-09 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated MESOS-7649:
-
Description: 
We are running mesos 1.2.0 on a CoreOS system and having the following gpf show 
up:

{code}
[57807.639274] traps: mesos-executor[63400] general protection ip:7f4bdfd1b05a 
sp:7ffdafce3500 error:0
[57807.648470]  in libstdc++.so.6.0.20[7f4bdfc2+155000]
{code}


Stack trace:

{code}
#0  0x7f59c20cd054 in std::basic_string::basic_string(std::string const&) () from 
/media/root/lib64/libstdc++.so.6
#1  0x7f59c401150d in process::UPID::UPID(process::ProcessBase const&) () 
from /media/root/lib64/libmesos-1.2.0.so
#2  0x7f59c403e623 in process::SocketManager::close(int) () from 
/media/root/lib64/libmesos-1.2.0.so
#3  0x7f59c403f904 in process::SocketManager::finalize() () from 
/media/root/lib64/libmesos-1.2.0.so
#4  0x7f59c403fc59 in process::finalize(bool) () from 
/media/root/lib64/libmesos-1.2.0.so
#5  0x55c02473c1bd in ?? ()
#6  0x7f59c172b93c in __libc_start_main () from /media/root/lib64/libc.so.6
#7  0x55c02473c789 in ?? ()
{code}

  was:
We are running mesos 1.2.0 on a CoreOS system and having the following gpf show 
up:

{code}
[57807.639274] traps: mesos-executor[63400] general protection ip:7f4bdfd1b05a 
sp:7ffdafce3500 error:0
[57807.648470]  in libstdc++.so.6.0.20[7f4bdfc2+155000]
{code}

I have the core dumps and am working on getting more info.


> GPF in mesos-executor
> -
>
> Key: MESOS-7649
> URL: https://issues.apache.org/jira/browse/MESOS-7649
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.2.0
>Reporter: Charles Allen
>
> We are running mesos 1.2.0 on a CoreOS system and having the following gpf 
> show up:
> {code}
> [57807.639274] traps: mesos-executor[63400] general protection 
> ip:7f4bdfd1b05a sp:7ffdafce3500 error:0
> [57807.648470]  in libstdc++.so.6.0.20[7f4bdfc2+155000]
> {code}
> Stack trace:
> {code}
> #0  0x7f59c20cd054 in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /media/root/lib64/libstdc++.so.6
> #1  0x7f59c401150d in process::UPID::UPID(process::ProcessBase const&) () 
> from /media/root/lib64/libmesos-1.2.0.so
> #2  0x7f59c403e623 in process::SocketManager::close(int) () from 
> /media/root/lib64/libmesos-1.2.0.so
> #3  0x7f59c403f904 in process::SocketManager::finalize() () from 
> /media/root/lib64/libmesos-1.2.0.so
> #4  0x7f59c403fc59 in process::finalize(bool) () from 
> /media/root/lib64/libmesos-1.2.0.so
> #5  0x55c02473c1bd in ?? ()
> #6  0x7f59c172b93c in __libc_start_main () from 
> /media/root/lib64/libc.so.6
> #7  0x55c02473c789 in ?? ()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7649) GPF in mesos-executor

2017-06-09 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated MESOS-7649:
-
Description: 
We are running mesos 1.2.0 on a CoreOS system and having the following gpf show 
up on occasion:

{code}
[57807.639274] traps: mesos-executor[63400] general protection ip:7f4bdfd1b05a 
sp:7ffdafce3500 error:0
[57807.648470]  in libstdc++.so.6.0.20[7f4bdfc2+155000]
{code}


Stack trace:

{code}
#0  0x7f59c20cd054 in std::basic_string::basic_string(std::string const&) () from 
/media/root/lib64/libstdc++.so.6
#1  0x7f59c401150d in process::UPID::UPID(process::ProcessBase const&) () 
from /media/root/lib64/libmesos-1.2.0.so
#2  0x7f59c403e623 in process::SocketManager::close(int) () from 
/media/root/lib64/libmesos-1.2.0.so
#3  0x7f59c403f904 in process::SocketManager::finalize() () from 
/media/root/lib64/libmesos-1.2.0.so
#4  0x7f59c403fc59 in process::finalize(bool) () from 
/media/root/lib64/libmesos-1.2.0.so
#5  0x55c02473c1bd in ?? ()
#6  0x7f59c172b93c in __libc_start_main () from /media/root/lib64/libc.so.6
#7  0x55c02473c789 in ?? ()
{code}

  was:
We are running mesos 1.2.0 on a CoreOS system and having the following gpf show 
up:

{code}
[57807.639274] traps: mesos-executor[63400] general protection ip:7f4bdfd1b05a 
sp:7ffdafce3500 error:0
[57807.648470]  in libstdc++.so.6.0.20[7f4bdfc2+155000]
{code}


Stack trace:

{code}
#0  0x7f59c20cd054 in std::basic_string::basic_string(std::string const&) () from 
/media/root/lib64/libstdc++.so.6
#1  0x7f59c401150d in process::UPID::UPID(process::ProcessBase const&) () 
from /media/root/lib64/libmesos-1.2.0.so
#2  0x7f59c403e623 in process::SocketManager::close(int) () from 
/media/root/lib64/libmesos-1.2.0.so
#3  0x7f59c403f904 in process::SocketManager::finalize() () from 
/media/root/lib64/libmesos-1.2.0.so
#4  0x7f59c403fc59 in process::finalize(bool) () from 
/media/root/lib64/libmesos-1.2.0.so
#5  0x55c02473c1bd in ?? ()
#6  0x7f59c172b93c in __libc_start_main () from /media/root/lib64/libc.so.6
#7  0x55c02473c789 in ?? ()
{code}


> GPF in mesos-executor
> -
>
> Key: MESOS-7649
> URL: https://issues.apache.org/jira/browse/MESOS-7649
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.2.0
>Reporter: Charles Allen
>
> We are running mesos 1.2.0 on a CoreOS system and having the following gpf 
> show up on occasion:
> {code}
> [57807.639274] traps: mesos-executor[63400] general protection 
> ip:7f4bdfd1b05a sp:7ffdafce3500 error:0
> [57807.648470]  in libstdc++.so.6.0.20[7f4bdfc2+155000]
> {code}
> Stack trace:
> {code}
> #0  0x7f59c20cd054 in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /media/root/lib64/libstdc++.so.6
> #1  0x7f59c401150d in process::UPID::UPID(process::ProcessBase const&) () 
> from /media/root/lib64/libmesos-1.2.0.so
> #2  0x7f59c403e623 in process::SocketManager::close(int) () from 
> /media/root/lib64/libmesos-1.2.0.so
> #3  0x7f59c403f904 in process::SocketManager::finalize() () from 
> /media/root/lib64/libmesos-1.2.0.so
> #4  0x7f59c403fc59 in process::finalize(bool) () from 
> /media/root/lib64/libmesos-1.2.0.so
> #5  0x55c02473c1bd in ?? ()
> #6  0x7f59c172b93c in __libc_start_main () from 
> /media/root/lib64/libc.so.6
> #7  0x55c02473c789 in ?? ()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7342) Port Docker tests

2017-06-09 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044673#comment-16044673
 ] 

Andrew Schwartzmeyer commented on MESOS-7342:
-

This is odd. I have two separate builds of Mesos on Windows right now, and in 
one of them, these tests try to run:

{{.\src\mesos-tests.exe --gtest_filter="ROOT_DOCKER*"}}

{noformat}
[==] Running 8 tests from 1 test case.
[--] Global test environment set-up.
[--] 8 tests from 
ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest
[ RUN  ] 
ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskRunning/0

C:\Users\andschwa\src\mesos\3rdparty\libprocess\include\process/gmock.hpp(209): 
ERROR: this mock object (used in test RO
OT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskRunning/0) 
should be deleted but never is. Its address is
 @02024AEB5888.
C:\Users\andschwa\src\mesos\src\tests\default_executor_tests.cpp(131): ERROR: 
this mock object (used in test ROOT_DOCKER
_DockerAndMesosContainerizers/DefaultExecutorTest.TaskRunning/0) should be 
deleted but never is. Its address is @020
24CBAF220.
C:\Users\andschwa\src\mesos\src\tests\mock_registrar.cpp(54): ERROR: this mock 
object (used in test ROOT_DOCKER_DockerAn
dMesosContainerizers/DefaultExecutorTest.TaskRunning/0) should be deleted but 
never is. Its address is @02024D6B4E70
.
ERROR: 3 leaked mock objects found at program exit.
{noformat}


And in the other:

{noformat}
[==] Running 0 tests from 0 test cases.
[==] 0 tests from 0 test cases ran. (16 ms total)
[  PASSED  ] 0 tests.
{noformat}

I'm trying to identify the difference between the two builds that is causing 
this.

> Port Docker tests
> -
>
> Key: MESOS-7342
> URL: https://issues.apache.org/jira/browse/MESOS-7342
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
> Environment: Windows 10
>Reporter: Andrew Schwartzmeyer
>Assignee: John Kordich
>  Labels: microsoft, windows
>
> While one of Daniel Pravat's last acts was introducing the the Docker 
> containerizer for Windows, we don't have tests. We need to port 
> `docker_tests.cpp` and `docker_containerizer_tests.cpp` to Windows.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7649) GPF in mesos-executor

2017-06-09 Thread Charles Allen (JIRA)
Charles Allen created MESOS-7649:


 Summary: GPF in mesos-executor
 Key: MESOS-7649
 URL: https://issues.apache.org/jira/browse/MESOS-7649
 Project: Mesos
  Issue Type: Bug
  Components: executor
Affects Versions: 1.2.0
Reporter: Charles Allen


We are running mesos 1.2.0 on a CoreOS system and having the following gpf show 
up:

{code}
[57807.639274] traps: mesos-executor[63400] general protection ip:7f4bdfd1b05a 
sp:7ffdafce3500 error:0
[57807.648470]  in libstdc++.so.6.0.20[7f4bdfc2+155000]
{code}

I have the core dumps and am working on getting more info.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7648) Mesos master should not return `/state` before finishing recovering agents from registry

2017-06-09 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-7648:


 Summary: Mesos master should not return `/state` before finishing 
recovering agents from registry
 Key: MESOS-7648
 URL: https://issues.apache.org/jira/browse/MESOS-7648
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li


We are working on relying on {{recovered_agents}} in MESOS-6177. However, we 
discovered that master could start to respond to {{/state.json}} endpoint 
before it finishes processing result from registry::recover.

The sequence seems to be registry was recovered -> /state query comes in -> 
recovered agents from registry.

See the following logs:

{noformat}
I0608 22:29:57.147212  6407 master.cpp:2124] Elected as the leading master!
I0608 22:29:57.147274  6407 master.cpp:1646] Recovering from registrar
I0608 22:29:57.148114  6412 log.cpp:553] Attempting to start the writer
I0608 22:29:57.149339  6411 replica.cpp:495] Replica received implicit promise 
request from __req_res__(2)@10.162.9.54:5050 with proposal 105
I0608 22:29:57.149860  6411 replica.cpp:344] Persisted promised to 105
I0608 22:29:57.151495  6410 coordinator.cpp:238] Coordinator attempting to fill 
missing positions
I0608 22:29:57.151595  6412 log.cpp:569] Writer started with ending position 
36816
I0608 22:29:58.111565  6423 registrar.cpp:362] Successfully fetched the 
registry (1200222B) in 934048us
I0608 22:29:58.214422  6423 registrar.cpp:461] Applied 1 operations in 
25.893664ms; attempting to update the registry
I0608 22:29:58.300578  6421 coordinator.cpp:348] Coordinator attempting to 
write APPEND action at position 36817
I0608 22:29:58.307567  6410 replica.cpp:539] Replica received write request for 
position 36817 from __req_res__(7)@10.162.9.54:5050
I0608 22:29:58.344857  6421 replica.cpp:693] Replica received learned notice 
for position 36817 from @0.0.0.0:0
I0608 22:29:58.378731  6408 coordinator.cpp:348] Coordinator attempting to 
write TRUNCATE action at position 36818
I0608 22:29:58.382043  6416 replica.cpp:539] Replica received write request for 
position 36818 from __req_res__(12)@10.162.9.54:5050
I0608 22:29:58.384946  6410 replica.cpp:693] Replica received learned notice 
for position 36818 from @0.0.0.0:0
I0608 22:29:59.507297  6423 registrar.cpp:506] Successfully updated the 
registry in 1.282937088secs
I0608 22:29:59.580960  6423 registrar.cpp:392] Successfully recovered registrar
I0608 22:29:59.940066  6415 http.cpp:420] HTTP GET for /master/state from 
10.67.139.161:57197 with User-Agent='mesos-uns-bridge'
I0608 22:30:00.342932  6425 master.cpp:1762] Recovered 3549 agents from the 
registry (1200220B); allowing 15mins for agents to re-register
{noformat}

We found that the request corresponding to second to last line above returns 0 
registered or recovered agents, thus incorrectly rendered its client to think 
it's an empty cluster.

[~anandmazumdar] [~vinodkone]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6916) Improve health checks validation.

2017-06-09 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6916:
---
Summary: Improve health checks validation.  (was: Improve health checks 
validation)

> Improve health checks validation.
> -
>
> Key: MESOS-6916
> URL: https://issues.apache.org/jira/browse/MESOS-6916
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>  Labels: health-check, mesosphere
>
> The "general"  fields should also be validated (i.e., `timeout_seconds`), 
> similar to what's done in https://reviews.apache.org/r/55458/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-5886) FUTURE_DISPATCH may react on irrelevant dispatch.

2017-06-09 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-5886:


Assignee: Andrei Budnik

> FUTURE_DISPATCH may react on irrelevant dispatch.
> -
>
> Key: MESOS-5886
> URL: https://issues.apache.org/jira/browse/MESOS-5886
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: mesosphere, tech-debt, tech-debt-test
>
> [{{FUTURE_DISPATCH}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L50]
>  uses 
> [{{DispatchMatcher}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L350]
>  to figure out whether a processed {{DispatchEvent}} is the same the user is 
> waiting for. However, comparing {{std::type_info}} of function pointers is 
> not enough: different class methods with same signatures will be matched. 
> Here is the test that proves this:
> {noformat}
> class DispatchProcess : public Process
> {
> public:
>   MOCK_METHOD0(func0, void());
>   MOCK_METHOD1(func1, bool(bool));
>   MOCK_METHOD1(func1_same_but_different, bool(bool));
>   MOCK_METHOD1(func2, Future(bool));
>   MOCK_METHOD1(func3, int(int));
>   MOCK_METHOD2(func4, Future(bool, int));
> };
> {noformat}
> {noformat}
> TEST(ProcessTest, DispatchMatch)
> {
>   DispatchProcess process;
>   PID pid = spawn();
>   Future future = FUTURE_DISPATCH(
>   pid,
>   ::func1_same_but_different);
>   EXPECT_CALL(process, func1(_))
> .WillOnce(ReturnArg<0>());
>   dispatch(pid, ::func1, true);
>   AWAIT_READY(future);
>   terminate(pid);
>   wait(pid);
> }
> {noformat}
> The test passes:
> {noformat}
> [ RUN  ] ProcessTest.DispatchMatch
> [   OK ] ProcessTest.DispatchMatch (1 ms)
> {noformat}
> This change was introduced in https://reviews.apache.org/r/28052/.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6162) Add support for cgroups blkio subsystem

2017-06-09 Thread Jason Lai (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044040#comment-16044040
 ] 

Jason Lai commented on MESOS-6162:
--

[~haoyixin] Hi! We didn't get to prioritize this as the diff was pending for 
review. But we'll resurrect this task, for the sake of incoming demands on 
this. Will keep you updated as we progress

> Add support for cgroups blkio subsystem
> ---
>
> Key: MESOS-6162
> URL: https://issues.apache.org/jira/browse/MESOS-6162
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: Jason Lai
>
> Noted that cgroups blkio subsystem may have performance issue, refer to 
> https://github.com/opencontainers/runc/issues/861



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6162) Add support for cgroups blkio subsystem

2017-06-09 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044032#comment-16044032
 ] 

Gilbert Song commented on MESOS-6162:
-

[~haoyixin], sorry for the delay. I chatted with Jason. He already has a local 
implementation. Considering the fact that a couple companies are interested in 
this feature. We will try to ship it by the end of next week. I will shepherd.

> Add support for cgroups blkio subsystem
> ---
>
> Key: MESOS-6162
> URL: https://issues.apache.org/jira/browse/MESOS-6162
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: Jason Lai
>
> Noted that cgroups blkio subsystem may have performance issue, refer to 
> https://github.com/opencontainers/runc/issues/861



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6162) Add support for cgroups blkio subsystem

2017-06-09 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-6162:

Shepherd: Gilbert Song

> Add support for cgroups blkio subsystem
> ---
>
> Key: MESOS-6162
> URL: https://issues.apache.org/jira/browse/MESOS-6162
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: Jason Lai
>
> Noted that cgroups blkio subsystem may have performance issue, refer to 
> https://github.com/opencontainers/runc/issues/861



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)