[jira] [Updated] (MESOS-7374) Running DOCKER images in Mesos Container Runtime without `linux/filesystem` isolation enabled renders host unusable

2017-04-25 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7374:
--
Target Version/s: 1.3.0  (was: 1.2.1, 1.3.0)

> Running DOCKER images in Mesos Container Runtime without `linux/filesystem` 
> isolation enabled renders host unusable
> ---
>
> Key: MESOS-7374
> URL: https://issues.apache.org/jira/browse/MESOS-7374
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Affects Versions: 1.2.0
>Reporter: Tim Harper
>Priority: Critical
>  Labels: containerizer, mesosphere
>
> If I run the pod below (using Marathon 1.4.2) against a mesos agent that has 
> the flags (also below), then the overlay filesystem replaces the system root 
> mount, effectively rendering the host unusable until reboot.
> flags:
> - {{--containerizers mesos,docker}}
> - {{--image_providers APPC,DOCKER}}
> - {{--isolation cgroups/cpu,cgroups/mem,docker/runtime}}
> pod definition for Marathon:
> {code:java}
> {
>   "id": "/simplepod",
>   "scaling": { "kind": "fixed", "instances": 1 },
>   "containers": [
> {
>   "name": "sleep1",
>   "exec": { "command": { "shell": "sleep 1000" } },
>   "resources": { "cpus": 0.1, "mem": 32 },
>   "image": {
> "id": "alpine",
> "kind": "DOCKER"
>   }
> }
>   ],
>   "networks": [ {"mode": "host"} ]
> }
> {code}
> Mesos should probably check for this and avoid replacing the system root 
> mount point at startup or launch time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7223) Linux filesystem isolator cannot mount host volume /dev/log.

2017-04-25 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7223:
--
Target Version/s:   (was: 1.2.1)

> Linux filesystem isolator cannot mount host volume /dev/log.
> 
>
> Key: MESOS-7223
> URL: https://issues.apache.org/jira/browse/MESOS-7223
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: Haralds Ulmanis
>  Labels: volumes
>
> I'm trying to mount /dev/log.
> ls -l /dev/log
> lrwxrwxrwx 1 root root 28 Mar  9 01:49 /dev/log -> 
> /run/systemd/journal/dev-log
> # ls -l /run/systemd/journal/dev-log
> srw-rw-rw- 1 root root 0 Mar  9 01:49 /run/systemd/journal/dev-log
> I have tried mounting /dev/log and /run/systemd/journal/dev-log, both produce 
> same errors:
> from stdout:
> Executing pre-exec command 
> '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/usr\/lib\/mesos\/mesos-containerizer"}'
> Executing pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/data\/mesos-agent\/slaves\/9b7ad711-9381-4338-b3c0-dac86253701e-S93\/frameworks\/a872f621-d10f-4021-a886-c5d564df104e-\/executors\/services_dev-2_lb-6.b8202973-04b0-11e7-be02-0a2b9a5c33cf\/runs\/cfb170f0-6c69-4475-9dbe-bb9967e19b42","\/data\/mesos-agent\/provisioner\/containers\/cfb170f0-6c69-4475-9dbe-bb9967e19b42\/backends\/overlay\/rootfses\/890a25e6-cb15-42e3-be9c-0aa3baf889f8\/data\/mesos-agent\/sandbox"],"shell":false,"value":"mount"}'
> Executing pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/run\/systemd\/journal\/dev-log","\/data\/mesos-agent\/provisioner\/containers\/cfb170f0-6c69-4475-9dbe-bb9967e19b42\/backends\/overlay\/rootfses\/890a25e6-cb15-42e3-be9c-0aa3baf889f8\/dev\/log"],"shell":false,"value":"mount"}'
> from stderr:
> mount: mount(2) failed: 
> /data/mesos-agent/provisioner/containers/cfb170f0-6c69-4475-9dbe-bb9967e19b42/backends/overlay/rootfses/890a25e6-cb15-42e3-be9c-0aa3baf889f8/dev/log:
>  Not a directory
> Failed to execute pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/run\/systemd\/journal\/dev-log","\/data\/mesos-agent\/provisioner\/containers\/cfb170f0-6c69-4475-9dbe-bb9967e19b42\/backends\/overlay\/rootfses\/890a25e6-cb15-42e3-be9c-0aa3baf889f8\/dev\/log"],"shell":false,"value":"mount"}'
> This particular job  i start from marathon and have the following definition 
> (if I change MESOS to DOCKER - it works): 
> "container": {
> "type": "MESOS",
> "volumes": [
>   {
> "hostPath": "/run/systemd/journal/dev-log",
> "containerPath": "/dev/log",
> "mode": "RW"
>   }
> ],
> "docker": {
>   "image": "",
>   "credential": null,
>   "forcePullImage": true
> }
>   },



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-5172) Registry puller cannot fetch blobs correctly from http Redirect 3xx urls.

2017-04-25 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5172:
--
Fix Version/s: 1.2.1
   1.1.2

> Registry puller cannot fetch blobs correctly from http Redirect 3xx urls.
> -
>
> Key: MESOS-5172
> URL: https://issues.apache.org/jira/browse/MESOS-5172
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: containerizer, mesosphere
> Fix For: 1.1.2, 1.2.1, 1.3.0
>
>
> When the registry puller is pulling a private repository from some private 
> registry (e.g., quay.io), errors may occur when fetching blobs, at which 
> point fetching the manifest of the repo is finished correctly. The error 
> message is `Unexpected HTTP response '400 Bad Request' when trying to 
> download the blob`. This may arise from the logic of fetching blobs, or 
> incorrect format of uri when requesting blobs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7228) Upgrade Mesos to build with proto3.

2017-04-25 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984112#comment-15984112
 ] 

Jay Guo commented on MESOS-7228:


[~zhitao] so no more {{required}} field in Mesos?

> Upgrade Mesos to build with proto3.
> ---
>
> Key: MESOS-7228
> URL: https://issues.apache.org/jira/browse/MESOS-7228
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>Assignee: Zhitao Li
>Priority: Critical
>
> We currently build Mesos with protobuf 2.6.1 and bundle it as a dependency. 
> We should upgrade it to use v3.2.0 instead. This would help us use arenas to 
> improve performance (MESOS-6971) and also help resolve some bugs around the 
> Mesos master not able to handle large protobufs (>64mbs in size, MESOS-4210) 
> etc. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-5896) When start Mesos container and docker images, it does not work.

2017-04-25 Thread Chun-Hung Hsiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983905#comment-15983905
 ] 

Chun-Hung Hsiao commented on MESOS-5896:


[~Sunzhe] It seems that you should quote the resources string when running 
{{mesos-execute}}: {{--resources="cpus:4;mem:1024;disk:2048;gpus:1"}}. 
Otherwise the shell would break this line into multiple commands by semicolons.

> When start Mesos container and docker images, it does not work.
> ---
>
> Key: MESOS-5896
> URL: https://issues.apache.org/jira/browse/MESOS-5896
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.0.0
>Reporter: Sunzhe
>  Labels: containerizer
>
> When I create Mesos container with docker image, like this:
> {code:title=test.json|borderStyle=solid}
> {
>   "id": "test-mesos-container-docker-image",
>   "cmd": "while [ true ]; do uname -a; sleep 3; done",
>   "cpus": 0.5,
>   "mem": 32.0,
>   "container": {
>   "type": "MESOS",
>   "mesos": {
>   "image": {
>   "type": "DOCKER",
>   "docker": {
>   "name": "ubuntu:14.04"
>   }
>   },
>   "network": "BRIDGE",
> "portMappings": [
>   {
> "containerPort": 8080,
> "hostPort": 0,
> "servicePort": 10008,
> "protocol": "tcp",
> "labels": {}
>   }
> ],
> "privileged": false,
> "parameters": [],
> "forcePullImage": false
>   }
>   }
> }
> {code}
> It does not wok! The result seems Docker image does not work, the container 
> uses host filesystem not the Docker image.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7428) Report exit code of tasks from default and command executors

2017-04-25 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-7428:
-
Description: 
Use case: some tasks should only be retried if the exit code matches certain 
user requirement.

Based on [~gilbert], we already checkpoint the exit code in containerizer now, 
and we need to clarify how to report exit code for executor containers v.s. 
nested containers, and we should do this consistently for command and default 
executor.

> Report exit code of tasks from default and command executors
> 
>
> Key: MESOS-7428
> URL: https://issues.apache.org/jira/browse/MESOS-7428
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> Use case: some tasks should only be retried if the exit code matches certain 
> user requirement.
> Based on [~gilbert], we already checkpoint the exit code in containerizer 
> now, and we need to clarify how to report exit code for executor containers 
> v.s. nested containers, and we should do this consistently for command and 
> default executor.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7428) Report exit code of tasks from default and command executors

2017-04-25 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-7428:


 Summary: Report exit code of tasks from default and command 
executors
 Key: MESOS-7428
 URL: https://issues.apache.org/jira/browse/MESOS-7428
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li
Assignee: Zhitao Li






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-7426) Support for agent lifecycle management.

2017-04-25 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar reassigned MESOS-7426:
-

Assignee: Anand Mazumdar

> Support for agent lifecycle management.
> ---
>
> Key: MESOS-7426
> URL: https://issues.apache.org/jira/browse/MESOS-7426
> Project: Mesos
>  Issue Type: Epic
>  Components: agent
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: agent-lifecycle, mesosphere
>
> This epic co-ordinates the work for introducing agent lifecycle management in 
> Mesos allowing a framework to be notified in case of agent node failures. The 
> existing {{Event::Failure}} is not enough for frameworks to know that the 
> given agent node isn't ever coming back.
> The primary motivations for introducing such a feature would be:
> - Currently, when an agent running a task fails, there is inherently an 
> operator interference needed (manual step) to remove the node via a 
> configuration API exposed by the framework e.g., dcos cassandra node replace 
> for the cassandra framework. This needs to be done once for every stateful 
> framework running on the cluster.
> - When an agent is marked as unhealthy, the removal rate is bounded if the 
> `--agent_rate_removal_limit` option is set. This is specifically problematic 
> for operators relying on EC2 autoscaling groups or for workload bursting to 
> another cloud.
> - When an agent is marked as unhealthy, the removal rate is bounded if the 
> `--agent_rate_removal_limit` option is set. This is specifically problematic 
> for operators relying on EC2 autoscaling groups or for workload bursting to 
> another cloud.
> - When the fault domain associated with an agent changes (e.g., it is moved 
> from an unallocated rack to an allocated rack), there is no feedback 
> mechanism for the framework.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7427) Registry puller cannot fetch manifests from Amazon ECR: 405 Unsupported.

2017-04-25 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-7427:
--

 Summary: Registry puller cannot fetch manifests from Amazon ECR: 
405 Unsupported.
 Key: MESOS-7427
 URL: https://issues.apache.org/jira/browse/MESOS-7427
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao
 Fix For: 1.3.0


When the registry puller is pulling a repository from Amazon ECR, a '405 
Unsupported' error occurs when fetching manifests. The error message is as 
follows:
{code}
{"errors":[{"code":"UNSUPPORTED","message":"Invalid parameter at 
'acceptedMediaTypes' failed to satisfy constraint: 'Member must satisfy 
constraint: [Member must satisfy regular expression pattern: 
\\w{1,127}\\/[-+.\\w]{1,127}]'"}]}}
{code}
The reason is that Amazon ECR checks the 'Accept' header strictly, which is not 
provided when fetching the manifests. See the following link for more details:
https://forums.aws.amazon.com/thread.jspa?threadID=254382



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7426) Support for agent lifecycle management.

2017-04-25 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-7426:
-

 Summary: Support for agent lifecycle management.
 Key: MESOS-7426
 URL: https://issues.apache.org/jira/browse/MESOS-7426
 Project: Mesos
  Issue Type: Epic
  Components: agent
Reporter: Anand Mazumdar


This epic co-ordinates the work for introducing agent lifecycle management in 
Mesos allowing a framework to be notified in case of agent node failures. The 
existing {{Event::Failure}} is not enough for frameworks to know that the given 
agent node isn't ever coming back.

The primary motivations for introducing such a feature would be:

- Currently, when an agent running a task fails, there is inherently an 
operator interference needed (manual step) to remove the node via a 
configuration API exposed by the framework e.g., dcos cassandra node replace 
for the cassandra framework. This needs to be done once for every stateful 
framework running on the cluster.

- When an agent is marked as unhealthy, the removal rate is bounded if the 
`--agent_rate_removal_limit` option is set. This is specifically problematic 
for operators relying on EC2 autoscaling groups or for workload bursting to 
another cloud.

- When an agent is marked as unhealthy, the removal rate is bounded if the 
`--agent_rate_removal_limit` option is set. This is specifically problematic 
for operators relying on EC2 autoscaling groups or for workload bursting to 
another cloud.

- When the fault domain associated with an agent changes (e.g., it is moved 
from an unallocated rack to an allocated rack), there is no feedback mechanism 
for the framework.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7375) provide additional insight for framework developers re: GPU_RESOURCES capability

2017-04-25 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983645#comment-15983645
 ] 

Kevin Klues edited comment on MESOS-7375 at 4/25/17 9:24 PM:
-

The flag you are thinking of is 
{{\-\-allocator_fairness_excluded_resource_names}} (i.e. you can set it as 
{{\-\-allocator_fairness_excluded_resource_names=gpus}}).

Regarding motivation for the GPU_RESOURCES capability-- here is an excerpt from 
an email I sent out recently:

"""
Ideally, marathon (and any other frameworks -- SDK include) should do some sort 
of preferential scheduling when they opt-in to use GPUs.  That is, they should 
*prefer* to run GPU jobs on GPU machines and non-GPU jobs on non-GPU machines 
(falling back to running them on GPU machines only if that is all that is 
available).

Additionally, we need a way for an operator to indicate whether GPUs are a 
scarce resource in their cluster or not. We have a flag in mesos that allows us 
to set this ( `--allocator_fairness_excluded_resource_names=gpus`), but we 
don't yet have a way of setting this through DC/OS. If we don't set this flag, 
we run the risk of Mesos's DRF algorithm choosing to very rarely send out 
offers from GPU machines once the first GPU job has been launched on them.

As a concrete example, imagine you have a machine with only 1 GPU and you 
launch a task that consumes it -- from DRF's perspective that node now has 100% 
usage of one of its resources. Even if you have 2 GPUs, and one gets consumed, 
DRF still thinks you have consumed 50% of one of its resources. Out of 
fairness, DRF will choose not to send offers from you until some other resource 
on *all* other nodes approaches 50% as well (which may take a while if you are 
allocating CPUs, memory, and disk in small increments).

Right now we don't set {{\-\-allocator_fairness_excluded_resource_names=gpus}} 
in DC/OS (but maybe we should?). Is it the case that most DC/OS users only 
install GPUs on a small number of nodes in their cluster? If so, we should 
consider it a scarce resource and set this flag by default. If not, then GPUs 
aren't actually a scarce resource and we shouldn't be setting this flag -- DRF 
will perform as expected without it.
"""


was (Author: klueska):
The flag you are thinking of is 
{{\-\-allocator_fairness_excluded_resource_names}} (i.e. you can set it as 
{{\-\-allocator_fairness_excluded_resource_names=gpus}}).

Regarding motivation for the GPU_RESOURCES capability-- here is an excerpt from 
an email I sent out recently:

"""
Ideally, marathon (and any other frameworks -- SDK include) should do some sort 
of preferential scheduling when they opt-in to use GPUs.  That is, they should 
*prefer* to run GPU jobs on GPU machines and non-GPU jobs on non-GPU machines 
(falling back to running them on GPU machines only if that is all that is 
available).

Additionally, we need a way for an operator to indicate whether GPUs are a 
scarce resource in their cluster or not. We have a flag in mesos that allows us 
to set this ( `--allocator_fairness_excluded_resource_names=gpus`), but we 
don't yet have a way of setting this through DC/OS. If we don't set this flag, 
we run the risk of Mesos's DRF algorithm choosing to very rarely send out 
offers from GPU machines once the first GPU job has been launched on them.

As a concrete example, imagine you have a machine with only 1 GPU and you 
launch a task that consumes it -- from DRF's perspective that node now has 100% 
usage of one of its resources. Even if you have 2 GPUs, and one gets consumed, 
DRF still thinks you have consumed 50% of one of its resources. Out of 
fairness, DRF will choose not to send offers from you until some other resource 
on *all* other nodes approaches 50% as well (which may take a while if you are 
allocating CPUs, memory, and disk in small increments).

Right now we don't set `--allocator_fairness_excluded_resource_names=gpus` in 
DC/OS (but maybe we should?). Is it the case that most DC/OS users only install 
GPUs on a small number of nodes in their cluster? If so, we should consider it 
a scarce resource and set this flag by default. If not, then GPUs aren't 
actually a scarce resource and we shouldn't be setting this flag-- DRF will 
perform as expected without it.
"""

> provide additional insight for framework developers re: GPU_RESOURCES 
> capability
> 
>
> Key: MESOS-7375
> URL: https://issues.apache.org/jira/browse/MESOS-7375
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: James DeFelice
>  Labels: mesosphere
>
> On clusters where all nodes are equal and every node has a GPU, frameworks 
> that **don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. 
> This is surprising for 

[jira] [Comment Edited] (MESOS-7375) provide additional insight for framework developers re: GPU_RESOURCES capability

2017-04-25 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983645#comment-15983645
 ] 

Kevin Klues edited comment on MESOS-7375 at 4/25/17 9:24 PM:
-

The flag you are thinking of is 
{{\-\-allocator_fairness_excluded_resource_names}} (i.e. you can set it as 
{{\-\-allocator_fairness_excluded_resource_names=gpus}}).

Regarding motivation for the GPU_RESOURCES capability-- here is an excerpt from 
an email I sent out recently:

"""
Ideally, marathon (and any other frameworks -- SDK include) should do some sort 
of preferential scheduling when they opt-in to use GPUs.  That is, they should 
*prefer* to run GPU jobs on GPU machines and non-GPU jobs on non-GPU machines 
(falling back to running them on GPU machines only if that is all that is 
available).

Additionally, we need a way for an operator to indicate whether GPUs are a 
scarce resource in their cluster or not. We have a flag in mesos that allows us 
to set this ( {{\-\-allocator_fairness_excluded_resource_names=gpus}}), but we 
don't yet have a way of setting this through DC/OS. If we don't set this flag, 
we run the risk of Mesos's DRF algorithm choosing to very rarely send out 
offers from GPU machines once the first GPU job has been launched on them.

As a concrete example, imagine you have a machine with only 1 GPU and you 
launch a task that consumes it -- from DRF's perspective that node now has 100% 
usage of one of its resources. Even if you have 2 GPUs, and one gets consumed, 
DRF still thinks you have consumed 50% of one of its resources. Out of 
fairness, DRF will choose not to send offers from you until some other resource 
on *all* other nodes approaches 50% as well (which may take a while if you are 
allocating CPUs, memory, and disk in small increments).

Right now we don't set {{\-\-allocator_fairness_excluded_resource_names=gpus}} 
in DC/OS (but maybe we should?). Is it the case that most DC/OS users only 
install GPUs on a small number of nodes in their cluster? If so, we should 
consider it a scarce resource and set this flag by default. If not, then GPUs 
aren't actually a scarce resource and we shouldn't be setting this flag -- DRF 
will perform as expected without it.
"""


was (Author: klueska):
The flag you are thinking of is 
{{\-\-allocator_fairness_excluded_resource_names}} (i.e. you can set it as 
{{\-\-allocator_fairness_excluded_resource_names=gpus}}).

Regarding motivation for the GPU_RESOURCES capability-- here is an excerpt from 
an email I sent out recently:

"""
Ideally, marathon (and any other frameworks -- SDK include) should do some sort 
of preferential scheduling when they opt-in to use GPUs.  That is, they should 
*prefer* to run GPU jobs on GPU machines and non-GPU jobs on non-GPU machines 
(falling back to running them on GPU machines only if that is all that is 
available).

Additionally, we need a way for an operator to indicate whether GPUs are a 
scarce resource in their cluster or not. We have a flag in mesos that allows us 
to set this ( `--allocator_fairness_excluded_resource_names=gpus`), but we 
don't yet have a way of setting this through DC/OS. If we don't set this flag, 
we run the risk of Mesos's DRF algorithm choosing to very rarely send out 
offers from GPU machines once the first GPU job has been launched on them.

As a concrete example, imagine you have a machine with only 1 GPU and you 
launch a task that consumes it -- from DRF's perspective that node now has 100% 
usage of one of its resources. Even if you have 2 GPUs, and one gets consumed, 
DRF still thinks you have consumed 50% of one of its resources. Out of 
fairness, DRF will choose not to send offers from you until some other resource 
on *all* other nodes approaches 50% as well (which may take a while if you are 
allocating CPUs, memory, and disk in small increments).

Right now we don't set {{\-\-allocator_fairness_excluded_resource_names=gpus}} 
in DC/OS (but maybe we should?). Is it the case that most DC/OS users only 
install GPUs on a small number of nodes in their cluster? If so, we should 
consider it a scarce resource and set this flag by default. If not, then GPUs 
aren't actually a scarce resource and we shouldn't be setting this flag -- DRF 
will perform as expected without it.
"""

> provide additional insight for framework developers re: GPU_RESOURCES 
> capability
> 
>
> Key: MESOS-7375
> URL: https://issues.apache.org/jira/browse/MESOS-7375
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: James DeFelice
>  Labels: mesosphere
>
> On clusters where all nodes are equal and every node has a GPU, frameworks 
> that **don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. 
> This is 

[jira] [Comment Edited] (MESOS-7375) provide additional insight for framework developers re: GPU_RESOURCES capability

2017-04-25 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983645#comment-15983645
 ] 

Kevin Klues edited comment on MESOS-7375 at 4/25/17 9:23 PM:
-

The flag you are thinking of is 
{{\-\-allocator_fairness_excluded_resource_names}} (i.e. you can set it as 
{{\-\-allocator_fairness_excluded_resource_names=gpus}}).

Regarding motivation for the GPU_RESOURCES capability-- here is an excerpt from 
an email I sent out recently:
{noformat}
Ideally, marathon (and any other frameworks -- SDK include) should do some sort 
of preferential scheduling when they opt-in to use GPUs.  That is, they should 
*prefer* to run GPU jobs on GPU machines and non-GPU jobs on non-GPU machines 
(falling back to running them on GPU machines only if that is all that is 
available).

Additionally, we need a way for an operator to indicate whether GPUs are a 
scarce resource in their cluster or not. We have a flag in mesos that allows us 
to set this ( `--allocator_fairness_excluded_resource_names=gpus`), but we 
don't yet have a way of setting this through DC/OS. If we don't set this flag, 
we run the risk of Mesos's DRF algorithm choosing to very rarely send out 
offers from GPU machines once the first GPU job has been launched on them.

As a concrete example, imagine you have a machine with only 1 GPU and you 
launch a task that consumes it -- from DRF's perspective that node now has 100% 
usage of one of its resources. Even if you have 2 GPUs, and one gets consumed, 
DRF still thinks you have consumed 50% of one of its resources. Out of 
fairness, DRF will choose not to send offers from you until some other resource 
on *all* other nodes approaches 50% as well (which may take a while if you are 
allocating CPUs, memory, and disk in small increments).

Right now we don't set `--allocator_fairness_excluded_resource_names=gpus` in 
DC/OS (but maybe we should?). Is it the case that most DC/OS users only install 
GPUs on a small number of nodes in their cluster? If so, we should consider it 
a scarce resource and set this flag by default. If not, then GPUs aren't 
actually a scarce resource and we shouldn't be setting this flag-- DRF will 
perform as expected without it.
{noformat}


was (Author: klueska):
The flag you are thinking of is 
{{--allocator_fairness_excluded_resource_names}} (i.e. you can set it as 
{{--allocator_fairness_excluded_resource_names=gpus}}).

Regarding motivation for the GPU_RESOURCES capability-- here is an excerpt from 
an email I sent out recently:
{noformat}
Ideally, marathon (and any other frameworks -- SDK include) should do some sort 
of preferential scheduling when they opt-in to use GPUs.  That is, they should 
*prefer* to run GPU jobs on GPU machines and non-GPU jobs on non-GPU machines 
(falling back to running them on GPU machines only if that is all that is 
available).

Additionally, we need a way for an operator to indicate whether GPUs are a 
scarce resource in their cluster or not. We have a flag in mesos that allows us 
to set this ( `--allocator_fairness_excluded_resource_names=gpus`), but we 
don't yet have a way of setting this through DC/OS. If we don't set this flag, 
we run the risk of Mesos's DRF algorithm choosing to very rarely send out 
offers from GPU machines once the first GPU job has been launched on them.

As a concrete example, imagine you have a machine with only 1 GPU and you 
launch a task that consumes it -- from DRF's perspective that node now has 100% 
usage of one of its resources. Even if you have 2 GPUs, and one gets consumed, 
DRF still thinks you have consumed 50% of one of its resources. Out of 
fairness, DRF will choose not to send offers from you until some other resource 
on *all* other nodes approaches 50% as well (which may take a while if you are 
allocating CPUs, memory, and disk in small increments).

Right now we don't set `--allocator_fairness_excluded_resource_names=gpus` in 
DC/OS (but maybe we should?). Is it the case that most DC/OS users only install 
GPUs on a small number of nodes in their cluster? If so, we should consider it 
a scarce resource and set this flag by default. If not, then GPUs aren't 
actually a scarce resource and we shouldn't be setting this flag-- DRF will 
perform as expected without it.
{noformat}

> provide additional insight for framework developers re: GPU_RESOURCES 
> capability
> 
>
> Key: MESOS-7375
> URL: https://issues.apache.org/jira/browse/MESOS-7375
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: James DeFelice
>  Labels: mesosphere
>
> On clusters where all nodes are equal and every node has a GPU, frameworks 
> that **don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. 
> This is 

[jira] [Comment Edited] (MESOS-7375) provide additional insight for framework developers re: GPU_RESOURCES capability

2017-04-25 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983645#comment-15983645
 ] 

Kevin Klues edited comment on MESOS-7375 at 4/25/17 9:23 PM:
-

The flag you are thinking of is 
{{\-\-allocator_fairness_excluded_resource_names}} (i.e. you can set it as 
{{\-\-allocator_fairness_excluded_resource_names=gpus}}).

Regarding motivation for the GPU_RESOURCES capability-- here is an excerpt from 
an email I sent out recently:

"""
Ideally, marathon (and any other frameworks -- SDK include) should do some sort 
of preferential scheduling when they opt-in to use GPUs.  That is, they should 
*prefer* to run GPU jobs on GPU machines and non-GPU jobs on non-GPU machines 
(falling back to running them on GPU machines only if that is all that is 
available).

Additionally, we need a way for an operator to indicate whether GPUs are a 
scarce resource in their cluster or not. We have a flag in mesos that allows us 
to set this ( `--allocator_fairness_excluded_resource_names=gpus`), but we 
don't yet have a way of setting this through DC/OS. If we don't set this flag, 
we run the risk of Mesos's DRF algorithm choosing to very rarely send out 
offers from GPU machines once the first GPU job has been launched on them.

As a concrete example, imagine you have a machine with only 1 GPU and you 
launch a task that consumes it -- from DRF's perspective that node now has 100% 
usage of one of its resources. Even if you have 2 GPUs, and one gets consumed, 
DRF still thinks you have consumed 50% of one of its resources. Out of 
fairness, DRF will choose not to send offers from you until some other resource 
on *all* other nodes approaches 50% as well (which may take a while if you are 
allocating CPUs, memory, and disk in small increments).

Right now we don't set `--allocator_fairness_excluded_resource_names=gpus` in 
DC/OS (but maybe we should?). Is it the case that most DC/OS users only install 
GPUs on a small number of nodes in their cluster? If so, we should consider it 
a scarce resource and set this flag by default. If not, then GPUs aren't 
actually a scarce resource and we shouldn't be setting this flag-- DRF will 
perform as expected without it.
"""


was (Author: klueska):
The flag you are thinking of is 
{{\-\-allocator_fairness_excluded_resource_names}} (i.e. you can set it as 
{{\-\-allocator_fairness_excluded_resource_names=gpus}}).

Regarding motivation for the GPU_RESOURCES capability-- here is an excerpt from 
an email I sent out recently:
{noformat}
Ideally, marathon (and any other frameworks -- SDK include) should do some sort 
of preferential scheduling when they opt-in to use GPUs.  That is, they should 
*prefer* to run GPU jobs on GPU machines and non-GPU jobs on non-GPU machines 
(falling back to running them on GPU machines only if that is all that is 
available).

Additionally, we need a way for an operator to indicate whether GPUs are a 
scarce resource in their cluster or not. We have a flag in mesos that allows us 
to set this ( `--allocator_fairness_excluded_resource_names=gpus`), but we 
don't yet have a way of setting this through DC/OS. If we don't set this flag, 
we run the risk of Mesos's DRF algorithm choosing to very rarely send out 
offers from GPU machines once the first GPU job has been launched on them.

As a concrete example, imagine you have a machine with only 1 GPU and you 
launch a task that consumes it -- from DRF's perspective that node now has 100% 
usage of one of its resources. Even if you have 2 GPUs, and one gets consumed, 
DRF still thinks you have consumed 50% of one of its resources. Out of 
fairness, DRF will choose not to send offers from you until some other resource 
on *all* other nodes approaches 50% as well (which may take a while if you are 
allocating CPUs, memory, and disk in small increments).

Right now we don't set `--allocator_fairness_excluded_resource_names=gpus` in 
DC/OS (but maybe we should?). Is it the case that most DC/OS users only install 
GPUs on a small number of nodes in their cluster? If so, we should consider it 
a scarce resource and set this flag by default. If not, then GPUs aren't 
actually a scarce resource and we shouldn't be setting this flag-- DRF will 
perform as expected without it.
{noformat}

> provide additional insight for framework developers re: GPU_RESOURCES 
> capability
> 
>
> Key: MESOS-7375
> URL: https://issues.apache.org/jira/browse/MESOS-7375
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: James DeFelice
>  Labels: mesosphere
>
> On clusters where all nodes are equal and every node has a GPU, frameworks 
> that **don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. 
> This is 

[jira] [Commented] (MESOS-7375) provide additional insight for framework developers re: GPU_RESOURCES capability

2017-04-25 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983645#comment-15983645
 ] 

Kevin Klues commented on MESOS-7375:


The flag you are thinking of is 
{{--allocator_fairness_excluded_resource_names}} (i.e. you can set it as 
{{--allocator_fairness_excluded_resource_names=gpus}}).

Regarding motivation for the GPU_RESOURCES capability-- here is an excerpt from 
an email I sent out recently:
{noformat}
Ideally, marathon (and any other frameworks -- SDK include) should do some sort 
of preferential scheduling when they opt-in to use GPUs.  That is, they should 
*prefer* to run GPU jobs on GPU machines and non-GPU jobs on non-GPU machines 
(falling back to running them on GPU machines only if that is all that is 
available).

Additionally, we need a way for an operator to indicate whether GPUs are a 
scarce resource in their cluster or not. We have a flag in mesos that allows us 
to set this ( `--allocator_fairness_excluded_resource_names=gpus`), but we 
don't yet have a way of setting this through DC/OS. If we don't set this flag, 
we run the risk of Mesos's DRF algorithm choosing to very rarely send out 
offers from GPU machines once the first GPU job has been launched on them.

As a concrete example, imagine you have a machine with only 1 GPU and you 
launch a task that consumes it -- from DRF's perspective that node now has 100% 
usage of one of its resources. Even if you have 2 GPUs, and one gets consumed, 
DRF still thinks you have consumed 50% of one of its resources. Out of 
fairness, DRF will choose not to send offers from you until some other resource 
on *all* other nodes approaches 50% as well (which may take a while if you are 
allocating CPUs, memory, and disk in small increments).

Right now we don't set `--allocator_fairness_excluded_resource_names=gpus` in 
DC/OS (but maybe we should?). Is it the case that most DC/OS users only install 
GPUs on a small number of nodes in their cluster? If so, we should consider it 
a scarce resource and set this flag by default. If not, then GPUs aren't 
actually a scarce resource and we shouldn't be setting this flag-- DRF will 
perform as expected without it.
{noformat}

> provide additional insight for framework developers re: GPU_RESOURCES 
> capability
> 
>
> Key: MESOS-7375
> URL: https://issues.apache.org/jira/browse/MESOS-7375
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: James DeFelice
>  Labels: mesosphere
>
> On clusters where all nodes are equal and every node has a GPU, frameworks 
> that **don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. 
> This is surprising for operators.
> Even when a framework doesn't **need** GPU resources, it may make sense for a 
> framework scheduler to provide a `--gpu-cluster-compat` (or similar) flag 
> that results in the framework advertising the `GPU_RESOURCES` capability even 
> though it does not intend to consume any GPU. The effect being that said 
> framework will now receive offers on clusters where all nodes have GPU 
> resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-5918) Replace jsonp with a more secure alternative

2017-04-25 Thread Jacob Janco (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983568#comment-15983568
 ] 

Jacob Janco edited comment on MESOS-5918 at 4/25/17 8:40 PM:
-

[~greggomann] [~anandmazumdar][~mlunoe][~xujyan] Reopening a bit of discussion 
on replacing the jsonp workaround with CORS handling server side. An initial 
idea is to have a configurable regex for domains available for cross origin 
requests which will match against sent Origin headers. At this point I don't 
think we'll have to support preflighting requests to add this functionality. 
Another consideration, should this be a libprocess level configuration or 
perhaps a flag set on masters and agents?


was (Author: jjanco):
[~greggomann] [~anandmazumdar][~mlunoe] Reopening a bit of discussion on 
replacing the jsonp workaround with CORS handling server side. An initial idea 
is to have a configurable regex for domains available for cross origin requests 
which will match against sent Origin headers. At this point I don't think we'll 
have to support preflighting requests to add this functionality. Another 
consideration, should this be a libprocess level configuration or perhaps a 
flag set on masters and agents?

> Replace jsonp with a more secure alternative
> 
>
> Key: MESOS-5918
> URL: https://issues.apache.org/jira/browse/MESOS-5918
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Yan Xu
>
> We currently use the {{jsonp}} technique to bypass CORS check. This practice 
> has many security concerns (see discussions on MESOS-5911) so we should 
> replace it with a better alternative.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-5918) Replace jsonp with a more secure alternative

2017-04-25 Thread Jacob Janco (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983568#comment-15983568
 ] 

Jacob Janco edited comment on MESOS-5918 at 4/25/17 8:40 PM:
-

[~greggomann] [~anandmazumdar] [~mlunoe] [~xujyan] Reopening a bit of 
discussion on replacing the jsonp workaround with CORS handling server side. An 
initial idea is to have a configurable regex for domains available for cross 
origin requests which will match against sent Origin headers. At this point I 
don't think we'll have to support preflighting requests to add this 
functionality. Another consideration, should this be a libprocess level 
configuration or perhaps a flag set on masters and agents?


was (Author: jjanco):
[~greggomann] [~anandmazumdar][~mlunoe][~xujyan] Reopening a bit of discussion 
on replacing the jsonp workaround with CORS handling server side. An initial 
idea is to have a configurable regex for domains available for cross origin 
requests which will match against sent Origin headers. At this point I don't 
think we'll have to support preflighting requests to add this functionality. 
Another consideration, should this be a libprocess level configuration or 
perhaps a flag set on masters and agents?

> Replace jsonp with a more secure alternative
> 
>
> Key: MESOS-5918
> URL: https://issues.apache.org/jira/browse/MESOS-5918
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Yan Xu
>
> We currently use the {{jsonp}} technique to bypass CORS check. This practice 
> has many security concerns (see discussions on MESOS-5911) so we should 
> replace it with a better alternative.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-5918) Replace jsonp with a more secure alternative

2017-04-25 Thread Jacob Janco (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983568#comment-15983568
 ] 

Jacob Janco commented on MESOS-5918:


[~greggomann] [~anandmazumdar][~mlunoe] Reopening a bit of discussion on 
replacing the jsonp workaround with CORS handling server side. An initial idea 
is to have a configurable regex for domains available for cross origin requests 
which will match against sent Origin headers. At this point I don't think we'll 
have to support preflighting requests to add this functionality. Another 
consideration, should this be a libprocess level configuration or perhaps a 
flag set on masters and agents?

> Replace jsonp with a more secure alternative
> 
>
> Key: MESOS-5918
> URL: https://issues.apache.org/jira/browse/MESOS-5918
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Yan Xu
>
> We currently use the {{jsonp}} technique to bypass CORS check. This practice 
> has many security concerns (see discussions on MESOS-5911) so we should 
> replace it with a better alternative.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7280) Unified containerizer provisions docker image error with COPY backend

2017-04-25 Thread Chun-Hung Hsiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983446#comment-15983446
 ] 

Chun-Hung Hsiao edited comment on MESOS-7280 at 4/25/17 7:00 PM:
-

Fixed by Commit 
https://github.com/apache/mesos/commit/3c8deedc9a1bce617965c3442713ebdc6691d1ae.
Unit test: https://reviews.apache.org/r/58640/


was (Author: chhsia0):
Fixed by Commit 3c8deedc9a1bce617965c3442713ebdc6691d1ae.
Unit test: https://reviews.apache.org/r/58640/

> Unified containerizer provisions docker image error with COPY backend
> -
>
> Key: MESOS-7280
> URL: https://issues.apache.org/jira/browse/MESOS-7280
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.2, 1.2.0
> Environment: CentOS 7.2,ext4, COPY
>Reporter: depay
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: copy-backend
>
> Error occurs on some specific docker images with COPY backend, both 1.0.2 and 
> 1.2.0. It works well with OVERLAY backend on 1.2.0.
> {quote}
> I0321 09:36:07.308830 27613 paths.cpp:528] Trying to chown 
> '/data/mesos/slaves/55f6df5e-2812-40a0-baf5-ce96f20677d3-S102/frameworks/20151223-150303-2677017098-5050-30032-/executors/ct:Transcoding_Test_114489497_1490060156172:3/runs/7e518538-7b56-4b14-a3c9-bee43c669bd7'
>  to user 'root'
> I0321 09:36:07.319628 27613 slave.cpp:5703] Launching executor 
> ct:Transcoding_Test_114489497_1490060156172:3 of framework 
> 20151223-150303-2677017098-5050-30032- with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> '/data/mesos/slaves/55f6df5e-2812-40a0-baf5-ce96f20677d3-S102/frameworks/20151223-150303-2677017098-5050-30032-/executors/ct:Transcoding_Test_114489497_1490060156172:3/runs/7e518538-7b56-4b14-a3c9-bee43c669bd7'
> I0321 09:36:07.321436 27615 containerizer.cpp:781] Starting container 
> '7e518538-7b56-4b14-a3c9-bee43c669bd7' for executor 
> 'ct:Transcoding_Test_114489497_1490060156172:3' of framework 
> '20151223-150303-2677017098-5050-30032-'
> I0321 09:36:37.902195 27600 provisioner.cpp:294] Provisioning image rootfs 
> '/data/mesos/provisioner/containers/7e518538-7b56-4b14-a3c9-bee43c669bd7/backends/copy/rootfses/8d2f7fe8-71ff-4317-a33c-a436241a93d9'
>  for container 7e518538-7b56-4b14-a3c9-bee43c669bd7
> *E0321 09:36:58.707718 27606 slave.cpp:4000] Container 
> '7e518538-7b56-4b14-a3c9-bee43c669bd7' for executor 
> 'ct:Transcoding_Test_114489497_1490060156172:3' of framework 
> 20151223-150303-2677017098-5050-30032- failed to start: Collect failed: 
> Failed to copy layer: cp: cannot create regular file 
> ‘/data/mesos/provisioner/containers/7e518538-7b56-4b14-a3c9-bee43c669bd7/backends/copy/rootfses/8d2f7fe8-71ff-4317-a33c-a436241a93d9/usr/bin/python’:
>  Text file busy*
> I0321 09:36:58.707991 27608 containerizer.cpp:1622] Destroying container 
> '7e518538-7b56-4b14-a3c9-bee43c669bd7'
> I0321 09:36:58.708468 27607 provisioner.cpp:434] Destroying container rootfs 
> at 
> '/data/mesos/provisioner/containers/7e518538-7b56-4b14-a3c9-bee43c669bd7/backends/copy/rootfses/8d2f7fe8-71ff-4317-a33c-a436241a93d9'
>  for container 7e518538-7b56-4b14-a3c9-bee43c669bd7
> {quote}
> Docker image is a private one, so that i have to try to reproduce this bug 
> with some sample Dockerfile as possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7423) Information on Mesos CI

2017-04-25 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983442#comment-15983442
 ] 

Vinod Kone commented on MESOS-7423:
---

This discussion is probably more appropriate for the d...@mesos.apache.org 
mailing list. Please send an email to dev-subscr...@mesos.apache.org to 
subscribe.

The Mesos CI uses ASF Jenkins infrastructure located at 
http://builds.apache.org.

AFAIK, ASF Jenkins doesn't have s390x machines. Someone needs to donate such 
machines to ASF so that they can be added to the CI pool. Please file an INFRA 
ticket (https://issues.apache.org/jira/browse/INFRA) for that and link here.

> Information on Mesos CI
> ---
>
> Key: MESOS-7423
> URL: https://issues.apache.org/jira/browse/MESOS-7423
> Project: Mesos
>  Issue Type: Task
>Reporter: Nayana Thorat
>
> Hi Vinod,
> We had raised an issue to add s390x support for mesos which was fixed and 
> resolved.
> https://issues.apache.org/jira/browse/MESOS-6742
> We also want to know about Mesos CI. 
> We need following details about current Mesos CI:
> 1. How is the current Mesos CI infrastructure? Travis/Jenkins?
> 2. Can Mesos CI extended to support s390x systems?
> We are not sure if this is right channel to discuss this topic. 
> Please let us know if you want to start this discussion on some other channel.
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7378) Build failure with missing gnu_dev_major and gnu_dev_minor symbols

2017-04-25 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983357#comment-15983357
 ] 

James Peach commented on MESOS-7378:


FWIW the Fedora glibc changelog says this was fixed in 2.13.90-12:

{noformat}
* Fri May 13 2011 Andreas Schwab  - 2.13.90-12
- Update from master
  - Fix resizing table for unique symbols when adding symbol for copy
relocation (BZ#12511)
  - Fix sched_setscheduler call in spawn implementation (BZ#12052)
  - Report write error in addmnt even for cached streams (BZ#12625)
  - Translate kernel error into what pthread_create should return
(BZ#386)
  - More configurability for secondary group lookup (BZ#11257)
  - Several locale data updates (BZ#11258, BZ#11487, BZ#11532,
BZ#11578, BZ#11653, BZ#11668, BZ#11945, BZ#11947, BZ#12158,
BZ#12200, BZ#12178, BZ#12178, BZ#12346, BZ#12449, BZ#12545,
BZ#12551, BZ#12611, BZ#12660, BZ#12681, BZ#12541, BZ#12711,
BZ#12738)
  - Fix Linux getcwd for long paths (BZ#12713)
  - static tls memory leak on TLS_DTV_AT_TP archs
  - Actually undefine ARG_MAX from 
  - Backport BIND code to query name as TLD (BZ#12734)
  - Allow $ORIGIN to reference trusted directoreis in SUID binaries
(BZ #12393)
  - Add missing {__BEGIN,__END}_DECLS to sys/sysmacros.h
  - Report if no record is found by initgroups in nss_files
- Never leave $ORIGIN unexpanded
- Revert "Ignore origin of privileged program"
- Reexport RPC interface
{noformat}

> Build failure with missing gnu_dev_major and gnu_dev_minor symbols
> --
>
> Key: MESOS-7378
> URL: https://issues.apache.org/jira/browse/MESOS-7378
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: James Peach
>
> {noformat}
> 03:46:16 - ./.libs/libmesos.so: undefined reference to 
> `gnu_dev_minor(unsigned long long)'
> 03:46:16 - ./.libs/libmesos.so: undefined reference to 
> `gnu_dev_major(unsigned long long)'
> {noformat}
> This is caused by the change in MESOS-7365.
> Including {{}} directly works on modern systems, but on our 
> older version of glibc, the {{}} header does not contain C++ 
> decls. This means that the inline symbols get C++ name mangling applied and 
> they don't get found at link time.
> {noformat}
> vagrant@mesos ~]$ cat /etc/redhat-release
> CentOS release 6.8 (Final)
> [vagrant@mesos ~]$ rpm -qa | grep glibc
> glibc-common-2.12-1.192.el6.x86_64
> glibc-devel-2.12-1.192.el6.x86_64
> glibc-2.12-1.192.el6.x86_64
> glibc-headers-2.12-1.192.el6.x86_64
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7223) Linux filesystem isolator cannot mount host volume /dev/log.

2017-04-25 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983356#comment-15983356
 ] 

Gilbert Song commented on MESOS-7223:
-

[~adam-mesos], I can look into this two weeks later. Do not have cycle for now. 
Let's remove the target version 1.2.1.

> Linux filesystem isolator cannot mount host volume /dev/log.
> 
>
> Key: MESOS-7223
> URL: https://issues.apache.org/jira/browse/MESOS-7223
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: Haralds Ulmanis
>  Labels: volumes
>
> I'm trying to mount /dev/log.
> ls -l /dev/log
> lrwxrwxrwx 1 root root 28 Mar  9 01:49 /dev/log -> 
> /run/systemd/journal/dev-log
> # ls -l /run/systemd/journal/dev-log
> srw-rw-rw- 1 root root 0 Mar  9 01:49 /run/systemd/journal/dev-log
> I have tried mounting /dev/log and /run/systemd/journal/dev-log, both produce 
> same errors:
> from stdout:
> Executing pre-exec command 
> '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/usr\/lib\/mesos\/mesos-containerizer"}'
> Executing pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/data\/mesos-agent\/slaves\/9b7ad711-9381-4338-b3c0-dac86253701e-S93\/frameworks\/a872f621-d10f-4021-a886-c5d564df104e-\/executors\/services_dev-2_lb-6.b8202973-04b0-11e7-be02-0a2b9a5c33cf\/runs\/cfb170f0-6c69-4475-9dbe-bb9967e19b42","\/data\/mesos-agent\/provisioner\/containers\/cfb170f0-6c69-4475-9dbe-bb9967e19b42\/backends\/overlay\/rootfses\/890a25e6-cb15-42e3-be9c-0aa3baf889f8\/data\/mesos-agent\/sandbox"],"shell":false,"value":"mount"}'
> Executing pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/run\/systemd\/journal\/dev-log","\/data\/mesos-agent\/provisioner\/containers\/cfb170f0-6c69-4475-9dbe-bb9967e19b42\/backends\/overlay\/rootfses\/890a25e6-cb15-42e3-be9c-0aa3baf889f8\/dev\/log"],"shell":false,"value":"mount"}'
> from stderr:
> mount: mount(2) failed: 
> /data/mesos-agent/provisioner/containers/cfb170f0-6c69-4475-9dbe-bb9967e19b42/backends/overlay/rootfses/890a25e6-cb15-42e3-be9c-0aa3baf889f8/dev/log:
>  Not a directory
> Failed to execute pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/run\/systemd\/journal\/dev-log","\/data\/mesos-agent\/provisioner\/containers\/cfb170f0-6c69-4475-9dbe-bb9967e19b42\/backends\/overlay\/rootfses\/890a25e6-cb15-42e3-be9c-0aa3baf889f8\/dev\/log"],"shell":false,"value":"mount"}'
> This particular job  i start from marathon and have the following definition 
> (if I change MESOS to DOCKER - it works): 
> "container": {
> "type": "MESOS",
> "volumes": [
>   {
> "hostPath": "/run/systemd/journal/dev-log",
> "containerPath": "/dev/log",
> "mode": "RW"
>   }
> ],
> "docker": {
>   "image": "",
>   "credential": null,
>   "forcePullImage": true
> }
>   },



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7223) Linux filesystem isolator cannot mount host volume /dev/log.

2017-04-25 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983353#comment-15983353
 ] 

Adam B commented on MESOS-7223:
---

[~gilbert], [~jieyu] Do you think we can get this fixed in master soon, so we 
can backport to 1.2.1?

> Linux filesystem isolator cannot mount host volume /dev/log.
> 
>
> Key: MESOS-7223
> URL: https://issues.apache.org/jira/browse/MESOS-7223
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: Haralds Ulmanis
>  Labels: volumes
>
> I'm trying to mount /dev/log.
> ls -l /dev/log
> lrwxrwxrwx 1 root root 28 Mar  9 01:49 /dev/log -> 
> /run/systemd/journal/dev-log
> # ls -l /run/systemd/journal/dev-log
> srw-rw-rw- 1 root root 0 Mar  9 01:49 /run/systemd/journal/dev-log
> I have tried mounting /dev/log and /run/systemd/journal/dev-log, both produce 
> same errors:
> from stdout:
> Executing pre-exec command 
> '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/usr\/lib\/mesos\/mesos-containerizer"}'
> Executing pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/data\/mesos-agent\/slaves\/9b7ad711-9381-4338-b3c0-dac86253701e-S93\/frameworks\/a872f621-d10f-4021-a886-c5d564df104e-\/executors\/services_dev-2_lb-6.b8202973-04b0-11e7-be02-0a2b9a5c33cf\/runs\/cfb170f0-6c69-4475-9dbe-bb9967e19b42","\/data\/mesos-agent\/provisioner\/containers\/cfb170f0-6c69-4475-9dbe-bb9967e19b42\/backends\/overlay\/rootfses\/890a25e6-cb15-42e3-be9c-0aa3baf889f8\/data\/mesos-agent\/sandbox"],"shell":false,"value":"mount"}'
> Executing pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/run\/systemd\/journal\/dev-log","\/data\/mesos-agent\/provisioner\/containers\/cfb170f0-6c69-4475-9dbe-bb9967e19b42\/backends\/overlay\/rootfses\/890a25e6-cb15-42e3-be9c-0aa3baf889f8\/dev\/log"],"shell":false,"value":"mount"}'
> from stderr:
> mount: mount(2) failed: 
> /data/mesos-agent/provisioner/containers/cfb170f0-6c69-4475-9dbe-bb9967e19b42/backends/overlay/rootfses/890a25e6-cb15-42e3-be9c-0aa3baf889f8/dev/log:
>  Not a directory
> Failed to execute pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/run\/systemd\/journal\/dev-log","\/data\/mesos-agent\/provisioner\/containers\/cfb170f0-6c69-4475-9dbe-bb9967e19b42\/backends\/overlay\/rootfses\/890a25e6-cb15-42e3-be9c-0aa3baf889f8\/dev\/log"],"shell":false,"value":"mount"}'
> This particular job  i start from marathon and have the following definition 
> (if I change MESOS to DOCKER - it works): 
> "container": {
> "type": "MESOS",
> "volumes": [
>   {
> "hostPath": "/run/systemd/journal/dev-log",
> "containerPath": "/dev/log",
> "mode": "RW"
>   }
> ],
> "docker": {
>   "image": "",
>   "credential": null,
>   "forcePullImage": true
> }
>   },



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-7378) Build failure with missing gnu_dev_major and gnu_dev_minor symbols

2017-04-25 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-7378:
--

Assignee: (was: James Peach)

> Build failure with missing gnu_dev_major and gnu_dev_minor symbols
> --
>
> Key: MESOS-7378
> URL: https://issues.apache.org/jira/browse/MESOS-7378
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: James Peach
>
> {noformat}
> 03:46:16 - ./.libs/libmesos.so: undefined reference to 
> `gnu_dev_minor(unsigned long long)'
> 03:46:16 - ./.libs/libmesos.so: undefined reference to 
> `gnu_dev_major(unsigned long long)'
> {noformat}
> This is caused by the change in MESOS-7365.
> Including {{}} directly works on modern systems, but on our 
> older version of glibc, the {{}} header does not contain C++ 
> decls. This means that the inline symbols get C++ name mangling applied and 
> they don't get found at link time.
> {noformat}
> vagrant@mesos ~]$ cat /etc/redhat-release
> CentOS release 6.8 (Final)
> [vagrant@mesos ~]$ rpm -qa | grep glibc
> glibc-common-2.12-1.192.el6.x86_64
> glibc-devel-2.12-1.192.el6.x86_64
> glibc-2.12-1.192.el6.x86_64
> glibc-headers-2.12-1.192.el6.x86_64
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7424) Secure UPID bindings in libprocess.

2017-04-25 Thread James Peach (JIRA)
James Peach created MESOS-7424:
--

 Summary: Secure UPID bindings in libprocess.
 Key: MESOS-7424
 URL: https://issues.apache.org/jira/browse/MESOS-7424
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: James Peach


{{libprocess}} has no way to securely enforce that a message comes from the 
{{UPID}} that it claims to come from. This makes it easy for malicious entities 
to spoof messages from a legitimate {{UPID}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6976) Disallow (re-)registration attempts by old agents

2017-04-25 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6976:
---
Shepherd: Benjamin Mahler
Target Version/s: 1.2.1, 1.3.0

> Disallow (re-)registration attempts by old agents
> -
>
> Key: MESOS-6976
> URL: https://issues.apache.org/jira/browse/MESOS-6976
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> Master should detect this situation and prevent the (re-)registration attempt 
> from succeeding. Should we shutdown the agent as well? If we don't shut it 
> down, the agent will loop and continue trying to connect. There's no reason 
> we _need_ to shutdown the agent, but it doesn't seem unreasonable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-1987) Add better support for handling arbitrary long versions strings in stout/version.hpp

2017-04-25 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway reassigned MESOS-1987:
--

Assignee: Neil Conway  (was: Kapil Arya)

> Add better support for handling arbitrary long versions strings in 
> stout/version.hpp
> 
>
> Key: MESOS-1987
> URL: https://issues.apache.org/jira/browse/MESOS-1987
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kapil Arya
>Assignee: Neil Conway
>  Labels: mesosphere
>
> Currently, the Version class handles strings of the form X.Y.Z only. A recent 
> patch (https://reviews.apache.org/r/27115/) allows for strings of the form 
> X.Y.Z-* by discarding the "-" along with the rest of the string following it. 
> This means that the check `Version("0.20.1") == Version("0.20.1-rc2")` will 
> succeed. 
> A better fix is to allow arbitrary number of components in the Version string 
> and still do the right thing w.r.t. comparisons.  To standardize it a bit, we 
> can consider Semantic Versioning (http://semver.org/).
> Semantic Versioning allows for strings of the following tagformat:
> 
> "MAJOR.MINOR.PATCH-IDENTIFIER[.IDENTIFIER]*"
> 
> An IDENTIFIER  must comprise only ASCII alphanumerics and hyphen [0-9A-Za-z-].
> One way to implement it in the Version class is to keep a vector of (string) 
> identifiers along with major, minor, and patch variable. Another alternative 
> is to not have the major, minor, and patch variables and just have a single 
> vector of strings.
> The comparison can be tricky.  One has to consider the pre-release version, 
> etc. as explained in SemVer 2.0.0 RFC.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6933) Executor does not respect grace period

2017-04-25 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982998#comment-15982998
 ] 

Alexander Rukletsov commented on MESOS-6933:


[~janisz], this is—unfortunately—a known issue that've been here for a while 
(linked the original ticket). Surprisingly we haven't seen a lot of requests to 
fix it (do folks avoid wrapping their tasks in {{sh}}?) and never got to work 
on this.

Do you want to suggest a patch? I'll be happy to shepherd.

> Executor does not respect grace period
> --
>
> Key: MESOS-6933
> URL: https://issues.apache.org/jira/browse/MESOS-6933
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Tomasz Janiszewski
>
> Mesos Command Executor try to support grace period with escalate but 
> unfortunately it does not work. It launches {{command}} by wrapping it in 
> {{sh -c}} this cause process tree to look like this
> {code}
> Received killTask
> Shutting down
> Sending SIGTERM to process tree at pid 18
> Sent SIGTERM to the following process trees:
> [ 
> -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
> ./bin/offer-i18n -e prod -p $PORT0 
>  \--- 19 command...
> ]
> Command terminated with signal Terminated (pid: 18)
> {code}
> This cause {{sh}} to immediately close and so executor, while wrapped 
> {{command}} might need some more time to finish. Finally, executor thinks 
> command executed gracefully so it won't 
> [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
>  to SIGKILL.
> This cause leaks when POSIX containerizer is used because if command ignores 
> SIGTERM it will be attached to initialize and never get killed. Using 
> pid/namespace only masks the problem because hanging process is captured 
> before it can gracefully shutdown.
> Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit 
> when all children processes finish. If not they will be killed by escalation 
> to SIGKILL.
> All versions from 0.20 are affected.
> This test should pass 
> [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343]
> [Mailing list 
> thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7423) Information on Mesos CI

2017-04-25 Thread Nayana Thorat (JIRA)
Nayana Thorat created MESOS-7423:


 Summary: Information on Mesos CI
 Key: MESOS-7423
 URL: https://issues.apache.org/jira/browse/MESOS-7423
 Project: Mesos
  Issue Type: Task
Reporter: Nayana Thorat


Hi Vinod,

We had raised an issue to add s390x support for mesos which was fixed and 
resolved.
https://issues.apache.org/jira/browse/MESOS-6742

We also want to know about Mesos CI. 

We need following details about current Mesos CI:
1. How is the current Mesos CI infrastructure? Travis/Jenkins?
2. Can Mesos CI extended to support s390x systems?

We are not sure if this is right channel to discuss this topic. 
Please let us know if you want to start this discussion on some other channel.

Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7378) Build failure with missing gnu_dev_major and gnu_dev_minor symbols

2017-04-25 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982510#comment-15982510
 ] 

Yan Xu commented on MESOS-7378:
---

Maybe this is OK as a general fix? We don't have a minimum glibc requirement.
At least reopen it to find a solution? Thoughts?

> Build failure with missing gnu_dev_major and gnu_dev_minor symbols
> --
>
> Key: MESOS-7378
> URL: https://issues.apache.org/jira/browse/MESOS-7378
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: James Peach
>Assignee: James Peach
>
> {noformat}
> 03:46:16 - ./.libs/libmesos.so: undefined reference to 
> `gnu_dev_minor(unsigned long long)'
> 03:46:16 - ./.libs/libmesos.so: undefined reference to 
> `gnu_dev_major(unsigned long long)'
> {noformat}
> This is caused by the change in MESOS-7365.
> Including {{}} directly works on modern systems, but on our 
> older version of glibc, the {{}} header does not contain C++ 
> decls. This means that the inline symbols get C++ name mangling applied and 
> they don't get found at link time.
> {noformat}
> vagrant@mesos ~]$ cat /etc/redhat-release
> CentOS release 6.8 (Final)
> [vagrant@mesos ~]$ rpm -qa | grep glibc
> glibc-common-2.12-1.192.el6.x86_64
> glibc-devel-2.12-1.192.el6.x86_64
> glibc-2.12-1.192.el6.x86_64
> glibc-headers-2.12-1.192.el6.x86_64
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)