[jira] [Commented] (MESOS-5879) cgroups/net_cls isolator causing agent recovery issues

2016-07-22 Thread Silas Snider (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390482#comment-15390482
 ] 

Silas Snider commented on MESOS-5879:
-

No, I'm talking about the newly identified issue where a child container that 
shares the net_cls classid of its parent (which is the default) is detected as 
a 'duplicate' during container recovery.

> cgroups/net_cls isolator causing agent recovery issues
> --
>
> Key: MESOS-5879
> URL: https://issues.apache.org/jira/browse/MESOS-5879
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, isolation, slave
>Reporter: Silas Snider
>Assignee: Avinash Sridharan
>
> We run with 'cgroups/net_cls' in our isolator list, and when we restart any 
> agent process in a cluster running an experimental custom isolator as well, 
> the agents are unable to recover from checkpoint, because net_cls reports 
> that unknown orphan containers have duplicate net_cls handles.
> While this is a problem that needs to be solved (probably by fixing our 
> custom isolator), it's also a problem that the net_cls isolator fails 
> recovery just for duplicate handles in cgroups that it is literally about to 
> unconditionally destroy during recovery. Can this be fixed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2602) Provide a way to "push" cluster state updates to a registered service.

2016-07-22 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390471#comment-15390471
 ] 

Deshi Xiao commented on MESOS-2602:
---

Cool

> Provide a way to "push" cluster state updates to a registered service.
> --
>
> Key: MESOS-2602
> URL: https://issues.apache.org/jira/browse/MESOS-2602
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Christos Kozyrakis
>Assignee: Zhitao Li
> Fix For: 1.0.0
>
>
> At the moment, service discovery systems like Mesos-DNS must periodically 
> pull the cluster state through state.json. This is extremely wasteful and 
> will not scale to large clusters. If the polling interval is low, the master 
> load will grow significantly. If the polling interval is high, there will be 
> added latency to service discovery. Moreover, the way state.json works right 
> now, one keeps reading the same information over and over again, including 
> info about about tasks no longer running. 
> We can design an endpoint that allows a "push" approach for state 
> information. Here is one of the possible ways to set it up:
> - a service can hit the end point at (re)start to get information for all 
> currently running tasks. 
> - a service can also register itself to get receive updates to task state 
> beyond that (ie, notifications of tasks starting/ending/etc). We may want to 
> add some qualifiers here, since service discovery systems care only about 
> certain types of updates.  
> This can be implemented through direct messaging, through a message queue, by 
> putting messages in etcd/zookeeper, etc. We should pick the way that is most 
> scalable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5879) cgroups/net_cls isolator causing agent recovery issues

2016-07-22 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390432#comment-15390432
 ] 

Avinash Sridharan commented on MESOS-5879:
--

You mean the issue of throwing an error when we are not able to reserve a 
handle for the orphan? I don't see that as bug. Given that the orphan 
containers are around we shouldn't break the semantics of allowing duplicates. 
As in this case, this is an erroneous situation, and something that will give 
unpredictable behavior if we allow it and hence should be caught early. 

I don't think just printing out a LOG(WARNING) is enough in this situation.

> cgroups/net_cls isolator causing agent recovery issues
> --
>
> Key: MESOS-5879
> URL: https://issues.apache.org/jira/browse/MESOS-5879
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, isolation, slave
>Reporter: Silas Snider
>Assignee: Avinash Sridharan
>
> We run with 'cgroups/net_cls' in our isolator list, and when we restart any 
> agent process in a cluster running an experimental custom isolator as well, 
> the agents are unable to recover from checkpoint, because net_cls reports 
> that unknown orphan containers have duplicate net_cls handles.
> While this is a problem that needs to be solved (probably by fixing our 
> custom isolator), it's also a problem that the net_cls isolator fails 
> recovery just for duplicate handles in cgroups that it is literally about to 
> unconditionally destroy during recovery. Can this be fixed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2346) Docker tasks exiting normally, but returning TASK_FAILED

2016-07-22 Thread Huadong Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390416#comment-15390416
 ] 

Huadong Liu commented on MESOS-2346:


[~anandmazumdar] I made a 0.28.x patch (thanks [~vinodkone] and [~vladap2016]) 
at https://reviews.apache.org/r/50360/ and I am currently running longevity 
tests it in our mesos/chronos environment. The patch overlaps with Vladimir 
Petrovic's MESOS-243 effort https://reviews.apache.org/r/50034/ on the master 
branch. Please help coordinate commits. I'd prefer a commit go either 0.28.x or 
master first and then port to other branches. To me, 0.28.x has a higher 
priority than a post 1.0 release, and easier to test. Please take a look when 
you have a chance. Thank you.

> Docker tasks exiting normally, but returning TASK_FAILED
> 
>
> Key: MESOS-2346
> URL: https://issues.apache.org/jira/browse/MESOS-2346
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.22.0
>Reporter: Brenden Matthews
>Priority: Critical
>
> Docker tasks which exit normally will return TASK_FAILED, as opposed to 
> TASK_FINISHED. This problem seems to occur only after `mesos-slave` has been 
> running for some time. If the slave is restarted, it will begin returning 
> TASK_FINISHED correctly.
> Sample slave log:
> {noformat}
> Feb 11 23:22:13 ip-10-102-188-213.ec2.internal mesos-slave[793]: I0211 
> 23:22:13.483464   798 slave.cpp:1138] Got assigned task 
> ct:1423696932164:2:canary: for framework 
> 20150211-045421-1401302794-5050-714-0001
> Feb 11 23:22:13 ip-10-102-188-213.ec2.internal mesos-slave[793]: I0211 
> 23:22:13.483667   798 slave.cpp:3854] Checkpointing FrameworkInfo to 
> '/tmp/mesos/meta/slaves/20150211-045421-1401302794-5050-714-S0/frameworks/20150211-045421-1401302794-5050-714-0001/framework.info'
> Feb 11 23:22:13 ip-10-102-188-213.ec2.internal mesos-slave[793]: I0211 
> 23:22:13.483894   798 slave.cpp:3861] Checkpointing framework pid 
> 'scheduler-f4679749-d7ad-4d8c-b610-f7043332d243@10.102.188.213:56385' to 
> '/tmp/mesos/meta/slaves/20150211-045421-1401302794-5050-714-S0/frameworks/20150211-045421-1401302794-5050-714-0001/framework.pid'
> Feb 11 23:22:13 ip-10-102-188-213.ec2.internal mesos-slave[793]: I0211 
> 23:22:13.484426   798 gc.cpp:84] Unscheduling 
> '/tmp/mesos/slaves/20150211-045421-1401302794-5050-714-S0/frameworks/20150211-045421-1401302794-5050-714-0001'
>  from gc
> Feb 11 23:22:13 ip-10-102-188-213.ec2.internal mesos-slave[793]: I0211 
> 23:22:13.484648   797 gc.cpp:84] Unscheduling 
> '/tmp/mesos/meta/slaves/20150211-045421-1401302794-5050-714-S0/frameworks/20150211-045421-1401302794-5050-714-0001'
>  from gc
> Feb 11 23:22:13 ip-10-102-188-213.ec2.internal mesos-slave[793]: I0211 
> 23:22:13.484748   797 slave.cpp:1253] Launching task 
> ct:1423696932164:2:canary: for framework 
> 20150211-045421-1401302794-5050-714-0001
> Feb 11 23:22:13 ip-10-102-188-213.ec2.internal mesos-slave[793]: I0211 
> 23:22:13.485697   797 slave.cpp:4297] Checkpointing ExecutorInfo to 
> '/tmp/mesos/meta/slaves/20150211-045421-1401302794-5050-714-S0/frameworks/20150211-045421-1401302794-5050-714-0001/executors/ct:1423696932164:2:canary:/executor.info'
> Feb 11 23:22:13 ip-10-102-188-213.ec2.internal mesos-slave[793]: I0211 
> 23:22:13.485999   797 slave.cpp:3929] Launching executor 
> ct:1423696932164:2:canary: of framework 
> 20150211-045421-1401302794-5050-714-0001 in work directory 
> '/tmp/mesos/slaves/20150211-045421-1401302794-5050-714-S0/frameworks/20150211-045421-1401302794-5050-714-0001/executors/ct:1423696932164:2:canary:/runs/5395b133-d10d-4204-999e-4a38c03c55f5'
> Feb 11 23:22:13 ip-10-102-188-213.ec2.internal mesos-slave[793]: I0211 
> 23:22:13.486212   797 slave.cpp:4320] Checkpointing TaskInfo to 
> '/tmp/mesos/meta/slaves/20150211-045421-1401302794-5050-714-S0/frameworks/20150211-045421-1401302794-5050-714-0001/executors/ct:1423696932164:2:canary:/runs/5395b133-d10d-4204-999e-4a38c03c55f5/tasks/ct:1423696932164:2:canary:/task.info'
> Feb 11 23:22:13 ip-10-102-188-213.ec2.internal mesos-slave[793]: I0211 
> 23:22:13.509457   797 slave.cpp:1376] Queuing task 
> 'ct:1423696932164:2:canary:' for executor ct:1423696932164:2:canary: of 
> framework '20150211-045421-1401302794-5050-714-0001
> Feb 11 23:22:13 ip-10-102-188-213.ec2.internal mesos-slave[793]: I0211 
> 23:22:13.510926   797 slave.cpp:574] Successfully attached file 
> '/tmp/mesos/slaves/20150211-045421-1401302794-5050-714-S0/frameworks/20150211-045421-1401302794-5050-714-0001/executors/ct:1423696932164:2:canary:/runs/5395b133-d10d-4204-999e-4a38c03c55f5'
> Feb 11 23:22:13 ip-10-102-188-213.ec2.internal mesos-slave[793]: I0211 
> 23:22:13.516738   799 docker.cpp:581] Starting container 
> '5395b133-d10d-4204-999e-4a38c03c55f5' for task 

[jira] [Created] (MESOS-5892) Volume container_path should be forbidden to be the container sandbox.

2016-07-22 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-5892:
---

 Summary: Volume container_path should be forbidden to be the 
container sandbox.
 Key: MESOS-5892
 URL: https://issues.apache.org/jira/browse/MESOS-5892
 Project: Mesos
  Issue Type: Bug
  Components: containerization, volumes
Reporter: Gilbert Song
Assignee: Gilbert Song


For either local volumes or docker external volumes, the container path should 
not be identical to the the container sandbox. Otherwise, persistent volumes 
and log inside of the sandbox will be overwritten.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5879) cgroups/net_cls isolator causing agent recovery issues

2016-07-22 Thread Silas Snider (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390377#comment-15390377
 ] 

Silas Snider commented on MESOS-5879:
-

Do you want to track this bug in this issue? Or should I file a new one?

> cgroups/net_cls isolator causing agent recovery issues
> --
>
> Key: MESOS-5879
> URL: https://issues.apache.org/jira/browse/MESOS-5879
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, isolation, slave
>Reporter: Silas Snider
>Assignee: Avinash Sridharan
>
> We run with 'cgroups/net_cls' in our isolator list, and when we restart any 
> agent process in a cluster running an experimental custom isolator as well, 
> the agents are unable to recover from checkpoint, because net_cls reports 
> that unknown orphan containers have duplicate net_cls handles.
> While this is a problem that needs to be solved (probably by fixing our 
> custom isolator), it's also a problem that the net_cls isolator fails 
> recovery just for duplicate handles in cgroups that it is literally about to 
> unconditionally destroy during recovery. Can this be fixed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5879) cgroups/net_cls isolator causing agent recovery issues

2016-07-22 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390376#comment-15390376
 ] 

Avinash Sridharan commented on MESOS-5879:
--

I think we can close this issue?

> cgroups/net_cls isolator causing agent recovery issues
> --
>
> Key: MESOS-5879
> URL: https://issues.apache.org/jira/browse/MESOS-5879
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, isolation, slave
>Reporter: Silas Snider
>Assignee: Avinash Sridharan
>
> We run with 'cgroups/net_cls' in our isolator list, and when we restart any 
> agent process in a cluster running an experimental custom isolator as well, 
> the agents are unable to recover from checkpoint, because net_cls reports 
> that unknown orphan containers have duplicate net_cls handles.
> While this is a problem that needs to be solved (probably by fixing our 
> custom isolator), it's also a problem that the net_cls isolator fails 
> recovery just for duplicate handles in cgroups that it is literally about to 
> unconditionally destroy during recovery. Can this be fixed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5879) cgroups/net_cls isolator causing agent recovery issues

2016-07-22 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390375#comment-15390375
 ] 

Avinash Sridharan commented on MESOS-5879:
--

Great !! Thanks for triaging this !!

> cgroups/net_cls isolator causing agent recovery issues
> --
>
> Key: MESOS-5879
> URL: https://issues.apache.org/jira/browse/MESOS-5879
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, isolation, slave
>Reporter: Silas Snider
>Assignee: Avinash Sridharan
>
> We run with 'cgroups/net_cls' in our isolator list, and when we restart any 
> agent process in a cluster running an experimental custom isolator as well, 
> the agents are unable to recover from checkpoint, because net_cls reports 
> that unknown orphan containers have duplicate net_cls handles.
> While this is a problem that needs to be solved (probably by fixing our 
> custom isolator), it's also a problem that the net_cls isolator fails 
> recovery just for duplicate handles in cgroups that it is literally about to 
> unconditionally destroy during recovery. Can this be fixed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5834) Mesos may pass --volume-driver to the Docker daemon multiple times.

2016-07-22 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5834:
--
Fix Version/s: (was: 1.1.0)

> Mesos may pass --volume-driver to the Docker daemon multiple times.
> ---
>
> Key: MESOS-5834
> URL: https://issues.apache.org/jira/browse/MESOS-5834
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.0.0
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> https://github.com/apache/mesos/blob/master/src/docker/docker.cpp#L590 will 
> append the "--volume-driver" flag to argv once per Volume.
> According to https://github.com/docker/docker/issues/16069 this flag can only 
> be specified once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5834) Mesos may pass --volume-driver to the Docker daemon multiple times.

2016-07-22 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5834:
--
Fix Version/s: 1.0.0

> Mesos may pass --volume-driver to the Docker daemon multiple times.
> ---
>
> Key: MESOS-5834
> URL: https://issues.apache.org/jira/browse/MESOS-5834
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.0.0
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> https://github.com/apache/mesos/blob/master/src/docker/docker.cpp#L590 will 
> append the "--volume-driver" flag to argv once per Volume.
> According to https://github.com/docker/docker/issues/16069 this flag can only 
> be specified once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5389) docker containerizer should prefix relative volume.container_path values with the path to the sandbox

2016-07-22 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5389:
--
Fix Version/s: (was: 1.1.0)

> docker containerizer should prefix relative volume.container_path values with 
> the path to the sandbox
> -
>
> Key: MESOS-5389
> URL: https://issues.apache.org/jira/browse/MESOS-5389
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, isolation
>Reporter: James DeFelice
>Assignee: Gilbert Song
>  Labels: docker, mesosphere, storage, volumes
> Fix For: 1.00
>
>
> docker containerizer currently requires absolute paths for values of 
> volume.container_path. this is inconsistent with the mesos containerizer 
> which requires relative container_path. it makes for a confusing API. both at 
> the Mesos level as well as at the Marathon level.
> ideally the docker containerizer would allow a framework to specify a 
> relative path for volume.container_path and in such cases automatically 
> convert it to an absolute path by prepending the sandbox directory to it.
> /cc [~jieyu]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5879) cgroups/net_cls isolator causing agent recovery issues

2016-07-22 Thread Silas Snider (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390373#comment-15390373
 ] 

Silas Snider commented on MESOS-5879:
-

We've discovered the issue. When we launch tasks, we create a child cgroup in 
all the mesos cgroups to run part of our tasks inside. When we do this in 
net_cls, the agent fails to recover because it's detecting a child container 
that has the same classid as the parent, which is totally valid in this case.

We didn't realize this was happening because our child container names are 
similar to mesos container names.

> cgroups/net_cls isolator causing agent recovery issues
> --
>
> Key: MESOS-5879
> URL: https://issues.apache.org/jira/browse/MESOS-5879
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, isolation, slave
>Reporter: Silas Snider
>Assignee: Avinash Sridharan
>
> We run with 'cgroups/net_cls' in our isolator list, and when we restart any 
> agent process in a cluster running an experimental custom isolator as well, 
> the agents are unable to recover from checkpoint, because net_cls reports 
> that unknown orphan containers have duplicate net_cls handles.
> While this is a problem that needs to be solved (probably by fixing our 
> custom isolator), it's also a problem that the net_cls isolator fails 
> recovery just for duplicate handles in cgroups that it is literally about to 
> unconditionally destroy during recovery. Can this be fixed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5389) docker containerizer should prefix relative volume.container_path values with the path to the sandbox

2016-07-22 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5389:
--
Fix Version/s: 1.00

> docker containerizer should prefix relative volume.container_path values with 
> the path to the sandbox
> -
>
> Key: MESOS-5389
> URL: https://issues.apache.org/jira/browse/MESOS-5389
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, isolation
>Reporter: James DeFelice
>Assignee: Gilbert Song
>  Labels: docker, mesosphere, storage, volumes
> Fix For: 1.1.0, 1.00
>
>
> docker containerizer currently requires absolute paths for values of 
> volume.container_path. this is inconsistent with the mesos containerizer 
> which requires relative container_path. it makes for a confusing API. both at 
> the Mesos level as well as at the Marathon level.
> ideally the docker containerizer would allow a framework to specify a 
> relative path for volume.container_path and in such cases automatically 
> convert it to an absolute path by prepending the sandbox directory to it.
> /cc [~jieyu]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4188) Executing "mesos-slave.sh" should use root privilege

2016-07-22 Thread Venkatesh Jayapal (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390361#comment-15390361
 ] 

Venkatesh Jayapal commented on MESOS-4188:
--

Hi, 

I created a new user (user1 added to sudoers list) on my Ubuntu 14 Vagrant box 
and followed the instructions to install and build mesos from scratch. 
http://mesos.apache.org/gettingstarted/

When I ran the mesos-slave, i got the same message. But the Python examples 
(provided in the installation bundle) runs fine without any error. 

user1@vagrant-ubuntu-trusty-64:~/mesos-0.28.2/build/bin$ ./mesos-slave.sh 
--master=127.0.0.1:5050 --work_dir=/home/user1/messos
I0722 22:16:33.808300 20527 main.cpp:223] Build: 2016-07-22 21:49:46 by user1
I0722 22:16:33.810518 20527 main.cpp:225] Version: 0.28.2
I0722 22:16:33.817015 20527 containerizer.cpp:149] Using isolation: 
posix/cpu,posix/mem,filesystem/posix
W0722 22:16:33.819242 20527 backend.cpp:66] Failed to create 'bind' backend: 
BindBackend requires root privileges
I0722 22:16:33.822891 20527 main.cpp:328] Starting Mesos slave
I0722 22:16:33.824800 20527 slave.cpp:193] Slave started on 1)@10.0.2.15:5051

So, can i ignore this warning message or is it a serious issue?  


> Executing "mesos-slave.sh" should use root privilege
> 
>
> Key: MESOS-4188
> URL: https://issues.apache.org/jira/browse/MESOS-4188
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 0.25.0
>Reporter: Nan Xiao
>
> In Examples of "Getting Started" document 
> (http://mesos.apache.org/gettingstarted/) document:  
> # Start mesos slave.
> $ ./bin/mesos-slave.sh --master=127.0.0.1:5050
> But without "root" privilege, it will output:  
> W1217 05:52:42.213497 24074 backend.cpp:50] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> This will affect some scenarios, such as "kubernetes on Mesos".
> Please check it, thanks very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5885) max_executors_per_agent does not take effect on mesos docker executor

2016-07-22 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5885:
--
Affects Version/s: 0.27.3
   0.28.2

> max_executors_per_agent does not take effect on mesos docker executor
> -
>
> Key: MESOS-5885
> URL: https://issues.apache.org/jira/browse/MESOS-5885
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.27.3, 0.28.2, 1.0.0
> Environment: centos 7.2
>Reporter: Qi Feng
>  Labels: mesosphere
>
> I build mesos-1.0.0-rc2 with network isolator in centos 7.2. And try to set 
> max_executors_per_agent=10 to test if docker task would be limited in 10 on 
> every mesos agent. In fact, my case is launching 40 tasks (0.1core 0.1M mem 
> each) on three different agent machine, and both agent launched more than 10 
> tasks.
> I found mesos master hold executor data in a haspmap, and the key is 
> ExecutorID.
> https://github.com/apache/mesos/blob/1.0.x/src/master/master.hpp#L306
> https://github.com/apache/mesos/blob/1.0.x/src/master/master.cpp#L5747
> Then I get state.json from mesos master to looking for any executor 
> information. Then I found executer_id is empty string in taskInfo json. Is 
> there any relation between the empty executor id and max_executors_per_agent 
> issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5885) max_executors_per_agent does not take effect on mesos docker executor

2016-07-22 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5885:
--
Labels: mesosphere  (was: )

> max_executors_per_agent does not take effect on mesos docker executor
> -
>
> Key: MESOS-5885
> URL: https://issues.apache.org/jira/browse/MESOS-5885
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.27.3, 0.28.2, 1.0.0
> Environment: centos 7.2
>Reporter: Qi Feng
>  Labels: mesosphere
>
> I build mesos-1.0.0-rc2 with network isolator in centos 7.2. And try to set 
> max_executors_per_agent=10 to test if docker task would be limited in 10 on 
> every mesos agent. In fact, my case is launching 40 tasks (0.1core 0.1M mem 
> each) on three different agent machine, and both agent launched more than 10 
> tasks.
> I found mesos master hold executor data in a haspmap, and the key is 
> ExecutorID.
> https://github.com/apache/mesos/blob/1.0.x/src/master/master.hpp#L306
> https://github.com/apache/mesos/blob/1.0.x/src/master/master.cpp#L5747
> Then I get state.json from mesos master to looking for any executor 
> information. Then I found executer_id is empty string in taskInfo json. Is 
> there any relation between the empty executor id and max_executors_per_agent 
> issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5885) max_executors_per_agent does not take effect on mesos docker executor

2016-07-22 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390341#comment-15390341
 ] 

Jie Yu commented on MESOS-5885:
---

This is unfortunately a bug in Mesos and a tech debt that we need to fix. 
Currently, master only keeps track of those executors of an agent that has an 
executorInfo (i.e., custom executors). For command tasks, master does not keep 
track of that in `slave->executors`. I think a true fix is to keep track of 
executorInfo for command tasks as well so that the whole code base is 
consistent on the hierarchy (slave -> frameworks -> executors -> tasks), 
especially when we want to introduce Pod like concept in Mesos.

A short term fix might be loop through `slave->tasks` as well and account for 
command tasks as well.

> max_executors_per_agent does not take effect on mesos docker executor
> -
>
> Key: MESOS-5885
> URL: https://issues.apache.org/jira/browse/MESOS-5885
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.27.3, 0.28.2, 1.0.0
> Environment: centos 7.2
>Reporter: Qi Feng
>  Labels: mesosphere
>
> I build mesos-1.0.0-rc2 with network isolator in centos 7.2. And try to set 
> max_executors_per_agent=10 to test if docker task would be limited in 10 on 
> every mesos agent. In fact, my case is launching 40 tasks (0.1core 0.1M mem 
> each) on three different agent machine, and both agent launched more than 10 
> tasks.
> I found mesos master hold executor data in a haspmap, and the key is 
> ExecutorID.
> https://github.com/apache/mesos/blob/1.0.x/src/master/master.hpp#L306
> https://github.com/apache/mesos/blob/1.0.x/src/master/master.cpp#L5747
> Then I get state.json from mesos master to looking for any executor 
> information. Then I found executer_id is empty string in taskInfo json. Is 
> there any relation between the empty executor id and max_executors_per_agent 
> issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5891) /help endpoint does not set Content-Type to HTML

2016-07-22 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390332#comment-15390332
 ] 

Joseph Wu commented on MESOS-5891:
--

{code}
commit 105eca66a63457ee509c58f8415a1a4d626b352a
Author: Joseph Wu 
Date:   Fri Jul 22 15:11:36 2016 -0700

Added an appropriate content type for the /help endpoints.

The `Content-Type` header was set to "text/plain" by default in all
responses here:
https://reviews.apache.org/r/46725/

This had the adverse consequence of changing the `/help` endpoints
into plain text.  Previously, the browser would see some ``
tags and assume the content was HTML.

Review: https://reviews.apache.org/r/50362
{code}

> /help endpoint does not set Content-Type to HTML
> 
>
> Key: MESOS-5891
> URL: https://issues.apache.org/jira/browse/MESOS-5891
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.0.0, 1.1.0
>
>
> This change added a default {{Content-Type}} to all responses:
> https://github.com/apache/mesos/commit/b2c5d91addbae609af3791f128c53fb3a26c7d53
> Unfortunately, this changed the {{/help}} endpoint from no {{Content-Type}} 
> to {{text/plain}}.  For a browser to render this page correctly, we need an 
> HTML content type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5891) /help endpoint does not set Content-Type to HTML

2016-07-22 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5891:


 Summary: /help endpoint does not set Content-Type to HTML
 Key: MESOS-5891
 URL: https://issues.apache.org/jira/browse/MESOS-5891
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Joseph Wu
Priority: Critical
 Fix For: 1.0.0, 1.1.0


This change added a default {{Content-Type}} to all responses:
https://github.com/apache/mesos/commit/b2c5d91addbae609af3791f128c53fb3a26c7d53

Unfortunately, this changed the {{/help}} endpoint from no {{Content-Type}} to 
{{text/plain}}.  For a browser to render this page correctly, we need an HTML 
content type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3341) Introduce Resource Resolution

2016-07-22 Thread Erik Weathers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390317#comment-15390317
 ] 

Erik Weathers commented on MESOS-3341:
--

Just spoke with [~jessicalhartog] about this, we believe that MESOS-4687 has 
fixed this issue.  So I'm resolving this one.

> Introduce Resource Resolution
> -
>
> Key: MESOS-3341
> URL: https://issues.apache.org/jira/browse/MESOS-3341
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jessica Hartog
>Priority: Minor
>
> After MESOS-1807, Mesos containers require >= 0.01 CPU resources and >= 32MB 
> Memory resources. In order to simplify accounting, Mesos should introduce 
> resource resolution.
> For example, it is possible to launch a task with 1.1 CPU (as it 
> exceeds the minimum number of CPUs and is therefore considered valid). The 
> fractional component of this task does not benefit the running process, and 
> can introduce floating point errors when Mesos is accounting its offers 
> (which we have already seen happening in MESOS-1867 and MESOS-2635). A 
> solution to this could be disallowing tasks with finer granularity than the 
> required resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4436) Propose design doc for fixed-point scalar resources

2016-07-22 Thread Erik Weathers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390316#comment-15390316
 ] 

Erik Weathers commented on MESOS-4436:
--

Isn't this a duplicate of MESOS-4545?  Rather than "won't fix"?

> Propose design doc for fixed-point scalar resources
> ---
>
> Key: MESOS-4436
> URL: https://issues.apache.org/jira/browse/MESOS-4436
> Project: Mesos
>  Issue Type: Task
>  Components: general
>Reporter: Neil Conway
>  Labels: mesosphere, resources
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4778) Add appc/runtime isolator for runtime isolation for appc images.

2016-07-22 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390311#comment-15390311
 ] 

Jie Yu commented on MESOS-4778:
---

commit b9c01fc36452ea9e4375e622ab9ac94000eef4b0
Author: Srinivas Brahmaroutu 
Date:   Fri Jul 22 15:31:45 2016 -0700

Added implementation to Appc Runtime Isolator.

Review: https://reviews.apache.org/r/49348/

> Add appc/runtime isolator for runtime isolation for appc images.
> 
>
> Key: MESOS-4778
> URL: https://issues.apache.org/jira/browse/MESOS-4778
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Jie Yu
>Assignee: Srinivas
>  Labels: containerizer, isolator
>
> Appc image also contains runtime information like 'exec', 'env', 
> 'workingDirectory' etc.
> https://github.com/appc/spec/blob/master/spec/aci.md
> Similar to docker images, we need to support a subset of them (mainly 'exec', 
> 'env' and 'workingDirectory').



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4778) Add appc/runtime isolator for runtime isolation for appc images.

2016-07-22 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390298#comment-15390298
 ] 

Jie Yu commented on MESOS-4778:
---

commit 0eec82f3b769ce26ccc31fbfb2777e3991864a9f
Author: Srinivas Brahmaroutu 
Date:   Fri Jul 22 15:23:09 2016 -0700

Added appcManifest to ImageInfo and ProvisionInfo.

Review: https://reviews.apache.org/r/49232/

> Add appc/runtime isolator for runtime isolation for appc images.
> 
>
> Key: MESOS-4778
> URL: https://issues.apache.org/jira/browse/MESOS-4778
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Jie Yu
>Assignee: Srinivas
>  Labels: containerizer, isolator
>
> Appc image also contains runtime information like 'exec', 'env', 
> 'workingDirectory' etc.
> https://github.com/appc/spec/blob/master/spec/aci.md
> Similar to docker images, we need to support a subset of them (mainly 'exec', 
> 'env' and 'workingDirectory').



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5855) Create a 'Disk (not) full' example framework

2016-07-22 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390210#comment-15390210
 ] 

Joseph Wu commented on MESOS-5855:
--

|| Review || Summary ||
| https://reviews.apache.org/r/46626/ | Framework |
| https://reviews.apache.org/r/50217/ | Examples tests |

> Create a 'Disk (not) full' example framework
> 
>
> Key: MESOS-5855
> URL: https://issues.apache.org/jira/browse/MESOS-5855
> Project: Mesos
>  Issue Type: Task
>Reporter: Artem Harutyunyan
>Assignee: Artem Harutyunyan
>Priority: Minor
>  Labels: mesosphere
>
> We need example frameworks for verifying the correct behavior of posix/disk 
> isolator when the disk quota enforcement is in place. One framework for 
> verifying that disk quota enforcement is working and that container gets 
> terminated when it goes beyond disk quota, and another one for verifying that 
> container does not get killed if it stays within its disk quota bounds. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5882) `os::cloexec` does not exist on Windows

2016-07-22 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390182#comment-15390182
 ] 

Joseph Wu commented on MESOS-5882:
--

Here's a list of callsites for {{os::cloexec}} and a summary of why 
{{os::cloexec}} matters in the callsite:
|| Location || Reason ||
| 
https://github.com/apache/mesos/blob/17a1e58d3f48d866ac5132cc28b2f33c2e287aac/3rdparty/libprocess/src/subprocess.cpp#L140-L144
 | When we open a PIPE to a subprocess, those pipes are kept open in the 
parent.  We need CLOEXEC to prevent future subprocesses from inheriting these 
incorrectly. |
| 
https://github.com/apache/mesos/blob/17a1e58d3f48d866ac5132cc28b2f33c2e287aac/3rdparty/libprocess/src/io.cpp#L264
  
https://github.com/apache/mesos/blob/17a1e58d3f48d866ac5132cc28b2f33c2e287aac/3rdparty/libprocess/src/io.cpp#L419
  
https://github.com/apache/mesos/blob/17a1e58d3f48d866ac5132cc28b2f33c2e287aac/3rdparty/libprocess/src/io.cpp#L476
  
https://github.com/apache/mesos/blob/17a1e58d3f48d866ac5132cc28b2f33c2e287aac/3rdparty/libprocess/src/io.cpp#L535
 | {{io::peek}}, {{io::read}}, {{io::write}}, and {{io::redirect}} duplicate an 
FD to control the FD lifecycle asynchronously.  We CLOEXEC to prevent FD leaks. 
|
| 
https://github.com/apache/mesos/blob/17a1e58d3f48d866ac5132cc28b2f33c2e287aac/3rdparty/libprocess/src/socket.cpp#L62
 | Sockets will CLOEXEC on create to prevent leaks. |
| 
https://github.com/apache/mesos/blob/17a1e58d3f48d866ac5132cc28b2f33c2e287aac/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L490
  
https://github.com/apache/mesos/blob/17a1e58d3f48d866ac5132cc28b2f33c2e287aac/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L827
 | SSL sockets CLOEXEC to prevent leaks.  This happens on socket create 
(because we duplicate the FD to control lifetime separately) and accept. |
| 
https://github.com/apache/mesos/blob/17a1e58d3f48d866ac5132cc28b2f33c2e287aac/3rdparty/libprocess/src/poll_socket.cpp#L69
 | Poll sockets CLOEXEC to prevent leaks.  This happens on accept. |
| 
https://github.com/apache/mesos/blob/17a1e58d3f48d866ac5132cc28b2f33c2e287aac/3rdparty/stout/include/stout/os/posix/write.hpp#L66-L69
 | A {{os::write}} helper CLOEXECs FDs to prevent leaks. |
| 
https://github.com/apache/mesos/blob/17a1e58d3f48d866ac5132cc28b2f33c2e287aac/3rdparty/stout/include/stout/os/open.hpp#L63-L81
 | {{os::open}} has some CLOEXEC logic, but doesn't CLOEXEC by default. |
| All calls to {{os::open}} except in:
  
https://github.com/apache/mesos/blob/17a1e58d3f48d866ac5132cc28b2f33c2e287aac/3rdparty/stout/include/stout/os/sunos.hpp
  
https://github.com/apache/mesos/blob/17a1e58d3f48d866ac5132cc28b2f33c2e287aac/3rdparty/stout/include/stout/os/touch.hpp#L34
  A couple tests | Except for some Solaris code (possibly not supported 
anymore) and {{os::touch}} (which opens and immediately closes an FD; this 
appears to be a possible leak), we CLOEXEC everywhere outside of tests to 
prevent leaks. |

> `os::cloexec` does not exist on Windows
> ---
>
> Key: MESOS-5882
> URL: https://issues.apache.org/jira/browse/MESOS-5882
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, stout
>
> `os::cloexec` does not work on Windows. It will never work at the OS level. 
> Because of this, there are likely many important and hard-to-detect bugs 
> hanging around the agent.
> This is extremely important to fix. Some possible solutions to investigate 
> (some of which are _extremely_ risky):
> * Abstract out file descriptors into a class, implement cloexec in that class 
> on Windows (since we can't rely on the OS to do it).
> * Refactor all the code that relies on `os::cloexec` to not rely on it.
> Of the two, the first seems less risky in the short term, because the cloexec 
> code only affects Windows. Depending on the semantics of the implementation 
> of the `FileDescriptor` class, it is possible that this is riskier to Windows 
> in the longer term, as the semantics of `cloexec` may have subtle difference 
> between Linux and Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5890) Support setting resources limits for containers.

2016-07-22 Thread Jie Yu (JIRA)
Jie Yu created MESOS-5890:
-

 Summary: Support setting resources limits for containers.
 Key: MESOS-5890
 URL: https://issues.apache.org/jira/browse/MESOS-5890
 Project: Mesos
  Issue Type: Epic
Reporter: Jie Yu


On linux, resources limits can be set using getrlimit
http://man7.org/linux/man-pages/man2/setrlimit.2.html

Setting resources limits is recommended for many big data frameworks like 
cassandra, kafka.

It would be nice if Mesos can expose an API allowing framework to choose the 
resources limit for their containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3910) Libprocess: Implement cleanup of the SocketManager in process::finalize

2016-07-22 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3910:
-
Sprint: Mesosphere Sprint 40

> Libprocess: Implement cleanup of the SocketManager in process::finalize
> ---
>
> Key: MESOS-3910
> URL: https://issues.apache.org/jira/browse/MESOS-3910
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess, test
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> The {{socket_manager}} and {{process_manager}} are intricately tied together. 
>  Currently, only the {{process_manager}} is cleaned up by 
> {{process::finalize}}.
> To clean up the {{socket_manager}}, we must close all sockets and deallocate 
> any existing {{HttpProxy}} or {{Encoder}} objects.  And we should prevent 
> further objects from being created/tracked by the {{socket_manager}}.
> *Proposal*
> # Clean up all processes other than {{gc}}.  This will clear all links and 
> delete all {{HttpProxy}} s while {{socket_manager}} still exists.
> # Close all sockets via {{SocketManager::close}}.  All of {{socket_manager}} 
> 's state is cleaned up via {{SocketManager::close}}, including termination of 
> {{HttpProxy}} (termination is idempotent, meaning that killing {{HttpProxy}} 
> s via {{process_manager}} is safe).
> # At this point, {{socket_manager}} should be empty and only the {{gc}} 
> process should be running.  (Since we're finalizing, assume there are no 
> threads trying to spawn processes.)  {{socket_manager}} can be deleted.
> # {{gc}} can be deleted.  This is currently a leaked pointer, so we'll also 
> need to track and delete that.
> # {{process_manager}} should be devoid of processes, so we can proceed with 
> cleanup (join threads, stop the {{EventLoop}}, etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3934) Libprocess: Unify the initialization of the MetricsProcess and ReaperProcess

2016-07-22 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3934:
-
Sprint: Mesosphere Sprint 40

> Libprocess: Unify the initialization of the MetricsProcess and ReaperProcess
> 
>
> Key: MESOS-3934
> URL: https://issues.apache.org/jira/browse/MESOS-3934
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess, test
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> Related to this 
> [TODO|https://github.com/apache/mesos/blob/aa0cd7ed4edf1184cbc592b5caa2429a8373e813/3rdparty/libprocess/src/process.cpp#L949-L950].
> The {{MetricsProcess}} and {{ReaperProcess}} are global processes 
> (singletons) which are initialized upon first use.  The two processes could 
> be initialized alongside the {{gc}}, {{help}}, {{logging}}, {{profiler}}, and 
> {{system}} (statistics) processes inside {{process::initialize}}.
> This is also necessary for libprocess re-initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5855) Create a 'Disk (not) full' example framework

2016-07-22 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5855:
-
Shepherd: Joseph Wu
  Sprint: Mesosphere Sprint 40
Story Points: 3
  Labels: mesosphere  (was: )

> Create a 'Disk (not) full' example framework
> 
>
> Key: MESOS-5855
> URL: https://issues.apache.org/jira/browse/MESOS-5855
> Project: Mesos
>  Issue Type: Task
>Reporter: Artem Harutyunyan
>Assignee: Artem Harutyunyan
>Priority: Minor
>  Labels: mesosphere
>
> We need example frameworks for verifying the correct behavior of posix/disk 
> isolator when the disk quota enforcement is in place. One framework for 
> verifying that disk quota enforcement is working and that container gets 
> terminated when it goes beyond disk quota, and another one for verifying that 
> container does not get killed if it stays within its disk quota bounds. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5792) Add mesos tests to CMake (make check)

2016-07-22 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5792:
-
  Sprint: Mesosphere Sprint 40
Story Points: 8

> Add mesos tests to CMake (make check)
> -
>
> Key: MESOS-5792
> URL: https://issues.apache.org/jira/browse/MESOS-5792
> Project: Mesos
>  Issue Type: Improvement
>  Components: build
>Reporter: Srinivas
>Assignee: Srinivas
>  Labels: build, mesosphere
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Provide CMakeLists.txt and configuration files to build mesos tests using 
> CMake.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1718) Command executor can overcommit the slave.

2016-07-22 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390029#comment-15390029
 ] 

Vinod Kone commented on MESOS-1718:
---

The overcommit is on the agent and is not surfaced to the master or schedulers. 
In other words it shouldn't affect scheduling by frameworks.

> Command executor can overcommit the slave.
> --
>
> Key: MESOS-1718
> URL: https://issues.apache.org/jira/browse/MESOS-1718
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>
> Currently we give a small amount of resources to the command executor, in 
> addition to resources used by the command task:
> https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
> {code: title=}
> ExecutorInfo Slave::getExecutorInfo(
> const FrameworkID& frameworkId,
> const TaskInfo& task)
> {
>   ...
> // Add an allowance for the command executor. This does lead to a
> // small overcommit of resources.
> executor.mutable_resources()->MergeFrom(
> Resources::parse(
>   "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
>   "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
>   ...
> }
> {code}
> This leads to an overcommit of the slave. Ideally, for command tasks we can 
> "transfer" all of the task resources to the executor at the slave / isolation 
> level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2602) Provide a way to "push" cluster state updates to a registered service.

2016-07-22 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-2602:
--
Summary: Provide a way to "push" cluster state updates to a registered 
service.  (was: Provide a way to "push" cluster state updates to a registered 
service. )

> Provide a way to "push" cluster state updates to a registered service.
> --
>
> Key: MESOS-2602
> URL: https://issues.apache.org/jira/browse/MESOS-2602
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Christos Kozyrakis
>Assignee: Zhitao Li
> Fix For: 1.0.0
>
>
> At the moment, service discovery systems like Mesos-DNS must periodically 
> pull the cluster state through state.json. This is extremely wasteful and 
> will not scale to large clusters. If the polling interval is low, the master 
> load will grow significantly. If the polling interval is high, there will be 
> added latency to service discovery. Moreover, the way state.json works right 
> now, one keeps reading the same information over and over again, including 
> info about about tasks no longer running. 
> We can design an endpoint that allows a "push" approach for state 
> information. Here is one of the possible ways to set it up:
> - a service can hit the end point at (re)start to get information for all 
> currently running tasks. 
> - a service can also register itself to get receive updates to task state 
> beyond that (ie, notifications of tasks starting/ending/etc). We may want to 
> add some qualifiers here, since service discovery systems care only about 
> certain types of updates.  
> This can be implemented through direct messaging, through a message queue, by 
> putting messages in etcd/zookeeper, etc. We should pick the way that is most 
> scalable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5871) KVM and Docker containerized mesos-slave, state.json always timeout

2016-07-22 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389803#comment-15389803
 ] 

Deshi Xiao edited comment on MESOS-5871 at 7/22/16 4:40 PM:


finally we resolve it.  we found the all machine's hostname  can't reach each 
other in KVM environment. so when we add hostname to machine, then the mesos 
slave work like a charm.

but i don't know why?  [~haosdent]


was (Author: xds2000):
finally we resolve it.  we found the all machine's hostname  can't reach each 
other in KVM environment. so when we add hostname to machine, then the mesos 
slave work like a charm.

but i don't know why?  [~haosdent[~haosdent]

> KVM and Docker containerized mesos-slave, state.json always timeout
> ---
>
> Key: MESOS-5871
> URL: https://issues.apache.org/jira/browse/MESOS-5871
> Project: Mesos
>  Issue Type: Bug
>Reporter: Deshi Xiao
>Assignee: haosdent
> Attachments: 34.pic_hd.jpg
>
>
> please see state.json timeout. please see perf log on issue machine.
> strace log:
> ```
> futex(0x14b5a48, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource 
> temporarily unavailable)
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x14999bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x140, 47666) 
> = 8
> futex(0x140, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14b4290, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource 
> temporarily unavailable)
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x14999bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x140, 47696) 
> = 8
> futex(0x140, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14b39f0, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14999bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x140, 47726) 
> = 8
> futex(0x140, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14b3518, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14999bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x140, 47756) 
> = 8
> futex(0x140, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14aa4d8, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource 
> temporarily unavailable)
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 0
> exit_group(0)   = ?
> +++ exited with 0 +++
> ```
> ```
> CentOS Linux release 7.0.1406 (Core)
> 3.10.0-327.22.2.el7.x86_64
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5871) KVM and Docker containerized mesos-slave, state.json always timeout

2016-07-22 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389803#comment-15389803
 ] 

Deshi Xiao commented on MESOS-5871:
---

finally we resolve it.  we found the all machine's hostname  can't reach each 
other in KVM environment. so when we add hostname to machine, then the mesos 
slave work like a charm.

but i don't know why?  [~haosdent[~haosdent]

> KVM and Docker containerized mesos-slave, state.json always timeout
> ---
>
> Key: MESOS-5871
> URL: https://issues.apache.org/jira/browse/MESOS-5871
> Project: Mesos
>  Issue Type: Bug
>Reporter: Deshi Xiao
>Assignee: haosdent
> Attachments: 34.pic_hd.jpg
>
>
> please see state.json timeout. please see perf log on issue machine.
> strace log:
> ```
> futex(0x14b5a48, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource 
> temporarily unavailable)
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x14999bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x140, 47666) 
> = 8
> futex(0x140, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14b4290, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource 
> temporarily unavailable)
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x14999bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x140, 47696) 
> = 8
> futex(0x140, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14b39f0, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14999bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x140, 47726) 
> = 8
> futex(0x140, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14b3518, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14999bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x140, 47756) 
> = 8
> futex(0x140, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14aa4d8, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource 
> temporarily unavailable)
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 0
> exit_group(0)   = ?
> +++ exited with 0 +++
> ```
> ```
> CentOS Linux release 7.0.1406 (Core)
> 3.10.0-327.22.2.el7.x86_64
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5871) KVM and Docker containerized mesos-slave, state.json always timeout

2016-07-22 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389789#comment-15389789
 ] 

Deshi Xiao commented on MESOS-5871:
---

not working mean state.json is always take long time to query.

> KVM and Docker containerized mesos-slave, state.json always timeout
> ---
>
> Key: MESOS-5871
> URL: https://issues.apache.org/jira/browse/MESOS-5871
> Project: Mesos
>  Issue Type: Bug
>Reporter: Deshi Xiao
>Assignee: haosdent
> Attachments: 34.pic_hd.jpg
>
>
> please see state.json timeout. please see perf log on issue machine.
> strace log:
> ```
> futex(0x14b5a48, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource 
> temporarily unavailable)
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x14999bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x140, 47666) 
> = 8
> futex(0x140, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14b4290, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource 
> temporarily unavailable)
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x14999bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x140, 47696) 
> = 8
> futex(0x140, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14b39f0, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14999bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x140, 47726) 
> = 8
> futex(0x140, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14b3518, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14999bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x140, 47756) 
> = 8
> futex(0x140, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x14aa4d8, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x1495f88, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource 
> temporarily unavailable)
> futex(0x1495f88, FUTEX_WAKE_PRIVATE, 1) = 0
> exit_group(0)   = ?
> +++ exited with 0 +++
> ```
> ```
> CentOS Linux release 7.0.1406 (Core)
> 3.10.0-327.22.2.el7.x86_64
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4659) Avoid leaving orphan task after framework failure + master failover

2016-07-22 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-4659:
---
Description: 
If a framework becomes disconnected from the master, its tasks are killed after 
waiting for {{failover_timeout}}.

However, if a master failover occurs but a framework never reconnects to the 
new master, we never kill any of the tasks associated with that framework. 
These tasks remain orphaned and presumably would need to be manually removed by 
the operator. Similarly, if a framework gets torn down or disconnects while it 
has running tasks on a partitioned agent, those tasks are not shutdown when the 
agent reregisters.

We should consider whether to kill such orphaned tasks automatically, likely 
after waiting for some (framework-configurable?) timeout.

  was:
If a framework becomes disconnected from the master, its tasks are killed after 
waiting for {{failover_timeout}}.

However, if a master failover occurs but a framework never reconnects to the 
new master, we never kill any of the tasks associated with that framework. 
These tasks remain orphaned and presumably would need to be manually removed by 
the operator.

We should consider whether to kill such orphaned tasks automatically, likely 
after waiting for some (framework-configurable?) timeout.


> Avoid leaving orphan task after framework failure + master failover
> ---
>
> Key: MESOS-4659
> URL: https://issues.apache.org/jira/browse/MESOS-4659
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>  Labels: failover, mesosphere
>
> If a framework becomes disconnected from the master, its tasks are killed 
> after waiting for {{failover_timeout}}.
> However, if a master failover occurs but a framework never reconnects to the 
> new master, we never kill any of the tasks associated with that framework. 
> These tasks remain orphaned and presumably would need to be manually removed 
> by the operator. Similarly, if a framework gets torn down or disconnects 
> while it has running tasks on a partitioned agent, those tasks are not 
> shutdown when the agent reregisters.
> We should consider whether to kill such orphaned tasks automatically, likely 
> after waiting for some (framework-configurable?) timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5303) Add capabilities support for mesos execute cli.

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5303:
-
Sprint: Mesosphere Sprint 34, Mesosphere Sprint 35, Mesosphere Sprint 37, 
Mesosphere Sprint 38, Mesosphere Sprint 39, Mesosphere Sprint 40  (was: 
Mesosphere Sprint 34, Mesosphere Sprint 35, Mesosphere Sprint 37, Mesosphere 
Sprint 38, Mesosphere Sprint 39)

> Add capabilities support for mesos execute cli.
> ---
>
> Key: MESOS-5303
> URL: https://issues.apache.org/jira/browse/MESOS-5303
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jojy Varghese
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>
> Add support for `user` and `capabilities` to execute cli. This will help in 
> testing the `capabilities` feature for unified containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4233) Logging is too verbose for sysadmins / syslog

2016-07-22 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-4233:
---
Sprint: Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 28, 
Mesosphere Sprint 29, Mesosphere Sprint 30, Mesosphere Sprint 31, Mesosphere 
Sprint 32, Mesosphere Sprint 33, Mesosphere Sprint 34, Mesosphere Sprint 35, 
Mesosphere Sprint 36, Mesosphere Sprint 37, Mesosphere Sprint 38, Mesosphere 
Sprint 39  (was: Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 
28, Mesosphere Sprint 29, Mesosphere Sprint 30, Mesosphere Sprint 31, 
Mesosphere Sprint 32, Mesosphere Sprint 33, Mesosphere Sprint 34, Mesosphere 
Sprint 35, Mesosphere Sprint 36, Mesosphere Sprint 37, Mesosphere Sprint 38, 
Mesosphere Sprint 39, Mesosphere Sprint 40)

> Logging is too verbose for sysadmins / syslog
> -
>
> Key: MESOS-4233
> URL: https://issues.apache.org/jira/browse/MESOS-4233
> Project: Mesos
>  Issue Type: Epic
>Reporter: Cody Maloney
>Assignee: Kapil Arya
>  Labels: mesosphere
> Attachments: giant_port_range_logging
>
>
> Currently mesos logs a lot. When launching a thousand tasks in the space of 
> 10 seconds it will print tens of thousands of log lines, overwhelming syslog 
> (there is a max rate at which a process can send stuff over a unix socket) 
> and not giving useful information to a sysadmin who cares about just the 
> high-level activity and when something goes wrong.
> Note mesos also blocks writing to its log locations, so when writing a lot of 
> log messages, it can fill up the write buffer in the kernel, and be suspended 
> until the syslog agent catches up reading from the socket (GLOG does a 
> blocking fwrite to stderr). GLOG also has a big mutex around logging so only 
> one thing logs at a time.
> While for "internal debugging" it is useful to see things like "message went 
> from internal compoent x to internal component y", from a sysadmin 
> perspective I only care about the high level actions taken (launched task for 
> framework x), sent offer to framework y, got task failed from host z. Note 
> those are what I'd expect at the "INFO" level. At the "WARNING" level I'd 
> expect very little to be logged / almost nothing in normal operation. Just 
> things like "WARN: Repliacted log write took longer than expected". WARN 
> would also get things like backtraces on crashes and abnormal exits / abort.
> When trying to launch 3k+ tasks inside a second, mesos logging currently 
> overwhelms syslog with 100k+ messages, many of which are thousands of bytes. 
> Sysadmins expect to be able to use syslog to monitor basic events in their 
> system. This is too much.
> We can keep logging the messages to files, but the logging to stderr needs to 
> be reduced significantly (stderr gets picked up and forwarded to syslog / 
> central aggregation).
> What I would like is if I can set the stderr logging level to be different / 
> independent from the file logging level (Syslog giving the "sysadmin" 
> aggregated overview, files useful for debugging in depth what happened in a 
> cluster). A lot of what mesos currently logs at info is really debugging info 
> / should show up as debug log level.
> Some samples of mesos logging a lot more than a sysadmin would want / expect 
> are attached, and some are below:
>  - Every task gets printed multiple times for a basic launch:
> {noformat}
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
> I1215 22:58:29.382644  1315 master.cpp:3248] Launching task 
> envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 
> 5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon)
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
> I1215 22:58:29.382925  1315 master.hpp:176] Adding task 
> envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(​*):0.0001; 
> mem(*​):16; ports(*):[14047-14047]
> {noformat}
>  - Every task status update prints many log lines, successful ones are part 
> of normal operation and maybe should be logged at info / debug levels, but 
> not to a sysadmin (Just show when things fail, and maybe aggregate counters 
> to tell of the volume of working)
>  - No log messagse should be really big / more than 1k characters (Would 
> prevent the giant port list attached, make that easily discoverable / bug 
> filable / fixable) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3815) Executors cannot register with SSL-required agent

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-3815:
-
Sprint: Mesosphere Sprint 39, Mesosphere Sprint 40  (was: Mesosphere Sprint 
39)

> Executors cannot register with SSL-required agent
> -
>
> Key: MESOS-3815
> URL: https://issues.apache.org/jira/browse/MESOS-3815
> Project: Mesos
>  Issue Type: Bug
>Reporter: haosdent
>Assignee: Till Toenshoff
>  Labels: docker, encryption, mesosphere, security, ssl
>
> Because docker executor not pass SSL related environment variables, 
> mesos-docker-executor could not works normal when SSL enable. More details 
> could found in http://search-hadoop.com/m/0Vlr6DsslDSvVs72



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5051) Create helpers for manipulating Linux capabilities.

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5051:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33, Mesosphere Sprint 34, 
Mesosphere Sprint 35, Mesosphere Sprint 37, Mesosphere Sprint 38, Mesosphere 
Sprint 39, Mesosphere Sprint 40  (was: Mesosphere Sprint 32, Mesosphere Sprint 
33, Mesosphere Sprint 34, Mesosphere Sprint 35, Mesosphere Sprint 37, 
Mesosphere Sprint 38, Mesosphere Sprint 39)

> Create helpers for manipulating Linux capabilities.
> ---
>
> Key: MESOS-5051
> URL: https://issues.apache.org/jira/browse/MESOS-5051
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>
> These helpers can either based on some existing library (e.g. libcap), or use 
> system calls directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5228) Add tests for Capability API.

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5228:
-
Sprint: Mesosphere Sprint 33, Mesosphere Sprint 34, Mesosphere Sprint 35, 
Mesosphere Sprint 37, Mesosphere Sprint 38, Mesosphere Sprint 39, Mesosphere 
Sprint 40  (was: Mesosphere Sprint 33, Mesosphere Sprint 34, Mesosphere Sprint 
35, Mesosphere Sprint 37, Mesosphere Sprint 38, Mesosphere Sprint 39)

> Add tests for Capability API.
> -
>
> Key: MESOS-5228
> URL: https://issues.apache.org/jira/browse/MESOS-5228
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Jojy Varghese
>Assignee: Benjamin Bannier
>  Labels: mesosphere, unified-containerizer-mvp
>
> Add basic tests for the capability API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5825) Support mounting image volume in mesos containerizer.

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5825:
-
Sprint: Mesosphere Sprint 39, Mesosphere Sprint 40  (was: Mesosphere Sprint 
39)

> Support mounting image volume in mesos containerizer.
> -
>
> Key: MESOS-5825
> URL: https://issues.apache.org/jira/browse/MESOS-5825
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: containerizer, filesystem, isolator, mesosphere
>
> Mesos containerizer should be able to support mounting image volume type. 
> Specifically, both image rootfs and default manifest should be reachable 
> inside container's mount namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5582) Create a `cgroups/devices` isolator.

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5582:
-
Sprint: Mesosphere Sprint 36, Mesosphere Sprint 37, Mesosphere Sprint 38, 
Mesosphere Sprint 39, Mesosphere Sprint 40  (was: Mesosphere Sprint 36, 
Mesosphere Sprint 37, Mesosphere Sprint 38, Mesosphere Sprint 39)

> Create a `cgroups/devices` isolator.
> 
>
> Key: MESOS-5582
> URL: https://issues.apache.org/jira/browse/MESOS-5582
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, isolator, mesosphere
>
> Currently, all the logic for the `cgroups/devices` isolator is bundled into 
> the Nvidia GPU Isolator. We should abstract it out into it's own component 
> and remove the redundant logic from the Nvidia GPU Isolator. Assuming the 
> guaranteed ordering between isolators from MESOS-5581, we can be sure that 
> the dependency order between the `cgroups/devices` and `gpu/nvidia` isolators 
> is met.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4690) Reorganize 3rdparty directory

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4690:
-
Sprint: Mesosphere Sprint 33, Mesosphere Sprint 34, Mesosphere Sprint 35, 
Mesosphere Sprint 36, Mesosphere Sprint 37, Mesosphere Sprint 38, Mesosphere 
Sprint 39, Mesosphere Sprint 40  (was: Mesosphere Sprint 33, Mesosphere Sprint 
34, Mesosphere Sprint 35, Mesosphere Sprint 36, Mesosphere Sprint 37, 
Mesosphere Sprint 38, Mesosphere Sprint 39)

> Reorganize 3rdparty directory
> -
>
> Key: MESOS-4690
> URL: https://issues.apache.org/jira/browse/MESOS-4690
> Project: Mesos
>  Issue Type: Epic
>  Components: build, libprocess, stout
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> This issues is currently being discussed in the dev mailing list:
> http://www.mail-archive.com/dev@mesos.apache.org/msg34349.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5802) SlaveAuthorizerTest/0.ViewFlags is flaky.

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5802:
-
Sprint: Mesosphere Sprint 39, Mesosphere Sprint 40  (was: Mesosphere Sprint 
39)

> SlaveAuthorizerTest/0.ViewFlags is flaky.
> -
>
> Key: MESOS-5802
> URL: https://issues.apache.org/jira/browse/MESOS-5802
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Jie Yu
>Assignee: Alexander Rojas
>  Labels: mesosphere, race-condition, slave
>
> {noformat}
> [15:24:47] :   [Step 10/10] [ RUN  ] SlaveAuthorizerTest/0.ViewFlags
> [15:24:47]W:   [Step 10/10] I0707 15:24:47.025609 25322 
> containerizer.cpp:196] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix,network/cni
> [15:24:47]W:   [Step 10/10] I0707 15:24:47.030421 25322 
> linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy 
> for the Linux launcher
> [15:24:47]W:   [Step 10/10] I0707 15:24:47.032060 25339 slave.cpp:205] Agent 
> started on 335)@172.30.2.7:43076
> [15:24:47]W:   [Step 10/10] I0707 15:24:47.032078 25339 slave.cpp:206] Flags 
> at startup: --acls="" --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http="true" 
> --authenticatee="crammd5" --authentication_backoff_factor="1secs" 
> --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" 
> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" 
> --cgroups_limit_swap="false" --cgroups_root="mesos" 
> --container_disk_watch_interval="15secs" --containerizers="mesos" 
> --credential="/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_0_ViewFlags_OsJb5C/credential"
>  --default_role="*" --disk_watch_interval="1mins" --docker="docker" 
> --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
> --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
> --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="1mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_0_ViewFlags_OsJb5C/fetch"
>  --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
> --gc_disk_headroom="0.1" --hadoop_home="" --help="true" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_command_executor="false" 
> --http_credentials="/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_0_ViewFlags_OsJb5C/http_credentials"
>  --image_provisioner_backend="copy" --initialize_driver_logging="true" 
> --isolation="posix/cpu,posix/mem" 
> --launcher_dir="/mnt/teamcity/work/4240ba9ddd0997c3/build/src" 
> --logbufsecs="0" --logging_level="INFO" 
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
> --perf_interval="1mins" --qos_correction_interval_min="0ns" --quiet="false" 
> --recover="reconnect" --recovery_timeout="15mins" 
> --registration_backoff_factor="10ms" 
> --resources="cpus:2;gpus:0;mem:1024;disk:1024;ports:[31000-32000]" 
> --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" 
> --strict="true" --switch_user="true" --systemd_enable_support="true" 
> --systemd_runtime_directory="/run/systemd/system" --version="false" 
> --work_dir="/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_0_ViewFlags_OsJb5C"
>  --xfs_project_range="[5000-1]"
> [15:24:47]W:   [Step 10/10] I0707 15:24:47.032306 25339 credentials.hpp:86] 
> Loading credential for authentication from 
> '/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_0_ViewFlags_OsJb5C/credential'
> [15:24:47]W:   [Step 10/10] I0707 15:24:47.032424 25339 slave.cpp:343] Agent 
> using credential for: test-principal
> [15:24:47]W:   [Step 10/10] I0707 15:24:47.032441 25339 credentials.hpp:37] 
> Loading credentials for authentication from 
> '/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_0_ViewFlags_OsJb5C/http_credentials'
> [15:24:47]W:   [Step 10/10] I0707 15:24:47.032528 25339 slave.cpp:395] Using 
> default 'basic' HTTP authenticator
> [15:24:47]W:   [Step 10/10] I0707 15:24:47.032754 25339 resources.cpp:572] 
> Parsing resources as JSON failed: 
> cpus:2;gpus:0;mem:1024;disk:1024;ports:[31000-32000]
> [15:24:47]W:   [Step 10/10] Trying semicolon-delimited string format instead
> [15:24:47]W:   [Step 10/10] I0707 15:24:47.032838 25339 resources.cpp:572] 
> Parsing resources as JSON failed: 
> cpus:2;gpus:0;mem:1024;disk:1024;ports:[31000-32000]
> [15:24:47]W:   [Step 10/10] Trying semicolon-delimited string format instead
> [15:24:47]W:   [Step 10/10] I0707 15:24:47.032968 25339 slave.cpp:594] Agent 
> resources: cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]
> 

[jira] [Updated] (MESOS-5788) Consider adding a Java Scheduler Shim/Adapter for the new/old API.

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5788:
-
Sprint: Mesosphere Sprint 39, Mesosphere Sprint 40  (was: Mesosphere Sprint 
39)

> Consider adding a Java Scheduler Shim/Adapter for the new/old API.
> --
>
> Key: MESOS-5788
> URL: https://issues.apache.org/jira/browse/MESOS-5788
> Project: Mesos
>  Issue Type: Task
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> Currently, for existing JAVA based frameworks, moving to try out the new API 
> can be cumbersome. This change intends to introduce a shim/adapter interface 
> that makes this easier by allowing to toggle between the old/new API 
> (driver/new scheduler library) implementation via an environment variable. 
> This would allow framework developers to transition their older frameworks to 
> the new API rather seamlessly.
> This would look similar to the work done for the executor shim for C++ 
> (command/docker executor). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5824) Include disk source information in stringification

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5824:
-
Sprint: Mesosphere Sprint 39, Mesosphere Sprint 40  (was: Mesosphere Sprint 
39)

> Include disk source information in stringification
> --
>
> Key: MESOS-5824
> URL: https://issues.apache.org/jira/browse/MESOS-5824
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Affects Versions: 0.28.2
>Reporter: Tim Harper
>Assignee: Tim Harper
>Priority: Minor
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> Some frameworks (like kafka_mesos) ignore the Source field when trying to 
> reserve an offered mount or path persistent volume; the resulting error 
> message is bewildering:
> {code:none}
> Task uses more resources
> cpus(*):4; mem(*):4096; ports(*):[31000-31000]; disk(kafka, 
> kafka)[kafka_0:data]:960679
> than available
> cpus(*):32; mem(*):256819;  ports(*):[31000-32000]; disk(kafka, 
> kafka)[kafka_0:data]:960679;   disk(*):240169;
> {code}
> The stringification of disk resources should include source information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4766) Improve allocator performance.

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4766:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33, Mesosphere Sprint 34, 
Mesosphere Sprint 35, Mesosphere Sprint 36, Mesosphere Sprint 37, Mesosphere 
Sprint 38, Mesosphere Sprint 39, Mesosphere Sprint 40  (was: Mesosphere Sprint 
32, Mesosphere Sprint 33, Mesosphere Sprint 34, Mesosphere Sprint 35, 
Mesosphere Sprint 36, Mesosphere Sprint 37, Mesosphere Sprint 38, Mesosphere 
Sprint 39)

> Improve allocator performance.
> --
>
> Key: MESOS-4766
> URL: https://issues.apache.org/jira/browse/MESOS-4766
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Michael Park
>Priority: Critical
>
> This is an epic to track the various tickets around improving the performance 
> of the allocator, including the following:
> * Preventing un-necessary backup of the allocator.
> * Reducing the cost of allocations and allocator state updates.
> * Improving performance of the DRF sorter.
> * More benchmarking to simulate scenarios with performance issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5779) Allow Docker v1 ImageManifests to be parsed from the output of `docker inspect`

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5779:
-
Sprint: Mesosphere Sprint 38, Mesosphere Sprint 39, Mesosphere Sprint 40  
(was: Mesosphere Sprint 38, Mesosphere Sprint 39)

> Allow Docker v1 ImageManifests to be parsed from the output of `docker 
> inspect`
> ---
>
> Key: MESOS-5779
> URL: https://issues.apache.org/jira/browse/MESOS-5779
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>
> The `docker::spec::v1::ImageManifest` protobuf implements the
> official v1 image manifest specification found at:
> 
> https://github.com/docker/docker/blob/master/image/spec/v1.md
> 
> The field names in this spec are all written in snake_case as are the
> field names of the JSON representing the image manifest when reading
> it from disk (for example after performing a `docker save`). As such,
> the protobuf for ImageManifest also provides these fields in
> snake_case. Unfortunately, the `docker inspect` command also provides
> a method of retrieving the JSON for an image manifest, with one major
> caveat -- it represents all of its top level keys in CamelCase.
> 
> To allow both representations to be parsed in the same way, we
> should intercept the incoming JSON from either source (disk or `docker
> inspect`) and convert it to a canonical snake_case representation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5275) Add capabilities support for unified containerizer.

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5275:
-
Sprint: Mesosphere Sprint 34, Mesosphere Sprint 35, Mesosphere Sprint 37, 
Mesosphere Sprint 38, Mesosphere Sprint 39, Mesosphere Sprint 40  (was: 
Mesosphere Sprint 34, Mesosphere Sprint 35, Mesosphere Sprint 37, Mesosphere 
Sprint 38, Mesosphere Sprint 39)

> Add capabilities support for unified containerizer.
> ---
>
> Key: MESOS-5275
> URL: https://issues.apache.org/jira/browse/MESOS-5275
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Jojy Varghese
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>
> Add capabilities support for unified containerizer. 
> Requirements:
> 1. Use the mesos capabilities API.
> 2. Frameworks be able to add capability requests for containers.
> 3. Agents be able to add maximum allowed capabilities for all containers 
> launched.
> Design document: 
> https://docs.google.com/document/d/1YiTift8TQla2vq3upQr7K-riQ_pQ-FKOCOsysQJROGc/edit#heading=h.rgfwelqrskmd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5570) Improve CHANGELOG and upgrades.md

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5570:
-
Sprint: Mesosphere Sprint 37, Mesosphere Sprint 38, Mesosphere Sprint 39, 
Mesosphere Sprint 40  (was: Mesosphere Sprint 37, Mesosphere Sprint 38, 
Mesosphere Sprint 39)

> Improve CHANGELOG and upgrades.md
> -
>
> Key: MESOS-5570
> URL: https://issues.apache.org/jira/browse/MESOS-5570
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>
> Currently we have a lot of data duplication between the CHANGELOG and 
> upgrades.md. We should try to improve this and potentially make the CHANGLOG 
> a markdown file as well. For inspiration see the Hadoop changelog: 
> https://github.com/apache/hadoop/blob/2e1d0ff4e901b8313c8d71869735b94ed8bc40a0/hadoop-common-project/hadoop-common/src/site/markdown/release/1.2.0/CHANGES.1.2.0.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3753) Test the HTTP Scheduler library with SSL enabled

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-3753:
-
Sprint: Mesosphere Sprint 39, Mesosphere Sprint 40  (was: Mesosphere Sprint 
39)

> Test the HTTP Scheduler library with SSL enabled
> 
>
> Key: MESOS-3753
> URL: https://issues.apache.org/jira/browse/MESOS-3753
> Project: Mesos
>  Issue Type: Story
>  Components: framework, HTTP API, test
>Reporter: Joseph Wu
>Assignee: Greg Mann
>  Labels: mesosphere, security
>
> Currently, the HTTP Scheduler library does not support SSL-enabled Mesos.  
> (You can manually test this by spinning up an SSL-enabled master and attempt 
> to run the event-call framework example against it.)
> We need to add tests that check the HTTP Scheduler library against 
> SSL-enabled Mesos:
> * with downgrade support,
> * with required framework/client-side certifications,
> * with/without verification of certificates (master-side),
> * with/without verification of certificates (framework-side),
> * with a custom certificate authority (CA)
> These options should be controlled by the same environment variables found on 
> the [SSL user doc|http://mesos.apache.org/documentation/latest/ssl/].
> Note: This issue will be broken down into smaller sub-issues as bugs/problems 
> are discovered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5822) Add a build script for the Windows CI

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5822:
-
Sprint: Mesosphere Sprint 39, Mesosphere Sprint 40  (was: Mesosphere Sprint 
39)

> Add a build script for the Windows CI
> -
>
> Key: MESOS-5822
> URL: https://issues.apache.org/jira/browse/MESOS-5822
> Project: Mesos
>  Issue Type: Improvement
>  Components: build
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere, microsoft, windows
>
> The ASF CI for Mesos runs a script that lives inside the Mesos codebase:
> https://github.com/apache/mesos/blob/1cbfdc3c1e4b8498a67f8531ab264003c8c19fb1/support/docker_build.sh
> ASF Infrastructure have set up a machine that we can use for building Mesos 
> on Windows.  Considering the environment, we will need a separate script to 
> build here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4233) Logging is too verbose for sysadmins / syslog

2016-07-22 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4233:
-
Sprint: Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 28, 
Mesosphere Sprint 29, Mesosphere Sprint 30, Mesosphere Sprint 31, Mesosphere 
Sprint 32, Mesosphere Sprint 33, Mesosphere Sprint 34, Mesosphere Sprint 35, 
Mesosphere Sprint 36, Mesosphere Sprint 37, Mesosphere Sprint 38, Mesosphere 
Sprint 39, Mesosphere Sprint 40  (was: Mesosphere Sprint 26, Mesosphere Sprint 
27, Mesosphere Sprint 28, Mesosphere Sprint 29, Mesosphere Sprint 30, 
Mesosphere Sprint 31, Mesosphere Sprint 32, Mesosphere Sprint 33, Mesosphere 
Sprint 34, Mesosphere Sprint 35, Mesosphere Sprint 36, Mesosphere Sprint 37, 
Mesosphere Sprint 38, Mesosphere Sprint 39)

> Logging is too verbose for sysadmins / syslog
> -
>
> Key: MESOS-4233
> URL: https://issues.apache.org/jira/browse/MESOS-4233
> Project: Mesos
>  Issue Type: Epic
>Reporter: Cody Maloney
>Assignee: Kapil Arya
>  Labels: mesosphere
> Attachments: giant_port_range_logging
>
>
> Currently mesos logs a lot. When launching a thousand tasks in the space of 
> 10 seconds it will print tens of thousands of log lines, overwhelming syslog 
> (there is a max rate at which a process can send stuff over a unix socket) 
> and not giving useful information to a sysadmin who cares about just the 
> high-level activity and when something goes wrong.
> Note mesos also blocks writing to its log locations, so when writing a lot of 
> log messages, it can fill up the write buffer in the kernel, and be suspended 
> until the syslog agent catches up reading from the socket (GLOG does a 
> blocking fwrite to stderr). GLOG also has a big mutex around logging so only 
> one thing logs at a time.
> While for "internal debugging" it is useful to see things like "message went 
> from internal compoent x to internal component y", from a sysadmin 
> perspective I only care about the high level actions taken (launched task for 
> framework x), sent offer to framework y, got task failed from host z. Note 
> those are what I'd expect at the "INFO" level. At the "WARNING" level I'd 
> expect very little to be logged / almost nothing in normal operation. Just 
> things like "WARN: Repliacted log write took longer than expected". WARN 
> would also get things like backtraces on crashes and abnormal exits / abort.
> When trying to launch 3k+ tasks inside a second, mesos logging currently 
> overwhelms syslog with 100k+ messages, many of which are thousands of bytes. 
> Sysadmins expect to be able to use syslog to monitor basic events in their 
> system. This is too much.
> We can keep logging the messages to files, but the logging to stderr needs to 
> be reduced significantly (stderr gets picked up and forwarded to syslog / 
> central aggregation).
> What I would like is if I can set the stderr logging level to be different / 
> independent from the file logging level (Syslog giving the "sysadmin" 
> aggregated overview, files useful for debugging in depth what happened in a 
> cluster). A lot of what mesos currently logs at info is really debugging info 
> / should show up as debug log level.
> Some samples of mesos logging a lot more than a sysadmin would want / expect 
> are attached, and some are below:
>  - Every task gets printed multiple times for a basic launch:
> {noformat}
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
> I1215 22:58:29.382644  1315 master.cpp:3248] Launching task 
> envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 
> 5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon)
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
> I1215 22:58:29.382925  1315 master.hpp:176] Adding task 
> envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(​*):0.0001; 
> mem(*​):16; ports(*):[14047-14047]
> {noformat}
>  - Every task status update prints many log lines, successful ones are part 
> of normal operation and maybe should be logged at info / debug levels, but 
> not to a sysadmin (Just show when things fail, and maybe aggregate counters 
> to tell of the volume of working)
>  - No log messagse should be really big / more than 1k characters (Would 
> prevent the giant port list attached, make that easily discoverable / bug 
> filable / fixable) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5888) SlaveAuthorizerTest/ViewFlags is flaky

2016-07-22 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-5888:
---
Shepherd: Vinod Kone

> SlaveAuthorizerTest/ViewFlags is flaky
> --
>
> Key: MESOS-5888
> URL: https://issues.apache.org/jira/browse/MESOS-5888
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Neil Conway
>Assignee: Neil Conway
>Priority: Minor
>  Labels: flaky-test, mesosphere
>
> Observed on internal CI:
> {noformat}
> [09:52:45] :   [Step 10/10] [ RUN  ] SlaveAuthorizerTest/1.ViewFlags
> [09:52:45]W:   [Step 10/10] I0722 09:52:45.797574 22980 
> containerizer.cpp:196] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix,network/cni
> [09:52:45]W:   [Step 10/10] I0722 09:52:45.800644 22980 
> linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy 
> for the Linux launcher
> [09:52:45]W:   [Step 10/10] I0722 09:52:45.801910 22996 slave.cpp:198] Agent 
> started on 338)@172.30.2.80:35421
> [09:52:45]W:   [Step 10/10] I0722 09:52:45.801923 22996 slave.cpp:199] Flags 
> at startup: --acls="" --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" 
> --credential="/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_1_ViewFlags_ISUu44/credential"
>  --default_role="*" --disk_watch_interval="1mins" --docker="docker" 
> --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
> --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
> --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="1mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_1_ViewFlags_ISUu44/fetch"
>  --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
> --gc_disk_headroom="0.1" --hadoop_home="" --help="true" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_command_executor="false" 
> --http_credentials="/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_1_ViewFlags_ISUu44/http_credentials"
>  --image_provisioner_backend="copy" --initialize_driver_logging="true" 
> --isolation="posix/cpu,posix/mem" 
> --launcher_dir="/mnt/teamcity/work/4240ba9ddd0997c3/build/src" 
> --logbufsecs="0" --logging_level="INFO" 
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
> --perf_interval="1mins" --qos_correction_interval_min="0ns" --quiet="false" 
> --recover="reconnect" --recovery_timeout="15mins" 
> --registration_backoff_factor="10ms" 
> --resources="cpus:2;gpus:0;mem:1024;disk:1024;ports:[31000-32000]" 
> --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" 
> --strict="true" --switch_user="true" --systemd_enable_support="true" 
> --systemd_runtime_directory="/run/systemd/system" --version="false" 
> --work_dir="/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_1_ViewFlags_ISUu44"
> [09:52:45]W:   [Step 10/10] I0722 09:52:45.802104 22996 credentials.hpp:86] 
> Loading credential for authentication from 
> '/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_1_ViewFlags_ISUu44/credential'
> [09:52:45]W:   [Step 10/10] I0722 09:52:45.802160 22996 slave.cpp:336] Agent 
> using credential for: test-principal
> [09:52:45]W:   [Step 10/10] I0722 09:52:45.802170 22996 credentials.hpp:37] 
> Loading credentials for authentication from 
> '/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_1_ViewFlags_ISUu44/http_credentials'
> [09:52:45]W:   [Step 10/10] I0722 09:52:45.802223 22996 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-agent-readonly'
> [09:52:45]W:   [Step 10/10] I0722 09:52:45.802264 22996 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-agent-readwrite'
> [09:52:45]W:   [Step 10/10] I0722 09:52:45.802472 22996 slave.cpp:519] Agent 
> resources: cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]
> [09:52:45]W:   [Step 10/10] I0722 09:52:45.802495 22996 slave.cpp:527] Agent 
> attributes: [  ]
> [09:52:45]W:   [Step 10/10] I0722 09:52:45.802500 22996 slave.cpp:532] Agent 
> hostname: ip-172-30-2-80.mesosphere.io
> [09:52:45]W:   [Step 10/10] I0722 09:52:45.802726 22996 process.cpp:3341] 
> Handling HTTP event for process 'slave(338)' 

[jira] [Updated] (MESOS-5889) Flakiness in SlaveRecoveryTest

2016-07-22 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-5889:
---
Attachment: slave_recovery_recover_unregistered_http_executor.log
slave_recovery_recover_terminated_executor.log
slave_recovery_cleanup_http_executor.log

> Flakiness in SlaveRecoveryTest
> --
>
> Key: MESOS-5889
> URL: https://issues.apache.org/jira/browse/MESOS-5889
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Neil Conway
>  Labels: mesosphere
> Attachments: slave_recovery_cleanup_http_executor.log, 
> slave_recovery_recover_terminated_executor.log, 
> slave_recovery_recover_unregistered_http_executor.log
>
>
> Observed on internal CI. Seems like it is related to cgroups? Observed 
> similar failures in the following tests, and probably more related tests:
> SlaveRecoveryTest/0.CleanupHTTPExecutor
> SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> Log files attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5889) Flakiness in SlaveRecoveryTest

2016-07-22 Thread Neil Conway (JIRA)
Neil Conway created MESOS-5889:
--

 Summary: Flakiness in SlaveRecoveryTest
 Key: MESOS-5889
 URL: https://issues.apache.org/jira/browse/MESOS-5889
 Project: Mesos
  Issue Type: Bug
  Components: tests
Reporter: Neil Conway


Observed on internal CI. Seems like it is related to cgroups? Observed similar 
failures in the following tests, and probably more related tests:

SlaveRecoveryTest/0.CleanupHTTPExecutor
SlaveRecoveryTest/0.RecoverUnregisteredHTTPExecutor
SlaveRecoveryTest/0.RecoverTerminatedExecutor

Log files attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5888) SlaveAuthorizerTest/ViewFlags is flaky

2016-07-22 Thread Neil Conway (JIRA)
Neil Conway created MESOS-5888:
--

 Summary: SlaveAuthorizerTest/ViewFlags is flaky
 Key: MESOS-5888
 URL: https://issues.apache.org/jira/browse/MESOS-5888
 Project: Mesos
  Issue Type: Bug
  Components: tests
Reporter: Neil Conway
Assignee: Neil Conway
Priority: Minor


Observed on internal CI:

{noformat}
[09:52:45] : [Step 10/10] [ RUN  ] SlaveAuthorizerTest/1.ViewFlags
[09:52:45]W: [Step 10/10] I0722 09:52:45.797574 22980 
containerizer.cpp:196] Using isolation: 
posix/cpu,posix/mem,filesystem/posix,network/cni
[09:52:45]W: [Step 10/10] I0722 09:52:45.800644 22980 
linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy 
for the Linux launcher
[09:52:45]W: [Step 10/10] I0722 09:52:45.801910 22996 slave.cpp:198] Agent 
started on 338)@172.30.2.80:35421
[09:52:45]W: [Step 10/10] I0722 09:52:45.801923 22996 slave.cpp:199] Flags 
at startup: --acls="" --appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos" 
--credential="/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_1_ViewFlags_ISUu44/credential"
 --default_role="*" --disk_watch_interval="1mins" --docker="docker" 
--docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
--docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
--docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" 
--docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="1mins" 
--executor_shutdown_grace_period="5secs" 
--fetcher_cache_dir="/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_1_ViewFlags_ISUu44/fetch"
 --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="true" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_command_executor="false" 
--http_credentials="/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_1_ViewFlags_ISUu44/http_credentials"
 --image_provisioner_backend="copy" --initialize_driver_logging="true" 
--isolation="posix/cpu,posix/mem" 
--launcher_dir="/mnt/teamcity/work/4240ba9ddd0997c3/build/src" --logbufsecs="0" 
--logging_level="INFO" --oversubscribed_resources_interval="15secs" 
--perf_duration="10secs" --perf_interval="1mins" 
--qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" 
--recovery_timeout="15mins" --registration_backoff_factor="10ms" 
--resources="cpus:2;gpus:0;mem:1024;disk:1024;ports:[31000-32000]" 
--revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" 
--strict="true" --switch_user="true" --systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_1_ViewFlags_ISUu44"
[09:52:45]W: [Step 10/10] I0722 09:52:45.802104 22996 credentials.hpp:86] 
Loading credential for authentication from 
'/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_1_ViewFlags_ISUu44/credential'
[09:52:45]W: [Step 10/10] I0722 09:52:45.802160 22996 slave.cpp:336] Agent 
using credential for: test-principal
[09:52:45]W: [Step 10/10] I0722 09:52:45.802170 22996 credentials.hpp:37] 
Loading credentials for authentication from 
'/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_1_ViewFlags_ISUu44/http_credentials'
[09:52:45]W: [Step 10/10] I0722 09:52:45.802223 22996 http.cpp:883] Using 
default 'basic' HTTP authenticator for realm 'mesos-agent-readonly'
[09:52:45]W: [Step 10/10] I0722 09:52:45.802264 22996 http.cpp:883] Using 
default 'basic' HTTP authenticator for realm 'mesos-agent-readwrite'
[09:52:45]W: [Step 10/10] I0722 09:52:45.802472 22996 slave.cpp:519] Agent 
resources: cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]
[09:52:45]W: [Step 10/10] I0722 09:52:45.802495 22996 slave.cpp:527] Agent 
attributes: [  ]
[09:52:45]W: [Step 10/10] I0722 09:52:45.802500 22996 slave.cpp:532] Agent 
hostname: ip-172-30-2-80.mesosphere.io
[09:52:45]W: [Step 10/10] I0722 09:52:45.802726 22996 process.cpp:3341] 
Handling HTTP event for process 'slave(338)' with path: '/slave(338)/flags'
[09:52:45]W: [Step 10/10] I0722 09:52:45.802738 22999 state.cpp:57] 
Recovering state from 
'/mnt/teamcity/temp/buildTmp/SlaveAuthorizerTest_1_ViewFlags_ISUu44/meta'
[09:52:45]W: [Step 10/10] I0722 09:52:45.802819 23000 
status_update_manager.cpp:200] Recovering status update manager
[09:52:45]W:  

[jira] [Commented] (MESOS-5887) Enhance DispatchEvent to include demangled method name.

2016-07-22 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389334#comment-15389334
 ] 

Alexander Rukletsov commented on MESOS-5887:


I think member function pointer is enough to uniquely identify the method which 
is being called. What I'm not sure about is whether it is always possible to 
convert it to a human-readable form.

> Enhance DispatchEvent to include demangled method name.
> ---
>
> Key: MESOS-5887
> URL: https://issues.apache.org/jira/browse/MESOS-5887
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Alexander Rukletsov
>
> Currently, 
> [{{DispatchEvent}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/event.hpp#L148]
>  does not include any user-friendly information about the actual method being 
> dispatched. This can be helpful in order to simplify triaging and debugging, 
> e.g., using {{\_\_processes\_\_}} endpoint. Now we print the [event type 
> only|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/src/process.cpp#L3198-L3203].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5886) FUTURE_DISPATCH may react on irrelevant dispatch.

2016-07-22 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389299#comment-15389299
 ] 

Benjamin Bannier commented on MESOS-5886:
-

It seem it might be enough to store just a ptr to the member function in the 
{{DispatchEvent}} and then do the matching with that. If we'd ever need to 
examine the {{type_info}} of the store member, it should be possible to get it 
from that ptr as well (currently we do not really use it but for the fuzzy 
match you reported here).

> FUTURE_DISPATCH may react on irrelevant dispatch.
> -
>
> Key: MESOS-5886
> URL: https://issues.apache.org/jira/browse/MESOS-5886
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>  Labels: mesosphere, tech-debt, tech-debt-test
>
> [{{FUTURE_DISPATCH}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L50]
>  uses 
> [{{DispatchMatcher}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L350]
>  to figure out whether a processed {{DispatchEvent}} is the same the user is 
> waiting for. However, comparing {{std::type_info}} of function pointers is 
> not enough: different class methods with same signatures will be matched. 
> Here is the test that proves this:
> {noformat}
> class DispatchProcess : public Process
> {
> public:
>   MOCK_METHOD0(func0, void());
>   MOCK_METHOD1(func1, bool(bool));
>   MOCK_METHOD1(func1_same_but_different, bool(bool));
>   MOCK_METHOD1(func2, Future(bool));
>   MOCK_METHOD1(func3, int(int));
>   MOCK_METHOD2(func4, Future(bool, int));
> };
> {noformat}
> {noformat}
> TEST(ProcessTest, DispatchMatch)
> {
>   DispatchProcess process;
>   PID pid = spawn();
>   Future future = FUTURE_DISPATCH(
>   pid,
>   ::func1_same_but_different);
>   EXPECT_CALL(process, func1(_))
> .WillOnce(ReturnArg<0>());
>   dispatch(pid, ::func1, true);
>   AWAIT_READY(future);
>   terminate(pid);
>   wait(pid);
> }
> {noformat}
> The test passes:
> {noformat}
> [ RUN  ] ProcessTest.DispatchMatch
> [   OK ] ProcessTest.DispatchMatch (1 ms)
> {noformat}
> This change was introduced in https://reviews.apache.org/r/28052/.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-1718) Command executor can overcommit the slave.

2016-07-22 Thread Christopher Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389230#comment-15389230
 ] 

Christopher Hunt edited comment on MESOS-1718 at 7/22/16 10:27 AM:
---

I realise that this is an old comment, but I'm curious to learn whether always 
adding an executor's resources can be avoided.

I have a situation where my executor requires 1.0 cpu. This means that every 
task it runs will require at least 1.0 cpu, and we'll quickly use up the cpu 
count for a given node. On a large EC2 instance of 4 cpus, I can probably run 
just two tasks with my executor.

...or does the addition play no part in scheduling...


was (Author: huntc):
I realise that this is an old comment, but I'm curious to learn whether always 
adding an executor's resources can be avoided.

I have a situation where my executor requires 1.0 cpu. This means that every 
task it runs will require at least 1.0 cpu, and we'll quickly use up the cpu 
count for a given node. On a large EC2 instance of 4 cpus, I can probably run 
just two tasks with my executor.

> Command executor can overcommit the slave.
> --
>
> Key: MESOS-1718
> URL: https://issues.apache.org/jira/browse/MESOS-1718
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>
> Currently we give a small amount of resources to the command executor, in 
> addition to resources used by the command task:
> https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
> {code: title=}
> ExecutorInfo Slave::getExecutorInfo(
> const FrameworkID& frameworkId,
> const TaskInfo& task)
> {
>   ...
> // Add an allowance for the command executor. This does lead to a
> // small overcommit of resources.
> executor.mutable_resources()->MergeFrom(
> Resources::parse(
>   "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
>   "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
>   ...
> }
> {code}
> This leads to an overcommit of the slave. Ideally, for command tasks we can 
> "transfer" all of the task resources to the executor at the slave / isolation 
> level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5887) Enhance DispatchEvent to include demangled method name.

2016-07-22 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-5887:
--

 Summary: Enhance DispatchEvent to include demangled method name.
 Key: MESOS-5887
 URL: https://issues.apache.org/jira/browse/MESOS-5887
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Alexander Rukletsov


Currently, 
[{{DispatchEvent}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/event.hpp#L148]
 does not include any user-friendly information about the actual method being 
dispatched. This can be helpful in order to simplify triaging and debugging, 
e.g., using {{\_\_processes\_\_}} endpoint. Now we print the [event type 
only|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/src/process.cpp#L3198-L3203].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5886) FUTURE_DISPATCH may react on irrelevant dispatch.

2016-07-22 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5886:
---
Summary: FUTURE_DISPATCH may react on irrelevant dispatch.  (was: 
FUTURE_DISPATCH may react on wrong dispatch.)

> FUTURE_DISPATCH may react on irrelevant dispatch.
> -
>
> Key: MESOS-5886
> URL: https://issues.apache.org/jira/browse/MESOS-5886
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>  Labels: mesosphere, tech-debt, tech-debt-test
>
> [{{FUTURE_DISPATCH}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L50]
>  uses 
> [{{DispatchMatcher}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L350]
>  to figure out whether a processed {{DispatchEvent}} is the same the user is 
> waiting for. However, comparing {{std::type_info}} of function pointers is 
> not enough: different class methods with same signatures will be matched. 
> Here is the test that proves this:
> {noformat}
> class DispatchProcess : public Process
> {
> public:
>   MOCK_METHOD0(func0, void());
>   MOCK_METHOD1(func1, bool(bool));
>   MOCK_METHOD1(func1_same_but_different, bool(bool));
>   MOCK_METHOD1(func2, Future(bool));
>   MOCK_METHOD1(func3, int(int));
>   MOCK_METHOD2(func4, Future(bool, int));
> };
> {noformat}
> {noformat}
> TEST(ProcessTest, DispatchMatch)
> {
>   DispatchProcess process;
>   PID pid = spawn();
>   Future future = FUTURE_DISPATCH(
>   pid,
>   ::func1_same_but_different);
>   EXPECT_CALL(process, func1(_))
> .WillOnce(ReturnArg<0>());
>   dispatch(pid, ::func1, true);
>   AWAIT_READY(future);
>   terminate(pid);
>   wait(pid);
> }
> {noformat}
> The test passes:
> {noformat}
> [ RUN  ] ProcessTest.DispatchMatch
> [   OK ] ProcessTest.DispatchMatch (1 ms)
> {noformat}
> This change was introduced in https://reviews.apache.org/r/28052/.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5886) FUTURE_DISPATCH may react on wrong dispatch.

2016-07-22 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-5886:
--

 Summary: FUTURE_DISPATCH may react on wrong dispatch.
 Key: MESOS-5886
 URL: https://issues.apache.org/jira/browse/MESOS-5886
 Project: Mesos
  Issue Type: Bug
Reporter: Alexander Rukletsov


[{{FUTURE_DISPATCH}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L50]
 uses 
[{{DispatchMatcher}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L350]
 to figure out whether a processed {{DispatchEvent}} is the same the user is 
waiting for. However, comparing {{std::type_info}} of function pointers is not 
enough: different class methods with same signatures will be matched. Here is 
the test that proves this:
{noformat}
class DispatchProcess : public Process
{
public:
  MOCK_METHOD0(func0, void());
  MOCK_METHOD1(func1, bool(bool));
  MOCK_METHOD1(func1_same_but_different, bool(bool));
  MOCK_METHOD1(func2, Future(bool));
  MOCK_METHOD1(func3, int(int));
  MOCK_METHOD2(func4, Future(bool, int));
};
{noformat}
{noformat}
TEST(ProcessTest, DispatchMatch)
{
  DispatchProcess process;

  PID pid = spawn();

  Future future = FUTURE_DISPATCH(
  pid,
  ::func1_same_but_different);

  EXPECT_CALL(process, func1(_))
.WillOnce(ReturnArg<0>());

  dispatch(pid, ::func1, true);

  AWAIT_READY(future);

  terminate(pid);
  wait(pid);
}
{noformat}
The test passes:
{noformat}
[ RUN  ] ProcessTest.DispatchMatch
[   OK ] ProcessTest.DispatchMatch (1 ms)
{noformat}

This change was introduced in https://reviews.apache.org/r/28052/.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1718) Command executor can overcommit the slave.

2016-07-22 Thread Christopher Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389230#comment-15389230
 ] 

Christopher Hunt commented on MESOS-1718:
-

I realise that this is an old comment, but I'm curious to learn whether always 
adding an executor's resources can be avoided.

I have a situation where my executor requires 1.0 cpu. This means that every 
task it runs will require at least 1.0 cpu, and we'll quickly use up the cpu 
count for a given node. On a large EC2 instance of 4 cpus, I can probably run 
just two tasks with my executor.

> Command executor can overcommit the slave.
> --
>
> Key: MESOS-1718
> URL: https://issues.apache.org/jira/browse/MESOS-1718
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>
> Currently we give a small amount of resources to the command executor, in 
> addition to resources used by the command task:
> https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
> {code: title=}
> ExecutorInfo Slave::getExecutorInfo(
> const FrameworkID& frameworkId,
> const TaskInfo& task)
> {
>   ...
> // Add an allowance for the command executor. This does lead to a
> // small overcommit of resources.
> executor.mutable_resources()->MergeFrom(
> Resources::parse(
>   "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
>   "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
>   ...
> }
> {code}
> This leads to an overcommit of the slave. Ideally, for command tasks we can 
> "transfer" all of the task resources to the executor at the slave / isolation 
> level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5851) Create mechanism to control authentication between different HTTP endpoints

2016-07-22 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389196#comment-15389196
 ] 

Adam B commented on MESOS-5851:
---

I left out the endpoint help doc changes, as those require a little more 
thought, and can wait until after Mesos 1.0.

commit e8ebbe5fe4189ef7ab046da2276a6abee41deeb2
Author: Greg Mann 
Date:   Fri Jul 22 01:53:23 2016 -0700

Updated CHANGELOG for new HTTP authentication flags.

Review: https://reviews.apache.org/r/50332/

commit 70af2b04f038becb71108896f1c354477d55cb07
Author: Greg Mann 
Date:   Fri Jul 22 01:51:01 2016 -0700

Updated upgrades.md for new HTTP authentication flags.

Review: https://reviews.apache.org/r/50333/

commit 52ae4a97b5581e74841feeccaba1b6c7d8ec311f
Author: Greg Mann 
Date:   Fri Jul 22 01:42:38 2016 -0700

Added readonly/readwrite auth flags to the docs.

Review: https://reviews.apache.org/r/50322/

commit 6da4d2c90f25497eab0f3fdfb6cf039b50304fe1
Author: Zhitao Li 
Date:   Fri Jul 22 01:19:51 2016 -0700

Refactored common HTTP authenticator initialize into helper function.

Review: https://reviews.apache.org/r/50320/

commit f6fea54bd744ca7fc698449b2879b03ae1cb0ed4
Author: Zhitao Li 
Date:   Thu Jul 21 23:43:34 2016 -0700

Separated AuthN for readonly and readwrite endpoints in Mesos.

Changes included:
- separate flags for readonly and readwrite endpoints;
- helper function for registering http authenticator;
- fixing existing tests.

Review: https://reviews.apache.org/r/50223/

commit 3a52c3d65f311a9582de48ff58f721d047dd12fd
Author: Zhitao Li 
Date:   Thu Jul 21 22:39:02 2016 -0700

Separated readonly and readwrite realms in libprocess.

Review: https://reviews.apache.org/r/50277/



> Create mechanism to control authentication between different HTTP endpoints
> ---
>
> Key: MESOS-5851
> URL: https://issues.apache.org/jira/browse/MESOS-5851
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.0.0
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> All endpoints authentication is controlled by one single flag. We need this 
> flag to be on so that `/reserve` `/unreserve` can get a principal.
> However, after 1.0, we cannot access important readonly endpoints 
> `/master/state/` and `/metric/snapshot/` anymore w/o a password. The latter 
> is detrimental on usability because many users don't have the supporting 
> infra to distribute such metrics into every metrics collecting process yet.
> I'm looking towards a mechanism to at least allow unauthenticated access to 
> selective whitelisted endpoints while keep endpoints requiring AuthN/AuthZ 
> still protected.
> quoting Joseph Wu, "we want a `--authenticate_http=true, but don't check` 
> option"
> Proposed endpoint to realm grouping by [~zhitao]
> {quote}
> /
> // Common realms shared by both master and agent
> ​
> FLAGS
> - /flags
> ​
> FILES
> - /files/browse
> - /files/browse.json
> - /files/debug
> - /files/debug.json
> - /files/download
> - /files/download.json
> - /files/read
> - /files/read.json
> ​
> LOGGING
> - /logging/toggle
> ​
> METRICS
> - /metrics/snapshot
> ​
> PROFILER
> - /profiler/start
> - /profiler/stop
> ​
> SYSTEMS
> - /system/stats.json
> ​
> VERSIONS
> - /version
> ​
> /
> // Additional master only realms
> ​
> MAINTENANCE
> - /machine/down
> - /machine/up
> - /maintenance/schedule
> - /maintenance/status
> ​
> OPERATORS
> - /api/v1
> ​
> SCHEDULERS
> - /api/v1/scheduler
> ​
> REGISTRARS
> - /registrar(id)/registry
> ​
> RESERVATIONS
> - /reserve
> - /unreserve
> - /quota
> - /weights
> ​
> TEARDOWN
> - /teardown
> ​
> VIEWS
> - /frameworks
> - /roles
> - /roles.json
> - /slaves
> - /state
> - /state-summary
> - /state.json
> - /tasks
> - /tasks.json
> ​
> VOLUMES
> - /create-volumes
> - /destroy-volumes
> ​
> UNAUTHENTICATED
> - /health
> - /redirect
> ​
> 
> // Additional agent realms
> 
> ​
> OPERATORS
> - /api/v1
> ​
> VIEWS
> - /containers
> - /monitor/statistics
> - /monitor/statistics.json
> - /state
> - /state.json
> ​
> UNAUTHENTICATED
> - /api/v1/executor
> - /health
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5851) Create mechanism to control authentication between different HTTP endpoints

2016-07-22 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389134#comment-15389134
 ] 

Greg Mann commented on MESOS-5851:
--

Here's a simple patch for documentation updates to configuration.md and 
authentication.md:
https://reviews.apache.org/r/50322/

And updates to the CHANGELOG and upgrades.md:
https://reviews.apache.org/r/50332/
https://reviews.apache.org/r/50333/

And here are a few patches which alter the endpoint help strings to mention if 
an endpoint is read-only or read-write. These are bigger changes, so not sure 
if we want to try to merge them at the moment:
https://reviews.apache.org/r/50329/
https://reviews.apache.org/r/50330/
https://reviews.apache.org/r/50331/

> Create mechanism to control authentication between different HTTP endpoints
> ---
>
> Key: MESOS-5851
> URL: https://issues.apache.org/jira/browse/MESOS-5851
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.0.0
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> All endpoints authentication is controlled by one single flag. We need this 
> flag to be on so that `/reserve` `/unreserve` can get a principal.
> However, after 1.0, we cannot access important readonly endpoints 
> `/master/state/` and `/metric/snapshot/` anymore w/o a password. The latter 
> is detrimental on usability because many users don't have the supporting 
> infra to distribute such metrics into every metrics collecting process yet.
> I'm looking towards a mechanism to at least allow unauthenticated access to 
> selective whitelisted endpoints while keep endpoints requiring AuthN/AuthZ 
> still protected.
> quoting Joseph Wu, "we want a `--authenticate_http=true, but don't check` 
> option"
> Proposed endpoint to realm grouping by [~zhitao]
> {quote}
> /
> // Common realms shared by both master and agent
> ​
> FLAGS
> - /flags
> ​
> FILES
> - /files/browse
> - /files/browse.json
> - /files/debug
> - /files/debug.json
> - /files/download
> - /files/download.json
> - /files/read
> - /files/read.json
> ​
> LOGGING
> - /logging/toggle
> ​
> METRICS
> - /metrics/snapshot
> ​
> PROFILER
> - /profiler/start
> - /profiler/stop
> ​
> SYSTEMS
> - /system/stats.json
> ​
> VERSIONS
> - /version
> ​
> /
> // Additional master only realms
> ​
> MAINTENANCE
> - /machine/down
> - /machine/up
> - /maintenance/schedule
> - /maintenance/status
> ​
> OPERATORS
> - /api/v1
> ​
> SCHEDULERS
> - /api/v1/scheduler
> ​
> REGISTRARS
> - /registrar(id)/registry
> ​
> RESERVATIONS
> - /reserve
> - /unreserve
> - /quota
> - /weights
> ​
> TEARDOWN
> - /teardown
> ​
> VIEWS
> - /frameworks
> - /roles
> - /roles.json
> - /slaves
> - /state
> - /state-summary
> - /state.json
> - /tasks
> - /tasks.json
> ​
> VOLUMES
> - /create-volumes
> - /destroy-volumes
> ​
> UNAUTHENTICATED
> - /health
> - /redirect
> ​
> 
> // Additional agent realms
> 
> ​
> OPERATORS
> - /api/v1
> ​
> VIEWS
> - /containers
> - /monitor/statistics
> - /monitor/statistics.json
> - /state
> - /state.json
> ​
> UNAUTHENTICATED
> - /api/v1/executor
> - /health
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5106) Improve test_http_framework so it can load master detector from modules

2016-07-22 Thread zhou xing (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389097#comment-15389097
 ] 

zhou xing commented on MESOS-5106:
--

Per discussion with Kapil, the framework in the future will use HTTP interface 
instead, closing this ticket. Feel free to re-open the ticket if neccessary.

> Improve test_http_framework so it can load master detector from modules
> ---
>
> Key: MESOS-5106
> URL: https://issues.apache.org/jira/browse/MESOS-5106
> Project: Mesos
>  Issue Type: Task
>Reporter: Shuai Lin
>Assignee: zhou xing
>
> I'm planning to restart the work of [MESOS-1806] (etcd contender/detector) 
> based on [MESOS-4610]. One thing I need to address first is when writing a 
> script test,  I need a framework that can use a master detector loaded from a 
> module. The best way to do this seems to be adding {{\-\-modules}} and 
> {{\-\-master_detector}} flags to {{test_http_framework.cpp}} so we can reuse 
> it in tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)