[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process

2015-12-08 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046570#comment-15046570
 ] 

Till Toenshoff commented on MESOS-4065:
---

>From your results, this conclusion seems sensible to me.

We should actually add a bug report in the zookeeper JIRA so it can be properly 
handled upstream 
(https://issues.apache.org/jira/browse/ZOOKEEPER/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
 
Could you please take care of that [~jdef]? 

> slave FD for ZK tcp connection leaked to executor process
> -
>
> Key: MESOS-4065
> URL: https://issues.apache.org/jira/browse/MESOS-4065
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.1, 0.25.0
>Reporter: James DeFelice
>  Labels: mesosphere, security
>
> {code}
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd
> root  1432 99.3  0.0 202420 12928 ?Rsl  21:32  13:51 
> ./etcd-mesos-executor -log_dir=./
> root  1450  0.4  0.1  38332 28752 ?Sl   21:32   0:03 ./etcd 
> --data-dir=etcd_data --name=etcd-1449178273 
> --listen-peer-urls=http://10.0.0.45:1025 
> --initial-advertise-peer-urls=http://10.0.0.45:1025 
> --listen-client-urls=http://10.0.0.45:1026 
> --advertise-client-urls=http://10.0.0.45:1026 
> --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025
>  --initial-cluster-state=existing
> core  1651  0.0  0.0   6740   928 pts/0S+   21:46   0:00 grep 
> --colour=auto -e etcd
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181
> etcd-meso 1432 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave
> root  1124  0.2  0.1 900496 25736 ?Ssl  21:11   0:04 
> /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave
> core  1658  0.0  0.0   6740   832 pts/0S+   21:46   0:00 grep 
> --colour=auto -e slave
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181
> mesos-sla 1124 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> {code}
> I only tested against mesos 0.24.1 and 0.25.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process

2015-12-08 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046588#comment-15046588
 ] 

Till Toenshoff commented on MESOS-4065:
---

Some tool that has been rather useful for debugging such issues within Mesos; 
https://github.com/tillt/mesos/commit/d6982ece26121c599426e6b5c573e8d8afeff837


> slave FD for ZK tcp connection leaked to executor process
> -
>
> Key: MESOS-4065
> URL: https://issues.apache.org/jira/browse/MESOS-4065
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.1, 0.25.0
>Reporter: James DeFelice
>  Labels: mesosphere, security
>
> {code}
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd
> root  1432 99.3  0.0 202420 12928 ?Rsl  21:32  13:51 
> ./etcd-mesos-executor -log_dir=./
> root  1450  0.4  0.1  38332 28752 ?Sl   21:32   0:03 ./etcd 
> --data-dir=etcd_data --name=etcd-1449178273 
> --listen-peer-urls=http://10.0.0.45:1025 
> --initial-advertise-peer-urls=http://10.0.0.45:1025 
> --listen-client-urls=http://10.0.0.45:1026 
> --advertise-client-urls=http://10.0.0.45:1026 
> --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025
>  --initial-cluster-state=existing
> core  1651  0.0  0.0   6740   928 pts/0S+   21:46   0:00 grep 
> --colour=auto -e etcd
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181
> etcd-meso 1432 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave
> root  1124  0.2  0.1 900496 25736 ?Ssl  21:11   0:04 
> /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave
> core  1658  0.0  0.0   6740   832 pts/0S+   21:46   0:00 grep 
> --colour=auto -e slave
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181
> mesos-sla 1124 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> {code}
> I only tested against mesos 0.24.1 and 0.25.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4075) Continue test suite execution across crashing tests.

2015-12-08 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15046890#comment-15046890
 ] 

Klaus Ma commented on MESOS-4075:
-

+1, maybe we can inject CHECK to avoid crash when testing.

> Continue test suite execution across crashing tests.
> 
>
> Key: MESOS-4075
> URL: https://issues.apache.org/jira/browse/MESOS-4075
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.26.0
>Reporter: Bernd Mathiske
>Assignee: Bernd Mathiske
>  Labels: mesosphere
>
> Currently, mesos-tests.sh exits when a test crashes. This is inconvenient 
> when trying to find out all tests that fail. 
> mesos-tests.sh should rate a test that crashes as failed and continue the 
> same way as if the test merely returned with a failure result and exited 
> properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process

2015-12-08 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047064#comment-15047064
 ] 

James DeFelice commented on MESOS-4065:
---

https://issues.apache.org/jira/browse/ZOOKEEPER-2338

> slave FD for ZK tcp connection leaked to executor process
> -
>
> Key: MESOS-4065
> URL: https://issues.apache.org/jira/browse/MESOS-4065
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.1, 0.25.0
>Reporter: James DeFelice
>  Labels: mesosphere, security
>
> {code}
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd
> root  1432 99.3  0.0 202420 12928 ?Rsl  21:32  13:51 
> ./etcd-mesos-executor -log_dir=./
> root  1450  0.4  0.1  38332 28752 ?Sl   21:32   0:03 ./etcd 
> --data-dir=etcd_data --name=etcd-1449178273 
> --listen-peer-urls=http://10.0.0.45:1025 
> --initial-advertise-peer-urls=http://10.0.0.45:1025 
> --listen-client-urls=http://10.0.0.45:1026 
> --advertise-client-urls=http://10.0.0.45:1026 
> --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025
>  --initial-cluster-state=existing
> core  1651  0.0  0.0   6740   928 pts/0S+   21:46   0:00 grep 
> --colour=auto -e etcd
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181
> etcd-meso 1432 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave
> root  1124  0.2  0.1 900496 25736 ?Ssl  21:11   0:04 
> /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave
> core  1658  0.0  0.0   6740   832 pts/0S+   21:46   0:00 grep 
> --colour=auto -e slave
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181
> mesos-sla 1124 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> {code}
> I only tested against mesos 0.24.1 and 0.25.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3925) Add HDFS based URI fetcher plugin.

2015-12-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-3925:
--
Sprint: Mesosphere Sprint 24

> Add HDFS based URI fetcher plugin.
> --
>
> Key: MESOS-3925
> URL: https://issues.apache.org/jira/browse/MESOS-3925
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>  Labels: twitter
>
> This plugin uses HDFS client to fetch artifacts. It can support schemes like 
> hdfs/hftp/s3/s3n
> It'll shell out the hadoop command to do the actual fetching.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3951) Make HDFS tool wrappers asynchronous.

2015-12-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-3951:
--
Sprint: Mesosphere Sprint 24

> Make HDFS tool wrappers asynchronous.
> -
>
> Key: MESOS-3951
> URL: https://issues.apache.org/jira/browse/MESOS-3951
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>
> The existing HDFS tool wrappers (src/hdfs/hdfs.hpp) are synchronous. They use 
> os::shell to shell out the 'hadoop' commands. This makes it very hard to be 
> reused at other locations in the code base.
> The URI fetcher HDFS plugin will try to re-use the existing HDFS tool 
> wrappers. In order to do that, we need to make it asynchronous first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4084) mesos-slave assigned marathon task wrongly to chronos framework after task failure

2015-12-08 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047202#comment-15047202
 ] 

Vinod Kone commented on MESOS-4084:
---

Hmm. This is really bizarre. The framework id for a status update is encoded in 
the status update message itself, which is sent by the executor (driver). Can 
you paste the log complete slave log lines between 8:58 and 9:03? Also log 
lines from stdout/stderr of the executor would be useful.

> mesos-slave assigned marathon task wrongly to chronos framework after task 
> failure
> --
>
> Key: MESOS-4084
> URL: https://issues.apache.org/jira/browse/MESOS-4084
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.22.2
> Environment: Ubuntu 14.04.2 LTS
> Mesos 0.22.2
> Marathon 0.11.0
> Chronos 2.4.0
>Reporter: Erhan Kesken
>Priority: Minor
>
> I don't know how to reproduce problem, only thing I can do, is sharing my 
> logs:
> https://gist.github.com/ekesken/f2edfd65cca8638b0136
> These are highlights from my logs:
> mesos-slave logs:
> {noformat}
> Dec  7 08:58:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 08:58:27.089156 56130 slave.cpp:2531] Handling status update TASK_FAILED 
> (UUID: 5b335fab-1722-4270-83a6-b4ec843be47f) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  of framework 20151113-112010-100670892-5050-7957-0001 from 
> executor(1)@172.29.1.12:1651
> 08:58:27 mesos-slave-node-012 mesos-slave[56099]: E1207 08:58:27.089874 56128 
> slave.cpp:2662] Failed to update resources for container 
> ed5f4f67-464d-4786-9628-cd48732de6b7 of executor 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  running task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  on status update for terminal task, destroying container: Failed to 
> determine cgroup for the 'cpu' subsystem: Failed to read /proc/34074/cgroup: 
> Failed to open file '/proc/34074/cgroup': No such file or directory
> {noformat}
> notice the framework id above, then 5m later we got following logs:
> {noformat}
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653187 56130 slave.cpp:2531] Handling status update TASK_RUNNING 
> (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001 from executor(1)@172.29.1.12:1651
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: W1207 
> 09:03:27.653282 56130 slave.cpp:2568] Could not find the executor for status 
> update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653390 56130 status_update_manager.cpp:317] Received status update 
> TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653543 56130 slave.cpp:2776] Forwarding the update TASK_RUNNING 
> (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001 to master@172.29.0.5:5050
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653688 56130 slave.cpp:2709] Sending acknowledgement for status 
> update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001 to executor(1)@172.29.1.12:1651
> Dec  7 09:03:37 mesos-slave-node-012 mesos-slave[56099]: W1207 
> 09:03:37.654337 56134 status_update_manager.cpp:472] Resending status update 
> TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001
> {noformat} 
> this caused deactivation of chronos immediately as seen on mesos-master log:
> {noformat}
> Dec  7 09:03:27 mesos-master-node-001 mesos-master[40898]: I1207 
> 09:03:27.654770 40948 master.cpp:1964] Deactivating framework 
> 

[jira] [Commented] (MESOS-3828) Strategy for Utilizing Docker 1.9 Multihost Networking

2015-12-08 Thread Spike Curtis (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047127#comment-15047127
 ] 

Spike Curtis commented on MESOS-3828:
-

We're already well on the road to getting IP-per-container networking into the 
MesosContainerizer.  I would strongly advocate for a unified networking layer 
that handles both Docker and Mesos containers.  This work is centered at:

https://github.com/mesosphere/net-modules

We can straightforwardly enhance this work to extend multihost networking to 
Docker containers as well, without the need for Docker's libnetwork.  That 
gives us the advantage of all tasks getting a consistent, unified, and easy to 
understand network function, rather than attempting to mate Docker's 
opinionated network model with Mesos container networking.

> Strategy for Utilizing Docker 1.9 Multihost Networking
> --
>
> Key: MESOS-3828
> URL: https://issues.apache.org/jira/browse/MESOS-3828
> Project: Mesos
>  Issue Type: Story
>  Components: isolation
>Affects Versions: 0.26.0
>Reporter: John Omernik
>Assignee: Timothy Chen
>  Labels: Docker, isolation, mesosphere, network, plugins
>
> This is a user story to discuss the strategy for Mesos to in using the new 
> Docker 1.9 feature: Multihost Networking. 
> http://blog.docker.com/2015/11/docker-multi-host-networking-ga/
> Basically we should determine if this is something we want to work with from 
> a standpoint of container isolation and going forward how can we best 
> integrate. 
> The space for networking in Mesos is growing fast with IP per Container and 
> other networking modules being worked on.  Projects like Project Calico offer 
> services from outside the Mesos community that plug nicely or will plug 
> nicely into Mesos.  
> So how about Multihost networking? An option to work with? With Docker being 
> a first class citizen of Mesos, this is something we should be considering. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3951) Make HDFS tool wrappers asynchronous.

2015-12-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-3951:
--
Story Points: 5

> Make HDFS tool wrappers asynchronous.
> -
>
> Key: MESOS-3951
> URL: https://issues.apache.org/jira/browse/MESOS-3951
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>
> The existing HDFS tool wrappers (src/hdfs/hdfs.hpp) are synchronous. They use 
> os::shell to shell out the 'hadoop' commands. This makes it very hard to be 
> reused at other locations in the code base.
> The URI fetcher HDFS plugin will try to re-use the existing HDFS tool 
> wrappers. In order to do that, we need to make it asynchronous first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3925) Add HDFS based URI fetcher plugin.

2015-12-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-3925:
--
Story Points: 3

> Add HDFS based URI fetcher plugin.
> --
>
> Key: MESOS-3925
> URL: https://issues.apache.org/jira/browse/MESOS-3925
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>  Labels: twitter
>
> This plugin uses HDFS client to fetch artifacts. It can support schemes like 
> hdfs/hftp/s3/s3n
> It'll shell out the hadoop command to do the actual fetching.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3925) Add HDFS based URI fetcher plugin.

2015-12-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-3925:
--
Labels: mesosphere twitter  (was: twitter)

> Add HDFS based URI fetcher plugin.
> --
>
> Key: MESOS-3925
> URL: https://issues.apache.org/jira/browse/MESOS-3925
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>  Labels: mesosphere, twitter
>
> This plugin uses HDFS client to fetch artifacts. It can support schemes like 
> hdfs/hftp/s3/s3n
> It'll shell out the hadoop command to do the actual fetching.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3951) Make HDFS tool wrappers asynchronous.

2015-12-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-3951:
--
Labels: mesosphere twitter  (was: )

> Make HDFS tool wrappers asynchronous.
> -
>
> Key: MESOS-3951
> URL: https://issues.apache.org/jira/browse/MESOS-3951
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>  Labels: mesosphere, twitter
>
> The existing HDFS tool wrappers (src/hdfs/hdfs.hpp) are synchronous. They use 
> os::shell to shell out the 'hadoop' commands. This makes it very hard to be 
> reused at other locations in the code base.
> The URI fetcher HDFS plugin will try to re-use the existing HDFS tool 
> wrappers. In order to do that, we need to make it asynchronous first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3925) Add HDFS based URI fetcher plugin.

2015-12-08 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-3925:
-

Assignee: Jie Yu

> Add HDFS based URI fetcher plugin.
> --
>
> Key: MESOS-3925
> URL: https://issues.apache.org/jira/browse/MESOS-3925
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>  Labels: mesosphere, twitter
>
> This plugin uses HDFS client to fetch artifacts. It can support schemes like 
> hdfs/hftp/s3/s3n
> It'll shell out the hadoop command to do the actual fetching.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4084) mesos-slave assigned marathon task wrongly to chronos framework after task failure

2015-12-08 Thread Erhan Kesken (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047311#comment-15047311
 ] 

Erhan Kesken commented on MESOS-4084:
-

I shared my complete slave log here: 
https://gist.github.com/ekesken/bed20adfba0995117d74 .Unfortunately 
stdout/stderr files are not available.

> mesos-slave assigned marathon task wrongly to chronos framework after task 
> failure
> --
>
> Key: MESOS-4084
> URL: https://issues.apache.org/jira/browse/MESOS-4084
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.22.2
> Environment: Ubuntu 14.04.2 LTS
> Mesos 0.22.2
> Marathon 0.11.0
> Chronos 2.4.0
>Reporter: Erhan Kesken
>Priority: Minor
>
> I don't know how to reproduce problem, only thing I can do, is sharing my 
> logs:
> https://gist.github.com/ekesken/f2edfd65cca8638b0136
> These are highlights from my logs:
> mesos-slave logs:
> {noformat}
> Dec  7 08:58:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 08:58:27.089156 56130 slave.cpp:2531] Handling status update TASK_FAILED 
> (UUID: 5b335fab-1722-4270-83a6-b4ec843be47f) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  of framework 20151113-112010-100670892-5050-7957-0001 from 
> executor(1)@172.29.1.12:1651
> 08:58:27 mesos-slave-node-012 mesos-slave[56099]: E1207 08:58:27.089874 56128 
> slave.cpp:2662] Failed to update resources for container 
> ed5f4f67-464d-4786-9628-cd48732de6b7 of executor 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  running task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  on status update for terminal task, destroying container: Failed to 
> determine cgroup for the 'cpu' subsystem: Failed to read /proc/34074/cgroup: 
> Failed to open file '/proc/34074/cgroup': No such file or directory
> {noformat}
> notice the framework id above, then 5m later we got following logs:
> {noformat}
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653187 56130 slave.cpp:2531] Handling status update TASK_RUNNING 
> (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001 from executor(1)@172.29.1.12:1651
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: W1207 
> 09:03:27.653282 56130 slave.cpp:2568] Could not find the executor for status 
> update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653390 56130 status_update_manager.cpp:317] Received status update 
> TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653543 56130 slave.cpp:2776] Forwarding the update TASK_RUNNING 
> (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001 to master@172.29.0.5:5050
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653688 56130 slave.cpp:2709] Sending acknowledgement for status 
> update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001 to executor(1)@172.29.1.12:1651
> Dec  7 09:03:37 mesos-slave-node-012 mesos-slave[56099]: W1207 
> 09:03:37.654337 56134 status_update_manager.cpp:472] Resending status update 
> TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001
> {noformat} 
> this caused deactivation of chronos immediately as seen on mesos-master log:
> {noformat}
> Dec  7 09:03:27 mesos-master-node-001 mesos-master[40898]: I1207 
> 09:03:27.654770 40948 master.cpp:1964] Deactivating framework 
> 20150624-210230-117448108-5050-3678-0001 (chronos-2.4.0) at 
> scheduler-7a4396f7-1f68-4f41-901e-805db5de0432@172.29.0.6:11893
> Dec  7 09:03:27 mesos-master-node-001 

[jira] [Created] (MESOS-4096) stout tests fail to build with external protobuf version

2015-12-08 Thread James Peach (JIRA)
James Peach created MESOS-4096:
--

 Summary: stout tests fail to build with external protobuf version
 Key: MESOS-4096
 URL: https://issues.apache.org/jira/browse/MESOS-4096
 Project: Mesos
  Issue Type: Bug
  Components: build, stout
Reporter: James Peach


Using the following configure options:
{code}
prefix/configure \
--disable-java \
--disable-python \
--enable-silent-rules \
--enable-debug \
--with-apr=$(apr-1-config --prefix) \
--with-protobuf=$(pkg-config --variable=prefix protobuf-lite)
{code}

The stout tests fail to build because code generated with a different protobuf 
version is checked in and not regenerated:

{code}
  CXX  stout_tests-protobuf_tests.pb.o
In file included from 
/Users/jpeach/src/mesos.git/3rdparty/libprocess/3rdparty/stout/tests/protobuf_tests.pb.cc:5:
/Users/jpeach/src/mesos.git/3rdparty/libprocess/3rdparty/stout/tests/protobuf_tests.pb.h:17:2:
 error: This
  file was generated by an older version of protoc which is
#error This file was generated by an older version of protoc which is
 ^
/Users/jpeach/src/mesos.git/3rdparty/libprocess/3rdparty/stout/tests/protobuf_tests.pb.h:18:2:
 error:
  incompatible with your Protocol Buffer headers. Please
#error incompatible with your Protocol Buffer headers.  Please
 ^
/Users/jpeach/src/mesos.git/3rdparty/libprocess/3rdparty/stout/tests/protobuf_tests.pb.h:19:2:
 error:
  regenerate this file with a newer version of protoc.
#error regenerate this file with a newer version of protoc.
 ^
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4048) Consider unifying slave timeout behavior between steady state and master failover

2015-12-08 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047385#comment-15047385
 ] 

Benjamin Mahler edited comment on MESOS-4048 at 12/8/15 8:06 PM:
-

This ticket is independent from MESOS-4049 in that it is discussing the current 
inconsistent approaches to agent partition detection (case 1 and 2 above).

When we were implementing master recovery, we wanted to use health checking to 
determine when an agent is unhealthy, but there were some implementation 
difficulties that led to the addition of {{\-\-slave_reregistration_timer}} 
instead. This approach is a bit scary because we may remove healthy agents that 
for some reason (e.g. ZK connectivity issues) could not re-register with the 
master after master failover. This was why we put in place some safety nets 
({{\-\-recovery_slave_removal_limit}} and we were able to re-use used the 
removal rate limiting).

The point of this ticket is to look into removing 
{{\-\-slave_reregistration_timer}} entirely and have the master perform the 
same health check based partition detection that it does in the steady state.

So, MESOS-4049 is about what we do *when* an agent is unhealthy. This ticket is 
about *how* we determine that an agent is unhealthy. Specifically, we want to 
determine it in a consistent way rather than having one approach in steady 
state and a different approach after master failover.

Make sense?


was (Author: bmahler):
This ticket is independent from MESOS-4049 in that it is discussing the current 
inconsistent approaches to agent partition handling (case 1 and 2 above).

When we were implementing master recovery, we wanted to use health checking to 
determine when an agent should be removed, but there were some implementation 
difficulties that led to the addition of {{--slave_reregistration_timer}} 
instead. This approach is a bit scary because we may remove healthy agents that 
for some reason (e.g. ZK connectivity issues) could not re-register with the 
master after master failover. This was why we put in place some safety nets 
({{--recovery_slave_removal_limit}} and we were able to re-use used the removal 
rate limiting).

The point of this ticket is to look into removing 
{{--slave_reregistration_timer}} entirely and have the master perform the same 
health check based partition detection that it does in the steady state.

So, MESOS-4049 is about what we do *when* an agent is unhealthy (e.g. 
partitioned). This ticket is about *how* we determine that an agent is 
unhealthy (e.g. partitioned). Specifically, we want to determine it in a 
consistent way rather than having one approach in steady state and a different 
approach after master failover.

Make sense?

> Consider unifying slave timeout behavior between steady state and master 
> failover
> -
>
> Key: MESOS-4048
> URL: https://issues.apache.org/jira/browse/MESOS-4048
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>Assignee: Anindya Sinha
>Priority: Minor
>  Labels: mesosphere
>
> Currently, there are two timeouts that control what happens when an agent is 
> partitioned from the master:
> 1. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} controls how long the 
> master waits before declaring a slave to be dead in the "steady state"
> 2. {{slave_reregister_timeout}} controls how long the master waits for a 
> slave to reregister after master failover.
> It is unclear whether these two cases really merit being treated differently 
> -- it might be simpler for operators to configure a single timeout that 
> controls how long the master waits before declaring that a slave is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4097) Change /roles endpoint to include quotas, weights, reserved resources?

2015-12-08 Thread Neil Conway (JIRA)
Neil Conway created MESOS-4097:
--

 Summary: Change /roles endpoint to include quotas, weights, 
reserved resources?
 Key: MESOS-4097
 URL: https://issues.apache.org/jira/browse/MESOS-4097
 Project: Mesos
  Issue Type: Improvement
Reporter: Neil Conway


MESOS-4085 changes the behavior of the {{/roles}} endpoint: rather than listing 
all the explicitly defined roles, we will now only list those roles that have 
one or more registered frameworks.

As suggested by [~alexr] in code review, this could be improved -- an operator 
might reasonably expect to see all the roles that have
* non-default weight
* non-default quota
* non-default ACLs?
* any static or dynamically reserved resources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4099) parallel make tests does not build all test targets

2015-12-08 Thread Joris Van Remoortere (JIRA)
Joris Van Remoortere created MESOS-4099:
---

 Summary: parallel make tests does not build all test targets
 Key: MESOS-4099
 URL: https://issues.apache.org/jira/browse/MESOS-4099
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Affects Versions: 0.26.0
 Environment: Ubuntu 15.04
clang-3.6 as well as gcc-4.9
Reporter: Joris Van Remoortere
Assignee: Kapil Arya


When inside 3rdparty/libprocess:
Running {{make -j8 tests}} from a clean build does not yield the 
{{libprocess-tests}} binary.
Running it a subsequent time triggers more compilation and ends up yielding the 
{{libprocess-tests}} binary.
This suggests the {{test}} target is not being built correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4098) Allow interactive terminal for mesos containerizer

2015-12-08 Thread Jojy Varghese (JIRA)
Jojy Varghese created MESOS-4098:


 Summary: Allow interactive terminal for mesos containerizer
 Key: MESOS-4098
 URL: https://issues.apache.org/jira/browse/MESOS-4098
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
 Environment: linux
Reporter: Jojy Varghese
Assignee: Jojy Varghese


Today mesos containerizer does not have a way to run tasks that require 
interactive sessions. An example use case is running a task that requires a 
manual password entry from an operator. Another use case could be debugging 
(gdb). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4071) Master crash during framework teardown ( Check failed: total.resources.contains(slaveId))

2015-12-08 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047464#comment-15047464
 ] 

Joris Van Remoortere commented on MESOS-4071:
-

My main fear here is that this wouldn't catch scenarios where the delta 
gradually gets larger as operations are performed.
[~jamespeach]Would you be up for writing a simple test case where we apply the 
arithmetic resource operations (eg. add, then subtract) iteratively to see if 
there are conditions under which the delta grows?

If the delta can grow then an `almostEquals` approach will just make the 
problem rarer, and not solve it. In this case we need to fix the math itself.

I want to make sure that we do not "push the problem down the road", especially 
if there are logical branches dependent on this math. There are likely even 
more of these in the schedulers that we communicate with, rather than the ones 
pointed out in the mesos code base.

> Master crash during framework teardown ( Check failed: 
> total.resources.contains(slaveId))
> -
>
> Key: MESOS-4071
> URL: https://issues.apache.org/jira/browse/MESOS-4071
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Mandeep Chadha
>
> Stack Trace :
> NOTE : Replaced IP address with XX.XX.XX.XX 
> {code}
> I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for 
> framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at 
> scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
> I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework 
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST)) at 
> schedulerc8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
> I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework 
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
> F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed: 
> total.resources.contains(slaveId)
> *** Check failure stack trace: ***
> @ 0x7f2b3dda53d8  google::LogMessage::Fail()
> @ 0x7f2b3dda5327  google::LogMessage::SendToLog()
> @ 0x7f2b3dda4d38  google::LogMessage::Flush()
> @ 0x7f2b3dda7a6c  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f2b3d3351a1  
> mesos::internal::master::allocator::DRFSorter::remove()
> @ 0x7f2b3d0b8c29  
> mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
> @ 0x7f2b3d0ca823 
> _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESJ_
> @ 0x7f2b3d0dc8dc  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11FrameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
> _
> @ 0x7f2b3dd2cc35  std::function<>::operator()()
> @ 0x7f2b3dd15ae5  process::ProcessBase::visit()
> @ 0x7f2b3dd188e2  process::DispatchEvent::visit()
> @   0x472366  process::ProcessBase::serve()
> @ 0x7f2b3dd1203f  process::ProcessManager::resume()
> @ 0x7f2b3dd061b2  process::internal::schedule()
> @ 0x7f2b3dd63efd  
> _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
> x_tupleIJXspT_EEE
> @ 0x7f2b3dd63e4d  std::_Bind_simple<>::operator()()
> @ 0x7f2b3dd63de6  std::thread::_Impl<>::_M_run()
> @   0x318c2b6470  (unknown)
> @   0x318b2079d1  (unknown)
> @   0x318aae8b5d  (unknown)
> @  (nil)  (unknown)
> Aborted (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3738) Mesos health check is invoked incorrectly when Mesos slave is within the docker container

2015-12-08 Thread Mark Hindess (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048240#comment-15048240
 ] 

Mark Hindess commented on MESOS-3738:
-

Has this fix been backported to a 0.23.x release? I'm using the latest 0.23.1 
debian package and it is still broken.

In case it helps anyone else upgrade smoothly to a working release, I am using 
a workaround of creating a mesos-health-check wrapper that execs the real 
mesos-health-check. That is:

{code}
bash$ cat  #!/bin/sh
> exec /usr/libexec/mesos/mesos-health-check "$@"
> EOF
bash$ chmod 0755 mesos-health-check
bash$ fakeroot sh -c "chown root:root mesos-health-check; \
   tar cf - mesos-health-check |gzip -9 
>mesos-health-check.tar.gz"
bash$ tar tvzf mesos-health-check.tar.gz
-rwxr-xr-x root/root56 2015-12-09 07:44 mesos-health-check
bash$ # deploy mesos-health-check.tar.gz to your mesos-slaves (I used 
ansible)
bash$ # if using docker, restart your slaves with mesos-health-check.tar.gz
bash$ # mounted as volume into your mesos-slave container
bash$ # add file:///path/to/mesos-health-check.tar.gz to uris in app json
{code}

> Mesos health check is invoked incorrectly when Mesos slave is within the 
> docker container
> -
>
> Key: MESOS-3738
> URL: https://issues.apache.org/jira/browse/MESOS-3738
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
> Environment: Docker 1.8.0:
> Client:
>  Version:  1.8.0
>  API version:  1.20
>  Go version:   go1.4.2
>  Git commit:   0d03096
>  Built:Tue Aug 11 16:48:39 UTC 2015
>  OS/Arch:  linux/amd64
> Server:
>  Version:  1.8.0
>  API version:  1.20
>  Go version:   go1.4.2
>  Git commit:   0d03096
>  Built:Tue Aug 11 16:48:39 UTC 2015
>  OS/Arch:  linux/amd64
> Host: Ubuntu 14.04
> Container: Debian 8.1 + Java-7
>Reporter: Yong Tang
>Assignee: haosdent
> Fix For: 0.26.0
>
> Attachments: MESOS-3738-0_23_1.patch, MESOS-3738-0_24_1.patch, 
> MESOS-3738-0_25_0.patch
>
>
> When Mesos slave is within the container, the COMMAND health check from 
> Marathon is invoked incorrectly.
> In such a scenario, the sandbox directory (instead of the 
> launcher/health-check directory) is used. This result in an error with the 
> container.
> Command to invoke the Mesos slave container:
> {noformat}
> sudo docker run -d -v /sys:/sys -v /usr/bin/docker:/usr/bin/docker:ro -v 
> /usr/lib/x86_64-linux-gnu/libapparmor.so.1:/usr/lib/x86_64-linux-gnu/libapparmor.so.1:ro
>  -v /var/run/docker.sock:/var/run/docker.sock -v /tmp/mesos:/tmp/mesos mesos 
> mesos slave --master=zk://10.2.1.2:2181/mesos --containerizers=docker,mesos 
> --executor_registration_timeout=5mins --docker_stop_timeout=10secs 
> --launcher=posix
> {noformat}
> Marathon JSON file:
> {code}
> {
>   "id": "ubuntu",
>   "container":
>   {
> "type": "DOCKER",
> "docker":
> {
>   "image": "ubuntu",
>   "network": "BRIDGE",
>   "parameters": []
> }
>   },
>   "args": [ "bash", "-c", "while true; do echo 1; sleep 5; done" ],
>   "uris": [],
>   "healthChecks":
>   [
> {
>   "protocol": "COMMAND",
>   "command": { "value": "echo Success" },
>   "gracePeriodSeconds": 3000,
>   "intervalSeconds": 5,
>   "timeoutSeconds": 5,
>   "maxConsecutiveFailures": 300
> }
>   ],
>   "instances": 1
> }
> {code}
> {noformat}
> STDOUT:
> root@cea2be47d64f:/mnt/mesos/sandbox# cat stdout 
> --container="mesos-e20f8959-cd9f-40ae-987d-809401309361-S0.815cc886-1cd1-4f13-8f9b-54af1f127c3f"
>  --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" 
> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" 
> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" 
> --sandbox_directory="/tmp/mesos/slaves/e20f8959-cd9f-40ae-987d-809401309361-S0/frameworks/e20f8959-cd9f-40ae-987d-809401309361-/executors/ubuntu.86bca10f-72c9-11e5-b36d-02420a020106/runs/815cc886-1cd1-4f13-8f9b-54af1f127c3f"
>  --stop_timeout="10secs"
> --container="mesos-e20f8959-cd9f-40ae-987d-809401309361-S0.815cc886-1cd1-4f13-8f9b-54af1f127c3f"
>  --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" 
> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" 
> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" 
> --sandbox_directory="/tmp/mesos/slaves/e20f8959-cd9f-40ae-987d-809401309361-S0/frameworks/e20f8959-cd9f-40ae-987d-809401309361-/executors/ubuntu.86bca10f-72c9-11e5-b36d-02420a020106/runs/815cc886-1cd1-4f13-8f9b-54af1f127c3f"
>  --stop_timeout="10secs"
> Registered docker executor on b01e2e75afcb
> Starting task 

[jira] [Commented] (MESOS-3818) Line wrapping for "--help" output

2015-12-08 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048223#comment-15048223
 ] 

Shuai Lin commented on MESOS-3818:
--

Sorry for the confusion.

I mean 80 columns might be too small for the output, since the flag name and 
the empty space already took 43 columns in the example I pasted, so I wonder 
should we stick to 80 columns, or use a larger value like 100 columns?


> Line wrapping for "--help" output
> -
>
> Key: MESOS-3818
> URL: https://issues.apache.org/jira/browse/MESOS-3818
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>Assignee: Shuai Lin
>Priority: Trivial
>  Labels: mesosphere, newbie
>
> The output of `mesos-slave --help`, `mesos-master --help`, and perhaps other 
> programs has very inconsistent line wrapping: different help text fragments 
> are wrapped at very different column numbers, which harms readability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3909) isolator module headers depend on picojson headers

2015-12-08 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047595#comment-15047595
 ] 

James Peach commented on MESOS-3909:


There are a number of solutions to this that I can see

1. Move the picojson dependencies into a .cpp file in stout

stout is supposed to be a header-only library, and this would undo that. I'm 
not sure of the history of why stout needs to be header-only, but maybe this 
restriction can be loosened.

2. Copy the picojson dependencies into a .cpp in libmesos

This works since picojson is just an internal dependency of Mesos. It needs a 
little ifdef hackery and it might be tricky to avoid copying the relevant 
picojson code, so maintainability is a question. 

3. Install picojson.h

We would need to install picojson.h as  and adjust Mesos 
include paths appropriately. Using an unbundled picojson would no longer work 
(though it probably doesn't work right today).

4. Do nothing

You can't build a Mesos isolator without fishing in the Mesos code for 
picojson.h. This seems undesirable since it makes release engineering of 
isolator modules harder.

> isolator module headers depend on picojson headers
> --
>
> Key: MESOS-3909
> URL: https://issues.apache.org/jira/browse/MESOS-3909
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api, modules
>Reporter: James Peach
>Assignee: James Peach
>
> When trying to build an isolator module, stout headers end up depending on 
> {{picojson.hpp}} which is not installed.
> {code}
> In file included from /opt/mesos/include/mesos/module/isolator.hpp:25:
> In file included from /opt/mesos/include/mesos/slave/isolator.hpp:30:
> In file included from /opt/mesos/include/process/dispatch.hpp:22:
> In file included from /opt/mesos/include/process/process.hpp:26:
> In file included from /opt/mesos/include/process/event.hpp:21:
> In file included from /opt/mesos/include/process/http.hpp:39:
> /opt/mesos/include/stout/json.hpp:23:10: fatal error: 'picojson.h' file not 
> found
> #include 
>  ^
> 8 warnings and 1 error generated.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4084) mesos-slave assigned marathon task wrongly to chronos framework after task failure

2015-12-08 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047845#comment-15047845
 ] 

Vinod Kone commented on MESOS-4084:
---

I haven't dug deeply but looks like the second status update (TASK_RUNNING) was 
being sent by the health check process. [~tnachen] any idea why a health check 
process launched inside docker executor outlives the container and sends status 
updates?

> mesos-slave assigned marathon task wrongly to chronos framework after task 
> failure
> --
>
> Key: MESOS-4084
> URL: https://issues.apache.org/jira/browse/MESOS-4084
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.22.2
> Environment: Ubuntu 14.04.2 LTS
> Mesos 0.22.2
> Marathon 0.11.0
> Chronos 2.4.0
>Reporter: Erhan Kesken
>Priority: Minor
>
> I don't know how to reproduce problem, only thing I can do, is sharing my 
> logs:
> https://gist.github.com/ekesken/f2edfd65cca8638b0136
> These are highlights from my logs:
> mesos-slave logs:
> {noformat}
> Dec  7 08:58:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 08:58:27.089156 56130 slave.cpp:2531] Handling status update TASK_FAILED 
> (UUID: 5b335fab-1722-4270-83a6-b4ec843be47f) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  of framework 20151113-112010-100670892-5050-7957-0001 from 
> executor(1)@172.29.1.12:1651
> 08:58:27 mesos-slave-node-012 mesos-slave[56099]: E1207 08:58:27.089874 56128 
> slave.cpp:2662] Failed to update resources for container 
> ed5f4f67-464d-4786-9628-cd48732de6b7 of executor 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  running task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  on status update for terminal task, destroying container: Failed to 
> determine cgroup for the 'cpu' subsystem: Failed to read /proc/34074/cgroup: 
> Failed to open file '/proc/34074/cgroup': No such file or directory
> {noformat}
> notice the framework id above, then 5m later we got following logs:
> {noformat}
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653187 56130 slave.cpp:2531] Handling status update TASK_RUNNING 
> (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001 from executor(1)@172.29.1.12:1651
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: W1207 
> 09:03:27.653282 56130 slave.cpp:2568] Could not find the executor for status 
> update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653390 56130 status_update_manager.cpp:317] Received status update 
> TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653543 56130 slave.cpp:2776] Forwarding the update TASK_RUNNING 
> (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001 to master@172.29.0.5:5050
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653688 56130 slave.cpp:2709] Sending acknowledgement for status 
> update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001 to executor(1)@172.29.1.12:1651
> Dec  7 09:03:37 mesos-slave-node-012 mesos-slave[56099]: W1207 
> 09:03:37.654337 56134 status_update_manager.cpp:472] Resending status update 
> TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001
> {noformat} 
> this caused deactivation of chronos immediately as seen on mesos-master log:
> {noformat}
> Dec  7 09:03:27 mesos-master-node-001 mesos-master[40898]: I1207 
> 09:03:27.654770 40948 master.cpp:1964] Deactivating framework 
> 20150624-210230-117448108-5050-3678-0001 (chronos-2.4.0) at 
> 

[jira] [Created] (MESOS-4102) Quota doesn't allocate resources on slave joining

2015-12-08 Thread Neil Conway (JIRA)
Neil Conway created MESOS-4102:
--

 Summary: Quota doesn't allocate resources on slave joining
 Key: MESOS-4102
 URL: https://issues.apache.org/jira/browse/MESOS-4102
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Neil Conway


See attached patch. {{framework1}} is not allocated any resources, despite the 
fact that the resources on {{agent2}} can safely be allocated to it without 
risk of violating {{quota1}}. If I understand the intended quota behavior 
correctly, this doesn't seem intended.

Note that if the framework is added _after_ the slaves are added, the resources 
on {{agent2}} are allocated to {{framework1}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3962) Add labels to the message Port

2015-12-08 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047850#comment-15047850
 ] 

Avinash Sridharan commented on MESOS-3962:
--

MESOS-3401 was incorrectly linked to this issue due to a confusion of what the 
issue meant by the message "Port". We wrongly assumed that it was related to 
port resources offered by the slave. The message "Port" being referred to here 
is the protobuf used in Discovery info. 

> Add labels to the message Port
> --
>
> Key: MESOS-3962
> URL: https://issues.apache.org/jira/browse/MESOS-3962
> Project: Mesos
>  Issue Type: Wish
>Reporter: Sargun Dhillon
>Assignee: Avinash Sridharan
>Priority: Minor
>  Labels: mesosphere
>
> I want to add arbitrary labels to the message "Port". I have a few use cases 
> for this:
> 1) I want to use it to drive isolators to install firewall rules associated 
> with the port
> 2) I want to use it to drive third party components to be able to specify 
> advertising information
> 3) I want to be able to able to use this to associate a deterministic virtual 
> hostname with a given port
> Ideally, once the task is launched, these labels would be immutable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3962) Add labels to the message Port

2015-12-08 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047883#comment-15047883
 ] 

Avinash Sridharan commented on MESOS-3962:
--

Had a discussion with Adam and Sargun. From mesos perspective a labels field 
needs to be introduced as an optional field in the message "Port" in 
include/mesos/mesos.proto . We will also need to update the JSON model object 
for TaskInfo to reflect these fields in state.json. 

However, making these changes in itself is not enough since currently Marathon 
is not populating the DiscoveryInfo field in TaskInfo. This implies that for 
service discovery to consume this field there are changes that need to be made 
in Marathon as well. 

> Add labels to the message Port
> --
>
> Key: MESOS-3962
> URL: https://issues.apache.org/jira/browse/MESOS-3962
> Project: Mesos
>  Issue Type: Wish
>Reporter: Sargun Dhillon
>Assignee: Avinash Sridharan
>Priority: Minor
>  Labels: mesosphere
>
> I want to add arbitrary labels to the message "Port". I have a few use cases 
> for this:
> 1) I want to use it to drive isolators to install firewall rules associated 
> with the port
> 2) I want to use it to drive third party components to be able to specify 
> advertising information
> 3) I want to be able to able to use this to associate a deterministic virtual 
> hostname with a given port
> Ideally, once the task is launched, these labels would be immutable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3962) Add labels to the message Port

2015-12-08 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-3962:
-
  Shepherd: Adam B
External issue URL: https://github.com/mesosphere/marathon/issues/1866

> Add labels to the message Port
> --
>
> Key: MESOS-3962
> URL: https://issues.apache.org/jira/browse/MESOS-3962
> Project: Mesos
>  Issue Type: Wish
>Reporter: Sargun Dhillon
>Assignee: Avinash Sridharan
>Priority: Minor
>  Labels: mesosphere
>
> I want to add arbitrary labels to the message "Port". I have a few use cases 
> for this:
> 1) I want to use it to drive isolators to install firewall rules associated 
> with the port
> 2) I want to use it to drive third party components to be able to specify 
> advertising information
> 3) I want to be able to able to use this to associate a deterministic virtual 
> hostname with a given port
> Ideally, once the task is launched, these labels would be immutable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4102) Quota doesn't allocate resources on slave joining

2015-12-08 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-4102:
---
Attachment: quota_absent_framework_test-1.patch

> Quota doesn't allocate resources on slave joining
> -
>
> Key: MESOS-4102
> URL: https://issues.apache.org/jira/browse/MESOS-4102
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Neil Conway
>  Labels: mesosphere, quota
> Attachments: quota_absent_framework_test-1.patch
>
>
> See attached patch. {{framework1}} is not allocated any resources, despite 
> the fact that the resources on {{agent2}} can safely be allocated to it 
> without risk of violating {{quota1}}. If I understand the intended quota 
> behavior correctly, this doesn't seem intended.
> Note that if the framework is added _after_ the slaves are added, the 
> resources on {{agent2}} are allocated to {{framework1}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4103) Show disk usage and allocation in WebUI

2015-12-08 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-4103:
-

 Summary: Show disk usage and allocation in WebUI
 Key: MESOS-4103
 URL: https://issues.apache.org/jira/browse/MESOS-4103
 Project: Mesos
  Issue Type: Improvement
Reporter: Vinod Kone
Assignee: Vinod Kone


Several places in the WebUI do not show disk utilization data (they only show 
cpu and mem). The max share shown in the webui also doesn't account for disk! 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4048) Consider unifying slave timeout behavior between steady state and master failover

2015-12-08 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047709#comment-15047709
 ] 

Klaus Ma commented on MESOS-4048:
-

Got it, make sense to me :).

> Consider unifying slave timeout behavior between steady state and master 
> failover
> -
>
> Key: MESOS-4048
> URL: https://issues.apache.org/jira/browse/MESOS-4048
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>Assignee: Anindya Sinha
>Priority: Minor
>  Labels: mesosphere
>
> Currently, there are two timeouts that control what happens when an agent is 
> partitioned from the master:
> 1. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} controls how long the 
> master waits before declaring a slave to be dead in the "steady state"
> 2. {{slave_reregister_timeout}} controls how long the master waits for a 
> slave to reregister after master failover.
> It is unclear whether these two cases really merit being treated differently 
> -- it might be simpler for operators to configure a single timeout that 
> controls how long the master waits before declaring that a slave is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3909) isolator module headers depend on picojson headers

2015-12-08 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047596#comment-15047596
 ] 

James Peach commented on MESOS-3909:


I tried (2) and it was pretty ugly. I think that (3) is the best bet.

> isolator module headers depend on picojson headers
> --
>
> Key: MESOS-3909
> URL: https://issues.apache.org/jira/browse/MESOS-3909
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api, modules
>Reporter: James Peach
>Assignee: James Peach
>
> When trying to build an isolator module, stout headers end up depending on 
> {{picojson.hpp}} which is not installed.
> {code}
> In file included from /opt/mesos/include/mesos/module/isolator.hpp:25:
> In file included from /opt/mesos/include/mesos/slave/isolator.hpp:30:
> In file included from /opt/mesos/include/process/dispatch.hpp:22:
> In file included from /opt/mesos/include/process/process.hpp:26:
> In file included from /opt/mesos/include/process/event.hpp:21:
> In file included from /opt/mesos/include/process/http.hpp:39:
> /opt/mesos/include/stout/json.hpp:23:10: fatal error: 'picojson.h' file not 
> found
> #include 
>  ^
> 8 warnings and 1 error generated.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4087) Introduce a module for logging executor/task output

2015-12-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15045985#comment-15045985
 ] 

Joseph Wu edited comment on MESOS-4087 at 12/8/15 11:23 PM:


Reviews:
|| Review || Summary||
| https://reviews.apache.org/r/41055/ 
  https://reviews.apache.org/r/41057/ | Refactoring |
| https://reviews.apache.org/r/41002/ | Module interface |
| https://reviews.apache.org/r/41003/ | Default module implementation |
| https://reviews.apache.org/r/41004/ | Modularification |
| https://reviews.apache.org/r/41061/ | New agent flags |
| https://reviews.apache.org/r/4/ | Regression test |


was (Author: kaysoky):
Reviews (WIP):
https://reviews.apache.org/r/41055/
https://reviews.apache.org/r/41057/
https://reviews.apache.org/r/41002/
https://reviews.apache.org/r/41003/
https://reviews.apache.org/r/41004/

> Introduce a module for logging executor/task output
> ---
>
> Key: MESOS-4087
> URL: https://issues.apache.org/jira/browse/MESOS-4087
> Project: Mesos
>  Issue Type: Task
>  Components: containerization, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Existing executor/task logs are logged to files in their sandbox directory, 
> with some nuances based on which containerizer is used (see background 
> section in linked document).
> A logger for executor/task logs has the following requirements:
> * The logger is given a command to run and must handle the stdout/stderr of 
> the command.
> * The handling of stdout/stderr must be resilient across agent failover.  
> Logging should not stop if the agent fails.
> * Logs should be readable, presumably via the web UI, or via some other 
> module-specific UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3760) Remove fragile sleep() from ProcessManager::settle()

2015-12-08 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-3760:
---
Component/s: test

> Remove fragile sleep() from ProcessManager::settle()
> 
>
> Key: MESOS-3760
> URL: https://issues.apache.org/jira/browse/MESOS-3760
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere, tech-debt, testing
>
> From {{ProcessManager::settle()}}:
> {code}
> // While refactoring in order to isolate libev behind abstractions
> // it became evident that this os::sleep is vital for tests to
> // pass. In particular, there are certain tests that assume too
> // much before they attempt to do a settle. One such example is
> // tests doing http::get followed by Clock::settle, where they
> // expect the http::get will have properly enqueued a process on
> // the run queue but http::get is just sending bytes on a
> // socket. Without sleeping at the beginning of this function we
> // can get unlucky and appear settled when in actuality the
> // kernel just hasn't copied the bytes to a socket or we haven't
> // yet read the bytes and enqueued an event on a process (and the
> // process on the run queue).
> os::sleep(Milliseconds(10));
> {code}
> Sleeping for 10 milliseconds doesn't guarantee that the kernel has done 
> anything at all; any test cases that depend on this behavior should be fixed 
> to actual perform the necessary synchronization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4101) Consider running most/all tests with the clock paused

2015-12-08 Thread Neil Conway (JIRA)
Neil Conway created MESOS-4101:
--

 Summary: Consider running most/all tests with the clock paused
 Key: MESOS-4101
 URL: https://issues.apache.org/jira/browse/MESOS-4101
 Project: Mesos
  Issue Type: Improvement
  Components: test
Reporter: Neil Conway


Presently, some tests pause the clock before they do timing-sensitive 
operations, only calling explicitly {{Clock::advance()}} to help ensure that 
dependencies on the clock don't cause the test to be non-deterministic. (Using 
{{Clock::advance()}} is typically also faster than waiting for the equivalent 
amount of physical time to elapse.)

However, most tests do not pause the clock, which contributes to the ongoing 
flakiness witnessed in many tests. We should investigate whether it is feasible 
to pause the clock in all/most tests (e.g., have the clock paused by default), 
and only enable the clock when the test cannot be implemented with 
{{Clock::advance()}}, {{Clock::settle()}}, and similar functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3818) Line wrapping for "--help" output

2015-12-08 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048100#comment-15048100
 ] 

Neil Conway commented on MESOS-3818:


Hi [~lins05], I'm not quite sure what you mean. Shouldn't we pick a column 
number to wrap the text at (say 80), and then use the same line length for all 
of the help output text?

> Line wrapping for "--help" output
> -
>
> Key: MESOS-3818
> URL: https://issues.apache.org/jira/browse/MESOS-3818
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>Assignee: Shuai Lin
>Priority: Trivial
>  Labels: mesosphere, newbie
>
> The output of `mesos-slave --help`, `mesos-master --help`, and perhaps other 
> programs has very inconsistent line wrapping: different help text fragments 
> are wrapped at very different column numbers, which harms readability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4071) Master crash during framework teardown ( Check failed: total.resources.contains(slaveId))

2015-12-08 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048023#comment-15048023
 ] 

James Peach commented on MESOS-4071:


[~jvanremoortere] it looks like Jie added a simple test in 
2e40c67ecf68bd818b789d5dd17baf5e00c43e2b. Is that something like what you were 
thinking of?

> Master crash during framework teardown ( Check failed: 
> total.resources.contains(slaveId))
> -
>
> Key: MESOS-4071
> URL: https://issues.apache.org/jira/browse/MESOS-4071
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Mandeep Chadha
>
> Stack Trace :
> NOTE : Replaced IP address with XX.XX.XX.XX 
> {code}
> I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for 
> framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at 
> scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
> I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework 
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST)) at 
> schedulerc8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
> I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework 
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
> F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed: 
> total.resources.contains(slaveId)
> *** Check failure stack trace: ***
> @ 0x7f2b3dda53d8  google::LogMessage::Fail()
> @ 0x7f2b3dda5327  google::LogMessage::SendToLog()
> @ 0x7f2b3dda4d38  google::LogMessage::Flush()
> @ 0x7f2b3dda7a6c  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f2b3d3351a1  
> mesos::internal::master::allocator::DRFSorter::remove()
> @ 0x7f2b3d0b8c29  
> mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
> @ 0x7f2b3d0ca823 
> _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESJ_
> @ 0x7f2b3d0dc8dc  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11FrameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
> _
> @ 0x7f2b3dd2cc35  std::function<>::operator()()
> @ 0x7f2b3dd15ae5  process::ProcessBase::visit()
> @ 0x7f2b3dd188e2  process::DispatchEvent::visit()
> @   0x472366  process::ProcessBase::serve()
> @ 0x7f2b3dd1203f  process::ProcessManager::resume()
> @ 0x7f2b3dd061b2  process::internal::schedule()
> @ 0x7f2b3dd63efd  
> _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
> x_tupleIJXspT_EEE
> @ 0x7f2b3dd63e4d  std::_Bind_simple<>::operator()()
> @ 0x7f2b3dd63de6  std::thread::_Impl<>::_M_run()
> @   0x318c2b6470  (unknown)
> @   0x318b2079d1  (unknown)
> @   0x318aae8b5d  (unknown)
> @  (nil)  (unknown)
> Aborted (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4067) ReservationTest.ACLMultipleOperations is flaky

2015-12-08 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-4067:

Shepherd: Michael Park

> ReservationTest.ACLMultipleOperations is flaky
> --
>
> Key: MESOS-4067
> URL: https://issues.apache.org/jira/browse/MESOS-4067
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>Assignee: Greg Mann
>  Labels: flaky, mesosphere
> Fix For: 0.27.0
>
>
> Observed from the CI: 
> https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu%3A14.04,label_exp=docker%7C%7CHadoop/1319/changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4100) Include ContainerID in certain Hook interface calls?

2015-12-08 Thread Nicholas Parker (JIRA)
Nicholas Parker created MESOS-4100:
--

 Summary: Include ContainerID in certain Hook interface calls?
 Key: MESOS-4100
 URL: https://issues.apache.org/jira/browse/MESOS-4100
 Project: Mesos
  Issue Type: Improvement
  Components: c++ api
Affects Versions: 0.25.0
Reporter: Nicholas Parker
Priority: Minor


I'm building an agent module which uses both the Isolator interface[1] to track 
containers over their lifespan, and the Hook interface[2] to inject environment 
variables into those containers.

Nearly all of the Isolator interface calls include the ContainerID, sometimes 
as the sole identifier. Meanwhile the Hook.slaveExecutorEnvironmentDecorator 
call is only given an ExecutorInfo, and doesn't have a ContainerID at all.

At the moment I'm working around the lack of ContainerID in the Hook call by 
storing a temporary ExecutorInfo->ContainerID mapping when Isolator.prepare() 
is called, then reading/clearing that mapping when 
Hook.slaveExecutorEnvironmentDecorator() is called. While this workaround 
appears to work for now, I worry that it will be brittle in the future, since 
it depends on Isolator.prepare() consistently being called before 
Hook.slaveExecutorEnvironmentDecorator().

The immediate issue is specific to including a ContainerID parameter within 
Hook.slaveExecutorEnvironmentDecorator(), but it may make sense to determine if 
other Hook calls should have similar updates.

[1] 
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=include/mesos/slave/isolator.hpp
[2] 
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=include/mesos/hook.hpp



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3909) isolator module headers depend on picojson headers

2015-12-08 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047499#comment-15047499
 ] 

James Peach commented on MESOS-3909:


Looking at {{stout/json.hpp}}, the dependency on {{picojson}} could easily be 
eliminated by moving some of the inline functions in the header to a {{.cpp}} 
file.

> isolator module headers depend on picojson headers
> --
>
> Key: MESOS-3909
> URL: https://issues.apache.org/jira/browse/MESOS-3909
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api, modules
>Reporter: James Peach
>Assignee: James Peach
>
> When trying to build an isolator module, stout headers end up depending on 
> {{picojson.hpp}} which is not installed.
> {code}
> In file included from /opt/mesos/include/mesos/module/isolator.hpp:25:
> In file included from /opt/mesos/include/mesos/slave/isolator.hpp:30:
> In file included from /opt/mesos/include/process/dispatch.hpp:22:
> In file included from /opt/mesos/include/process/process.hpp:26:
> In file included from /opt/mesos/include/process/event.hpp:21:
> In file included from /opt/mesos/include/process/http.hpp:39:
> /opt/mesos/include/stout/json.hpp:23:10: fatal error: 'picojson.h' file not 
> found
> #include 
>  ^
> 8 warnings and 1 error generated.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2782) Document the sandbox

2015-12-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047500#comment-15047500
 ] 

Joseph Wu commented on MESOS-2782:
--

The tests for sandbox expectations already exist:
* {{PathsTest.Executor}}
* {{GarbageCollectorIntegrationTest.ExitedExecutor}}
* {{GarbageCollectorIntegrationTest.DiskUsage}}
* {{SlaveRecoveryTest.GCExecutor}}
* Indirectly tested by {{FilesTest.*}} and {{FetcherTest.*}}

> Document the sandbox
> 
>
> Key: MESOS-2782
> URL: https://issues.apache.org/jira/browse/MESOS-2782
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Aaron Bell
>Assignee: Joseph Wu
>  Labels: documentation, mesosphere
>
> The sandbox is the arena of debugging for most Mesos users. From an 
> application- or framework-developer perspective, they need to know
> - What it is
> - Where it is
> - How to use it, and how NOT to use it
> - What Mesos writes here (fetcher etc.)
> - Storage limits
> - Lifecycle and garbage collection
> This needs to be documented to help users get over the hump of learning to 
> work with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2782) Document the sandbox

2015-12-08 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047500#comment-15047500
 ] 

Joseph Wu edited comment on MESOS-2782 at 12/8/15 9:25 PM:
---

The tests for sandbox expectations already exist:
* {{PathsTest.Executor}}
* {{GarbageCollectorIntegrationTest.ExitedExecutor}}
* {{GarbageCollectorIntegrationTest.DiskUsage}}
* {{SlaveRecoveryTest.GCExecutor}}
* Indirectly tested by {{FilesTest.\*}} and {{FetcherTest.\*}}


was (Author: kaysoky):
The tests for sandbox expectations already exist:
* {{PathsTest.Executor}}
* {{GarbageCollectorIntegrationTest.ExitedExecutor}}
* {{GarbageCollectorIntegrationTest.DiskUsage}}
* {{SlaveRecoveryTest.GCExecutor}}
* Indirectly tested by {{FilesTest.*}} and {{FetcherTest.*}}

> Document the sandbox
> 
>
> Key: MESOS-2782
> URL: https://issues.apache.org/jira/browse/MESOS-2782
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Aaron Bell
>Assignee: Joseph Wu
>  Labels: documentation, mesosphere
>
> The sandbox is the arena of debugging for most Mesos users. From an 
> application- or framework-developer perspective, they need to know
> - What it is
> - Where it is
> - How to use it, and how NOT to use it
> - What Mesos writes here (fetcher etc.)
> - Storage limits
> - Lifecycle and garbage collection
> This needs to be documented to help users get over the hump of learning to 
> work with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)