date:20150619


 [ 
https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-2903:
--
Target Version/s: 0.23.0

 Network isolator should not fail when target state already exists
 -

 Key: MESOS-2903
 URL: https://issues.apache.org/jira/browse/MESOS-2903
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.23.0
Reporter: Paul Brett
Priority: Critical

 Network isolator has multiple instances of the following pattern:
 {noformat}
   Trybool something = ::create();  
   if (something.isError()) {  
  
 ++metrics.something_errors;  
 return Failure(Failed to create something ...)
   } else if (!icmpVethToEth0.get()) { 
   
 ++metrics.adding_veth_icmp_filters_already_exist; 
   
 return Failure(Something already exists);
   }   
   
 {noformat}
 These failures have occurred in operation due to the failure to recover or 
 delete an orphan, causing the slave to remain on line but unable to create 
 new resources.We should convert the second failure message in this 
 pattern to an information message since the final state of the system is the 
 state that we requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2903) Network isolator should not fail when target state already exists


 [ 
https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-2903:
--
Priority: Critical  (was: Major)

 Network isolator should not fail when target state already exists
 -

 Key: MESOS-2903
 URL: https://issues.apache.org/jira/browse/MESOS-2903
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Reporter: Paul Brett
Priority: Critical

 Network isolator has multiple instances of the following pattern:
 {noformat}
   Trybool something = ::create();  
   if (something.isError()) {  
  
 ++metrics.something_errors;  
 return Failure(Failed to create something ...)
   } else if (!icmpVethToEth0.get()) { 
   
 ++metrics.adding_veth_icmp_filters_already_exist; 
   
 return Failure(Something already exists);
   }   
   
 {noformat}
 These failures have occurred in operation due to the failure to recover or 
 delete an orphan, causing the slave to remain on line but unable to create 
 new resources.We should convert the second failure message in this 
 pattern to an information message since the final state of the system is the 
 state that we requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2903) Network isolator should not fail when target state already exists


 [ 
https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-2903:
--
Affects Version/s: 0.23.0

 Network isolator should not fail when target state already exists
 -

 Key: MESOS-2903
 URL: https://issues.apache.org/jira/browse/MESOS-2903
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.23.0
Reporter: Paul Brett
Priority: Critical

 Network isolator has multiple instances of the following pattern:
 {noformat}
   Trybool something = ::create();  
   if (something.isError()) {  
  
 ++metrics.something_errors;  
 return Failure(Failed to create something ...)
   } else if (!icmpVethToEth0.get()) { 
   
 ++metrics.adding_veth_icmp_filters_already_exist; 
   
 return Failure(Something already exists);
   }   
   
 {noformat}
 These failures have occurred in operation due to the failure to recover or 
 delete an orphan, causing the slave to remain on line but unable to create 
 new resources.We should convert the second failure message in this 
 pattern to an information message since the final state of the system is the 
 state that we requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks when using systemd

2015-06-19 Thread Chris Fortier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593976#comment-14593976
 ] 

Chris Fortier commented on MESOS-2419:
--

Brenden - Thank you so much for taking a look at this. I also just found out 
about https://issues.apache.org/jira/browse/MESOS-2115 

Apparently mesos in a container is something Mesosphere is working on and 
should be included on the next release. :)

 Slave recovery not recovering tasks when using systemd
 --

 Key: MESOS-2419
 URL: https://issues.apache.org/jira/browse/MESOS-2419
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Brenden Matthews
Assignee: Joerg Schad
 Attachments: mesos-chronos.log.gz, mesos.log.gz


 {color:red}
 Note: the resolution to this issue is described in the following comment 
 below:
 https://issues.apache.org/jira/browse/MESOS-2419?focusedCommentId=14357028page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357028
 {color:red}
 In a recent build from master (updated yesterday), slave recovery appears to 
 have broken.
 I'll attach the slave log (with GLOG_v=1) showing a task called 
 `long-running-job` which is a Chronos job that just does `sleep 1h`. After 
 restarting the slave, the task shows as `TASK_FAILED`.
 Here's another case, which is for a docker task:
 {noformat}
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.247207 10022 docker.cpp:468] Recovering container 
 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 20150226-230228-2931198986-5050-717-
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254791 10022 docker.cpp:1333] Executor for container 
 'f2001064-e076-4978-b764-ed12a5244e78' has exited
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254812 10022 docker.cpp:1159] Destroying container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 
 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 20150226-230228-2931198986-5050-717-
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup 
 for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container 
 f2001064-e076-4978-b764-ed12a5244e78
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 
 'f2001064-e076-4978-b764-ed12a5244e78' has exited
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717- at executor(1)@10.81.189.232:43130
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 '20150226-230228-2931198986-5050-717-' failed: Container 
 'f2001064-e076-4978-b764-ed12a5244e78' not found
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for 
 executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717-: Not monitored
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED 
 (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717- from @0.0.0.0:0
 Feb

[jira] [Commented] (MESOS-2637) Consolidate 'foo', 'bar', ... string constants in test and example code

2015-06-19 Thread Colin Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593988#comment-14593988
 ] 

Colin Williams commented on MESOS-2637:
---

Sorry about the delay, I'll have some cycles on Sunday, I'll try to wrap it
up then.
On Jun 19, 2015 2:29 PM, Niklas Quarfot Nielsen (JIRA) j...@apache.org



 Consolidate 'foo', 'bar', ... string constants in test and example code
 ---

 Key: MESOS-2637
 URL: https://issues.apache.org/jira/browse/MESOS-2637
 Project: Mesos
  Issue Type: Bug
  Components: technical debt
Reporter: Niklas Quarfot Nielsen
Assignee: Colin Williams

 We are using 'foo', 'bar', ... string constants and pairs in 
 src/tests/master_tests.cpp, src/tests/slave_tests.cpp, 
 src/tests/hook_tests.cpp and src/examples/test_hook_module.cpp for label and 
 hooks tests. We should consolidate them to make the call sites less prone to 
 forgetting to update all call sites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2903) Network isolator should not fail when target state already exists


[ 
https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594152#comment-14594152
 ] 

Jie Yu commented on MESOS-2903:
---

I think this is related to the recent change about the slave recovery semantics 
(MESOS-2367). Previously, slave won't finish recovery if some orphan containers 
cannot be destroyed. Therefore, the port mapping isolator simply assumes that 
it knows about all the filters on host. However, this is no longer true after 
MESOS-2367 is committed. So the isolator code needs to adapt to that new 
semantics.

 Network isolator should not fail when target state already exists
 -

 Key: MESOS-2903
 URL: https://issues.apache.org/jira/browse/MESOS-2903
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.23.0
Reporter: Paul Brett
Priority: Critical

 Network isolator has multiple instances of the following pattern:
 {noformat}
   Trybool something = ::create();  
   if (something.isError()) {  
  
 ++metrics.something_errors;  
 return Failure(Failed to create something ...)
   } else if (!icmpVethToEth0.get()) { 
   
 ++metrics.adding_veth_icmp_filters_already_exist; 
   
 return Failure(Something already exists);
   }   
   
 {noformat}
 These failures have occurred in operation due to the failure to recover or 
 delete an orphan, causing the slave to remain on line but unable to create 
 new resources.We should convert the second failure message in this 
 pattern to an information message since the final state of the system is the 
 state that we requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks when using systemd

2015-06-19 Thread Brenden Matthews (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593946#comment-14593946
 ] 

Brenden Matthews commented on MESOS-2419:
-

I would not suggest running the Mesos processes inside Docker containers. In 
fact, it's an anti-pattern. It will indeed break recovery if you try to do that.

 Slave recovery not recovering tasks when using systemd
 --

 Key: MESOS-2419
 URL: https://issues.apache.org/jira/browse/MESOS-2419
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Brenden Matthews
Assignee: Joerg Schad
 Attachments: mesos-chronos.log.gz, mesos.log.gz


 {color:red}
 Note: the resolution to this issue is described in the following comment 
 below:
 https://issues.apache.org/jira/browse/MESOS-2419?focusedCommentId=14357028page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357028
 {color:red}
 In a recent build from master (updated yesterday), slave recovery appears to 
 have broken.
 I'll attach the slave log (with GLOG_v=1) showing a task called 
 `long-running-job` which is a Chronos job that just does `sleep 1h`. After 
 restarting the slave, the task shows as `TASK_FAILED`.
 Here's another case, which is for a docker task:
 {noformat}
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.247207 10022 docker.cpp:468] Recovering container 
 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 20150226-230228-2931198986-5050-717-
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254791 10022 docker.cpp:1333] Executor for container 
 'f2001064-e076-4978-b764-ed12a5244e78' has exited
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254812 10022 docker.cpp:1159] Destroying container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 
 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 20150226-230228-2931198986-5050-717-
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup 
 for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container 
 f2001064-e076-4978-b764-ed12a5244e78
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 
 'f2001064-e076-4978-b764-ed12a5244e78' has exited
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717- at executor(1)@10.81.189.232:43130
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 '20150226-230228-2931198986-5050-717-' failed: Container 
 'f2001064-e076-4978-b764-ed12a5244e78' not found
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for 
 executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717-: Not monitored
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED 
 (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717- from @0.0.0.0:0
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.599093

[jira] [Created] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-06-19 Thread Cody Maloney (JIRA)

Cody Maloney created MESOS-2902:
---

 Summary: Enable Mesos to use arbitrary script / module to figure 
out IP, HOSTNAME
 Key: MESOS-2902
 URL: https://issues.apache.org/jira/browse/MESOS-2902
 Project: Mesos
  Issue Type: Improvement
  Components: master, modules, slave
Reporter: Cody Maloney
Priority: Minor


Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. 
This doesn't work on a lot of clouds as we want things like public IPs (which 
aren't the default DNS), there aren't FQDN names (Azure), or the correct way to 
figure it out is to call some cloud-specific endpoint.

If Mesos / Libprocess could load a mesos-module (Or run a script) which is 
provided per-cloud, we can figure out perfectly the IP / Hostname for the given 
environment. It also means we can ship one identical set of files to all hosts 
in a given provider which doesn't happen to have the DNS scheme + hostnames 
that libprocess/Mesos expects. Currently we have to generate host-specific 
config files which Mesos uses to guess.

The host-specific files break / fall apart if machines change IP / hostname 
without being reinstalled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-2891) Performance regression in hierarchical allocator.


 [ 
https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu resolved MESOS-2891.
---
   Resolution: Fixed
Fix Version/s: 0.23.0

 Performance regression in hierarchical allocator.
 -

 Key: MESOS-2891
 URL: https://issues.apache.org/jira/browse/MESOS-2891
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Benjamin Mahler
Assignee: Jie Yu
Priority: Blocker
  Labels: twitter
 Fix For: 0.23.0

 Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg


 For large clusters, the 0.23.0 allocator cannot keep up with the volume of 
 slaves. After the following slave was re-registered, it took the allocator a 
 long time to work through the backlog of slaves to add:
 {noformat:title=45 minute delay}
 I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 
 20150422-211121-2148346890-5050-3253-S4695
 I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 
 20150422-211121-2148346890-5050-3253-S4695
 {noformat}
 Empirically, 
 [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462]
  and 
 [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533]
  have become expensive.
 Some timings from a production cluster reveal that the allocator spending in 
 the low tens of milliseconds for each call to {{addSlave}} and 
 {{updateSlave}}, when there are tens of thousands of slaves this amounts to 
 the large delay seen above.
 We also saw a slow steady increase in memory consumption, hinting further at 
 a queue backup in the allocator.
 A synthetic benchmark like we did for the registrar would be prudent here, 
 along with visibility into the allocator's queue size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2891) Performance regression in hierarchical allocator.


[ 
https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594082#comment-14594082
 ] 

Jie Yu commented on MESOS-2891:
---

commit 68505cd0a478a96393ca988e74f99460333f5e45
Author: Jie Yu yujie@gmail.com
Date:   Fri Jun 19 15:20:07 2015 -0700

Fixed a bug in test filter that prevent some tests from being launched.

Review: https://reviews.apache.org/r/35671

commit ce1c6e2aad748d9f999c09b9bb4897e19fc18175
Author: Jie Yu yujie@gmail.com
Date:   Fri Jun 19 12:38:24 2015 -0700

Improved the performance of DRF sorter by caching the scalars.

Review: https://reviews.apache.org/r/35664

commit 114d2aa568284eba98dad60f8265c573112bad49
Author: Jie Yu yujie@gmail.com
Date:   Fri Jun 19 12:37:27 2015 -0700

Added a helper in Resources to get all scalar resources.

Review: https://reviews.apache.org/r/35663

 Performance regression in hierarchical allocator.
 -

 Key: MESOS-2891
 URL: https://issues.apache.org/jira/browse/MESOS-2891
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Benjamin Mahler
Assignee: Jie Yu
Priority: Blocker
  Labels: twitter
 Fix For: 0.23.0

 Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg


 For large clusters, the 0.23.0 allocator cannot keep up with the volume of 
 slaves. After the following slave was re-registered, it took the allocator a 
 long time to work through the backlog of slaves to add:
 {noformat:title=45 minute delay}
 I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 
 20150422-211121-2148346890-5050-3253-S4695
 I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 
 20150422-211121-2148346890-5050-3253-S4695
 {noformat}
 Empirically, 
 [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462]
  and 
 [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533]
  have become expensive.
 Some timings from a production cluster reveal that the allocator spending in 
 the low tens of milliseconds for each call to {{addSlave}} and 
 {{updateSlave}}, when there are tens of thousands of slaves this amounts to 
 the large delay seen above.
 We also saw a slow steady increase in memory consumption, hinting further at 
 a queue backup in the allocator.
 A synthetic benchmark like we did for the registrar would be prudent here, 
 along with visibility into the allocator's queue size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2664) Modernize the codebase to C++11

2015-06-19 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594083#comment-14594083
 ] 

Benjamin Mahler commented on MESOS-2664:


Not relying on default seems like a nice simplification we could make, happy to 
move the run-time error to a compile time one :)

Curious if it will bleed into our 3rd party dependencies?

 Modernize the codebase to C++11
 ---

 Key: MESOS-2664
 URL: https://issues.apache.org/jira/browse/MESOS-2664
 Project: Mesos
  Issue Type: Epic
  Components: technical debt
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 Since [this 
 commit|https://github.com/apache/mesos/commit/0f5c78fad3423181f7227027eb42d162811514e7],
  we officially require GCC-4.8+ and Clang-3.5+. This means that we now have 
 full C++11 support and therefore can start to modernize our codebase to be 
 more readable, safer and efficient!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2903) Network isolator should not fail when target state already exists

Paul Brett created MESOS-2903:
-

 Summary: Network isolator should not fail when target state 
already exists
 Key: MESOS-2903
 URL: https://issues.apache.org/jira/browse/MESOS-2903
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Reporter: Paul Brett


Network isolator has multiple instances of the following pattern:

{noformat}
  Trybool something = ::create();  
  if (something.isError()) {   
++metrics.something_errors;  
return Failure(Failed to create something ...)
  } else if (!icmpVethToEth0.get()) {   

++metrics.adding_veth_icmp_filters_already_exist;   

return Failure(Something already exists);
  } 

{noformat}

These failures have occurred in operation due to the failure to recover or 
delete an orphan, causing the slave to remain on line but unable to create new 
resources.We should convert the second failure message in this pattern to 
an information message since the final state of the system is the state that we 
requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2891) Performance regression in hierarchical allocator.

2015-06-19 Thread Benjamin Mahler (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-2891:
---
  Sprint: Twitter Mesos Q2 Sprint 5
Shepherd: Benjamin Mahler
Assignee: Jie Yu
Story Points: 3

 Performance regression in hierarchical allocator.
 -

 Key: MESOS-2891
 URL: https://issues.apache.org/jira/browse/MESOS-2891
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Benjamin Mahler
Assignee: Jie Yu
Priority: Blocker
  Labels: twitter
 Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg


 For large clusters, the 0.23.0 allocator cannot keep up with the volume of 
 slaves. After the following slave was re-registered, it took the allocator a 
 long time to work through the backlog of slaves to add:
 {noformat:title=45 minute delay}
 I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 
 20150422-211121-2148346890-5050-3253-S4695
 I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 
 20150422-211121-2148346890-5050-3253-S4695
 {noformat}
 Empirically, 
 [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462]
  and 
 [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533]
  have become expensive.
 Some timings from a production cluster reveal that the allocator spending in 
 the low tens of milliseconds for each call to {{addSlave}} and 
 {{updateSlave}}, when there are tens of thousands of slaves this amounts to 
 the large delay seen above.
 We also saw a slow steady increase in memory consumption, hinting further at 
 a queue backup in the allocator.
 A synthetic benchmark like we did for the registrar would be prudent here, 
 along with visibility into the allocator's queue size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2903) Network isolator should not fail when target state already exists


[ 
https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594094#comment-14594094
 ] 

Jie Yu commented on MESOS-2903:
---

Can you paste the relevant logging to show why this is necessary?

 Network isolator should not fail when target state already exists
 -

 Key: MESOS-2903
 URL: https://issues.apache.org/jira/browse/MESOS-2903
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Reporter: Paul Brett

 Network isolator has multiple instances of the following pattern:
 {noformat}
   Trybool something = ::create();  
   if (something.isError()) {  
  
 ++metrics.something_errors;  
 return Failure(Failed to create something ...)
   } else if (!icmpVethToEth0.get()) { 
   
 ++metrics.adding_veth_icmp_filters_already_exist; 
   
 return Failure(Something already exists);
   }   
   
 {noformat}
 These failures have occurred in operation due to the failure to recover or 
 delete an orphan, causing the slave to remain on line but unable to create 
 new resources.We should convert the second failure message in this 
 pattern to an information message since the final state of the system is the 
 state that we requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-2891) Performance regression in hierarchical allocator.

2015-06-19 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594146#comment-14594146
 ] 

Vinod Kone edited comment on MESOS-2891 at 6/20/15 12:12 AM:
-

Reopening because updateSlave (and likely updateAllocation) also need to be 
addressed.

Some numbers from a benchmark test.

{code}
[ RUN  ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/0
Added 1000 slaves in 766.99568ms
Updated 1000 slaves in 6.807111421secs
[   OK ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/0 
(7751 ms)
[ RUN  ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/1
Added 5000 slaves in 3.886493374secs
Updated 5000 slaves in 4.07753897601667mins
[   OK ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/1 
(249472 ms)
[ RUN  ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/2
Added 1 slaves in 7.720996758secs
Updated 1 slaves in 16.4897123807167mins
[   OK ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/2 
(999001 ms)
{code}



was (Author: vinodkone):
Reopening because updateSlave (and likely updateAllocation) also need to be 
addressed.

Some numbers from a benchmark test.

{code}
[ RUN  ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/0
Added 1000 slaves in 766.99568ms
Updated 1000 slaves in 6.807111421secs
[   OK ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/0 
(7751 ms)
[ RUN  ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/1
Added 5000 slaves in 3.886493374secs
Updated 5000 slaves in 4.07753897601667mins
[   OK ]
{code}


 Performance regression in hierarchical allocator.
 -

 Key: MESOS-2891
 URL: https://issues.apache.org/jira/browse/MESOS-2891
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Benjamin Mahler
Assignee: Jie Yu
Priority: Blocker
  Labels: twitter
 Fix For: 0.23.0

 Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg


 For large clusters, the 0.23.0 allocator cannot keep up with the volume of 
 slaves. After the following slave was re-registered, it took the allocator a 
 long time to work through the backlog of slaves to add:
 {noformat:title=45 minute delay}
 I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 
 20150422-211121-2148346890-5050-3253-S4695
 I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 
 20150422-211121-2148346890-5050-3253-S4695
 {noformat}
 Empirically, 
 [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462]
  and 
 [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533]
  have become expensive.
 Some timings from a production cluster reveal that the allocator spending in 
 the low tens of milliseconds for each call to {{addSlave}} and 
 {{updateSlave}}, when there are tens of thousands of slaves this amounts to 
 the large delay seen above.
 We also saw a slow steady increase in memory consumption, hinting further at 
 a queue backup in the allocator.
 A synthetic benchmark like we did for the registrar would be prudent here, 
 along with visibility into the allocator's queue size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks when using systemd

2015-06-19 Thread Chris Fortier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593904#comment-14593904
 ] 

Chris Fortier commented on MESOS-2419:
--

that would be fantastic!

Here's the systemd unit and the associated logs:

Systemd unit:
```
[Unit]
Description=MesosSlave
After=docker.service dockercfg.service
Requires=docker.service dockercfg.service

[Service]
Environment=MESOS_IMAGE=mesosphere/mesos-slave:0.22.1-1.0.ubuntu1404
Environment=ZOOKEEPER=internal-portfolio-Internal-1GZCG4XPMN2WS-969465822.us-west-2.elb.amazonaws.com:2181

User=core
KillMode=process
Restart=on-failure
RestartSec=20
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill mesos_slave
ExecStartPre=-/usr/bin/docker rm mesos_slave
ExecStartPre=/usr/bin/docker pull ${MESOS_IMAGE}
ExecStart=/usr/bin/sh -c sudo /usr/bin/docker run \
--name=mesos_slave \
--net=host \
--privileged \
-v /home/core/.dockercfg:/root/.dockercfg:ro \
-v /sys:/sys \
-v /usr/bin/docker:/usr/bin/docker:ro \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \
-v /var/lib/mesos/slave:/var/lib/mesos/slave \
${MESOS_IMAGE} \
--ip=$(/usr/bin/ip -o -4 addr list eth0 | grep global | awk \'{print $4}\' 
| cut -d/ -f1) \
--attributes=zone:$(curl -s 
http://169.254.169.254/latest/meta-data/placement/availability-zone)\;os:coreos 
\
--containerizers=docker \
--executor_registration_timeout=10mins \
--hostname=`curl -s 
http://169.254.169.254/latest/meta-data/public-hostname` \
--isolation=cgroups/cpu,cgroups/mem \
--log_dir=/var/log/mesos \
--master=zk://${ZOOKEEPER}/mesos \
--work_dir=/var/lib/mesos/slave
ExecStop=/usr/bin/docker stop mesos_slave
ExecStartPost=/usr/bin/docker pull behance/utility:latest
ExecStartPost=/usr/bin/docker pull ubuntu:14.04
ExecStartPost=/usr/bin/docker pull debian:jessie

[Install]
WantedBy=multi-user.target

[X-Fleet]
Global=true
MachineMetadata=role=worker
```


Logs:
```
fortier@ip-10-43-3-126 ~ $ docker logs 7cd21326a98c
I0619 17:57:22.075104 15406 logging.cpp:172] INFO level logging started!
I0619 17:57:22.075305 15406 main.cpp:156] Build: 2015-05-05 06:15:50 by root
I0619 17:57:22.075314 15406 main.cpp:158] Version: 0.22.1
I0619 17:57:22.075319 15406 main.cpp:161] Git tag: 0.22.1
I0619 17:57:22.075322 15406 main.cpp:165] Git SHA: 
d6309f92a7f9af3ab61a878403e3d9c284ea87e0
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@712: Client 
environment:zookeeper.version=zookeeper C client 3.4.5
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@716: Client 
environment:host.name=ip-10-43-3-126.us-west-2.compute.internal
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@723: Client 
environment:os.name=Linux
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@724: Client 
environment:os.arch=4.0.5
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@725: Client 
environment:os.version=#2 SMP Thu Jun 18 08:53:45 UTC 2015
I0619 17:57:22.177387 15406 main.cpp:200] Starting Mesos slave
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@733: Client 
environment:user.name=(null)
I0619 17:57:22.178097 15406 slave.cpp:174] Slave started on 1)@10.43.3.126:5051
2015-06-19 17:57:22,178:15406(0x7f918ec5d700):ZOO_INFO@log_env@741: Client 
environment:user.home=/root
2015-06-19 17:57:22,178:15406(0x7f918ec5d700):ZOO_INFO@log_env@753: Client 
environment:user.dir=/
2015-06-19 17:57:22,178:15406(0x7f918ec5d700):ZOO_INFO@zookeeper_init@786: 
Initiating client connection, 
host=internal-portfolio-Internal-1GZCG4XPMN2WS-969465822.us-west-2.elb.amazonaws.com:2181
 sessionTimeout=1 watcher=0x7f91936bfa60 sessionId=0 sessionPasswd=null 
context=0x7f9178001010 flags=0
I0619 17:57:22.178235 15406 slave.cpp:322] Slave resources: cpus(*):8; 
mem(*):14019; disk(*):42121; ports(*):[31000-32000]
I0619 17:57:22.178401 15406 slave.cpp:351] Slave hostname: 
ec2-52-24-66-221.us-west-2.compute.amazonaws.com
I0619 17:57:22.178427 15406 slave.cpp:352] Slave checkpoint: true
I0619 17:57:22.179797 15415 state.cpp:35] Recovering state from 
'/var/lib/mesos/slave/meta'
I0619 17:57:22.180737 15417 slave.cpp:3890] Recovering framework 
20150612-153240-4144114442-5050-1-
I0619 17:57:22.180768 15417 slave.cpp:4319] Recovering executor 
'portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe'
 of framework 20150612-153240-4144114442-5050-1-
I0619 17:57:22.181002 15412 status_update_manager.cpp:197] Recovering status 
update manager
I0619 17:57:22.181032 15412 status_update_manager.cpp:205] Recovering executor 
'portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe'
 of framework 20150612-153240-4144114442-5050-1-
I0619 17:57:22.181354 15414

[jira] [Deleted] (MESOS-2901) test - please ignore


 [ 
https://issues.apache.org/jira/browse/MESOS-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio deleted MESOS-2901:
---


 test - please ignore
 

 Key: MESOS-2901
 URL: https://issues.apache.org/jira/browse/MESOS-2901
 Project: Mesos
  Issue Type: Wish
Reporter: Marco Massenzio
  Labels: label-foo, mesosphere, random





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2901) test - please ignore

Marco Massenzio created MESOS-2901:
--

 Summary: test - please ignore
 Key: MESOS-2901
 URL: https://issues.apache.org/jira/browse/MESOS-2901
 Project: Mesos
  Issue Type: Wish
Reporter: Marco Massenzio






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-06-19 Thread Benjamin Mahler (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594013#comment-14594013
]

Benjamin Mahler commented on MESOS-2902:

Any reason that the mesos itself has to execute the script? It seems like the
start script for mesos should run the arbitrary code / external programs
necessary to compute flags. Taking this to its extreme, should we add scripts
for all of the flags (e.g. {{--resources}}, {{--quiet}}, etc)?

Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

Key: MESOS-2902
URL: https://issues.apache.org/jira/browse/MESOS-2902
Project: Mesos
Issue Type: Improvement
Components: master, modules, slave
Reporter: Cody Maloney
Priority: Minor
Labels: mesosphere

Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS
lookup. This doesn't work on a lot of clouds as we want things like public
IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the
correct way to figure it out is to call some cloud-specific endpoint.
If Mesos / Libprocess could load a mesos-module (Or run a script) which is
provided per-cloud, we can figure out perfectly the IP / Hostname for the
given environment. It also means we can ship one identical set of files to
all hosts in a given provider which doesn't happen to have the DNS scheme +
hostnames that libprocess/Mesos expects. Currently we have to generate
host-specific config files which Mesos uses to guess.
The host-specific files break / fall apart if machines change IP / hostname
without being reinstalled.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2798) Export statistics on unevictable memory

2015-06-19 Thread Chi Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chi Zhang updated MESOS-2798:
-
Sprint: Twitter Q2 Sprint 3, Twitter Mesos Q2 Sprint 5  (was: Twitter Q2 
Sprint 3)

 Export statistics on unevictable memory
 -

 Key: MESOS-2798
 URL: https://issues.apache.org/jira/browse/MESOS-2798
 Project: Mesos
  Issue Type: Improvement
Reporter: Chi Zhang
Assignee: Chi Zhang
  Labels: twitter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2798) Export statistics on unevictable memory

2015-06-19 Thread Chi Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593956#comment-14593956
 ] 

Chi Zhang commented on MESOS-2798:
--

https://reviews.apache.org/r/35668/

 Export statistics on unevictable memory
 -

 Key: MESOS-2798
 URL: https://issues.apache.org/jira/browse/MESOS-2798
 Project: Mesos
  Issue Type: Improvement
Reporter: Chi Zhang
Assignee: Chi Zhang
  Labels: twitter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-06-19 Thread Cody Maloney (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594068#comment-14594068
]

Cody Maloney edited comment on MESOS-2902 at 6/19/15 10:58 PM:
---

I can't drop it in a systemd unit file which runs a command before mesos and
pass the data without making a temp file which is an odd way to do the config
generation.

I could make a new mesos-init-fetch-ip script which I run instead of mesos, and
that script then execs mesos. This confuses init system tracking of processes
somewhat, and obfuscates what the underlying commands being run are.

It also adds a lot of error scenarios. For example, the wrapper script is
updated and the change contains a typo, so it sets LIBPROCES_IP instead of
LIBPROCESS_IP), Libprocess silently ignores the wrong environment variable. The
environment I'm in Libprocess' internal logic guesses an IP that works. It gets
engrained slightly incorrect as it rolls out across the cluster.

Currently one of the biggest pain points in initially setting up a Mesos
cluster is getting the right IPs + Hostnames setup. If Mesos Master and Mesos
Slave had a flag which was required, {{\-\-ip\-detection=reverse_dns}} or
{{--ip-detection=/usr/bin/detect_mesos_ip}}. It would make it so that users see
what mesos is doing and make an informed decision, rather than running Mesos,
having things break with really bad error messages (Wrong hostname/IP on your
Scheduler? No logging of things breaking happens...).

As far as generalizing it further. Note I'm saying IP, HOSTNAME are
host-specific, which is why this sort of capability makes sense. It is
impossible for me to know when I'm installing static config files to a Host,
VM, Docker what the IP and Hostname are going to be. That is not the case for
{{\-\-resources}}, {{\-quiet}} and the like. They are able to be pre-determined
for a host. IP and Hostname are Runtime parameters of a machine (When you
attach your machine to a network, they are assigned dynamically).

was (Author: cmaloney):
I can't drop it in a systemd unit file which runs a command before mesos and
pass the data without making a temp file which is an odd way to do the config
generation.

Currently one of the biggest pain points in initially setting up a Mesos
cluster is getting the right IPs + Hostnames setup. If Mesos Master and Mesos
Slave had a flag which was required, {{ \-\-ip\-detection=reverse_dns}} or
{{--ip-detection=,/usr/bin/detect_mesos_ip} }}. It would make it so that users
see what mesos is doing and make an informed decision, rather than running
Mesos, having things break with really bad error messages (Wrong hostname/IP on
your Scheduler? No logging of things breaking happens...).

Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

[jira] [Created] (MESOS-2904) Add slave metric to count container launch failures

Paul Brett created MESOS-2904:
-

 Summary: Add slave metric to count container launch failures
 Key: MESOS-2904
 URL: https://issues.apache.org/jira/browse/MESOS-2904
 Project: Mesos
  Issue Type: Bug
  Components: slave, statistics
Reporter: Paul Brett
Assignee: Paul Brett


We have seen circumstances where a machine has been consistently unable to 
launch containers due to an inconsistent state (for example, unexpected network 
configuration).   Adding a metric to track container launch failures will allow 
us to detect and alert on slaves in such a state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

2015-06-19 Thread Cody Maloney (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594068#comment-14594068
]

Cody Maloney commented on MESOS-2902:
-

I can't drop it in a systemd unit file which runs a command before mesos and
pass the data without making a temp file which is an odd way to do the config
generation.

Currently one of the biggest pain points in initially setting up a Mesos
cluster is getting the right IPs + Hostnames setup. If Mesos Master and Mesos
Slave had a flag which was required, {{ \-\-ip\-detection=reverse_dns}} or
{{--ip-detection=,/usr/bin/detect_mesos_ip} }}. It would make it so that users
see what mesos is doing and make an informed decision, rather than running
Mesos, having things break with really bad error messages (Wrong hostname/IP on
your Scheduler? No logging of things breaking happens...).

Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2830) Add an endpoint to slaves to allow launching system administration tasks


[ 
https://issues.apache.org/jira/browse/MESOS-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594204#comment-14594204
 ] 

Marco Massenzio commented on MESOS-2830:


As I stated:
{quote]
I'll start looking into this and probably draft a design doc.
{quote}

design is actual work :)

 Add an endpoint to slaves to allow launching system administration tasks
 

 Key: MESOS-2830
 URL: https://issues.apache.org/jira/browse/MESOS-2830
 Project: Mesos
  Issue Type: Wish
  Components: slave
Reporter: Cody Maloney
Assignee: Marco Massenzio
Priority: Minor
  Labels: mesosphere

 As a System Administrator often times I need to run a organization-mandated 
 task on every machine in the cluster. Ideally I could do this within the 
 framework of mesos resources if it is a cleanup or auditing task, but 
 sometimes I just have to run something, and run it now, regardless if a 
 machine has un-accounted resources  (Ex: Adding/removing a user).
 Currently to do this I have to completely bypass Mesos and SSH to the box. 
 Ideally I could tell a mesos slave (With proper authentication) to run a 
 container with the limited special permissions needed to get the task done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2891) Performance regression in hierarchical allocator.

2015-06-19 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594238#comment-14594238
 ] 

Vinod Kone commented on MESOS-2891:
---

Review: https://reviews.apache.org/r/35679
Review: https://reviews.apache.org/r/35680
Review: https://reviews.apache.org/r/35682


 Performance regression in hierarchical allocator.
 -

 Key: MESOS-2891
 URL: https://issues.apache.org/jira/browse/MESOS-2891
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Benjamin Mahler
Assignee: Jie Yu
Priority: Blocker
  Labels: twitter
 Fix For: 0.23.0

 Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg


 For large clusters, the 0.23.0 allocator cannot keep up with the volume of 
 slaves. After the following slave was re-registered, it took the allocator a 
 long time to work through the backlog of slaves to add:
 {noformat:title=45 minute delay}
 I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 
 20150422-211121-2148346890-5050-3253-S4695
 I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 
 20150422-211121-2148346890-5050-3253-S4695
 {noformat}
 Empirically, 
 [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462]
  and 
 [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533]
  have become expensive.
 Some timings from a production cluster reveal that the allocator spending in 
 the low tens of milliseconds for each call to {{addSlave}} and 
 {{updateSlave}}, when there are tens of thousands of slaves this amounts to 
 the large delay seen above.
 We also saw a slow steady increase in memory consumption, hinting further at 
 a queue backup in the allocator.
 A synthetic benchmark like we did for the registrar would be prudent here, 
 along with visibility into the allocator's queue size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2903) Network isolator should not fail when target state already exists


[ 
https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594171#comment-14594171
 ] 

Paul Brett commented on MESOS-2903:
---

The new logic will be:

{noformat}
  Trybool something = ::create();  
  if (something.isError()) {   
++metrics.something_errors;  
return Failure(Failed to create something ...)
  } else if (!icmpVethToEth0.get()) {
// already exists
Trybool something = ::update();
if (something.isError()) {  
 
  ++metrics.something_errors;  
  return Failure(Failed to update something ...)
}
  } 
{noformat}

 Network isolator should not fail when target state already exists
 -

 Key: MESOS-2903
 URL: https://issues.apache.org/jira/browse/MESOS-2903
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.23.0
Reporter: Paul Brett
Priority: Critical

 Network isolator has multiple instances of the following pattern:
 {noformat}
   Trybool something = ::create();  
   if (something.isError()) {  
  
 ++metrics.something_errors;  
 return Failure(Failed to create something ...)
   } else if (!icmpVethToEth0.get()) { 
   
 ++metrics.adding_veth_icmp_filters_already_exist; 
   
 return Failure(Something already exists);
   }   
   
 {noformat}
 These failures have occurred in operation due to the failure to recover or 
 delete an orphan, causing the slave to remain on line but unable to create 
 new resources.We should convert the second failure message in this 
 pattern to an information message since the final state of the system is the 
 state that we requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-2903) Network isolator should not fail when target state already exists


 [ 
https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Brett reassigned MESOS-2903:
-

Assignee: Paul Brett

 Network isolator should not fail when target state already exists
 -

 Key: MESOS-2903
 URL: https://issues.apache.org/jira/browse/MESOS-2903
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.23.0
Reporter: Paul Brett
Assignee: Paul Brett
Priority: Critical

 Network isolator has multiple instances of the following pattern:
 {noformat}
   Trybool something = ::create();  
   if (something.isError()) {  
  
 ++metrics.something_errors;  
 return Failure(Failed to create something ...)
   } else if (!icmpVethToEth0.get()) { 
   
 ++metrics.adding_veth_icmp_filters_already_exist; 
   
 return Failure(Something already exists);
   }   
   
 {noformat}
 These failures have occurred in operation due to the failure to recover or 
 delete an orphan, causing the slave to remain on line but unable to create 
 new resources.We should convert the second failure message in this 
 pattern to an information message since the final state of the system is the 
 state that we requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2903) Network isolator should not fail when target state already exists


 [ 
https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Brett updated MESOS-2903:
--
Story Points: 2

 Network isolator should not fail when target state already exists
 -

 Key: MESOS-2903
 URL: https://issues.apache.org/jira/browse/MESOS-2903
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.23.0
Reporter: Paul Brett
Assignee: Paul Brett
Priority: Critical

 Network isolator has multiple instances of the following pattern:
 {noformat}
   Trybool something = ::create();  
   if (something.isError()) {  
  
 ++metrics.something_errors;  
 return Failure(Failed to create something ...)
   } else if (!icmpVethToEth0.get()) { 
   
 ++metrics.adding_veth_icmp_filters_already_exist; 
   
 return Failure(Something already exists);
   }   
   
 {noformat}
 These failures have occurred in operation due to the failure to recover or 
 delete an orphan, causing the slave to remain on line but unable to create 
 new resources.We should convert the second failure message in this 
 pattern to an information message since the final state of the system is the 
 state that we requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-2830) Add an endpoint to slaves to allow launching system administration tasks


[ 
https://issues.apache.org/jira/browse/MESOS-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594204#comment-14594204
 ] 

Marco Massenzio edited comment on MESOS-2830 at 6/20/15 12:41 AM:
--

As I stated:
{quote}
I'll start looking into this and probably draft a design doc.
{quote}

design is actual work :)


was (Author: marco-mesos):
As I stated:
{quote]
I'll start looking into this and probably draft a design doc.
{quote}

design is actual work :)

 Add an endpoint to slaves to allow launching system administration tasks
 

 Key: MESOS-2830
 URL: https://issues.apache.org/jira/browse/MESOS-2830
 Project: Mesos
  Issue Type: Wish
  Components: slave
Reporter: Cody Maloney
Assignee: Marco Massenzio
Priority: Minor
  Labels: mesosphere

 As a System Administrator often times I need to run a organization-mandated 
 task on every machine in the cluster. Ideally I could do this within the 
 framework of mesos resources if it is a cleanup or auditing task, but 
 sometimes I just have to run something, and run it now, regardless if a 
 machine has un-accounted resources  (Ex: Adding/removing a user).
 Currently to do this I have to completely bypass Mesos and SSH to the box. 
 Ideally I could tell a mesos slave (With proper authentication) to run a 
 container with the limited special permissions needed to get the task done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2830) Add an endpoint to slaves to allow launching system administration tasks


 [ 
https://issues.apache.org/jira/browse/MESOS-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2830:
---
  Sprint: Mesosphere Sprint 13
Target Version/s: 0.24.0
Shepherd: Benjamin Hindman
Story Points: 8

 Add an endpoint to slaves to allow launching system administration tasks
 

 Key: MESOS-2830
 URL: https://issues.apache.org/jira/browse/MESOS-2830
 Project: Mesos
  Issue Type: Wish
  Components: slave
Reporter: Cody Maloney
Assignee: Marco Massenzio
Priority: Minor
  Labels: mesosphere

 As a System Administrator often times I need to run a organization-mandated 
 task on every machine in the cluster. Ideally I could do this within the 
 framework of mesos resources if it is a cleanup or auditing task, but 
 sometimes I just have to run something, and run it now, regardless if a 
 machine has un-accounted resources  (Ex: Adding/removing a user).
 Currently to do this I have to completely bypass Mesos and SSH to the box. 
 Ideally I could tell a mesos slave (With proper authentication) to run a 
 container with the limited special permissions needed to get the task done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2853) Report per-container metrics from host egress filter


[ 
https://issues.apache.org/jira/browse/MESOS-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594175#comment-14594175
 ] 

Paul Brett commented on MESOS-2853:
---

Container metrics are not tracked by fq_codel on a per-filter basis, hence this 
information is not available.  Will wait to see the interaction of fq_codel on 
host eth0 with real workloads before deciding if further work is required.

 Report per-container metrics from host egress filter
 

 Key: MESOS-2853
 URL: https://issues.apache.org/jira/browse/MESOS-2853
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Reporter: Paul Brett
Assignee: Paul Brett
  Labels: twitter

 Export in statistics.json the fq_codel flow statistics for each container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2664) Modernize the codebase to C++11

2015-06-19 Thread Michael Park (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594181#comment-14594181
 ] 

Michael Park commented on MESOS-2664:
-

The only 3rdparty that it affected was {{stout/protobuf.hpp}}, which is also 
where spelling out the default would be the most prevalent due to the number 
of members in the {{FieldDescriptor}} enum.

With approach (1) we have:
{code}
   TryNothing operator () (const JSON::Object object) const
   {
 switch (field-type()) {
   case google::protobuf::FieldDescriptor::TYPE_MESSAGE:
 /* ... */
 break;
   case google::protobuf::FieldDescriptor::TYPE_DOUBLE:
   case google::protobuf::FieldDescriptor::TYPE_FLOAT:
   case google::protobuf::FieldDescriptor::TYPE_INT32:
   case google::protobuf::FieldDescriptor::TYPE_INT64:
   case google::protobuf::FieldDescriptor::TYPE_UINT32:
   case google::protobuf::FieldDescriptor::TYPE_UINT64:
   case google::protobuf::FieldDescriptor::TYPE_FIXED32:
   case google::protobuf::FieldDescriptor::TYPE_FIXED64:
   case google::protobuf::FieldDescriptor::TYPE_BOOL:
   case google::protobuf::FieldDescriptor::TYPE_STRING:
   case google::protobuf::FieldDescriptor::TYPE_GROUP:
   case google::protobuf::FieldDescriptor::TYPE_BYTES:
   case google::protobuf::FieldDescriptor::TYPE_ENUM:
   case google::protobuf::FieldDescriptor::TYPE_SFIXED32:
   case google::protobuf::FieldDescriptor::TYPE_SFIXED64:
   case google::protobuf::FieldDescriptor::TYPE_SINT32:
   case google::protobuf::FieldDescriptor::TYPE_SINT64:
 return Error(Not expecting a JSON object for field ' +
  field-name() + ');
 }
 return Nothing();
   }
{code}

With (2) we have:

{code}
   TryNothing operator () (const JSON::Object object) const
   {
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored -Wswitch-enum
 switch (field-type()) {
   case google::protobuf::FieldDescriptor::TYPE_MESSAGE:
 /* ... */
 break;
   default:
 return Error(Not expecting a JSON object for field ' +
  field-name() + ');
 }
#pragma GCC diagnostic pop
 return Nothing();
   }
{code}

(2) is nice in the sense that it's clear {{TYPE_MESSAGE}} is really the only 
type we'll ever handle. However, considering this is the worst case scenario 
(having to state 17 cases in place for a {{default}}) that doesn't occur often, 
I'm in favor of adopting (1).

 Modernize the codebase to C++11
 ---

 Key: MESOS-2664
 URL: https://issues.apache.org/jira/browse/MESOS-2664
 Project: Mesos
  Issue Type: Epic
  Components: technical debt
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 Since [this 
 commit|https://github.com/apache/mesos/commit/0f5c78fad3423181f7227027eb42d162811514e7],
  we officially require GCC-4.8+ and Clang-3.5+. This means that we now have 
 full C++11 support and therefore can start to modernize our codebase to be 
 more readable, safer and efficient!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-2664) Modernize the codebase to C++11

2015-06-19 Thread Michael Park (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594181#comment-14594181
 ] 

Michael Park edited comment on MESOS-2664 at 6/20/15 12:24 AM:
---

The only 3rdparty that it affected was {{stout/protobuf.hpp}}, which is also 
where spelling out the default would be the most prevalent due to the number 
of members in the {{FieldDescriptor}} enum.

With approach (1) we have:
{code}
   TryNothing operator () (const JSON::Object object) const
   {
 switch (field-type()) {
   case google::protobuf::FieldDescriptor::TYPE_MESSAGE:
 /* ... */
 break;
   case google::protobuf::FieldDescriptor::TYPE_DOUBLE:
   case google::protobuf::FieldDescriptor::TYPE_FLOAT:
   case google::protobuf::FieldDescriptor::TYPE_INT32:
   case google::protobuf::FieldDescriptor::TYPE_INT64:
   case google::protobuf::FieldDescriptor::TYPE_UINT32:
   case google::protobuf::FieldDescriptor::TYPE_UINT64:
   case google::protobuf::FieldDescriptor::TYPE_FIXED32:
   case google::protobuf::FieldDescriptor::TYPE_FIXED64:
   case google::protobuf::FieldDescriptor::TYPE_BOOL:
   case google::protobuf::FieldDescriptor::TYPE_STRING:
   case google::protobuf::FieldDescriptor::TYPE_GROUP:
   case google::protobuf::FieldDescriptor::TYPE_BYTES:
   case google::protobuf::FieldDescriptor::TYPE_ENUM:
   case google::protobuf::FieldDescriptor::TYPE_SFIXED32:
   case google::protobuf::FieldDescriptor::TYPE_SFIXED64:
   case google::protobuf::FieldDescriptor::TYPE_SINT32:
   case google::protobuf::FieldDescriptor::TYPE_SINT64:
 return Error(Not expecting a JSON object for field ' +
  field-name() + ');
 }
 return Nothing();
   }
{code}

With (2) we have:

{code}
   TryNothing operator () (const JSON::Object object) const
   {
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored -Wswitch-enum
 switch (field-type()) {
   case google::protobuf::FieldDescriptor::TYPE_MESSAGE:
 /* ... */
 break;
   default:
 return Error(Not expecting a JSON object for field ' +
  field-name() + ');
 }
#pragma GCC diagnostic pop
 return Nothing();
   }
{code}

(2) is nice in the sense that it's clear {{TYPE_MESSAGE}} is really the only 
type we'll ever handle. However, considering this is the worst case scenario 
(having to state 17 cases in place for a {{default}}) that doesn't occur often, 
I'm also in favor of adopting (1).


was (Author: mcypark):
The only 3rdparty that it affected was {{stout/protobuf.hpp}}, which is also 
where spelling out the default would be the most prevalent due to the number 
of members in the {{FieldDescriptor}} enum.

With approach (1) we have:
{code}
   TryNothing operator () (const JSON::Object object) const
   {
 switch (field-type()) {
   case google::protobuf::FieldDescriptor::TYPE_MESSAGE:
 /* ... */
 break;
   case google::protobuf::FieldDescriptor::TYPE_DOUBLE:
   case google::protobuf::FieldDescriptor::TYPE_FLOAT:
   case google::protobuf::FieldDescriptor::TYPE_INT32:
   case google::protobuf::FieldDescriptor::TYPE_INT64:
   case google::protobuf::FieldDescriptor::TYPE_UINT32:
   case google::protobuf::FieldDescriptor::TYPE_UINT64:
   case google::protobuf::FieldDescriptor::TYPE_FIXED32:
   case google::protobuf::FieldDescriptor::TYPE_FIXED64:
   case google::protobuf::FieldDescriptor::TYPE_BOOL:
   case google::protobuf::FieldDescriptor::TYPE_STRING:
   case google::protobuf::FieldDescriptor::TYPE_GROUP:
   case google::protobuf::FieldDescriptor::TYPE_BYTES:
   case google::protobuf::FieldDescriptor::TYPE_ENUM:
   case google::protobuf::FieldDescriptor::TYPE_SFIXED32:
   case google::protobuf::FieldDescriptor::TYPE_SFIXED64:
   case google::protobuf::FieldDescriptor::TYPE_SINT32:
   case google::protobuf::FieldDescriptor::TYPE_SINT64:
 return Error(Not expecting a JSON object for field ' +
  field-name() + ');
 }
 return Nothing();
   }
{code}

With (2) we have:

{code}
   TryNothing operator () (const JSON::Object object) const
   {
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored -Wswitch-enum
 switch (field-type()) {
   case google::protobuf::FieldDescriptor::TYPE_MESSAGE:
 /* ... */
 break;
   default:
 return Error(Not expecting a JSON object for field ' +
  field-name() + ');
 }
#pragma GCC diagnostic pop
 return Nothing();
   }
{code}

(2) is nice in the sense that it's clear {{TYPE_MESSAGE}} is really the only 
type we'll ever handle. However, considering this is the worst case scenario 
(having to state 17 cases in place for a {{default}}) that doesn't occur often, 
I'm in favor of

[jira] [Resolved] (MESOS-2843) Update the design doc to include updating capabilities field


 [ 
https://issues.apache.org/jira/browse/MESOS-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditi Dixit resolved MESOS-2843.

Resolution: Fixed

 Update the design doc to include updating capabilities field
 

 Key: MESOS-2843
 URL: https://issues.apache.org/jira/browse/MESOS-2843
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone
Assignee: Aditi Dixit

 We recently added a new field capabilities to FrameworkInfo. The design doc 
 needs to be updated to reflect its update semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2898) Write tests for new JSON (ZooKeeper) functionality

Marco Massenzio created MESOS-2898:
--

 Summary: Write tests for new JSON (ZooKeeper) functionality
 Key: MESOS-2898
 URL: https://issues.apache.org/jira/browse/MESOS-2898
 Project: Mesos
  Issue Type: Task
Reporter: Marco Massenzio
Assignee: Marco Massenzio


Follow up from MESOS-2340, need to ensure this does not break the ZooKeeper 
discovery functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2900) The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint


 [ 
https://issues.apache.org/jira/browse/MESOS-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditi Dixit updated MESOS-2900:
---
Description: The capabilities field was added to FrameworkInfo recently. It 
should be displayed in the state.json HTTP endpoint.

 The capabilities field was added recently to FrameworkInfo. It should be 
 displayed in the state.json endpoint
 -

 Key: MESOS-2900
 URL: https://issues.apache.org/jira/browse/MESOS-2900
 Project: Mesos
  Issue Type: Bug
Reporter: Aditi Dixit
Assignee: Aditi Dixit
Priority: Minor

 The capabilities field was added to FrameworkInfo recently. It should be 
 displayed in the state.json HTTP endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2899) The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint

Aditi Dixit created MESOS-2899:
--

 Summary: The capabilities field was added recently to 
FrameworkInfo. It should be displayed in the state.json endpoint
 Key: MESOS-2899
 URL: https://issues.apache.org/jira/browse/MESOS-2899
 Project: Mesos
  Issue Type: Bug
Reporter: Aditi Dixit
Assignee: Aditi Dixit
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1733) Change the stout path utility to declare a single, variadic 'join' function instead of several separate declarations of various discrete arities

2015-06-19 Thread Adam B (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-1733:
--
Labels: mesosphere  (was: )

 Change the stout path utility to declare a single, variadic 'join' function 
 instead of several separate declarations of various discrete arities
 

 Key: MESOS-1733
 URL: https://issues.apache.org/jira/browse/MESOS-1733
 Project: Mesos
  Issue Type: Improvement
  Components: build, stout
Reporter: Patrick Reilly
Assignee: Anand Mazumdar
  Labels: mesosphere





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2900) The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint

Aditi Dixit created MESOS-2900:
--

 Summary: The capabilities field was added recently to 
FrameworkInfo. It should be displayed in the state.json endpoint
 Key: MESOS-2900
 URL: https://issues.apache.org/jira/browse/MESOS-2900
 Project: Mesos
  Issue Type: Bug
Reporter: Aditi Dixit
Assignee: Aditi Dixit
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1733) Change the stout path utility to declare a single, variadic 'join' function instead of several separate declarations of various discrete arities

2015-06-19 Thread Adam B (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-1733:
--
  Sprint: Mesosphere Sprint 12
Story Points: 5

 Change the stout path utility to declare a single, variadic 'join' function 
 instead of several separate declarations of various discrete arities
 

 Key: MESOS-1733
 URL: https://issues.apache.org/jira/browse/MESOS-1733
 Project: Mesos
  Issue Type: Improvement
  Components: build, stout
Reporter: Patrick Reilly
Assignee: Anand Mazumdar
  Labels: mesosphere





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2900) Display capabilities in state.json


[ 
https://issues.apache.org/jira/browse/MESOS-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593735#comment-14593735
 ] 

Marco Massenzio commented on MESOS-2900:


As noted in MESOS-2899, {{state.json}} is a heavy endpoint, so before adding 
more stuff, it may be worth to assess whether it's really necessary.

 Display capabilities in state.json
 --

 Key: MESOS-2900
 URL: https://issues.apache.org/jira/browse/MESOS-2900
 Project: Mesos
  Issue Type: Bug
Reporter: Aditi Dixit
Assignee: Aditi Dixit
Priority: Minor

 The capabilities field was added to FrameworkInfo recently. It should be 
 displayed in the state.json HTTP endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2394) Create styleguide for documentation

2015-06-19 Thread Niklas Quarfot Nielsen (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2394:
--
Shepherd: Bernd Mathiske  (was: Niklas Quarfot Nielsen)

 Create styleguide for documentation
 ---

 Key: MESOS-2394
 URL: https://issues.apache.org/jira/browse/MESOS-2394
 Project: Mesos
  Issue Type: Documentation
Reporter: Joerg Schad
Assignee: Joerg Schad
Priority: Minor

 As of right now different pages in our documentation use quite different 
 styles. Consider for example the different emphasis for NOTE:
 * {noformat} NOTE: 
 http://mesos.apache.org/documentation/latest/slave-recovery/{noformat}
 *  {noformat}*NOTE*: http://mesos.apache.org/documentation/latest/upgrades/ 
 {noformat} 
 Would be great to establish a common style for the documentation!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2712) When trying to install mesos 0.22.0 version on Redhat Enterprise Linux 6.0 , i am getting error configure error cannot find libsvn_subr-1 headers . I tried with ./confi


 [ 
https://issues.apache.org/jira/browse/MESOS-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2712:
---
Priority: Minor  (was: Blocker)

 When trying to install mesos 0.22.0 version on Redhat Enterprise Linux 6.0 , 
 i am getting error configure error cannot find libsvn_subr-1 headers . I 
 tried with ./configure --with-svn option also but still the same. 
 --

 Key: MESOS-2712
 URL: https://issues.apache.org/jira/browse/MESOS-2712
 Project: Mesos
  Issue Type: Bug
  Components: general
Affects Versions: 0.22.0
Reporter: Sujit
Priority: Minor

 When trying to install mesos 0.22.0 version on Redhat Enterprise Linux 6.0 , 
 i am getting error configure error cannot find libsvn_subr-1 headers . I 
 tried with ./configure --with-svn option also but still the same. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2899) The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint


[ 
https://issues.apache.org/jira/browse/MESOS-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593731#comment-14593731
 ] 

Marco Massenzio commented on MESOS-2899:


While this may be useful, unfortunately, the {{state.json}} endpoint is already 
very heavy and the response to client takes quite some time to be returned.

I'm not sure how much more data (and time) this would add to it.

 The capabilities field was added recently to FrameworkInfo. It should be 
 displayed in the state.json endpoint
 -

 Key: MESOS-2899
 URL: https://issues.apache.org/jira/browse/MESOS-2899
 Project: Mesos
  Issue Type: Bug
Reporter: Aditi Dixit
Assignee: Aditi Dixit
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2891) Performance regression in hierarchical allocator.


[ 
https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593660#comment-14593660
 ] 

Jie Yu commented on MESOS-2891:
---

According to the flame graph [~bmahler] attached, seems that the most expensive 
calculation is here:
https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L286

If a cluster with tens of thousands of slaves, summing the entire hashmap 
returned from 'allocation[name]' is definitely expensive.

We could also try to optimize Resources::operator +=. Currently, if the right 
hand side of the operator is a single 'resource', we'll call 'validate' on the 
resource before performing the operation. The validation is expensive. Most of 
the time, the validation is not necessary since most of the time, the 
'resource' has already been verified before.

 Performance regression in hierarchical allocator.
 -

 Key: MESOS-2891
 URL: https://issues.apache.org/jira/browse/MESOS-2891
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Benjamin Mahler
Priority: Blocker
  Labels: twitter
 Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg


 For large clusters, the 0.23.0 allocator cannot keep up with the volume of 
 slaves. After the following slave was re-registered, it took the allocator a 
 long time to work through the backlog of slaves to add:
 {noformat:title=45 minute delay}
 I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 
 20150422-211121-2148346890-5050-3253-S4695
 I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 
 20150422-211121-2148346890-5050-3253-S4695
 {noformat}
 Empirically, 
 [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462]
  and 
 [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533]
  have become expensive.
 Some timings from a production cluster reveal that the allocator spending in 
 the low tens of milliseconds for each call to {{addSlave}} and 
 {{updateSlave}}, when there are tens of thousands of slaves this amounts to 
 the large delay seen above.
 We also saw a slow steady increase in memory consumption, hinting further at 
 a queue backup in the allocator.
 A synthetic benchmark like we did for the registrar would be prudent here, 
 along with visibility into the allocator's queue size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2900) Display capabilities in state.json


 [ 
https://issues.apache.org/jira/browse/MESOS-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditi Dixit updated MESOS-2900:
---
Summary: Display capabilities in state.json  (was: The capabilities field 
was added recently to FrameworkInfo. It should be displayed in the state.json 
endpoint)

 Display capabilities in state.json
 --

 Key: MESOS-2900
 URL: https://issues.apache.org/jira/browse/MESOS-2900
 Project: Mesos
  Issue Type: Bug
Reporter: Aditi Dixit
Assignee: Aditi Dixit
Priority: Minor

 The capabilities field was added to FrameworkInfo recently. It should be 
 displayed in the state.json HTTP endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2768) SIGPIPE in process::run_in_event_loop()

2015-06-19 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593724#comment-14593724
 ] 

Vinod Kone commented on MESOS-2768:
---

[~jvanremoortere] [~benjaminhindman] any update on this? we are still seeing 
this in production.

 SIGPIPE in process::run_in_event_loop()
 ---

 Key: MESOS-2768
 URL: https://issues.apache.org/jira/browse/MESOS-2768
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.23.0
Reporter: Yan Xu
Priority: Critical

 Observed in production.
 {noformat:title=slave log}
 I0526 12:17:48.027257 51633 slave.cpp:4077] Received a new estimation of the 
 oversubscribable resources 
 W0526 12:17:48.027257 51636 logging.cpp:91] RAW: Received signal SIGPIPE; 
 escalating to SIGABRT
 *** Aborted at 1432642668 (unix time) try date -d @1432642668 if you are 
 using GNU date ***
 PC: @ 0x7fa58c23eb6d raise
 *** SIGABRT (@0xc9a5) received by PID 51621 (TID 0x7fa58224c940) from PID 
 51621; stack trace: ***
 @ 0x7fa58c23eca0 (unknown)
 @ 0x7fa58c23eb6d raise
 @ 0x7fa58cc19ba7 mesos::internal::logging::handler()
 @ 0x7fa58c23eca0 (unknown)
 @ 0x7fa58c23da2b __libc_write
 @ 0x7fa58cb57b6f evpipe_write.part.5
 @ 0x7fa58d245070 process::run_in_event_loop()
 @ 0x7fa58d2441ba process::EventLoop::delay()
 @ 0x7fa58d1c3c9c process::clock::scheduleTick()
 @ 0x7fa58d1c65b1 process::Clock::timer()
 @ 0x7fa58d23915a process::delay()
 @ 0x7fa58d23a740 process::ReaperProcess::wait()
 @ 0x7fa58d21261a process::ProcessManager::resume()
 @ 0x7fa58d2128dc process::schedule()
 @ 0x7fa58c23683d start_thread
 @ 0x7fa58ba28fcd clone
 {noformat}
 {noformat:title=gdb}
 (gdb) bt
 #0  0x7fa58c23eb6d in raise () from /lib64/libpthread.so.0
 #1  0x7fa58cc19ba7 in mesos::internal::logging::handler (signal=Unhandled 
 dwarf expression opcode 0xf3
 ) at logging/logging.cpp:92
 #2  signal handler called
 #3  0x7fa58c23da2b in write () from /lib64/libpthread.so.0
 #4  0x7fa58cb57b6f in evpipe_write (loop=0x7fa58e1e79c0, flag=Unhandled 
 dwarf expression opcode 0xfa
 ) at ev.c:2172
 #5  0x7fa58d245070 in process::run_in_event_loopNothing(const 
 std::functionprocess::FutureNothing() ) (f=Unhandled dwarf expression 
 opcode 0xf3
 ) at src/libev.hpp:80
 #6  0x7fa58d2441ba in process::EventLoop::delay(const Duration , const 
 std::functionvoid() ) (duration=Unhandled dwarf expression opcode 0xf3
 ) at src/libev.cpp:106
 #7  0x7fa58d1c3c9c in process::clock::scheduleTick (timers=Unhandled 
 dwarf expression opcode 0xf3
 ) at src/clock.cpp:119
 #8  0x7fa58d1c65b1 in process::Clock::timer(const Duration , const 
 std::functionvoid() ) (duration=Unhandled dwarf expression opcode 0xf3
 ) at src/clock.cpp:254
 #9  0x7fa58d23915a in process::delayprocess::ReaperProcess 
 (duration=..., pid=Unhandled dwarf expression opcode 0xf3
 ) at ./include/process/delay.hpp:25
 #10 0x7fa58d23a740 in process::ReaperProcess::wait (this=0x2056920) at 
 src/reap.cpp:93
 #11 0x7fa58d21261a in process::ProcessManager::resume (this=0x1db8d20, 
 process=0x2056958) at src/process.cpp:2172
 #12 0x7fa58d2128dc in process::schedule (arg=Unhandled dwarf expression 
 opcode 0xf3
 ) at src/process.cpp:602
 #13 0x7fa58c23683d in start_thread () from /lib64/libpthread.so.0
 #14 0x7fa58ba28fcd in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2807) As a developer I need an easy way to convert MasterInfo protobuf to/from JSON

2015-06-19 Thread Niklas Quarfot Nielsen (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2807:
--
Assignee: Marco Massenzio

 As a developer I need an easy way to convert MasterInfo protobuf to/from JSON
 -

 Key: MESOS-2807
 URL: https://issues.apache.org/jira/browse/MESOS-2807
 Project: Mesos
  Issue Type: Task
  Components: leader election
Reporter: Marco Massenzio
Assignee: Marco Massenzio
 Fix For: 0.23.0


 As a preliminary to MESOS-2340, this requires the implementation of a simple 
 (de)serialization mechanism to JSON from/to {{MasterInfo}} protobuf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2637) Consolidate 'foo', 'bar', ... string constants in test and example code

2015-06-19 Thread Niklas Quarfot Nielsen (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593749#comment-14593749
 ] 

Niklas Quarfot Nielsen commented on MESOS-2637:
---

@colin: Do you have cycles to work on this? If not, I will remove myself as 
shepherd for now

 Consolidate 'foo', 'bar', ... string constants in test and example code
 ---

 Key: MESOS-2637
 URL: https://issues.apache.org/jira/browse/MESOS-2637
 Project: Mesos
  Issue Type: Bug
  Components: technical debt
Reporter: Niklas Quarfot Nielsen
Assignee: Colin Williams

 We are using 'foo', 'bar', ... string constants and pairs in 
 src/tests/master_tests.cpp, src/tests/slave_tests.cpp, 
 src/tests/hook_tests.cpp and src/examples/test_hook_module.cpp for label and 
 hooks tests. We should consolidate them to make the call sites less prone to 
 forgetting to update all call sites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2873) style hook prevent's valid markdown files from getting committed


 [ 
https://issues.apache.org/jira/browse/MESOS-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2873:
---
Sprint: Mesosphere Sprint 12

 style hook prevent's valid markdown files from getting committed
 

 Key: MESOS-2873
 URL: https://issues.apache.org/jira/browse/MESOS-2873
 Project: Mesos
  Issue Type: Bug
Reporter: Alexander Rojas
Priority: Trivial

 According to the original [markdown 
 specification|http://daringfireball.net/projects/markdown/syntax#p] and to 
 the most [recent 
 standarization|http://spec.commonmark.org/0.20/#hard-line-breaks] effort, two 
 spaces at the end of a line create a hard line break (it breaks the line 
 without starting a new paragraph), similar to the html code {{br/}}. 
 However, there's a hook in mesos which prevent files with trailing whitespace 
 to be committed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks when using systemd

2015-06-19 Thread Chris Fortier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593844#comment-14593844
 ] 

Chris Fortier commented on MESOS-2419:
--

Brenden -

would you be able to share the systemd file you are using to start the mesos 
slave?

I'm having the same issue but KillMode=process doesn't seem to help. We are 
also running mesos as a container on CoreOS.


 Slave recovery not recovering tasks when using systemd
 --

 Key: MESOS-2419
 URL: https://issues.apache.org/jira/browse/MESOS-2419
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Brenden Matthews
Assignee: Joerg Schad
 Attachments: mesos-chronos.log.gz, mesos.log.gz


 {color:red}
 Note: the resolution to this issue is described in the following comment 
 below:
 https://issues.apache.org/jira/browse/MESOS-2419?focusedCommentId=14357028page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357028
 {color:red}
 In a recent build from master (updated yesterday), slave recovery appears to 
 have broken.
 I'll attach the slave log (with GLOG_v=1) showing a task called 
 `long-running-job` which is a Chronos job that just does `sleep 1h`. After 
 restarting the slave, the task shows as `TASK_FAILED`.
 Here's another case, which is for a docker task:
 {noformat}
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.247207 10022 docker.cpp:468] Recovering container 
 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 20150226-230228-2931198986-5050-717-
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254791 10022 docker.cpp:1333] Executor for container 
 'f2001064-e076-4978-b764-ed12a5244e78' has exited
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254812 10022 docker.cpp:1159] Destroying container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 
 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 20150226-230228-2931198986-5050-717-
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup 
 for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container 
 f2001064-e076-4978-b764-ed12a5244e78
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 
 'f2001064-e076-4978-b764-ed12a5244e78' has exited
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717- at executor(1)@10.81.189.232:43130
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 '20150226-230228-2931198986-5050-717-' failed: Container 
 'f2001064-e076-4978-b764-ed12a5244e78' not found
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for 
 executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717-: Not monitored
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED 
 (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717- from @0.0.0.0:0
 Feb 27 00:09:50

[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks when using systemd

2015-06-19 Thread Brenden Matthews (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593849#comment-14593849
 ] 

Brenden Matthews commented on MESOS-2419:
-

I'm afraid I don't have it anymore. I'd be happy to review yours, however.

 Slave recovery not recovering tasks when using systemd
 --

 Key: MESOS-2419
 URL: https://issues.apache.org/jira/browse/MESOS-2419
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Brenden Matthews
Assignee: Joerg Schad
 Attachments: mesos-chronos.log.gz, mesos.log.gz


 {color:red}
 Note: the resolution to this issue is described in the following comment 
 below:
 https://issues.apache.org/jira/browse/MESOS-2419?focusedCommentId=14357028page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357028
 {color:red}
 In a recent build from master (updated yesterday), slave recovery appears to 
 have broken.
 I'll attach the slave log (with GLOG_v=1) showing a task called 
 `long-running-job` which is a Chronos job that just does `sleep 1h`. After 
 restarting the slave, the task shows as `TASK_FAILED`.
 Here's another case, which is for a docker task:
 {noformat}
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.247207 10022 docker.cpp:468] Recovering container 
 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 20150226-230228-2931198986-5050-717-
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254791 10022 docker.cpp:1333] Executor for container 
 'f2001064-e076-4978-b764-ed12a5244e78' has exited
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254812 10022 docker.cpp:1159] Destroying container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 
 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 20150226-230228-2931198986-5050-717-
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup 
 for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container 
 f2001064-e076-4978-b764-ed12a5244e78
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 
 'f2001064-e076-4978-b764-ed12a5244e78' has exited
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717- at executor(1)@10.81.189.232:43130
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 '20150226-230228-2931198986-5050-717-' failed: Container 
 'f2001064-e076-4978-b764-ed12a5244e78' not found
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for 
 executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717-: Not monitored
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED 
 (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717- from @0.0.0.0:0
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.599093 10023 slave.cpp:2637] Failed to update resources for 
 container

[jira] [Closed] (MESOS-2899) The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint