[jira] [Updated] (MESOS-2903) Network isolator should not fail when target state already exists
[ https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-2903: -- Target Version/s: 0.23.0 Network isolator should not fail when target state already exists - Key: MESOS-2903 URL: https://issues.apache.org/jira/browse/MESOS-2903 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.23.0 Reporter: Paul Brett Priority: Critical Network isolator has multiple instances of the following pattern: {noformat} Trybool something = ::create(); if (something.isError()) { ++metrics.something_errors; return Failure(Failed to create something ...) } else if (!icmpVethToEth0.get()) { ++metrics.adding_veth_icmp_filters_already_exist; return Failure(Something already exists); } {noformat} These failures have occurred in operation due to the failure to recover or delete an orphan, causing the slave to remain on line but unable to create new resources.We should convert the second failure message in this pattern to an information message since the final state of the system is the state that we requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2903) Network isolator should not fail when target state already exists
[ https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-2903: -- Priority: Critical (was: Major) Network isolator should not fail when target state already exists - Key: MESOS-2903 URL: https://issues.apache.org/jira/browse/MESOS-2903 Project: Mesos Issue Type: Bug Components: isolation Reporter: Paul Brett Priority: Critical Network isolator has multiple instances of the following pattern: {noformat} Trybool something = ::create(); if (something.isError()) { ++metrics.something_errors; return Failure(Failed to create something ...) } else if (!icmpVethToEth0.get()) { ++metrics.adding_veth_icmp_filters_already_exist; return Failure(Something already exists); } {noformat} These failures have occurred in operation due to the failure to recover or delete an orphan, causing the slave to remain on line but unable to create new resources.We should convert the second failure message in this pattern to an information message since the final state of the system is the state that we requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2903) Network isolator should not fail when target state already exists
[ https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-2903: -- Affects Version/s: 0.23.0 Network isolator should not fail when target state already exists - Key: MESOS-2903 URL: https://issues.apache.org/jira/browse/MESOS-2903 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.23.0 Reporter: Paul Brett Priority: Critical Network isolator has multiple instances of the following pattern: {noformat} Trybool something = ::create(); if (something.isError()) { ++metrics.something_errors; return Failure(Failed to create something ...) } else if (!icmpVethToEth0.get()) { ++metrics.adding_veth_icmp_filters_already_exist; return Failure(Something already exists); } {noformat} These failures have occurred in operation due to the failure to recover or delete an orphan, causing the slave to remain on line but unable to create new resources.We should convert the second failure message in this pattern to an information message since the final state of the system is the state that we requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks when using systemd
[ https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593976#comment-14593976 ] Chris Fortier commented on MESOS-2419: -- Brenden - Thank you so much for taking a look at this. I also just found out about https://issues.apache.org/jira/browse/MESOS-2115 Apparently mesos in a container is something Mesosphere is working on and should be included on the next release. :) Slave recovery not recovering tasks when using systemd -- Key: MESOS-2419 URL: https://issues.apache.org/jira/browse/MESOS-2419 Project: Mesos Issue Type: Bug Components: slave Reporter: Brenden Matthews Assignee: Joerg Schad Attachments: mesos-chronos.log.gz, mesos.log.gz {color:red} Note: the resolution to this issue is described in the following comment below: https://issues.apache.org/jira/browse/MESOS-2419?focusedCommentId=14357028page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357028 {color:red} In a recent build from master (updated yesterday), slave recovery appears to have broken. I'll attach the slave log (with GLOG_v=1) showing a task called `long-running-job` which is a Chronos job that just does `sleep 1h`. After restarting the slave, the task shows as `TASK_FAILED`. Here's another case, which is for a docker task: {noformat} Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247207 10022 docker.cpp:468] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717- Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254791 10022 docker.cpp:1333] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254812 10022 docker.cpp:1159] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717- Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container f2001064-e076-4978-b764-ed12a5244e78 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717- at executor(1)@10.81.189.232:43130 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework '20150226-230228-2931198986-5050-717-' failed: Container 'f2001064-e076-4978-b764-ed12a5244e78' not found Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-: Not monitored Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717- from @0.0.0.0:0 Feb
[jira] [Commented] (MESOS-2637) Consolidate 'foo', 'bar', ... string constants in test and example code
[ https://issues.apache.org/jira/browse/MESOS-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593988#comment-14593988 ] Colin Williams commented on MESOS-2637: --- Sorry about the delay, I'll have some cycles on Sunday, I'll try to wrap it up then. On Jun 19, 2015 2:29 PM, Niklas Quarfot Nielsen (JIRA) j...@apache.org Consolidate 'foo', 'bar', ... string constants in test and example code --- Key: MESOS-2637 URL: https://issues.apache.org/jira/browse/MESOS-2637 Project: Mesos Issue Type: Bug Components: technical debt Reporter: Niklas Quarfot Nielsen Assignee: Colin Williams We are using 'foo', 'bar', ... string constants and pairs in src/tests/master_tests.cpp, src/tests/slave_tests.cpp, src/tests/hook_tests.cpp and src/examples/test_hook_module.cpp for label and hooks tests. We should consolidate them to make the call sites less prone to forgetting to update all call sites. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2903) Network isolator should not fail when target state already exists
[ https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594152#comment-14594152 ] Jie Yu commented on MESOS-2903: --- I think this is related to the recent change about the slave recovery semantics (MESOS-2367). Previously, slave won't finish recovery if some orphan containers cannot be destroyed. Therefore, the port mapping isolator simply assumes that it knows about all the filters on host. However, this is no longer true after MESOS-2367 is committed. So the isolator code needs to adapt to that new semantics. Network isolator should not fail when target state already exists - Key: MESOS-2903 URL: https://issues.apache.org/jira/browse/MESOS-2903 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.23.0 Reporter: Paul Brett Priority: Critical Network isolator has multiple instances of the following pattern: {noformat} Trybool something = ::create(); if (something.isError()) { ++metrics.something_errors; return Failure(Failed to create something ...) } else if (!icmpVethToEth0.get()) { ++metrics.adding_veth_icmp_filters_already_exist; return Failure(Something already exists); } {noformat} These failures have occurred in operation due to the failure to recover or delete an orphan, causing the slave to remain on line but unable to create new resources.We should convert the second failure message in this pattern to an information message since the final state of the system is the state that we requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks when using systemd
[ https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593946#comment-14593946 ] Brenden Matthews commented on MESOS-2419: - I would not suggest running the Mesos processes inside Docker containers. In fact, it's an anti-pattern. It will indeed break recovery if you try to do that. Slave recovery not recovering tasks when using systemd -- Key: MESOS-2419 URL: https://issues.apache.org/jira/browse/MESOS-2419 Project: Mesos Issue Type: Bug Components: slave Reporter: Brenden Matthews Assignee: Joerg Schad Attachments: mesos-chronos.log.gz, mesos.log.gz {color:red} Note: the resolution to this issue is described in the following comment below: https://issues.apache.org/jira/browse/MESOS-2419?focusedCommentId=14357028page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357028 {color:red} In a recent build from master (updated yesterday), slave recovery appears to have broken. I'll attach the slave log (with GLOG_v=1) showing a task called `long-running-job` which is a Chronos job that just does `sleep 1h`. After restarting the slave, the task shows as `TASK_FAILED`. Here's another case, which is for a docker task: {noformat} Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247207 10022 docker.cpp:468] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717- Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254791 10022 docker.cpp:1333] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254812 10022 docker.cpp:1159] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717- Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container f2001064-e076-4978-b764-ed12a5244e78 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717- at executor(1)@10.81.189.232:43130 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework '20150226-230228-2931198986-5050-717-' failed: Container 'f2001064-e076-4978-b764-ed12a5244e78' not found Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-: Not monitored Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717- from @0.0.0.0:0 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.599093
[jira] [Created] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
Cody Maloney created MESOS-2902: --- Summary: Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Priority: Minor Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-2891) Performance regression in hierarchical allocator.
[ https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu resolved MESOS-2891. --- Resolution: Fixed Fix Version/s: 0.23.0 Performance regression in hierarchical allocator. - Key: MESOS-2891 URL: https://issues.apache.org/jira/browse/MESOS-2891 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Benjamin Mahler Assignee: Jie Yu Priority: Blocker Labels: twitter Fix For: 0.23.0 Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg For large clusters, the 0.23.0 allocator cannot keep up with the volume of slaves. After the following slave was re-registered, it took the allocator a long time to work through the backlog of slaves to add: {noformat:title=45 minute delay} I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 20150422-211121-2148346890-5050-3253-S4695 I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 20150422-211121-2148346890-5050-3253-S4695 {noformat} Empirically, [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462] and [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533] have become expensive. Some timings from a production cluster reveal that the allocator spending in the low tens of milliseconds for each call to {{addSlave}} and {{updateSlave}}, when there are tens of thousands of slaves this amounts to the large delay seen above. We also saw a slow steady increase in memory consumption, hinting further at a queue backup in the allocator. A synthetic benchmark like we did for the registrar would be prudent here, along with visibility into the allocator's queue size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2891) Performance regression in hierarchical allocator.
[ https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594082#comment-14594082 ] Jie Yu commented on MESOS-2891: --- commit 68505cd0a478a96393ca988e74f99460333f5e45 Author: Jie Yu yujie@gmail.com Date: Fri Jun 19 15:20:07 2015 -0700 Fixed a bug in test filter that prevent some tests from being launched. Review: https://reviews.apache.org/r/35671 commit ce1c6e2aad748d9f999c09b9bb4897e19fc18175 Author: Jie Yu yujie@gmail.com Date: Fri Jun 19 12:38:24 2015 -0700 Improved the performance of DRF sorter by caching the scalars. Review: https://reviews.apache.org/r/35664 commit 114d2aa568284eba98dad60f8265c573112bad49 Author: Jie Yu yujie@gmail.com Date: Fri Jun 19 12:37:27 2015 -0700 Added a helper in Resources to get all scalar resources. Review: https://reviews.apache.org/r/35663 Performance regression in hierarchical allocator. - Key: MESOS-2891 URL: https://issues.apache.org/jira/browse/MESOS-2891 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Benjamin Mahler Assignee: Jie Yu Priority: Blocker Labels: twitter Fix For: 0.23.0 Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg For large clusters, the 0.23.0 allocator cannot keep up with the volume of slaves. After the following slave was re-registered, it took the allocator a long time to work through the backlog of slaves to add: {noformat:title=45 minute delay} I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 20150422-211121-2148346890-5050-3253-S4695 I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 20150422-211121-2148346890-5050-3253-S4695 {noformat} Empirically, [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462] and [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533] have become expensive. Some timings from a production cluster reveal that the allocator spending in the low tens of milliseconds for each call to {{addSlave}} and {{updateSlave}}, when there are tens of thousands of slaves this amounts to the large delay seen above. We also saw a slow steady increase in memory consumption, hinting further at a queue backup in the allocator. A synthetic benchmark like we did for the registrar would be prudent here, along with visibility into the allocator's queue size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2664) Modernize the codebase to C++11
[ https://issues.apache.org/jira/browse/MESOS-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594083#comment-14594083 ] Benjamin Mahler commented on MESOS-2664: Not relying on default seems like a nice simplification we could make, happy to move the run-time error to a compile time one :) Curious if it will bleed into our 3rd party dependencies? Modernize the codebase to C++11 --- Key: MESOS-2664 URL: https://issues.apache.org/jira/browse/MESOS-2664 Project: Mesos Issue Type: Epic Components: technical debt Reporter: Michael Park Assignee: Michael Park Labels: mesosphere Since [this commit|https://github.com/apache/mesos/commit/0f5c78fad3423181f7227027eb42d162811514e7], we officially require GCC-4.8+ and Clang-3.5+. This means that we now have full C++11 support and therefore can start to modernize our codebase to be more readable, safer and efficient! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2903) Network isolator should not fail when target state already exists
Paul Brett created MESOS-2903: - Summary: Network isolator should not fail when target state already exists Key: MESOS-2903 URL: https://issues.apache.org/jira/browse/MESOS-2903 Project: Mesos Issue Type: Bug Components: isolation Reporter: Paul Brett Network isolator has multiple instances of the following pattern: {noformat} Trybool something = ::create(); if (something.isError()) { ++metrics.something_errors; return Failure(Failed to create something ...) } else if (!icmpVethToEth0.get()) { ++metrics.adding_veth_icmp_filters_already_exist; return Failure(Something already exists); } {noformat} These failures have occurred in operation due to the failure to recover or delete an orphan, causing the slave to remain on line but unable to create new resources.We should convert the second failure message in this pattern to an information message since the final state of the system is the state that we requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2891) Performance regression in hierarchical allocator.
[ https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-2891: --- Sprint: Twitter Mesos Q2 Sprint 5 Shepherd: Benjamin Mahler Assignee: Jie Yu Story Points: 3 Performance regression in hierarchical allocator. - Key: MESOS-2891 URL: https://issues.apache.org/jira/browse/MESOS-2891 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Benjamin Mahler Assignee: Jie Yu Priority: Blocker Labels: twitter Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg For large clusters, the 0.23.0 allocator cannot keep up with the volume of slaves. After the following slave was re-registered, it took the allocator a long time to work through the backlog of slaves to add: {noformat:title=45 minute delay} I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 20150422-211121-2148346890-5050-3253-S4695 I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 20150422-211121-2148346890-5050-3253-S4695 {noformat} Empirically, [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462] and [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533] have become expensive. Some timings from a production cluster reveal that the allocator spending in the low tens of milliseconds for each call to {{addSlave}} and {{updateSlave}}, when there are tens of thousands of slaves this amounts to the large delay seen above. We also saw a slow steady increase in memory consumption, hinting further at a queue backup in the allocator. A synthetic benchmark like we did for the registrar would be prudent here, along with visibility into the allocator's queue size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2903) Network isolator should not fail when target state already exists
[ https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594094#comment-14594094 ] Jie Yu commented on MESOS-2903: --- Can you paste the relevant logging to show why this is necessary? Network isolator should not fail when target state already exists - Key: MESOS-2903 URL: https://issues.apache.org/jira/browse/MESOS-2903 Project: Mesos Issue Type: Bug Components: isolation Reporter: Paul Brett Network isolator has multiple instances of the following pattern: {noformat} Trybool something = ::create(); if (something.isError()) { ++metrics.something_errors; return Failure(Failed to create something ...) } else if (!icmpVethToEth0.get()) { ++metrics.adding_veth_icmp_filters_already_exist; return Failure(Something already exists); } {noformat} These failures have occurred in operation due to the failure to recover or delete an orphan, causing the slave to remain on line but unable to create new resources.We should convert the second failure message in this pattern to an information message since the final state of the system is the state that we requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2891) Performance regression in hierarchical allocator.
[ https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594146#comment-14594146 ] Vinod Kone edited comment on MESOS-2891 at 6/20/15 12:12 AM: - Reopening because updateSlave (and likely updateAllocation) also need to be addressed. Some numbers from a benchmark test. {code} [ RUN ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/0 Added 1000 slaves in 766.99568ms Updated 1000 slaves in 6.807111421secs [ OK ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/0 (7751 ms) [ RUN ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/1 Added 5000 slaves in 3.886493374secs Updated 5000 slaves in 4.07753897601667mins [ OK ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/1 (249472 ms) [ RUN ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/2 Added 1 slaves in 7.720996758secs Updated 1 slaves in 16.4897123807167mins [ OK ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/2 (999001 ms) {code} was (Author: vinodkone): Reopening because updateSlave (and likely updateAllocation) also need to be addressed. Some numbers from a benchmark test. {code} [ RUN ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/0 Added 1000 slaves in 766.99568ms Updated 1000 slaves in 6.807111421secs [ OK ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/0 (7751 ms) [ RUN ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/1 Added 5000 slaves in 3.886493374secs Updated 5000 slaves in 4.07753897601667mins [ OK ] {code} Performance regression in hierarchical allocator. - Key: MESOS-2891 URL: https://issues.apache.org/jira/browse/MESOS-2891 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Benjamin Mahler Assignee: Jie Yu Priority: Blocker Labels: twitter Fix For: 0.23.0 Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg For large clusters, the 0.23.0 allocator cannot keep up with the volume of slaves. After the following slave was re-registered, it took the allocator a long time to work through the backlog of slaves to add: {noformat:title=45 minute delay} I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 20150422-211121-2148346890-5050-3253-S4695 I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 20150422-211121-2148346890-5050-3253-S4695 {noformat} Empirically, [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462] and [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533] have become expensive. Some timings from a production cluster reveal that the allocator spending in the low tens of milliseconds for each call to {{addSlave}} and {{updateSlave}}, when there are tens of thousands of slaves this amounts to the large delay seen above. We also saw a slow steady increase in memory consumption, hinting further at a queue backup in the allocator. A synthetic benchmark like we did for the registrar would be prudent here, along with visibility into the allocator's queue size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks when using systemd
[ https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593904#comment-14593904 ] Chris Fortier commented on MESOS-2419: -- that would be fantastic! Here's the systemd unit and the associated logs: Systemd unit: ``` [Unit] Description=MesosSlave After=docker.service dockercfg.service Requires=docker.service dockercfg.service [Service] Environment=MESOS_IMAGE=mesosphere/mesos-slave:0.22.1-1.0.ubuntu1404 Environment=ZOOKEEPER=internal-portfolio-Internal-1GZCG4XPMN2WS-969465822.us-west-2.elb.amazonaws.com:2181 User=core KillMode=process Restart=on-failure RestartSec=20 TimeoutStartSec=0 ExecStartPre=-/usr/bin/docker kill mesos_slave ExecStartPre=-/usr/bin/docker rm mesos_slave ExecStartPre=/usr/bin/docker pull ${MESOS_IMAGE} ExecStart=/usr/bin/sh -c sudo /usr/bin/docker run \ --name=mesos_slave \ --net=host \ --privileged \ -v /home/core/.dockercfg:/root/.dockercfg:ro \ -v /sys:/sys \ -v /usr/bin/docker:/usr/bin/docker:ro \ -v /var/run/docker.sock:/var/run/docker.sock \ -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \ -v /var/lib/mesos/slave:/var/lib/mesos/slave \ ${MESOS_IMAGE} \ --ip=$(/usr/bin/ip -o -4 addr list eth0 | grep global | awk \'{print $4}\' | cut -d/ -f1) \ --attributes=zone:$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)\;os:coreos \ --containerizers=docker \ --executor_registration_timeout=10mins \ --hostname=`curl -s http://169.254.169.254/latest/meta-data/public-hostname` \ --isolation=cgroups/cpu,cgroups/mem \ --log_dir=/var/log/mesos \ --master=zk://${ZOOKEEPER}/mesos \ --work_dir=/var/lib/mesos/slave ExecStop=/usr/bin/docker stop mesos_slave ExecStartPost=/usr/bin/docker pull behance/utility:latest ExecStartPost=/usr/bin/docker pull ubuntu:14.04 ExecStartPost=/usr/bin/docker pull debian:jessie [Install] WantedBy=multi-user.target [X-Fleet] Global=true MachineMetadata=role=worker ``` Logs: ``` fortier@ip-10-43-3-126 ~ $ docker logs 7cd21326a98c I0619 17:57:22.075104 15406 logging.cpp:172] INFO level logging started! I0619 17:57:22.075305 15406 main.cpp:156] Build: 2015-05-05 06:15:50 by root I0619 17:57:22.075314 15406 main.cpp:158] Version: 0.22.1 I0619 17:57:22.075319 15406 main.cpp:161] Git tag: 0.22.1 I0619 17:57:22.075322 15406 main.cpp:165] Git SHA: d6309f92a7f9af3ab61a878403e3d9c284ea87e0 2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5 2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@716: Client environment:host.name=ip-10-43-3-126.us-west-2.compute.internal 2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@723: Client environment:os.name=Linux 2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@724: Client environment:os.arch=4.0.5 2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@725: Client environment:os.version=#2 SMP Thu Jun 18 08:53:45 UTC 2015 I0619 17:57:22.177387 15406 main.cpp:200] Starting Mesos slave 2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@733: Client environment:user.name=(null) I0619 17:57:22.178097 15406 slave.cpp:174] Slave started on 1)@10.43.3.126:5051 2015-06-19 17:57:22,178:15406(0x7f918ec5d700):ZOO_INFO@log_env@741: Client environment:user.home=/root 2015-06-19 17:57:22,178:15406(0x7f918ec5d700):ZOO_INFO@log_env@753: Client environment:user.dir=/ 2015-06-19 17:57:22,178:15406(0x7f918ec5d700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=internal-portfolio-Internal-1GZCG4XPMN2WS-969465822.us-west-2.elb.amazonaws.com:2181 sessionTimeout=1 watcher=0x7f91936bfa60 sessionId=0 sessionPasswd=null context=0x7f9178001010 flags=0 I0619 17:57:22.178235 15406 slave.cpp:322] Slave resources: cpus(*):8; mem(*):14019; disk(*):42121; ports(*):[31000-32000] I0619 17:57:22.178401 15406 slave.cpp:351] Slave hostname: ec2-52-24-66-221.us-west-2.compute.amazonaws.com I0619 17:57:22.178427 15406 slave.cpp:352] Slave checkpoint: true I0619 17:57:22.179797 15415 state.cpp:35] Recovering state from '/var/lib/mesos/slave/meta' I0619 17:57:22.180737 15417 slave.cpp:3890] Recovering framework 20150612-153240-4144114442-5050-1- I0619 17:57:22.180768 15417 slave.cpp:4319] Recovering executor 'portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe' of framework 20150612-153240-4144114442-5050-1- I0619 17:57:22.181002 15412 status_update_manager.cpp:197] Recovering status update manager I0619 17:57:22.181032 15412 status_update_manager.cpp:205] Recovering executor 'portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe' of framework 20150612-153240-4144114442-5050-1- I0619 17:57:22.181354 15414
[jira] [Deleted] (MESOS-2901) test - please ignore
[ https://issues.apache.org/jira/browse/MESOS-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio deleted MESOS-2901: --- test - please ignore Key: MESOS-2901 URL: https://issues.apache.org/jira/browse/MESOS-2901 Project: Mesos Issue Type: Wish Reporter: Marco Massenzio Labels: label-foo, mesosphere, random -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2901) test - please ignore
Marco Massenzio created MESOS-2901: -- Summary: test - please ignore Key: MESOS-2901 URL: https://issues.apache.org/jira/browse/MESOS-2901 Project: Mesos Issue Type: Wish Reporter: Marco Massenzio -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594013#comment-14594013 ] Benjamin Mahler commented on MESOS-2902: Any reason that the mesos itself has to execute the script? It seems like the start script for mesos should run the arbitrary code / external programs necessary to compute flags. Taking this to its extreme, should we add scripts for all of the flags (e.g. {{--resources}}, {{--quiet}}, etc)? Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Priority: Minor Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2798) Export statistics on unevictable memory
[ https://issues.apache.org/jira/browse/MESOS-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chi Zhang updated MESOS-2798: - Sprint: Twitter Q2 Sprint 3, Twitter Mesos Q2 Sprint 5 (was: Twitter Q2 Sprint 3) Export statistics on unevictable memory - Key: MESOS-2798 URL: https://issues.apache.org/jira/browse/MESOS-2798 Project: Mesos Issue Type: Improvement Reporter: Chi Zhang Assignee: Chi Zhang Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2798) Export statistics on unevictable memory
[ https://issues.apache.org/jira/browse/MESOS-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593956#comment-14593956 ] Chi Zhang commented on MESOS-2798: -- https://reviews.apache.org/r/35668/ Export statistics on unevictable memory - Key: MESOS-2798 URL: https://issues.apache.org/jira/browse/MESOS-2798 Project: Mesos Issue Type: Improvement Reporter: Chi Zhang Assignee: Chi Zhang Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594068#comment-14594068 ] Cody Maloney edited comment on MESOS-2902 at 6/19/15 10:58 PM: --- I can't drop it in a systemd unit file which runs a command before mesos and pass the data without making a temp file which is an odd way to do the config generation. I could make a new mesos-init-fetch-ip script which I run instead of mesos, and that script then execs mesos. This confuses init system tracking of processes somewhat, and obfuscates what the underlying commands being run are. It also adds a lot of error scenarios. For example, the wrapper script is updated and the change contains a typo, so it sets LIBPROCES_IP instead of LIBPROCESS_IP), Libprocess silently ignores the wrong environment variable. The environment I'm in Libprocess' internal logic guesses an IP that works. It gets engrained slightly incorrect as it rolls out across the cluster. Currently one of the biggest pain points in initially setting up a Mesos cluster is getting the right IPs + Hostnames setup. If Mesos Master and Mesos Slave had a flag which was required, {{\-\-ip\-detection=reverse_dns}} or {{--ip-detection=/usr/bin/detect_mesos_ip}}. It would make it so that users see what mesos is doing and make an informed decision, rather than running Mesos, having things break with really bad error messages (Wrong hostname/IP on your Scheduler? No logging of things breaking happens...). As far as generalizing it further. Note I'm saying IP, HOSTNAME are host-specific, which is why this sort of capability makes sense. It is impossible for me to know when I'm installing static config files to a Host, VM, Docker what the IP and Hostname are going to be. That is not the case for {{\-\-resources}}, {{\-quiet}} and the like. They are able to be pre-determined for a host. IP and Hostname are Runtime parameters of a machine (When you attach your machine to a network, they are assigned dynamically). was (Author: cmaloney): I can't drop it in a systemd unit file which runs a command before mesos and pass the data without making a temp file which is an odd way to do the config generation. I could make a new mesos-init-fetch-ip script which I run instead of mesos, and that script then execs mesos. This confuses init system tracking of processes somewhat, and obfuscates what the underlying commands being run are. It also adds a lot of error scenarios. For example, the wrapper script is updated and the change contains a typo, so it sets LIBPROCES_IP instead of LIBPROCESS_IP), Libprocess silently ignores the wrong environment variable. The environment I'm in Libprocess' internal logic guesses an IP that works. It gets engrained slightly incorrect as it rolls out across the cluster. Currently one of the biggest pain points in initially setting up a Mesos cluster is getting the right IPs + Hostnames setup. If Mesos Master and Mesos Slave had a flag which was required, {{ \-\-ip\-detection=reverse_dns}} or {{--ip-detection=,/usr/bin/detect_mesos_ip} }}. It would make it so that users see what mesos is doing and make an informed decision, rather than running Mesos, having things break with really bad error messages (Wrong hostname/IP on your Scheduler? No logging of things breaking happens...). As far as generalizing it further. Note I'm saying IP, HOSTNAME are host-specific, which is why this sort of capability makes sense. It is impossible for me to know when I'm installing static config files to a Host, VM, Docker what the IP and Hostname are going to be. That is not the case for {{\-\-resources}}, {{\-quiet}} and the like. They are able to be pre-determined for a host. IP and Hostname are Runtime parameters of a machine (When you attach your machine to a network, they are assigned dynamically). Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Priority: Minor Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which
[jira] [Created] (MESOS-2904) Add slave metric to count container launch failures
Paul Brett created MESOS-2904: - Summary: Add slave metric to count container launch failures Key: MESOS-2904 URL: https://issues.apache.org/jira/browse/MESOS-2904 Project: Mesos Issue Type: Bug Components: slave, statistics Reporter: Paul Brett Assignee: Paul Brett We have seen circumstances where a machine has been consistently unable to launch containers due to an inconsistent state (for example, unexpected network configuration). Adding a metric to track container launch failures will allow us to detect and alert on slaves in such a state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2902) Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME
[ https://issues.apache.org/jira/browse/MESOS-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594068#comment-14594068 ] Cody Maloney commented on MESOS-2902: - I can't drop it in a systemd unit file which runs a command before mesos and pass the data without making a temp file which is an odd way to do the config generation. I could make a new mesos-init-fetch-ip script which I run instead of mesos, and that script then execs mesos. This confuses init system tracking of processes somewhat, and obfuscates what the underlying commands being run are. It also adds a lot of error scenarios. For example, the wrapper script is updated and the change contains a typo, so it sets LIBPROCES_IP instead of LIBPROCESS_IP), Libprocess silently ignores the wrong environment variable. The environment I'm in Libprocess' internal logic guesses an IP that works. It gets engrained slightly incorrect as it rolls out across the cluster. Currently one of the biggest pain points in initially setting up a Mesos cluster is getting the right IPs + Hostnames setup. If Mesos Master and Mesos Slave had a flag which was required, {{ \-\-ip\-detection=reverse_dns}} or {{--ip-detection=,/usr/bin/detect_mesos_ip} }}. It would make it so that users see what mesos is doing and make an informed decision, rather than running Mesos, having things break with really bad error messages (Wrong hostname/IP on your Scheduler? No logging of things breaking happens...). As far as generalizing it further. Note I'm saying IP, HOSTNAME are host-specific, which is why this sort of capability makes sense. It is impossible for me to know when I'm installing static config files to a Host, VM, Docker what the IP and Hostname are going to be. That is not the case for {{\-\-resources}}, {{\-quiet}} and the like. They are able to be pre-determined for a host. IP and Hostname are Runtime parameters of a machine (When you attach your machine to a network, they are assigned dynamically). Enable Mesos to use arbitrary script / module to figure out IP, HOSTNAME Key: MESOS-2902 URL: https://issues.apache.org/jira/browse/MESOS-2902 Project: Mesos Issue Type: Improvement Components: master, modules, slave Reporter: Cody Maloney Priority: Minor Labels: mesosphere Currently Mesos tries to guess the IP, HOSTNAME by doing a reverse DNS lookup. This doesn't work on a lot of clouds as we want things like public IPs (which aren't the default DNS), there aren't FQDN names (Azure), or the correct way to figure it out is to call some cloud-specific endpoint. If Mesos / Libprocess could load a mesos-module (Or run a script) which is provided per-cloud, we can figure out perfectly the IP / Hostname for the given environment. It also means we can ship one identical set of files to all hosts in a given provider which doesn't happen to have the DNS scheme + hostnames that libprocess/Mesos expects. Currently we have to generate host-specific config files which Mesos uses to guess. The host-specific files break / fall apart if machines change IP / hostname without being reinstalled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2830) Add an endpoint to slaves to allow launching system administration tasks
[ https://issues.apache.org/jira/browse/MESOS-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594204#comment-14594204 ] Marco Massenzio commented on MESOS-2830: As I stated: {quote] I'll start looking into this and probably draft a design doc. {quote} design is actual work :) Add an endpoint to slaves to allow launching system administration tasks Key: MESOS-2830 URL: https://issues.apache.org/jira/browse/MESOS-2830 Project: Mesos Issue Type: Wish Components: slave Reporter: Cody Maloney Assignee: Marco Massenzio Priority: Minor Labels: mesosphere As a System Administrator often times I need to run a organization-mandated task on every machine in the cluster. Ideally I could do this within the framework of mesos resources if it is a cleanup or auditing task, but sometimes I just have to run something, and run it now, regardless if a machine has un-accounted resources (Ex: Adding/removing a user). Currently to do this I have to completely bypass Mesos and SSH to the box. Ideally I could tell a mesos slave (With proper authentication) to run a container with the limited special permissions needed to get the task done. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2891) Performance regression in hierarchical allocator.
[ https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594238#comment-14594238 ] Vinod Kone commented on MESOS-2891: --- Review: https://reviews.apache.org/r/35679 Review: https://reviews.apache.org/r/35680 Review: https://reviews.apache.org/r/35682 Performance regression in hierarchical allocator. - Key: MESOS-2891 URL: https://issues.apache.org/jira/browse/MESOS-2891 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Benjamin Mahler Assignee: Jie Yu Priority: Blocker Labels: twitter Fix For: 0.23.0 Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg For large clusters, the 0.23.0 allocator cannot keep up with the volume of slaves. After the following slave was re-registered, it took the allocator a long time to work through the backlog of slaves to add: {noformat:title=45 minute delay} I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 20150422-211121-2148346890-5050-3253-S4695 I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 20150422-211121-2148346890-5050-3253-S4695 {noformat} Empirically, [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462] and [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533] have become expensive. Some timings from a production cluster reveal that the allocator spending in the low tens of milliseconds for each call to {{addSlave}} and {{updateSlave}}, when there are tens of thousands of slaves this amounts to the large delay seen above. We also saw a slow steady increase in memory consumption, hinting further at a queue backup in the allocator. A synthetic benchmark like we did for the registrar would be prudent here, along with visibility into the allocator's queue size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2903) Network isolator should not fail when target state already exists
[ https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594171#comment-14594171 ] Paul Brett commented on MESOS-2903: --- The new logic will be: {noformat} Trybool something = ::create(); if (something.isError()) { ++metrics.something_errors; return Failure(Failed to create something ...) } else if (!icmpVethToEth0.get()) { // already exists Trybool something = ::update(); if (something.isError()) { ++metrics.something_errors; return Failure(Failed to update something ...) } } {noformat} Network isolator should not fail when target state already exists - Key: MESOS-2903 URL: https://issues.apache.org/jira/browse/MESOS-2903 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.23.0 Reporter: Paul Brett Priority: Critical Network isolator has multiple instances of the following pattern: {noformat} Trybool something = ::create(); if (something.isError()) { ++metrics.something_errors; return Failure(Failed to create something ...) } else if (!icmpVethToEth0.get()) { ++metrics.adding_veth_icmp_filters_already_exist; return Failure(Something already exists); } {noformat} These failures have occurred in operation due to the failure to recover or delete an orphan, causing the slave to remain on line but unable to create new resources.We should convert the second failure message in this pattern to an information message since the final state of the system is the state that we requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2903) Network isolator should not fail when target state already exists
[ https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Brett reassigned MESOS-2903: - Assignee: Paul Brett Network isolator should not fail when target state already exists - Key: MESOS-2903 URL: https://issues.apache.org/jira/browse/MESOS-2903 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.23.0 Reporter: Paul Brett Assignee: Paul Brett Priority: Critical Network isolator has multiple instances of the following pattern: {noformat} Trybool something = ::create(); if (something.isError()) { ++metrics.something_errors; return Failure(Failed to create something ...) } else if (!icmpVethToEth0.get()) { ++metrics.adding_veth_icmp_filters_already_exist; return Failure(Something already exists); } {noformat} These failures have occurred in operation due to the failure to recover or delete an orphan, causing the slave to remain on line but unable to create new resources.We should convert the second failure message in this pattern to an information message since the final state of the system is the state that we requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2903) Network isolator should not fail when target state already exists
[ https://issues.apache.org/jira/browse/MESOS-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Brett updated MESOS-2903: -- Story Points: 2 Network isolator should not fail when target state already exists - Key: MESOS-2903 URL: https://issues.apache.org/jira/browse/MESOS-2903 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.23.0 Reporter: Paul Brett Assignee: Paul Brett Priority: Critical Network isolator has multiple instances of the following pattern: {noformat} Trybool something = ::create(); if (something.isError()) { ++metrics.something_errors; return Failure(Failed to create something ...) } else if (!icmpVethToEth0.get()) { ++metrics.adding_veth_icmp_filters_already_exist; return Failure(Something already exists); } {noformat} These failures have occurred in operation due to the failure to recover or delete an orphan, causing the slave to remain on line but unable to create new resources.We should convert the second failure message in this pattern to an information message since the final state of the system is the state that we requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2830) Add an endpoint to slaves to allow launching system administration tasks
[ https://issues.apache.org/jira/browse/MESOS-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594204#comment-14594204 ] Marco Massenzio edited comment on MESOS-2830 at 6/20/15 12:41 AM: -- As I stated: {quote} I'll start looking into this and probably draft a design doc. {quote} design is actual work :) was (Author: marco-mesos): As I stated: {quote] I'll start looking into this and probably draft a design doc. {quote} design is actual work :) Add an endpoint to slaves to allow launching system administration tasks Key: MESOS-2830 URL: https://issues.apache.org/jira/browse/MESOS-2830 Project: Mesos Issue Type: Wish Components: slave Reporter: Cody Maloney Assignee: Marco Massenzio Priority: Minor Labels: mesosphere As a System Administrator often times I need to run a organization-mandated task on every machine in the cluster. Ideally I could do this within the framework of mesos resources if it is a cleanup or auditing task, but sometimes I just have to run something, and run it now, regardless if a machine has un-accounted resources (Ex: Adding/removing a user). Currently to do this I have to completely bypass Mesos and SSH to the box. Ideally I could tell a mesos slave (With proper authentication) to run a container with the limited special permissions needed to get the task done. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2830) Add an endpoint to slaves to allow launching system administration tasks
[ https://issues.apache.org/jira/browse/MESOS-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2830: --- Sprint: Mesosphere Sprint 13 Target Version/s: 0.24.0 Shepherd: Benjamin Hindman Story Points: 8 Add an endpoint to slaves to allow launching system administration tasks Key: MESOS-2830 URL: https://issues.apache.org/jira/browse/MESOS-2830 Project: Mesos Issue Type: Wish Components: slave Reporter: Cody Maloney Assignee: Marco Massenzio Priority: Minor Labels: mesosphere As a System Administrator often times I need to run a organization-mandated task on every machine in the cluster. Ideally I could do this within the framework of mesos resources if it is a cleanup or auditing task, but sometimes I just have to run something, and run it now, regardless if a machine has un-accounted resources (Ex: Adding/removing a user). Currently to do this I have to completely bypass Mesos and SSH to the box. Ideally I could tell a mesos slave (With proper authentication) to run a container with the limited special permissions needed to get the task done. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2853) Report per-container metrics from host egress filter
[ https://issues.apache.org/jira/browse/MESOS-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594175#comment-14594175 ] Paul Brett commented on MESOS-2853: --- Container metrics are not tracked by fq_codel on a per-filter basis, hence this information is not available. Will wait to see the interaction of fq_codel on host eth0 with real workloads before deciding if further work is required. Report per-container metrics from host egress filter Key: MESOS-2853 URL: https://issues.apache.org/jira/browse/MESOS-2853 Project: Mesos Issue Type: Improvement Components: isolation Reporter: Paul Brett Assignee: Paul Brett Labels: twitter Export in statistics.json the fq_codel flow statistics for each container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2664) Modernize the codebase to C++11
[ https://issues.apache.org/jira/browse/MESOS-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594181#comment-14594181 ] Michael Park commented on MESOS-2664: - The only 3rdparty that it affected was {{stout/protobuf.hpp}}, which is also where spelling out the default would be the most prevalent due to the number of members in the {{FieldDescriptor}} enum. With approach (1) we have: {code} TryNothing operator () (const JSON::Object object) const { switch (field-type()) { case google::protobuf::FieldDescriptor::TYPE_MESSAGE: /* ... */ break; case google::protobuf::FieldDescriptor::TYPE_DOUBLE: case google::protobuf::FieldDescriptor::TYPE_FLOAT: case google::protobuf::FieldDescriptor::TYPE_INT32: case google::protobuf::FieldDescriptor::TYPE_INT64: case google::protobuf::FieldDescriptor::TYPE_UINT32: case google::protobuf::FieldDescriptor::TYPE_UINT64: case google::protobuf::FieldDescriptor::TYPE_FIXED32: case google::protobuf::FieldDescriptor::TYPE_FIXED64: case google::protobuf::FieldDescriptor::TYPE_BOOL: case google::protobuf::FieldDescriptor::TYPE_STRING: case google::protobuf::FieldDescriptor::TYPE_GROUP: case google::protobuf::FieldDescriptor::TYPE_BYTES: case google::protobuf::FieldDescriptor::TYPE_ENUM: case google::protobuf::FieldDescriptor::TYPE_SFIXED32: case google::protobuf::FieldDescriptor::TYPE_SFIXED64: case google::protobuf::FieldDescriptor::TYPE_SINT32: case google::protobuf::FieldDescriptor::TYPE_SINT64: return Error(Not expecting a JSON object for field ' + field-name() + '); } return Nothing(); } {code} With (2) we have: {code} TryNothing operator () (const JSON::Object object) const { #pragma GCC diagnostic push #pragma GCC diagnostic ignored -Wswitch-enum switch (field-type()) { case google::protobuf::FieldDescriptor::TYPE_MESSAGE: /* ... */ break; default: return Error(Not expecting a JSON object for field ' + field-name() + '); } #pragma GCC diagnostic pop return Nothing(); } {code} (2) is nice in the sense that it's clear {{TYPE_MESSAGE}} is really the only type we'll ever handle. However, considering this is the worst case scenario (having to state 17 cases in place for a {{default}}) that doesn't occur often, I'm in favor of adopting (1). Modernize the codebase to C++11 --- Key: MESOS-2664 URL: https://issues.apache.org/jira/browse/MESOS-2664 Project: Mesos Issue Type: Epic Components: technical debt Reporter: Michael Park Assignee: Michael Park Labels: mesosphere Since [this commit|https://github.com/apache/mesos/commit/0f5c78fad3423181f7227027eb42d162811514e7], we officially require GCC-4.8+ and Clang-3.5+. This means that we now have full C++11 support and therefore can start to modernize our codebase to be more readable, safer and efficient! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2664) Modernize the codebase to C++11
[ https://issues.apache.org/jira/browse/MESOS-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594181#comment-14594181 ] Michael Park edited comment on MESOS-2664 at 6/20/15 12:24 AM: --- The only 3rdparty that it affected was {{stout/protobuf.hpp}}, which is also where spelling out the default would be the most prevalent due to the number of members in the {{FieldDescriptor}} enum. With approach (1) we have: {code} TryNothing operator () (const JSON::Object object) const { switch (field-type()) { case google::protobuf::FieldDescriptor::TYPE_MESSAGE: /* ... */ break; case google::protobuf::FieldDescriptor::TYPE_DOUBLE: case google::protobuf::FieldDescriptor::TYPE_FLOAT: case google::protobuf::FieldDescriptor::TYPE_INT32: case google::protobuf::FieldDescriptor::TYPE_INT64: case google::protobuf::FieldDescriptor::TYPE_UINT32: case google::protobuf::FieldDescriptor::TYPE_UINT64: case google::protobuf::FieldDescriptor::TYPE_FIXED32: case google::protobuf::FieldDescriptor::TYPE_FIXED64: case google::protobuf::FieldDescriptor::TYPE_BOOL: case google::protobuf::FieldDescriptor::TYPE_STRING: case google::protobuf::FieldDescriptor::TYPE_GROUP: case google::protobuf::FieldDescriptor::TYPE_BYTES: case google::protobuf::FieldDescriptor::TYPE_ENUM: case google::protobuf::FieldDescriptor::TYPE_SFIXED32: case google::protobuf::FieldDescriptor::TYPE_SFIXED64: case google::protobuf::FieldDescriptor::TYPE_SINT32: case google::protobuf::FieldDescriptor::TYPE_SINT64: return Error(Not expecting a JSON object for field ' + field-name() + '); } return Nothing(); } {code} With (2) we have: {code} TryNothing operator () (const JSON::Object object) const { #pragma GCC diagnostic push #pragma GCC diagnostic ignored -Wswitch-enum switch (field-type()) { case google::protobuf::FieldDescriptor::TYPE_MESSAGE: /* ... */ break; default: return Error(Not expecting a JSON object for field ' + field-name() + '); } #pragma GCC diagnostic pop return Nothing(); } {code} (2) is nice in the sense that it's clear {{TYPE_MESSAGE}} is really the only type we'll ever handle. However, considering this is the worst case scenario (having to state 17 cases in place for a {{default}}) that doesn't occur often, I'm also in favor of adopting (1). was (Author: mcypark): The only 3rdparty that it affected was {{stout/protobuf.hpp}}, which is also where spelling out the default would be the most prevalent due to the number of members in the {{FieldDescriptor}} enum. With approach (1) we have: {code} TryNothing operator () (const JSON::Object object) const { switch (field-type()) { case google::protobuf::FieldDescriptor::TYPE_MESSAGE: /* ... */ break; case google::protobuf::FieldDescriptor::TYPE_DOUBLE: case google::protobuf::FieldDescriptor::TYPE_FLOAT: case google::protobuf::FieldDescriptor::TYPE_INT32: case google::protobuf::FieldDescriptor::TYPE_INT64: case google::protobuf::FieldDescriptor::TYPE_UINT32: case google::protobuf::FieldDescriptor::TYPE_UINT64: case google::protobuf::FieldDescriptor::TYPE_FIXED32: case google::protobuf::FieldDescriptor::TYPE_FIXED64: case google::protobuf::FieldDescriptor::TYPE_BOOL: case google::protobuf::FieldDescriptor::TYPE_STRING: case google::protobuf::FieldDescriptor::TYPE_GROUP: case google::protobuf::FieldDescriptor::TYPE_BYTES: case google::protobuf::FieldDescriptor::TYPE_ENUM: case google::protobuf::FieldDescriptor::TYPE_SFIXED32: case google::protobuf::FieldDescriptor::TYPE_SFIXED64: case google::protobuf::FieldDescriptor::TYPE_SINT32: case google::protobuf::FieldDescriptor::TYPE_SINT64: return Error(Not expecting a JSON object for field ' + field-name() + '); } return Nothing(); } {code} With (2) we have: {code} TryNothing operator () (const JSON::Object object) const { #pragma GCC diagnostic push #pragma GCC diagnostic ignored -Wswitch-enum switch (field-type()) { case google::protobuf::FieldDescriptor::TYPE_MESSAGE: /* ... */ break; default: return Error(Not expecting a JSON object for field ' + field-name() + '); } #pragma GCC diagnostic pop return Nothing(); } {code} (2) is nice in the sense that it's clear {{TYPE_MESSAGE}} is really the only type we'll ever handle. However, considering this is the worst case scenario (having to state 17 cases in place for a {{default}}) that doesn't occur often, I'm in favor of
[jira] [Resolved] (MESOS-2843) Update the design doc to include updating capabilities field
[ https://issues.apache.org/jira/browse/MESOS-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditi Dixit resolved MESOS-2843. Resolution: Fixed Update the design doc to include updating capabilities field Key: MESOS-2843 URL: https://issues.apache.org/jira/browse/MESOS-2843 Project: Mesos Issue Type: Bug Reporter: Vinod Kone Assignee: Aditi Dixit We recently added a new field capabilities to FrameworkInfo. The design doc needs to be updated to reflect its update semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2898) Write tests for new JSON (ZooKeeper) functionality
Marco Massenzio created MESOS-2898: -- Summary: Write tests for new JSON (ZooKeeper) functionality Key: MESOS-2898 URL: https://issues.apache.org/jira/browse/MESOS-2898 Project: Mesos Issue Type: Task Reporter: Marco Massenzio Assignee: Marco Massenzio Follow up from MESOS-2340, need to ensure this does not break the ZooKeeper discovery functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2900) The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint
[ https://issues.apache.org/jira/browse/MESOS-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditi Dixit updated MESOS-2900: --- Description: The capabilities field was added to FrameworkInfo recently. It should be displayed in the state.json HTTP endpoint. The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint - Key: MESOS-2900 URL: https://issues.apache.org/jira/browse/MESOS-2900 Project: Mesos Issue Type: Bug Reporter: Aditi Dixit Assignee: Aditi Dixit Priority: Minor The capabilities field was added to FrameworkInfo recently. It should be displayed in the state.json HTTP endpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2899) The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint
Aditi Dixit created MESOS-2899: -- Summary: The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint Key: MESOS-2899 URL: https://issues.apache.org/jira/browse/MESOS-2899 Project: Mesos Issue Type: Bug Reporter: Aditi Dixit Assignee: Aditi Dixit Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1733) Change the stout path utility to declare a single, variadic 'join' function instead of several separate declarations of various discrete arities
[ https://issues.apache.org/jira/browse/MESOS-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-1733: -- Labels: mesosphere (was: ) Change the stout path utility to declare a single, variadic 'join' function instead of several separate declarations of various discrete arities Key: MESOS-1733 URL: https://issues.apache.org/jira/browse/MESOS-1733 Project: Mesos Issue Type: Improvement Components: build, stout Reporter: Patrick Reilly Assignee: Anand Mazumdar Labels: mesosphere -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2900) The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint
Aditi Dixit created MESOS-2900: -- Summary: The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint Key: MESOS-2900 URL: https://issues.apache.org/jira/browse/MESOS-2900 Project: Mesos Issue Type: Bug Reporter: Aditi Dixit Assignee: Aditi Dixit Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1733) Change the stout path utility to declare a single, variadic 'join' function instead of several separate declarations of various discrete arities
[ https://issues.apache.org/jira/browse/MESOS-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-1733: -- Sprint: Mesosphere Sprint 12 Story Points: 5 Change the stout path utility to declare a single, variadic 'join' function instead of several separate declarations of various discrete arities Key: MESOS-1733 URL: https://issues.apache.org/jira/browse/MESOS-1733 Project: Mesos Issue Type: Improvement Components: build, stout Reporter: Patrick Reilly Assignee: Anand Mazumdar Labels: mesosphere -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2900) Display capabilities in state.json
[ https://issues.apache.org/jira/browse/MESOS-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593735#comment-14593735 ] Marco Massenzio commented on MESOS-2900: As noted in MESOS-2899, {{state.json}} is a heavy endpoint, so before adding more stuff, it may be worth to assess whether it's really necessary. Display capabilities in state.json -- Key: MESOS-2900 URL: https://issues.apache.org/jira/browse/MESOS-2900 Project: Mesos Issue Type: Bug Reporter: Aditi Dixit Assignee: Aditi Dixit Priority: Minor The capabilities field was added to FrameworkInfo recently. It should be displayed in the state.json HTTP endpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2394) Create styleguide for documentation
[ https://issues.apache.org/jira/browse/MESOS-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2394: -- Shepherd: Bernd Mathiske (was: Niklas Quarfot Nielsen) Create styleguide for documentation --- Key: MESOS-2394 URL: https://issues.apache.org/jira/browse/MESOS-2394 Project: Mesos Issue Type: Documentation Reporter: Joerg Schad Assignee: Joerg Schad Priority: Minor As of right now different pages in our documentation use quite different styles. Consider for example the different emphasis for NOTE: * {noformat} NOTE: http://mesos.apache.org/documentation/latest/slave-recovery/{noformat} * {noformat}*NOTE*: http://mesos.apache.org/documentation/latest/upgrades/ {noformat} Would be great to establish a common style for the documentation! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2712) When trying to install mesos 0.22.0 version on Redhat Enterprise Linux 6.0 , i am getting error configure error cannot find libsvn_subr-1 headers . I tried with ./confi
[ https://issues.apache.org/jira/browse/MESOS-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2712: --- Priority: Minor (was: Blocker) When trying to install mesos 0.22.0 version on Redhat Enterprise Linux 6.0 , i am getting error configure error cannot find libsvn_subr-1 headers . I tried with ./configure --with-svn option also but still the same. -- Key: MESOS-2712 URL: https://issues.apache.org/jira/browse/MESOS-2712 Project: Mesos Issue Type: Bug Components: general Affects Versions: 0.22.0 Reporter: Sujit Priority: Minor When trying to install mesos 0.22.0 version on Redhat Enterprise Linux 6.0 , i am getting error configure error cannot find libsvn_subr-1 headers . I tried with ./configure --with-svn option also but still the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2899) The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint
[ https://issues.apache.org/jira/browse/MESOS-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593731#comment-14593731 ] Marco Massenzio commented on MESOS-2899: While this may be useful, unfortunately, the {{state.json}} endpoint is already very heavy and the response to client takes quite some time to be returned. I'm not sure how much more data (and time) this would add to it. The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint - Key: MESOS-2899 URL: https://issues.apache.org/jira/browse/MESOS-2899 Project: Mesos Issue Type: Bug Reporter: Aditi Dixit Assignee: Aditi Dixit Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2891) Performance regression in hierarchical allocator.
[ https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593660#comment-14593660 ] Jie Yu commented on MESOS-2891: --- According to the flame graph [~bmahler] attached, seems that the most expensive calculation is here: https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L286 If a cluster with tens of thousands of slaves, summing the entire hashmap returned from 'allocation[name]' is definitely expensive. We could also try to optimize Resources::operator +=. Currently, if the right hand side of the operator is a single 'resource', we'll call 'validate' on the resource before performing the operation. The validation is expensive. Most of the time, the validation is not necessary since most of the time, the 'resource' has already been verified before. Performance regression in hierarchical allocator. - Key: MESOS-2891 URL: https://issues.apache.org/jira/browse/MESOS-2891 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Benjamin Mahler Priority: Blocker Labels: twitter Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg For large clusters, the 0.23.0 allocator cannot keep up with the volume of slaves. After the following slave was re-registered, it took the allocator a long time to work through the backlog of slaves to add: {noformat:title=45 minute delay} I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 20150422-211121-2148346890-5050-3253-S4695 I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 20150422-211121-2148346890-5050-3253-S4695 {noformat} Empirically, [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462] and [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533] have become expensive. Some timings from a production cluster reveal that the allocator spending in the low tens of milliseconds for each call to {{addSlave}} and {{updateSlave}}, when there are tens of thousands of slaves this amounts to the large delay seen above. We also saw a slow steady increase in memory consumption, hinting further at a queue backup in the allocator. A synthetic benchmark like we did for the registrar would be prudent here, along with visibility into the allocator's queue size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2900) Display capabilities in state.json
[ https://issues.apache.org/jira/browse/MESOS-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditi Dixit updated MESOS-2900: --- Summary: Display capabilities in state.json (was: The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint) Display capabilities in state.json -- Key: MESOS-2900 URL: https://issues.apache.org/jira/browse/MESOS-2900 Project: Mesos Issue Type: Bug Reporter: Aditi Dixit Assignee: Aditi Dixit Priority: Minor The capabilities field was added to FrameworkInfo recently. It should be displayed in the state.json HTTP endpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2768) SIGPIPE in process::run_in_event_loop()
[ https://issues.apache.org/jira/browse/MESOS-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593724#comment-14593724 ] Vinod Kone commented on MESOS-2768: --- [~jvanremoortere] [~benjaminhindman] any update on this? we are still seeing this in production. SIGPIPE in process::run_in_event_loop() --- Key: MESOS-2768 URL: https://issues.apache.org/jira/browse/MESOS-2768 Project: Mesos Issue Type: Bug Affects Versions: 0.23.0 Reporter: Yan Xu Priority: Critical Observed in production. {noformat:title=slave log} I0526 12:17:48.027257 51633 slave.cpp:4077] Received a new estimation of the oversubscribable resources W0526 12:17:48.027257 51636 logging.cpp:91] RAW: Received signal SIGPIPE; escalating to SIGABRT *** Aborted at 1432642668 (unix time) try date -d @1432642668 if you are using GNU date *** PC: @ 0x7fa58c23eb6d raise *** SIGABRT (@0xc9a5) received by PID 51621 (TID 0x7fa58224c940) from PID 51621; stack trace: *** @ 0x7fa58c23eca0 (unknown) @ 0x7fa58c23eb6d raise @ 0x7fa58cc19ba7 mesos::internal::logging::handler() @ 0x7fa58c23eca0 (unknown) @ 0x7fa58c23da2b __libc_write @ 0x7fa58cb57b6f evpipe_write.part.5 @ 0x7fa58d245070 process::run_in_event_loop() @ 0x7fa58d2441ba process::EventLoop::delay() @ 0x7fa58d1c3c9c process::clock::scheduleTick() @ 0x7fa58d1c65b1 process::Clock::timer() @ 0x7fa58d23915a process::delay() @ 0x7fa58d23a740 process::ReaperProcess::wait() @ 0x7fa58d21261a process::ProcessManager::resume() @ 0x7fa58d2128dc process::schedule() @ 0x7fa58c23683d start_thread @ 0x7fa58ba28fcd clone {noformat} {noformat:title=gdb} (gdb) bt #0 0x7fa58c23eb6d in raise () from /lib64/libpthread.so.0 #1 0x7fa58cc19ba7 in mesos::internal::logging::handler (signal=Unhandled dwarf expression opcode 0xf3 ) at logging/logging.cpp:92 #2 signal handler called #3 0x7fa58c23da2b in write () from /lib64/libpthread.so.0 #4 0x7fa58cb57b6f in evpipe_write (loop=0x7fa58e1e79c0, flag=Unhandled dwarf expression opcode 0xfa ) at ev.c:2172 #5 0x7fa58d245070 in process::run_in_event_loopNothing(const std::functionprocess::FutureNothing() ) (f=Unhandled dwarf expression opcode 0xf3 ) at src/libev.hpp:80 #6 0x7fa58d2441ba in process::EventLoop::delay(const Duration , const std::functionvoid() ) (duration=Unhandled dwarf expression opcode 0xf3 ) at src/libev.cpp:106 #7 0x7fa58d1c3c9c in process::clock::scheduleTick (timers=Unhandled dwarf expression opcode 0xf3 ) at src/clock.cpp:119 #8 0x7fa58d1c65b1 in process::Clock::timer(const Duration , const std::functionvoid() ) (duration=Unhandled dwarf expression opcode 0xf3 ) at src/clock.cpp:254 #9 0x7fa58d23915a in process::delayprocess::ReaperProcess (duration=..., pid=Unhandled dwarf expression opcode 0xf3 ) at ./include/process/delay.hpp:25 #10 0x7fa58d23a740 in process::ReaperProcess::wait (this=0x2056920) at src/reap.cpp:93 #11 0x7fa58d21261a in process::ProcessManager::resume (this=0x1db8d20, process=0x2056958) at src/process.cpp:2172 #12 0x7fa58d2128dc in process::schedule (arg=Unhandled dwarf expression opcode 0xf3 ) at src/process.cpp:602 #13 0x7fa58c23683d in start_thread () from /lib64/libpthread.so.0 #14 0x7fa58ba28fcd in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2807) As a developer I need an easy way to convert MasterInfo protobuf to/from JSON
[ https://issues.apache.org/jira/browse/MESOS-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2807: -- Assignee: Marco Massenzio As a developer I need an easy way to convert MasterInfo protobuf to/from JSON - Key: MESOS-2807 URL: https://issues.apache.org/jira/browse/MESOS-2807 Project: Mesos Issue Type: Task Components: leader election Reporter: Marco Massenzio Assignee: Marco Massenzio Fix For: 0.23.0 As a preliminary to MESOS-2340, this requires the implementation of a simple (de)serialization mechanism to JSON from/to {{MasterInfo}} protobuf. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2637) Consolidate 'foo', 'bar', ... string constants in test and example code
[ https://issues.apache.org/jira/browse/MESOS-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593749#comment-14593749 ] Niklas Quarfot Nielsen commented on MESOS-2637: --- @colin: Do you have cycles to work on this? If not, I will remove myself as shepherd for now Consolidate 'foo', 'bar', ... string constants in test and example code --- Key: MESOS-2637 URL: https://issues.apache.org/jira/browse/MESOS-2637 Project: Mesos Issue Type: Bug Components: technical debt Reporter: Niklas Quarfot Nielsen Assignee: Colin Williams We are using 'foo', 'bar', ... string constants and pairs in src/tests/master_tests.cpp, src/tests/slave_tests.cpp, src/tests/hook_tests.cpp and src/examples/test_hook_module.cpp for label and hooks tests. We should consolidate them to make the call sites less prone to forgetting to update all call sites. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2873) style hook prevent's valid markdown files from getting committed
[ https://issues.apache.org/jira/browse/MESOS-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2873: --- Sprint: Mesosphere Sprint 12 style hook prevent's valid markdown files from getting committed Key: MESOS-2873 URL: https://issues.apache.org/jira/browse/MESOS-2873 Project: Mesos Issue Type: Bug Reporter: Alexander Rojas Priority: Trivial According to the original [markdown specification|http://daringfireball.net/projects/markdown/syntax#p] and to the most [recent standarization|http://spec.commonmark.org/0.20/#hard-line-breaks] effort, two spaces at the end of a line create a hard line break (it breaks the line without starting a new paragraph), similar to the html code {{br/}}. However, there's a hook in mesos which prevent files with trailing whitespace to be committed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks when using systemd
[ https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593844#comment-14593844 ] Chris Fortier commented on MESOS-2419: -- Brenden - would you be able to share the systemd file you are using to start the mesos slave? I'm having the same issue but KillMode=process doesn't seem to help. We are also running mesos as a container on CoreOS. Slave recovery not recovering tasks when using systemd -- Key: MESOS-2419 URL: https://issues.apache.org/jira/browse/MESOS-2419 Project: Mesos Issue Type: Bug Components: slave Reporter: Brenden Matthews Assignee: Joerg Schad Attachments: mesos-chronos.log.gz, mesos.log.gz {color:red} Note: the resolution to this issue is described in the following comment below: https://issues.apache.org/jira/browse/MESOS-2419?focusedCommentId=14357028page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357028 {color:red} In a recent build from master (updated yesterday), slave recovery appears to have broken. I'll attach the slave log (with GLOG_v=1) showing a task called `long-running-job` which is a Chronos job that just does `sleep 1h`. After restarting the slave, the task shows as `TASK_FAILED`. Here's another case, which is for a docker task: {noformat} Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247207 10022 docker.cpp:468] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717- Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254791 10022 docker.cpp:1333] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254812 10022 docker.cpp:1159] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717- Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container f2001064-e076-4978-b764-ed12a5244e78 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717- at executor(1)@10.81.189.232:43130 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework '20150226-230228-2931198986-5050-717-' failed: Container 'f2001064-e076-4978-b764-ed12a5244e78' not found Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-: Not monitored Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717- from @0.0.0.0:0 Feb 27 00:09:50
[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks when using systemd
[ https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593849#comment-14593849 ] Brenden Matthews commented on MESOS-2419: - I'm afraid I don't have it anymore. I'd be happy to review yours, however. Slave recovery not recovering tasks when using systemd -- Key: MESOS-2419 URL: https://issues.apache.org/jira/browse/MESOS-2419 Project: Mesos Issue Type: Bug Components: slave Reporter: Brenden Matthews Assignee: Joerg Schad Attachments: mesos-chronos.log.gz, mesos.log.gz {color:red} Note: the resolution to this issue is described in the following comment below: https://issues.apache.org/jira/browse/MESOS-2419?focusedCommentId=14357028page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357028 {color:red} In a recent build from master (updated yesterday), slave recovery appears to have broken. I'll attach the slave log (with GLOG_v=1) showing a task called `long-running-job` which is a Chronos job that just does `sleep 1h`. After restarting the slave, the task shows as `TASK_FAILED`. Here's another case, which is for a docker task: {noformat} Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247207 10022 docker.cpp:468] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717- Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254791 10022 docker.cpp:1333] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254812 10022 docker.cpp:1159] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717- Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container f2001064-e076-4978-b764-ed12a5244e78 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717- at executor(1)@10.81.189.232:43130 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework '20150226-230228-2931198986-5050-717-' failed: Container 'f2001064-e076-4978-b764-ed12a5244e78' not found Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-: Not monitored Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717- from @0.0.0.0:0 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.599093 10023 slave.cpp:2637] Failed to update resources for container
[jira] [Closed] (MESOS-2899) The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint
[ https://issues.apache.org/jira/browse/MESOS-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio closed MESOS-2899. -- Resolution: Won't Fix Following from [~aditidixit]'s email. The capabilities field was added recently to FrameworkInfo. It should be displayed in the state.json endpoint - Key: MESOS-2899 URL: https://issues.apache.org/jira/browse/MESOS-2899 Project: Mesos Issue Type: Bug Reporter: Aditi Dixit Assignee: Aditi Dixit Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)