[jira] [Commented] (MESOS-2246) Improve slave health-checking
[ https://issues.apache.org/jira/browse/MESOS-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588361#comment-14588361 ] Joe Smith commented on MESOS-2246: -- [~vinodkone] [~jieyu] given the tickets in this epic are completed, can this be resolved? Improve slave health-checking - Key: MESOS-2246 URL: https://issues.apache.org/jira/browse/MESOS-2246 Project: Mesos Issue Type: Epic Components: master, slave Reporter: Dominic Hamon In the event of a network partition, or other systemic issues, we may see widespread slave removal. There are several approaches we can take to mitigate this issue including, but not limited to: . rate limit the slave removal . change how we do health checking to not rely on a single point of view . work with frameworks to determine SLA of running services before removing the slave . manual control to allow operator intervention -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2755) Slave should perform a diff in logs when incompatible SlaveInfo is detected
Joe Smith created MESOS-2755: Summary: Slave should perform a diff in logs when incompatible SlaveInfo is detected Key: MESOS-2755 URL: https://issues.apache.org/jira/browse/MESOS-2755 Project: Mesos Issue Type: Story Components: slave Reporter: Joe Smith When diagnosing slaves with attribute changes, it'd be super helpful for the logs to contain a diff which displayed the changes in the {{SlaveInfo}} instead of only printing out the old and new contents of {{SlaveInfo}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1739) Allow slave reconfiguration on restart
[ https://issues.apache.org/jira/browse/MESOS-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536242#comment-14536242 ] Joe Smith commented on MESOS-1739: -- Howdy all, What's the status of this? This change would greatly increase flexibility for us operators! Thanks, Joe Allow slave reconfiguration on restart -- Key: MESOS-1739 URL: https://issues.apache.org/jira/browse/MESOS-1739 Project: Mesos Issue Type: Epic Reporter: Patrick Reilly Assignee: Cody Maloney Make it so that either via a slave restart or a out of process reconfigure ping, the attributes and resources of a slave can be updated to be a superset of what they used to be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2676) slave recovery always fails when resources change
[ https://issues.apache.org/jira/browse/MESOS-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522498#comment-14522498 ] Joe Smith commented on MESOS-2676: -- [~vinodkone] made [a compelling explanation previously|http://mail-archives.apache.org/mod_mbox/mesos-user/201406.mbox/%3ccaakwvazphphzfhnbr46amfxwg3bynppbn+jl4wc8qpor9cs...@mail.gmail.com%3E]- but I don't think we should let this go. What if the resources increase by (for example) 1.0 CPU? Surely that's a safe change, and allowing changes of some nature would greatly increase the flexibility of operators. slave recovery always fails when resources change - Key: MESOS-2676 URL: https://issues.apache.org/jira/browse/MESOS-2676 Project: Mesos Issue Type: Improvement Reporter: David Robinson Slave recovery fails whenever --resources is changed. Ideally recovery would only fail if --resources has changed _and_ the still-executing tasks no longer fit within the new --resources range. Increasing resources should always be allowed. For example, if a slave was started with --resources=cpus:15, then the slave was restarted w/ --resources=cpus:16, the slave should start successfully. Same for mem, ports, disk and ephemeral_ports. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2367) Improve slave resiliency in the face of orphan containers
Joe Smith created MESOS-2367: Summary: Improve slave resiliency in the face of orphan containers Key: MESOS-2367 URL: https://issues.apache.org/jira/browse/MESOS-2367 Project: Mesos Issue Type: Bug Components: slave Reporter: Joe Smith Right now there's a case where a misbehaving executor can cause a slave process to flap: {panel:title=Quote From [~jieyu]} {quote} 1) User tries to kill an instance 2) Slave sends {{KillTaskMessage}} to executor 3) Executor sends kill signals to task processes 4) Executor sends {{TASK_KILLED}} to slave 5) Slave updates container cpu limit to be 0.01 cpus 6) A user-process is still processing the kill signal 7) the task process cannot exit since it has too little cpu share and is throttled 8) Executor itself terminates 9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path. 10) Slave restarts, and is constantly flapping because it cannot kill orphan containers {quote} {panel} The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1474) Provide cluster maintenance primitives for operators.
[ https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127701#comment-14127701 ] Joe Smith commented on MESOS-1474: -- The doc looks great! Thanks for all the hard work and rigor! Provide cluster maintenance primitives for operators. - Key: MESOS-1474 URL: https://issues.apache.org/jira/browse/MESOS-1474 Project: Mesos Issue Type: Epic Components: framework, master, slave Reporter: Benjamin Mahler Normally cluster upgrades can be done seamlessly using the built-in slave recovery feature. However, there are situations where operators want to be able to perform destructive maintenance operations on machines: * Non-recoverable slave upgrades. * Machine reboots. * Kernel upgrades. * Machine decommissioning. * etc. In these situations, best practice is to perform rolling maintenance in large batches of machines. This can be problematic for frameworks when many related tasks are located within a batch of machines going for maintenance. There are a few primitives of interest here: * Provide a way for operators to fully shutdown a slave (killing all tasks underneath it). * Provide a way for operators to mark specific slaves as undergoing maintenance. This means that no more offers are being sent for these slaves, and no new tasks will launch on them. * Provide a way for frameworks to be notified when resources are requested to be relinquished. This gives the framework to proactively move a task before it is forcibly killed. It also allows the automation of operations like: please drain these slaves within 1 hour. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1758) Freezer failure leads to lost task during container destruction.
[ https://issues.apache.org/jira/browse/MESOS-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124977#comment-14124977 ] Joe Smith commented on MESOS-1758: -- Can we make sure this gets into 0.21.0? This is continuing to hit us with LOST tasks, so just want to make sure it gets included. Thanks! Freezer failure leads to lost task during container destruction. Key: MESOS-1758 URL: https://issues.apache.org/jira/browse/MESOS-1758 Project: Mesos Issue Type: Bug Components: containerization Reporter: Benjamin Mahler In the past we've seen numerous issues around the freezer. Lately, on the 2.6.44 kernel, we've seen issues where we're unable to freeze the cgroup: (1) An oom occurs. (2) No indication of oom in the kernel logs. (3) The slave is unable to freeze the cgroup. (4) The task is marked as lost. {noformat} I0903 16:46:24.956040 25469 mem.cpp:575] Memory limit exceeded: Requested: 15488MB Maximum Used: 15488MB MEMORY STATISTICS: cache 7958691840 rss 8281653248 mapped_file 9474048 pgpgin 4487861 pgpgout 522933 pgfault 2533780 pgmajfault 11 inactive_anon 0 active_anon 8281653248 inactive_file 7631708160 active_file 326852608 unevictable 0 hierarchical_memory_limit 16240345088 total_cache 7958691840 total_rss 8281653248 total_mapped_file 9474048 total_pgpgin 4487861 total_pgpgout 522933 total_pgfault 2533780 total_pgmajfault 11 total_inactive_anon 0 total_active_anon 8281653248 total_inactive_file 7631728640 total_active_file 326852608 total_unevictable 0 I0903 16:46:24.956848 25469 containerizer.cpp:1041] Container bbb9732a-d600-4c1b-b326-846338c608c3 has reached its limit for resource mem(*):1.62403e+10 and will be terminated I0903 16:46:24.957427 25469 containerizer.cpp:909] Destroying container 'bbb9732a-d600-4c1b-b326-846338c608c3' I0903 16:46:24.958664 25481 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:34.959529 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:34.962070 25482 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.710848ms I0903 16:46:34.962658 25479 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:44.963349 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:44.965631 25472 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.588224ms I0903 16:46:44.966356 25472 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:54.967254 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:56.008447 25475 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 2.15296ms I0903 16:46:56.009071 25466 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:06.010329 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:06.012538 25467 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.643008ms I0903 16:47:06.013216 25467 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:12.516348 25480 slave.cpp:3030] Current usage 9.57%. Max allowed age: 5.630238827780799days I0903 16:47:16.015192 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:16.017043 25486 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.511168ms I0903 16:47:16.017555 25480 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:19.862746 25483 http.cpp:245] HTTP request for '/slave(1)/stats.json' E0903 16:47:24.960055 25472 slave.cpp:2557] Termination of executor 'E' of framework '201104070004-002563-' failed: Failed to destroy container: discarded future I0903 16:47:24.962054 25472 slave.cpp:2087] Handling status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-002563- from @0.0.0.0:0 I0903 16:47:24.963470 25469 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 128MB for container bbb9732a-d600-4c1b-b326-846338c608c3
[jira] [Commented] (MESOS-1765) Use PID namespace to avoid freezing cgroup
[ https://issues.apache.org/jira/browse/MESOS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124979#comment-14124979 ] Joe Smith commented on MESOS-1765: -- [~wangcong] can you share a link to the kernel bug? (Or a pointer to more discussion?) Sounds like we should also keep tabs on fixing that as well. Use PID namespace to avoid freezing cgroup -- Key: MESOS-1765 URL: https://issues.apache.org/jira/browse/MESOS-1765 Project: Mesos Issue Type: Story Components: containerization Reporter: Cong Wang There is some known kernel issue when we freeze the whole cgroup upon OOM. Mesos probably can just use PID namespace so that we will only need to kill the init of the pid namespace, instead of freezing all the processes and killing them one by one. But I am not quite sure if this would break the existing code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1678) Any framework with credentials can kill any other framework via http
Joe Smith created MESOS-1678: Summary: Any framework with credentials can kill any other framework via http Key: MESOS-1678 URL: https://issues.apache.org/jira/browse/MESOS-1678 Project: Mesos Issue Type: Story Reporter: Joe Smith Looking through [the review introducing the /shutdown http endpoint|https://reviews.apache.org/r/22832], it appears that any framework's credentials can be used to shutdown any other framework: Around line 650: {code} foreach (const Credential credential, master-credentials.get().http()) { if (credential.principal() == username (!credential.has_secret() || credential.secret() == password)) { // TODO(ijimenez) make removeFramework asynchronously master-removeFramework(framework); return OK(); } } {code} Thanks to [~adam-mesos], I looked into in [authorization doc|https://github.com/apache/mesos/blob/master/docs/authorization.md], however I don't see where the ACL-checking is happening within that code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (MESOS-1678) Any framework with credentials can kill any other framework via http
[ https://issues.apache.org/jira/browse/MESOS-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe Smith closed MESOS-1678. Resolution: Duplicate Any framework with credentials can kill any other framework via http Key: MESOS-1678 URL: https://issues.apache.org/jira/browse/MESOS-1678 Project: Mesos Issue Type: Story Reporter: Joe Smith Looking through [the review introducing the /shutdown http endpoint|https://reviews.apache.org/r/22832], it appears that any framework's credentials can be used to shutdown any other framework: Around line 650: {code} foreach (const Credential credential, master-credentials.get().http()) { if (credential.principal() == username (!credential.has_secret() || credential.secret() == password)) { // TODO(ijimenez) make removeFramework asynchronously master-removeFramework(framework); return OK(); } } {code} Thanks to [~adam-mesos], I looked into in [authorization doc|https://github.com/apache/mesos/blob/master/docs/authorization.md], however I don't see where the ACL-checking is happening within that code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1343) Authorize HTTP endpoints through ACLs
[ https://issues.apache.org/jira/browse/MESOS-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088411#comment-14088411 ] Joe Smith commented on MESOS-1343: -- For 0.20.0, we either need to enable the ACLs, or remove the shutdown from http- as is, any framework's credentials can be used to shutdown other frameworks. Authorize HTTP endpoints through ACLs - Key: MESOS-1343 URL: https://issues.apache.org/jira/browse/MESOS-1343 Project: Mesos Issue Type: Story Reporter: Vinod Kone -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1343) Authorize HTTP endpoints through ACLs
[ https://issues.apache.org/jira/browse/MESOS-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088432#comment-14088432 ] Joe Smith commented on MESOS-1343: -- Clarified with [~benjaminhindman]- these are gated via specific http-principals only. So a framework will not be able to use its principal to send requests via http- however these HTTP endpoints still need to have an ACL applied to them. Authorize HTTP endpoints through ACLs - Key: MESOS-1343 URL: https://issues.apache.org/jira/browse/MESOS-1343 Project: Mesos Issue Type: Story Reporter: Vinod Kone -- This message was sent by Atlassian JIRA (v6.2#6252)