[jira] [Commented] (MESOS-701) Improve webui performance for large clusters.
[ https://issues.apache.org/jira/browse/MESOS-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16457171#comment-16457171 ] Qui Nguyen commented on MESOS-701: -- We recently ran into an issue with this, where keeping the UI open for a large cluster slowed down the master. Perhaps reducing/eliminating the automatic refresh and/or caching state could help, too? > Improve webui performance for large clusters. > - > > Key: MESOS-701 > URL: https://issues.apache.org/jira/browse/MESOS-701 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler >Priority: Major > Labels: scalability > > For large clusters with tens of thousands of slaves, the webui is unusably > slow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8853) Quota limits should be both backward as well as forward compatible.
Meng Zhu created MESOS-8853: --- Summary: Quota limits should be both backward as well as forward compatible. Key: MESOS-8853 URL: https://issues.apache.org/jira/browse/MESOS-8853 Project: Mesos Issue Type: Improvement Reporter: Meng Zhu Introducing quota limits should maintain both backward as well forward compatibility. When upgrading from an old master that does not support quota limit to a new master that supports quota limit, relevant system behavior should stay same. While this is not possible for the downgrade case, we should try to minimize the impact. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8594) Mesos master crash (under load)
[ https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16457134#comment-16457134 ] Benjamin Mahler commented on MESOS-8594: [~chhsia0] and I noticed MESOS-8852, a process::loop fix will likely effectively fix this, but it's still technically possible to hit a stack overflow if futures complete always within a specific window. With MESOS-8852, then the process::loop fix will become completely effective. > Mesos master crash (under load) > --- > > Key: MESOS-8594 > URL: https://issues.apache.org/jira/browse/MESOS-8594 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.0, 1.6.0 >Reporter: A. Dukhovniy >Assignee: Benjamin Mahler >Priority: Blocker > Labels: reliability > Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, > lldb-regiser-read.txt > > > Mesos master crashes under load. Attached are some infos from the `lldb`: > {code:java} > Process 41933 resuming > Process 41933 stopped > * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x789ecff8) > frame #0: 0x00010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35 > 32 template > 33 struct _Some > 34 { > -> 35 _Some(T _t) : t(std::move(_t)) {} > 36 > 37 T t; > 38 }; > Target 0: (mesos-master) stopped. > (lldb) > {code} > To quote [~abudnik] > {quote}it’s the stack overflow bug in libprocess due to the way > `internal::send()` and `internal::_send()` are implemented in `process.cpp` > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8852) process::loop does not guarantee stack overflow prevention.
Benjamin Mahler created MESOS-8852: -- Summary: process::loop does not guarantee stack overflow prevention. Key: MESOS-8852 URL: https://issues.apache.org/jira/browse/MESOS-8852 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Benjamin Mahler One of the goals of process::loop is to prevent stack overflows in the case that the callbacks are completing synchronously. However, it's still possible for process::loop to stack overflow if the body and iterate futures transition between the checking of them being ready and the setting of the continuation callbacks. If the futures continuously transition in these windows, the stack will overflow. One way to fix this would be to provide an atomic set-callbacks-if-pending function on Future (e.g. {{bool setIfPending(...)}} that allows the caller to avoid accidentally invoking callbacks synchronously when setting the callbacks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8594) Mesos master crash (under load)
[ https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-8594: -- Assignee: Benjamin Mahler Will look into a fix without waiting for the http::Server patches. > Mesos master crash (under load) > --- > > Key: MESOS-8594 > URL: https://issues.apache.org/jira/browse/MESOS-8594 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.0, 1.6.0 >Reporter: A. Dukhovniy >Assignee: Benjamin Mahler >Priority: Blocker > Labels: reliability > Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, > lldb-regiser-read.txt > > > Mesos master crashes under load. Attached are some infos from the `lldb`: > {code:java} > Process 41933 resuming > Process 41933 stopped > * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x789ecff8) > frame #0: 0x00010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35 > 32 template > 33 struct _Some > 34 { > -> 35 _Some(T _t) : t(std::move(_t)) {} > 36 > 37 T t; > 38 }; > Target 0: (mesos-master) stopped. > (lldb) > {code} > To quote [~abudnik] > {quote}it’s the stack overflow bug in libprocess due to the way > `internal::send()` and `internal::_send()` are implemented in `process.cpp` > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456573#comment-16456573 ] Vishant Singh commented on MESOS-8574: -- [~abudnik] not completely sure the reason for docker hang. But it seems like the docker has stale information about running containers. The container gets killed as part of a task kill request from marathon.As the docker task-kill involves SIGTREM (first) and then SIGIKILL (after timeout), the SIGKILL terminates the task but dockerd does not get updated of this state. Might because the SIGKILL does not have signal handlers which can eventually update the state information in docker. After this, when a new task is launched on this host the docker inspect or docker ps would be unresponsive. At this point I have an monitoring on docker hang and idea is to restart the docker if its in hung state. > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0 > > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8739) Implement a test to check that a launched container can be killed.
[ https://issues.apache.org/jira/browse/MESOS-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456526#comment-16456526 ] Andrei Budnik commented on MESOS-8739: -- Already covered by `SlaveTest.*` > Implement a test to check that a launched container can be killed. > -- > > Key: MESOS-8739 > URL: https://issues.apache.org/jira/browse/MESOS-8739 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Priority: Major > Labels: mesosphere, test > > This test launches a long-running task, then successively calls `wait()` and > `destroy()` methods of the composing containerizer. Both termination statuses > must be equal. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8738) Implement a test to check that a recovered container can be killed.
[ https://issues.apache.org/jira/browse/MESOS-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456518#comment-16456518 ] Andrei Budnik commented on MESOS-8738: -- This test case is already covered by `SlaveRecoveryTest.KillTask`. > Implement a test to check that a recovered container can be killed. > --- > > Key: MESOS-8738 > URL: https://issues.apache.org/jira/browse/MESOS-8738 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Priority: Major > Labels: mesosphere, test > > This test verifies that a recovered container can be killed via `destroy()` > method of composing containerizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8809) Add functions for manipulating POSIX ACLs into stout
[ https://issues.apache.org/jira/browse/MESOS-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456453#comment-16456453 ] Qian Zhang commented on MESOS-8809: --- commit 617d55e24a3bef7305b75c8fc6cbd1d1f14d7f6a Author: Qian Zhang Date: Fri Apr 27 09:30:48 2018 +0800 Added `libacl` into a few Dockerfiles. This commit adds `libacl` into Dockerfiles for the images: 1. mesos/mesos-build 2. mesos/mesos-tidy 2. mesos/mesos-mini Review: https://reviews.apache.org/r/66840 > Add functions for manipulating POSIX ACLs into stout > > > Key: MESOS-8809 > URL: https://issues.apache.org/jira/browse/MESOS-8809 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > > We need to add functions for setting/getting POSIX ACLs into stout so that we > can leverage these functions to grant volume permissions to the specific task > user. > This will introduce a new dependency {{libacl-devel}} when building Mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005)