[jira] [Commented] (MESOS-701) Improve webui performance for large clusters.

2018-04-27 Thread Qui Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16457171#comment-16457171
 ] 

Qui Nguyen commented on MESOS-701:
--

We recently ran into an issue with this, where keeping the UI open for a large 
cluster slowed down the master. Perhaps reducing/eliminating the automatic 
refresh and/or caching state could help, too?

> Improve webui performance for large clusters.
> -
>
> Key: MESOS-701
> URL: https://issues.apache.org/jira/browse/MESOS-701
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: scalability
>
> For large clusters with tens of thousands of slaves, the webui is unusably 
> slow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8853) Quota limits should be both backward as well as forward compatible.

2018-04-27 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-8853:
---

 Summary: Quota limits should be both backward as well as forward 
compatible.
 Key: MESOS-8853
 URL: https://issues.apache.org/jira/browse/MESOS-8853
 Project: Mesos
  Issue Type: Improvement
Reporter: Meng Zhu


Introducing quota limits should maintain both backward as well forward 
compatibility. When upgrading from an old master that does not support quota 
limit to a new master that supports quota limit, relevant system behavior 
should stay same. While this is not possible for the downgrade case, we should 
try to minimize the impact.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8594) Mesos master crash (under load)

2018-04-27 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16457134#comment-16457134
 ] 

Benjamin Mahler commented on MESOS-8594:


[~chhsia0] and I noticed MESOS-8852, a process::loop fix will likely 
effectively fix this, but it's still technically possible to hit a stack 
overflow if futures complete always within a specific window. With MESOS-8852, 
then the process::loop fix will become completely effective.

> Mesos master crash (under load)
> ---
>
> Key: MESOS-8594
> URL: https://issues.apache.org/jira/browse/MESOS-8594
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0, 1.6.0
>Reporter: A. Dukhovniy
>Assignee: Benjamin Mahler
>Priority: Blocker
>  Labels: reliability
> Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, 
> lldb-regiser-read.txt
>
>
> Mesos master crashes under load. Attached are some infos from the `lldb`:
> {code:java}
> Process 41933 resuming
> Process 41933 stopped
> * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x789ecff8)
> frame #0: 0x00010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35
> 32 template 
> 33 struct _Some
> 34 {
> -> 35 _Some(T _t) : t(std::move(_t)) {}
> 36
> 37 T t;
> 38 };
> Target 0: (mesos-master) stopped.
> (lldb)
> {code}
> To quote [~abudnik]
> {quote}it’s the stack overflow bug in libprocess due to the way 
> `internal::send()` and `internal::_send()` are implemented in `process.cpp`
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8852) process::loop does not guarantee stack overflow prevention.

2018-04-27 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-8852:
--

 Summary: process::loop does not guarantee stack overflow 
prevention.
 Key: MESOS-8852
 URL: https://issues.apache.org/jira/browse/MESOS-8852
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Mahler


One of the goals of process::loop is to prevent stack overflows in the case 
that the callbacks are completing synchronously. However, it's still possible 
for process::loop to stack overflow if the body and iterate futures transition 
between the checking of them being ready and the setting of the continuation 
callbacks. If the futures continuously transition in these windows, the stack 
will overflow.

One way to fix this would be to provide an atomic set-callbacks-if-pending 
function on Future (e.g. {{bool setIfPending(...)}} that allows the caller to 
avoid accidentally invoking callbacks synchronously when setting the callbacks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8594) Mesos master crash (under load)

2018-04-27 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-8594:
--

Assignee: Benjamin Mahler

Will look into a fix without waiting for the http::Server patches.

> Mesos master crash (under load)
> ---
>
> Key: MESOS-8594
> URL: https://issues.apache.org/jira/browse/MESOS-8594
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0, 1.6.0
>Reporter: A. Dukhovniy
>Assignee: Benjamin Mahler
>Priority: Blocker
>  Labels: reliability
> Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, 
> lldb-regiser-read.txt
>
>
> Mesos master crashes under load. Attached are some infos from the `lldb`:
> {code:java}
> Process 41933 resuming
> Process 41933 stopped
> * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x789ecff8)
> frame #0: 0x00010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35
> 32 template 
> 33 struct _Some
> 34 {
> -> 35 _Some(T _t) : t(std::move(_t)) {}
> 36
> 37 T t;
> 38 };
> Target 0: (mesos-master) stopped.
> (lldb)
> {code}
> To quote [~abudnik]
> {quote}it’s the stack overflow bug in libprocess due to the way 
> `internal::send()` and `internal::_send()` are implemented in `process.cpp`
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-04-27 Thread Vishant Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456573#comment-16456573
 ] 

Vishant Singh commented on MESOS-8574:
--

[~abudnik]

not completely sure the reason for docker hang.

But it seems like the docker has stale information about running containers.

The container gets killed as part of a task kill request from marathon.As the 
docker task-kill involves SIGTREM (first) and then SIGIKILL (after timeout), 
the SIGKILL terminates the task but dockerd does not get updated of this state. 
Might because the SIGKILL does not have signal handlers which can eventually 
update the state information in docker.

After this, when a new task is launched on this host the docker inspect  or 
docker ps would be unresponsive.

At this point I have an monitoring on docker hang and idea is to restart the 
docker if its in hung state.

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0
>
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8739) Implement a test to check that a launched container can be killed.

2018-04-27 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456526#comment-16456526
 ] 

Andrei Budnik commented on MESOS-8739:
--

Already covered by `SlaveTest.*`

> Implement a test to check that a launched container can be killed.
> --
>
> Key: MESOS-8739
> URL: https://issues.apache.org/jira/browse/MESOS-8739
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> This test launches a long-running task, then successively calls `wait()` and 
> `destroy()` methods of the composing containerizer. Both termination statuses 
> must be equal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8738) Implement a test to check that a recovered container can be killed.

2018-04-27 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456518#comment-16456518
 ] 

Andrei Budnik commented on MESOS-8738:
--

This test case is already covered by `SlaveRecoveryTest.KillTask`.

> Implement a test to check that a recovered container can be killed.
> ---
>
> Key: MESOS-8738
> URL: https://issues.apache.org/jira/browse/MESOS-8738
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: mesosphere, test
>
> This test verifies that a recovered container can be killed via `destroy()` 
> method of composing containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8809) Add functions for manipulating POSIX ACLs into stout

2018-04-27 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456453#comment-16456453
 ] 

Qian Zhang commented on MESOS-8809:
---

commit 617d55e24a3bef7305b75c8fc6cbd1d1f14d7f6a
Author: Qian Zhang 
Date: Fri Apr 27 09:30:48 2018 +0800

Added `libacl` into a few Dockerfiles.
 
 This commit adds `libacl` into Dockerfiles for the images:
 1. mesos/mesos-build
 2. mesos/mesos-tidy
 2. mesos/mesos-mini
 
 Review: https://reviews.apache.org/r/66840

> Add functions for manipulating POSIX ACLs into stout
> 
>
> Key: MESOS-8809
> URL: https://issues.apache.org/jira/browse/MESOS-8809
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> We need to add functions for setting/getting POSIX ACLs into stout so that we 
> can leverage these functions to grant volume permissions to the specific task 
> user.
> This will introduce a new dependency {{libacl-devel}} when building Mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)