[jira] [Commented] (MESOS-8618) ReconciliationTest.ReconcileStatusUpdateTaskState is flaky.

2018-04-30 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459442#comment-16459442
 ] 

Yan Xu commented on MESOS-8618:
---

{noformat:title=}
commit 1c6d9e5e6d7439444c77d6c91b18642f69557dfe
Author: Jiang Yan Xu 
Date:   Mon Apr 23 14:59:44 2018 -0700

Fixed flaky ReconciliationTest.ReconcileStatusUpdateTaskState.

To simulate a master failover we need to use `replicated_log` as the
registry otherwise the master loses persisted info about the agents.

Review: https://reviews.apache.org/r/66769
{noformat}

> ReconciliationTest.ReconcileStatusUpdateTaskState is flaky.
> ---
>
> Key: MESOS-8618
> URL: https://issues.apache.org/jira/browse/MESOS-8618
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: ec Debian 9 with SSL
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>Priority: Major
>  Labels: flaky-test
> Fix For: 1.6.0
>
> Attachments: 
> ReconciliationTest.ReconcileStatusUpdateTaskState-badrun.txt
>
>
> {noformat}
> ../../src/tests/reconciliation_tests.cpp:1129
>   Expected: TASK_RUNNING
> To be equal to: update->state()
>   Which is: TASK_FINISHED
> {noformat}
> {noformat}
> ../../src/tests/reconciliation_tests.cpp:1130: Failure
>   Expected: TaskStatus::REASON_RECONCILIATION
>   Which is: 9
> To be equal to: update->reason()
>   Which is: 32
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8594) Mesos master crash (under load)

2018-04-30 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459348#comment-16459348
 ] 

Benjamin Mahler commented on MESOS-8594:


{noformat}
commit 8a639ca63bd8071245e270aecdda574aec6f8d3e
Author: Benjamin Mahler 
Date:   Sat Apr 28 18:28:39 2018 -0700

Reduced likelihood of a stack overflow in libprocess socket send path.

Currently, the socket send path is implemented using an asynchronous
loop with callbacks. Without using `process::loop`, this pattern is
prone to a stack overflow in the case that all asynchronous calls
complete synchronously. This is possible with sockets if the socket
is always ready for writing. Users have reported the crash in both
MESOS-8594 and MESOS-8834, so the stack overflow is encountered in
practice.

This patch updates the send path to leverage `process::loop`, which
is supposed to prevent stack overflows in asynchronous loops. However,
it is still possible for `process::loop` to stack overflow due to
MESOS-8852. In practice, I expect that even without MESOS-8852 fixed,
users won't see any stack overflows in the send path.

Review: https://reviews.apache.org/r/66863
{noformat}

> Mesos master crash (under load)
> ---
>
> Key: MESOS-8594
> URL: https://issues.apache.org/jira/browse/MESOS-8594
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0
>Reporter: A. Dukhovniy
>Assignee: Benjamin Mahler
>Priority: Blocker
>  Labels: reliability
> Fix For: 1.6.0
>
> Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, 
> lldb-regiser-read.txt
>
>
> Mesos master crashes under load. Attached are some infos from the `lldb`:
> {code:java}
> Process 41933 resuming
> Process 41933 stopped
> * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x789ecff8)
> frame #0: 0x00010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35
> 32 template 
> 33 struct _Some
> 34 {
> -> 35 _Some(T _t) : t(std::move(_t)) {}
> 36
> 37 T t;
> 38 };
> Target 0: (mesos-master) stopped.
> (lldb)
> {code}
> To quote [~abudnik]
> {quote}it’s the stack overflow bug in libprocess due to the way 
> `internal::send()` and `internal::_send()` are implemented in `process.cpp`
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8840) `cpu.cfs_quota_us` may be accidentally set for command task using docker during agent recovery.

2018-04-30 Thread Meng Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459062#comment-16459062
 ] 

Meng Zhu commented on MESOS-8840:
-

We probably also want to audit `create` and `update` to see if there are any 
other discrepancies between the two that would lead to accidental changes in 
cgroup settings.

> `cpu.cfs_quota_us` may be accidentally set for command task using docker 
> during agent recovery.
> ---
>
> Key: MESOS-8840
> URL: https://issues.apache.org/jira/browse/MESOS-8840
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.2.3
>Reporter: Meng Zhu
>Priority: Critical
>
> Prior to Mesos 1.3, docker containerizer does not honor the flag 
> `–cgroups_enable_cfs` for command task when creating the container, a patch 
> ported this flag to docker command executor only up to 1.3 (MESOS-6134) 
> However, docker containerizer honors the flag when updating containers:
> https://github.com/apache/mesos/blob/7559c9352c78912526820f6222ed2b17ad3b19cf/src/slave/containerizer/docker.cpp#L1726
> For non-command tasks, docker containerizer always `update` the resources 
> during launch:
> https://github.com/apache/mesos/blob/7559c9352c78912526820f6222ed2b17ad3b19cf/src/slave/containerizer/docker.cpp#L1325-L1330
> For command tasks, it is not the case:
> https://github.com/apache/mesos/blob/7559c9352c78912526820f6222ed2b17ad3b19cf/src/slave/containerizer/docker.cpp#L1271-L1277
> However, when recovering the executor, `update` is called for both command 
> and non-command tasks.
> This means, for command task, the cpu cgroup cfs settings would change when a 
> command executor is recovered. Specifically, recovered command executors will 
> have cfs set while all other command executors will not. This may lead to a 
> drastic change in the container resource usage depending on the system load.
> To maintain backward compatibility, we probably want to avoid setting the 
> `cpu.cfs_quota_us` field in `update` if the field is not already set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota

2018-04-30 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459045#comment-16459045
 ] 

James Peach commented on MESOS-6575:


| [/r/66173|https://reviews.apache.org/r/66173/] | Added test for `disk/xfs` 
container limitation. |
| [r/66001|https://reviews.apache.org/r/66001/]| Added soft limit and kill to 
`disk/xfs`. |

> Change `disk/xfs` isolator to terminate executor when it exceeds quota
> --
>
> Key: MESOS-6575
> URL: https://issues.apache.org/jira/browse/MESOS-6575
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Santhosh Kumar Shanmugham
>Assignee: James Peach
>Priority: Major
> Fix For: 1.6.0
>
>
> Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf 
> when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on 
> XFS's internal quota enforcement, silently fails the {{write}} operation, 
> that causes the quota limit to be exceeded, without surfacing the quota 
> breach information.
> This task is to change the `disk/xfs` isolator so that, a 
> {{ContainerLimitation}} message is triggered when the quota is exceeded. 
> This feature will rely on the underlying filesystem being mounted with 
> {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes 
> a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the 
> isolator can track the disk quota via {{xfs_quota}}, very much like 
> {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface 
> the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, 
> causing the executor to be terminated. This feature can then be turned on/off 
> via the existing {{enforce_container_disk_quota}} option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-701) Improve webui performance for large clusters.

2018-04-30 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458948#comment-16458948
 ] 

Benjamin Mahler commented on MESOS-701:
---

[~qui] it sounds like you're referring to the master scalability in the face of 
state polling, which I think we should treat as a separate issue from whether 
the ui can stay responsive for large clusters. For master scalability, the plan 
is to focus on MESOS-8345. For ui responsiveness, I think there are multiple 
angles of attack to look into (e.g. api filtering, pagination, streaming api, 
or some combinations of these, etc).

> Improve webui performance for large clusters.
> -
>
> Key: MESOS-701
> URL: https://issues.apache.org/jira/browse/MESOS-701
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: scalability
>
> For large clusters with tens of thousands of slaves, the webui is unusably 
> slow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8854) Building on JDK 9+ systems fails

2018-04-30 Thread Srdjan Grubor (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458712#comment-16458712
 ] 

Srdjan Grubor edited comment on MESOS-8854 at 4/30/18 7:33 PM:
---

-Just a sidenote that this doesn't address any issues (if any - I'm finding 
this out right now) occur later in the process in those JDKs.-

With both patches, I am able to build Mesos on OpenJDK11


was (Author: sgnn7):
Just a sidenote that this doesn't address any issues (if any - I'm finding this 
out right now) occur later in the process in those JDKs.

> Building on JDK 9+ systems fails
> 
>
> Key: MESOS-8854
> URL: https://issues.apache.org/jira/browse/MESOS-8854
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.5.0
> Environment: Ubuntu 18.04 LTS Docker image
>Reporter: Srdjan Grubor
>Priority: Minor
> Attachments: 0001-Added-support-for-OpenJDK9-in-configure.ac.patch, 
> 0001-Fix-building-on-latest-JDKs-by-converting-javah-to-j.patch
>
>
> JDK paths have changed in v9+ (no `jre/` nor arch leading to `libjvm.so`) so 
> on those machines, Mesos fails to configure. Also, `javah` has been 
> completely removed from JDK10+
>  
> Ubuntu 18.04 image with OpenJDK11 and OpenJDK8 installed:
> {code:java}
> /usr/lib/jvm# find . -name 'libjvm*'
> ./java-11-openjdk-amd64/lib/server/libjvm.so
> ./java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so{code}
>  
> I've attached a patch that seems to work in this setup



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8855) Change TaskStatus.Reason's default value to something

2018-04-30 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8855:
-

 Summary: Change TaskStatus.Reason's default value to something 
 Key: MESOS-8855
 URL: https://issues.apache.org/jira/browse/MESOS-8855
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu


We are constantly adding new task reasons and they'll result in the default 
enum value on clients that don't recognize them, right now the default value 
(first value) is {{REASON_COMMAND_EXECUTOR_FAILED}} and we should change it to 
something more legit. 

Also [~jieyu] has this TODO

{code:title=}
enum Reason {
// TODO(jieyu): The default value when a caller doesn't check for
// presence is 0 and so ideally the 0 reason is not a valid one.
// Since this is not used anywhere, consider removing this reason.
REASON_COMMAND_EXECUTOR_FAILED = 0;
}
{code}

Note that Mesos already defines an used {{REASON_TASK_UNKNOWN}} and the fact 
that there's task state {{TASK_UNKNOWN}} may influence the naming of the 
default enum field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors

2018-04-30 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458823#comment-16458823
 ] 

Andrei Budnik commented on MESOS-6285:
--

Introducing a limit for the number of stored tasks per executor and/or 
framework in the garbage collector can solve the issue.

> Agents may OOM during recovery if there are too many tasks or executors
> ---
>
> Key: MESOS-6285
> URL: https://issues.apache.org/jira/browse/MESOS-6285
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Joseph Wu
>Priority: Major
>  Labels: mesosphere
>
> On an test cluster, we encountered a degenerate case where running the 
> example {{long-lived-framework}} for over a week would render the agent 
> un-recoverable.  
> The {{long-lived-framework}} creates one custom {{long-lived-executor}} and 
> launches a single task on that executor every time it receives an offer from 
> that agent.  Over a week's worth of time, the framework manages to launch 
> some 400k tasks (short sleeps) on one executor.  During runtime, this is not 
> problematic, as each completed task is quickly rotated out of the agent's 
> memory (and checkpointed to disk).
> During recovery, however, the agent reads every single task into memory, 
> which leads to slow recovery; and often results in the agent being OOM-killed 
> before it finishes recovering.
> To repro this condition quickly:
> 1) Apply this patch to the {{long-lived-framework}}:
> {code}
> diff --git a/src/examples/long_lived_framework.cpp 
> b/src/examples/long_lived_framework.cpp
> index 7c57eb5..1263d82 100644
> --- a/src/examples/long_lived_framework.cpp
> +++ b/src/examples/long_lived_framework.cpp
> @@ -358,16 +358,6 @@ private:
>// Helper to launch a task using an offer.
>void launch(const Offer& offer)
>{
> -int taskId = tasksLaunched++;
> -++metrics.tasks_launched;
> -
> -TaskInfo task;
> -task.set_name("Task " + stringify(taskId));
> -task.mutable_task_id()->set_value(stringify(taskId));
> -task.mutable_agent_id()->MergeFrom(offer.agent_id());
> -task.mutable_resources()->CopyFrom(taskResources);
> -task.mutable_executor()->CopyFrom(executor);
> -
>  Call call;
>  call.set_type(Call::ACCEPT);
>  
> @@ -380,7 +370,23 @@ private:
>  Offer::Operation* operation = accept->add_operations();
>  operation->set_type(Offer::Operation::LAUNCH);
>  
> -operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +// Launch as many tasks as possible in the given offer.
> +Resources remaining = Resources(offer.resources()).flatten();
> +while (remaining.contains(taskResources)) {
> +  int taskId = tasksLaunched++;
> +  ++metrics.tasks_launched;
> +
> +  TaskInfo task;
> +  task.set_name("Task " + stringify(taskId));
> +  task.mutable_task_id()->set_value(stringify(taskId));
> +  task.mutable_agent_id()->MergeFrom(offer.agent_id());
> +  task.mutable_resources()->CopyFrom(taskResources);
> +  task.mutable_executor()->CopyFrom(executor);
> +
> +  operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +
> +  remaining -= taskResources;
> +}
>  
>  mesos->send(call);
>}
> {code}
> 2) Run a master, agent, and {{long-lived-framework}}.  On a 1 CPU, 1 GB agent 
> + this patch, it should take about 10 minutes to build up sufficient task 
> launches.
> 3) Restart the agent and watch it flail during recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8854) Building on JDK 9+ systems fails in configure step

2018-04-30 Thread Srdjan Grubor (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458712#comment-16458712
 ] 

Srdjan Grubor commented on MESOS-8854:
--

Just a sidenote that this doesn't address any issues (if any - I'm finding this 
out right now) occur later in the process in those JDKs.

> Building on JDK 9+ systems fails in configure step
> --
>
> Key: MESOS-8854
> URL: https://issues.apache.org/jira/browse/MESOS-8854
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.5.0
> Environment: Ubuntu 18.04 LTS Docker image
>Reporter: Srdjan Grubor
>Priority: Minor
> Attachments: 0001-Added-support-for-OpenJDK9-in-configure.ac.patch
>
>
> JDK paths have changed in v9+ (no `jre/` nor arch leading to `libjvm.so`) so 
> on those machines, Mesos fails to configure.
>  
> Ubuntu 18.04 image with OpenJDK11 and OpenJDK8 installed:
> {code:java}
> /usr/lib/jvm# find . -name 'libjvm*'
> ./java-11-openjdk-amd64/lib/server/libjvm.so
> ./java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so{code}
>  
> I've attached a patch that seems to work in this setup



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8854) Building on OpenJDK 9+ systems fails in configure step

2018-04-30 Thread Srdjan Grubor (JIRA)
Srdjan Grubor created MESOS-8854:


 Summary: Building on OpenJDK 9+ systems fails in configure step
 Key: MESOS-8854
 URL: https://issues.apache.org/jira/browse/MESOS-8854
 Project: Mesos
  Issue Type: Bug
  Components: build
Affects Versions: 1.5.0
 Environment: Ubuntu 18.04 LTS Docker image
Reporter: Srdjan Grubor
 Attachments: 0001-Added-support-for-OpenJDK9-in-configure.ac.patch

OpenJDK paths have changed in v9+ (no `jre/` nor arch leading to `libjvm.so`) 
so on those machines, Mesos fails to configure.

 

Ubuntu 18.04 image with OpenJDK11 and OpenJDK8 installed:
{code:java}
/usr/lib/jvm# find . -name 'libjvm*'
./java-11-openjdk-amd64/lib/server/libjvm.so
./java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so{code}
 

I've attached a patch that seems to work in this setup



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)