[jira] [Commented] (MESOS-8618) ReconciliationTest.ReconcileStatusUpdateTaskState is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459442#comment-16459442 ] Yan Xu commented on MESOS-8618: --- {noformat:title=} commit 1c6d9e5e6d7439444c77d6c91b18642f69557dfe Author: Jiang Yan Xu Date: Mon Apr 23 14:59:44 2018 -0700 Fixed flaky ReconciliationTest.ReconcileStatusUpdateTaskState. To simulate a master failover we need to use `replicated_log` as the registry otherwise the master loses persisted info about the agents. Review: https://reviews.apache.org/r/66769 {noformat} > ReconciliationTest.ReconcileStatusUpdateTaskState is flaky. > --- > > Key: MESOS-8618 > URL: https://issues.apache.org/jira/browse/MESOS-8618 > Project: Mesos > Issue Type: Bug > Components: test > Environment: ec Debian 9 with SSL >Reporter: Alexander Rukletsov >Assignee: Yan Xu >Priority: Major > Labels: flaky-test > Fix For: 1.6.0 > > Attachments: > ReconciliationTest.ReconcileStatusUpdateTaskState-badrun.txt > > > {noformat} > ../../src/tests/reconciliation_tests.cpp:1129 > Expected: TASK_RUNNING > To be equal to: update->state() > Which is: TASK_FINISHED > {noformat} > {noformat} > ../../src/tests/reconciliation_tests.cpp:1130: Failure > Expected: TaskStatus::REASON_RECONCILIATION > Which is: 9 > To be equal to: update->reason() > Which is: 32 > {noformat} > Full log attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8594) Mesos master crash (under load)
[ https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459348#comment-16459348 ] Benjamin Mahler commented on MESOS-8594: {noformat} commit 8a639ca63bd8071245e270aecdda574aec6f8d3e Author: Benjamin Mahler Date: Sat Apr 28 18:28:39 2018 -0700 Reduced likelihood of a stack overflow in libprocess socket send path. Currently, the socket send path is implemented using an asynchronous loop with callbacks. Without using `process::loop`, this pattern is prone to a stack overflow in the case that all asynchronous calls complete synchronously. This is possible with sockets if the socket is always ready for writing. Users have reported the crash in both MESOS-8594 and MESOS-8834, so the stack overflow is encountered in practice. This patch updates the send path to leverage `process::loop`, which is supposed to prevent stack overflows in asynchronous loops. However, it is still possible for `process::loop` to stack overflow due to MESOS-8852. In practice, I expect that even without MESOS-8852 fixed, users won't see any stack overflows in the send path. Review: https://reviews.apache.org/r/66863 {noformat} > Mesos master crash (under load) > --- > > Key: MESOS-8594 > URL: https://issues.apache.org/jira/browse/MESOS-8594 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0 >Reporter: A. Dukhovniy >Assignee: Benjamin Mahler >Priority: Blocker > Labels: reliability > Fix For: 1.6.0 > > Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, > lldb-regiser-read.txt > > > Mesos master crashes under load. Attached are some infos from the `lldb`: > {code:java} > Process 41933 resuming > Process 41933 stopped > * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x789ecff8) > frame #0: 0x00010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35 > 32 template > 33 struct _Some > 34 { > -> 35 _Some(T _t) : t(std::move(_t)) {} > 36 > 37 T t; > 38 }; > Target 0: (mesos-master) stopped. > (lldb) > {code} > To quote [~abudnik] > {quote}it’s the stack overflow bug in libprocess due to the way > `internal::send()` and `internal::_send()` are implemented in `process.cpp` > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8840) `cpu.cfs_quota_us` may be accidentally set for command task using docker during agent recovery.
[ https://issues.apache.org/jira/browse/MESOS-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459062#comment-16459062 ] Meng Zhu commented on MESOS-8840: - We probably also want to audit `create` and `update` to see if there are any other discrepancies between the two that would lead to accidental changes in cgroup settings. > `cpu.cfs_quota_us` may be accidentally set for command task using docker > during agent recovery. > --- > > Key: MESOS-8840 > URL: https://issues.apache.org/jira/browse/MESOS-8840 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.2.3 >Reporter: Meng Zhu >Priority: Critical > > Prior to Mesos 1.3, docker containerizer does not honor the flag > `–cgroups_enable_cfs` for command task when creating the container, a patch > ported this flag to docker command executor only up to 1.3 (MESOS-6134) > However, docker containerizer honors the flag when updating containers: > https://github.com/apache/mesos/blob/7559c9352c78912526820f6222ed2b17ad3b19cf/src/slave/containerizer/docker.cpp#L1726 > For non-command tasks, docker containerizer always `update` the resources > during launch: > https://github.com/apache/mesos/blob/7559c9352c78912526820f6222ed2b17ad3b19cf/src/slave/containerizer/docker.cpp#L1325-L1330 > For command tasks, it is not the case: > https://github.com/apache/mesos/blob/7559c9352c78912526820f6222ed2b17ad3b19cf/src/slave/containerizer/docker.cpp#L1271-L1277 > However, when recovering the executor, `update` is called for both command > and non-command tasks. > This means, for command task, the cpu cgroup cfs settings would change when a > command executor is recovered. Specifically, recovered command executors will > have cfs set while all other command executors will not. This may lead to a > drastic change in the container resource usage depending on the system load. > To maintain backward compatibility, we probably want to avoid setting the > `cpu.cfs_quota_us` field in `update` if the field is not already set. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459045#comment-16459045 ] James Peach commented on MESOS-6575: | [/r/66173|https://reviews.apache.org/r/66173/] | Added test for `disk/xfs` container limitation. | | [r/66001|https://reviews.apache.org/r/66001/]| Added soft limit and kill to `disk/xfs`. | > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > Fix For: 1.6.0 > > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-701) Improve webui performance for large clusters.
[ https://issues.apache.org/jira/browse/MESOS-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458948#comment-16458948 ] Benjamin Mahler commented on MESOS-701: --- [~qui] it sounds like you're referring to the master scalability in the face of state polling, which I think we should treat as a separate issue from whether the ui can stay responsive for large clusters. For master scalability, the plan is to focus on MESOS-8345. For ui responsiveness, I think there are multiple angles of attack to look into (e.g. api filtering, pagination, streaming api, or some combinations of these, etc). > Improve webui performance for large clusters. > - > > Key: MESOS-701 > URL: https://issues.apache.org/jira/browse/MESOS-701 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler >Priority: Major > Labels: scalability > > For large clusters with tens of thousands of slaves, the webui is unusably > slow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8854) Building on JDK 9+ systems fails
[ https://issues.apache.org/jira/browse/MESOS-8854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458712#comment-16458712 ] Srdjan Grubor edited comment on MESOS-8854 at 4/30/18 7:33 PM: --- -Just a sidenote that this doesn't address any issues (if any - I'm finding this out right now) occur later in the process in those JDKs.- With both patches, I am able to build Mesos on OpenJDK11 was (Author: sgnn7): Just a sidenote that this doesn't address any issues (if any - I'm finding this out right now) occur later in the process in those JDKs. > Building on JDK 9+ systems fails > > > Key: MESOS-8854 > URL: https://issues.apache.org/jira/browse/MESOS-8854 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.5.0 > Environment: Ubuntu 18.04 LTS Docker image >Reporter: Srdjan Grubor >Priority: Minor > Attachments: 0001-Added-support-for-OpenJDK9-in-configure.ac.patch, > 0001-Fix-building-on-latest-JDKs-by-converting-javah-to-j.patch > > > JDK paths have changed in v9+ (no `jre/` nor arch leading to `libjvm.so`) so > on those machines, Mesos fails to configure. Also, `javah` has been > completely removed from JDK10+ > > Ubuntu 18.04 image with OpenJDK11 and OpenJDK8 installed: > {code:java} > /usr/lib/jvm# find . -name 'libjvm*' > ./java-11-openjdk-amd64/lib/server/libjvm.so > ./java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so{code} > > I've attached a patch that seems to work in this setup -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8855) Change TaskStatus.Reason's default value to something
Yan Xu created MESOS-8855: - Summary: Change TaskStatus.Reason's default value to something Key: MESOS-8855 URL: https://issues.apache.org/jira/browse/MESOS-8855 Project: Mesos Issue Type: Bug Reporter: Yan Xu We are constantly adding new task reasons and they'll result in the default enum value on clients that don't recognize them, right now the default value (first value) is {{REASON_COMMAND_EXECUTOR_FAILED}} and we should change it to something more legit. Also [~jieyu] has this TODO {code:title=} enum Reason { // TODO(jieyu): The default value when a caller doesn't check for // presence is 0 and so ideally the 0 reason is not a valid one. // Since this is not used anywhere, consider removing this reason. REASON_COMMAND_EXECUTOR_FAILED = 0; } {code} Note that Mesos already defines an used {{REASON_TASK_UNKNOWN}} and the fact that there's task state {{TASK_UNKNOWN}} may influence the naming of the default enum field. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors
[ https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458823#comment-16458823 ] Andrei Budnik commented on MESOS-6285: -- Introducing a limit for the number of stored tasks per executor and/or framework in the garbage collector can solve the issue. > Agents may OOM during recovery if there are too many tasks or executors > --- > > Key: MESOS-6285 > URL: https://issues.apache.org/jira/browse/MESOS-6285 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Joseph Wu >Priority: Major > Labels: mesosphere > > On an test cluster, we encountered a degenerate case where running the > example {{long-lived-framework}} for over a week would render the agent > un-recoverable. > The {{long-lived-framework}} creates one custom {{long-lived-executor}} and > launches a single task on that executor every time it receives an offer from > that agent. Over a week's worth of time, the framework manages to launch > some 400k tasks (short sleeps) on one executor. During runtime, this is not > problematic, as each completed task is quickly rotated out of the agent's > memory (and checkpointed to disk). > During recovery, however, the agent reads every single task into memory, > which leads to slow recovery; and often results in the agent being OOM-killed > before it finishes recovering. > To repro this condition quickly: > 1) Apply this patch to the {{long-lived-framework}}: > {code} > diff --git a/src/examples/long_lived_framework.cpp > b/src/examples/long_lived_framework.cpp > index 7c57eb5..1263d82 100644 > --- a/src/examples/long_lived_framework.cpp > +++ b/src/examples/long_lived_framework.cpp > @@ -358,16 +358,6 @@ private: >// Helper to launch a task using an offer. >void launch(const Offer& offer) >{ > -int taskId = tasksLaunched++; > -++metrics.tasks_launched; > - > -TaskInfo task; > -task.set_name("Task " + stringify(taskId)); > -task.mutable_task_id()->set_value(stringify(taskId)); > -task.mutable_agent_id()->MergeFrom(offer.agent_id()); > -task.mutable_resources()->CopyFrom(taskResources); > -task.mutable_executor()->CopyFrom(executor); > - > Call call; > call.set_type(Call::ACCEPT); > > @@ -380,7 +370,23 @@ private: > Offer::Operation* operation = accept->add_operations(); > operation->set_type(Offer::Operation::LAUNCH); > > -operation->mutable_launch()->add_task_infos()->CopyFrom(task); > +// Launch as many tasks as possible in the given offer. > +Resources remaining = Resources(offer.resources()).flatten(); > +while (remaining.contains(taskResources)) { > + int taskId = tasksLaunched++; > + ++metrics.tasks_launched; > + > + TaskInfo task; > + task.set_name("Task " + stringify(taskId)); > + task.mutable_task_id()->set_value(stringify(taskId)); > + task.mutable_agent_id()->MergeFrom(offer.agent_id()); > + task.mutable_resources()->CopyFrom(taskResources); > + task.mutable_executor()->CopyFrom(executor); > + > + operation->mutable_launch()->add_task_infos()->CopyFrom(task); > + > + remaining -= taskResources; > +} > > mesos->send(call); >} > {code} > 2) Run a master, agent, and {{long-lived-framework}}. On a 1 CPU, 1 GB agent > + this patch, it should take about 10 minutes to build up sufficient task > launches. > 3) Restart the agent and watch it flail during recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8854) Building on JDK 9+ systems fails in configure step
[ https://issues.apache.org/jira/browse/MESOS-8854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458712#comment-16458712 ] Srdjan Grubor commented on MESOS-8854: -- Just a sidenote that this doesn't address any issues (if any - I'm finding this out right now) occur later in the process in those JDKs. > Building on JDK 9+ systems fails in configure step > -- > > Key: MESOS-8854 > URL: https://issues.apache.org/jira/browse/MESOS-8854 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.5.0 > Environment: Ubuntu 18.04 LTS Docker image >Reporter: Srdjan Grubor >Priority: Minor > Attachments: 0001-Added-support-for-OpenJDK9-in-configure.ac.patch > > > JDK paths have changed in v9+ (no `jre/` nor arch leading to `libjvm.so`) so > on those machines, Mesos fails to configure. > > Ubuntu 18.04 image with OpenJDK11 and OpenJDK8 installed: > {code:java} > /usr/lib/jvm# find . -name 'libjvm*' > ./java-11-openjdk-amd64/lib/server/libjvm.so > ./java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so{code} > > I've attached a patch that seems to work in this setup -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8854) Building on OpenJDK 9+ systems fails in configure step
Srdjan Grubor created MESOS-8854: Summary: Building on OpenJDK 9+ systems fails in configure step Key: MESOS-8854 URL: https://issues.apache.org/jira/browse/MESOS-8854 Project: Mesos Issue Type: Bug Components: build Affects Versions: 1.5.0 Environment: Ubuntu 18.04 LTS Docker image Reporter: Srdjan Grubor Attachments: 0001-Added-support-for-OpenJDK9-in-configure.ac.patch OpenJDK paths have changed in v9+ (no `jre/` nor arch leading to `libjvm.so`) so on those machines, Mesos fails to configure. Ubuntu 18.04 image with OpenJDK11 and OpenJDK8 installed: {code:java} /usr/lib/jvm# find . -name 'libjvm*' ./java-11-openjdk-amd64/lib/server/libjvm.so ./java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so{code} I've attached a patch that seems to work in this setup -- This message was sent by Atlassian JIRA (v7.6.3#76005)