[jira] [Commented] (MESOS-7813) when lxc run after a period of time, the file(/proc/pid/cgroup) is modified, devices,blkio,memory,cpuacct is changed. why?
[ https://issues.apache.org/jira/browse/MESOS-7813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100056#comment-16100056 ] Joris Van Remoortere commented on MESOS-7813: - [~y123456yz] take a look at this comment and the surrounding code in the systemd cgroup code base: https://github.com/systemd/systemd/blob/52b1478414067eb9381b413408f920da7f162c6f/src/core/cgroup.c#L1345-L1348 > when lxc run after a period of time, the file(/proc/pid/cgroup) is modified, > devices,blkio,memory,cpuacct is changed. why? > -- > > Key: MESOS-7813 > URL: https://issues.apache.org/jira/browse/MESOS-7813 > Project: Mesos > Issue Type: Bug > Components: agent, cgroups, executor, framework > Environment: 1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 > GNU/Linux >Reporter: y123456yz > > when lxc run after a period of time, the file(/proc/pid/cgroup) is modified, > devices,blkio,memory,cpuacct is changed. why? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-3009) Reproduce systemd cgroup behavior
[ https://issues.apache.org/jira/browse/MESOS-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100050#comment-16100050 ] Joris Van Remoortere commented on MESOS-3009: - from: [http://man7.org/linux/man-pages/man5/systemd.resource-control.5.html] {quote}Turns on delegation of further resource control partitioning to processes of the unit. For unprivileged services (i.e. those using the User= setting), this allows processes to create a subhierarchy beneath its control group path. For privileged services and scopes, this ensures the processes will have all control group controllers enabled.{quote} Systemd has started implementing the linux kernel goal of making the cgroup file hierarchy read only. Sometimes it rebalances the cgroups hierarchy. If there are settings (files) in there that were not initiated by it then it may delete them if a rebalancing event occurs. One way to prevent this is to notify systemd that you want to control the subhierarchy for your specific systemd unit. > Reproduce systemd cgroup behavior > -- > > Key: MESOS-3009 > URL: https://issues.apache.org/jira/browse/MESOS-3009 > Project: Mesos > Issue Type: Task >Reporter: Artem Harutyunyan >Assignee: Joris Van Remoortere > Labels: mesosphere > > It has been noticed before that systemd reorganizes cgroup hierarchy created > by mesos slave. Because of this mesos is no longer able to find the cgroup, > and there is also a chance of undoing the isolation that mesos slave puts in > place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7813) when lxc run after a period of time, the file(/proc/pid/cgroup) is modified, devices,blkio,memory,cpuacct is changed. why?
[ https://issues.apache.org/jira/browse/MESOS-7813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094733#comment-16094733 ] Joris Van Remoortere commented on MESOS-7813: - [~y123456yz] here is an example of the systemd configuration in DC/OS https://github.com/dcos/dcos/blob/18c76a2b4b24aab0c4107bae9c7191a68e6de174/packages/mesos/extra/dcos-mesos-slave.service > when lxc run after a period of time, the file(/proc/pid/cgroup) is modified, > devices,blkio,memory,cpuacct is changed. why? > -- > > Key: MESOS-7813 > URL: https://issues.apache.org/jira/browse/MESOS-7813 > Project: Mesos > Issue Type: Bug > Components: agent, cgroups, executor, framework > Environment: 1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 > GNU/Linux >Reporter: y123456yz > > when lxc run after a period of time, the file(/proc/pid/cgroup) is modified, > devices,blkio,memory,cpuacct is changed. why? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7813) when lxc run after a period of time, the file(/proc/pid/cgroup) is modified, devices,blkio,memory,cpuacct is changed. why?
[ https://issues.apache.org/jira/browse/MESOS-7813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094144#comment-16094144 ] Joris Van Remoortere commented on MESOS-7813: - [~y123456yz] Check out the delegate flag in systemd. Here is an explanation of the problem: https://issues.apache.org/jira/browse/MESOS-3425 > when lxc run after a period of time, the file(/proc/pid/cgroup) is modified, > devices,blkio,memory,cpuacct is changed. why? > -- > > Key: MESOS-7813 > URL: https://issues.apache.org/jira/browse/MESOS-7813 > Project: Mesos > Issue Type: Bug > Components: agent, cgroups, executor, framework > Environment: 1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 > GNU/Linux >Reporter: y123456yz > > when lxc run after a period of time, the file(/proc/pid/cgroup) is modified, > devices,blkio,memory,cpuacct is changed. why? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6828) Consider ways for frameworks to ignore offers with an Unavailability
[ https://issues.apache.org/jira/browse/MESOS-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949915#comment-15949915 ] Joris Van Remoortere commented on MESOS-6828: - Based on some offline discussion I want to suggest that the least dangerous solution (in my opinion) is to have frameworks prefer offers with the longest availability by default. Aurora is a good example of a framework that collects offers and has the ability to express a preference while iterating the offers to match a task to launch. Preferring offers with no (or longest in the future) unavailability will naturally tend new tasks away from machines that will be entering maintenace. A benefit of this approach is that the agents in the schedule will still be used if there is demand pressure for resources by the framework. > Consider ways for frameworks to ignore offers with an Unavailability > > > Key: MESOS-6828 > URL: https://issues.apache.org/jira/browse/MESOS-6828 > Project: Mesos > Issue Type: Improvement >Reporter: Joris Van Remoortere >Assignee: Artem Harutyunyan > Labels: maintenance > > Due to the opt-in nature of maintenance primitives in Mesos, there is a > deficiency for cluster administrators when frameworks have not opted in. > An example case: > - Cluster with reasonable churn (tasks terminate naturally) > - Operator specifies maintenance schedule > Ideally *even* in a world where none of the frameworks had opted in to > maintenance primitives the operator would have some way of preventing > frameworks from scheduling further work on agents in the schedule. The > natural termination of the tasks in the cluster would allow the nodes to > drain gracefully and the operator to then perform maintenance. > 2 options that have been discussed so far: > # Provide a capability for frameworks to automatically filter offers with an > {{Unavailability}} set. > #* Pro: Finer grained control. Allows other frameworks to keep scheduling > short lived tasks that can complete before the Unavailability. > #* Con: All frameworks have to be updated. Consider making this an > environment variable to the scheduler driver for legacy frameworks. > # Provide a flag on the master to filter all offers with an > {{Unavailability}} set. > #* Pro: Immediately actionable / usable. > #* Con: Coarse grained. Some frameworks may suffer efficiency. > #* Con: *Dangerous*: planning out a multi-day maintenance schedule for an > entire cluster will prevent any frameworks from scheduling further work, > potentially stalling the cluster. > Action Items: Provide further context for each option and consider others. We > need to ensure we have something immediately consumable by users to fill the > gap until maintenance primitives are the norm. We also need to ensure we > prevent dangerous scenarios like the Con listed for option #2. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-6484) Memory leak in `Future::after()`
[ https://issues.apache.org/jira/browse/MESOS-6484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-6484: Shepherd: Joris Van Remoortere Sprint: Mesosphere Sprint 48 > Memory leak in `Future::after()` > --- > > Key: MESOS-6484 > URL: https://issues.apache.org/jira/browse/MESOS-6484 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.1.0 >Reporter: Alexander Rojas >Assignee: Alexander Rojas > Labels: libprocess, mesosphere > Fix For: 1.2.0 > > > The problem arises when one tries to associate an {{after()}} call to copied > futures. The following test case is enough to reproduce the issue: > {code} > TEST(FutureTest, After3) > { > auto policy = std::make_shared(0); > { > auto generator = []() { > return Future(); > }; > Future future = generator() > .after(Milliseconds(1), > [policy](const Future&) { >return Nothing(); > }); > AWAIT_READY(future); > } > EXPECT_EQ(1, policy.use_count()); > } > {code} > In the test, one would expect that there is only one active reference to > {{policy}}, therefore the expectation {{EXPECT_EQ(1, policy.use_count())}}. > However, if after is triggered more than once, each extra call adds one > undeleted reference to {{policy}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6828) Consider ways for frameworks to ignore offers with an Unavailability
[ https://issues.apache.org/jira/browse/MESOS-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771179#comment-15771179 ] Joris Van Remoortere commented on MESOS-6828: - An updated proposal to improve flexibility while still being easily consumable: # Allow operators to specify a separate start time for when offers should stop being sent prior to the actual maintenance window. # Add an opt-in capability for frameworks to be able to see offers during the period described in point #1 By controlling the time period during which offers are not sent out we are able to stagger them out based on the maintenance schedule and prevent the stalling scenario described in the ticket description. > Consider ways for frameworks to ignore offers with an Unavailability > > > Key: MESOS-6828 > URL: https://issues.apache.org/jira/browse/MESOS-6828 > Project: Mesos > Issue Type: Improvement >Reporter: Joris Van Remoortere >Assignee: Artem Harutyunyan > Labels: maintenance > > Due to the opt-in nature of maintenance primitives in Mesos, there is a > deficiency for cluster administrators when frameworks have not opted in. > An example case: > - Cluster with reasonable churn (tasks terminate naturally) > - Operator specifies maintenance schedule > Ideally *even* in a world where none of the frameworks had opted in to > maintenance primitives the operator would have some way of preventing > frameworks from scheduling further work on agents in the schedule. The > natural termination of the tasks in the cluster would allow the nodes to > drain gracefully and the operator to then perform maintenance. > 2 options that have been discussed so far: > # Provide a capability for frameworks to automatically filter offers with an > {{Unavailability}} set. > #* Pro: Finer grained control. Allows other frameworks to keep scheduling > short lived tasks that can complete before the Unavailability. > #* Con: All frameworks have to be updated. Consider making this an > environment variable to the scheduler driver for legacy frameworks. > # Provide a flag on the master to filter all offers with an > {{Unavailability}} set. > #* Pro: Immediately actionable / usable. > #* Con: Coarse grained. Some frameworks may suffer efficiency. > #* Con: *Dangerous*: planning out a multi-day maintenance schedule for an > entire cluster will prevent any frameworks from scheduling further work, > potentially stalling the cluster. > Action Items: Provide further context for each option and consider others. We > need to ensure we have something immediately consumable by users to fill the > gap until maintenance primitives are the norm. We also need to ensure we > prevent dangerous scenarios like the Con listed for option #2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6828) Consider ways for frameworks to ignore offers with an Unavailability
Joris Van Remoortere created MESOS-6828: --- Summary: Consider ways for frameworks to ignore offers with an Unavailability Key: MESOS-6828 URL: https://issues.apache.org/jira/browse/MESOS-6828 Project: Mesos Issue Type: Improvement Reporter: Joris Van Remoortere Assignee: Artem Harutyunyan Due to the opt-in nature of maintenance primitives in Mesos, there is a deficiency for cluster administrators when frameworks have not opted in. An example case: - Cluster with reasonable churn (tasks terminate naturally) - Operator specifies maintenance schedule Ideally *even* in a world where none of the frameworks had opted in to maintenance primitives the operator would have some way of preventing frameworks from scheduling further work on agents in the schedule. The natural termination of the tasks in the cluster would allow the nodes to drain gracefully and the operator to then perform maintenance. 2 options that have been discussed so far: # Provide a capability for frameworks to automatically filter offers with an {{Unavailability}} set. #* Pro: Finer grained control. Allows other frameworks to keep scheduling short lived tasks that can complete before the Unavailability. #* Con: All frameworks have to be updated. Consider making this an environment variable to the scheduler driver for legacy frameworks. # Provide a flag on the master to filter all offers with an {{Unavailability}} set. #* Pro: Immediately actionable / usable. #* Con: Coarse grained. Some frameworks may suffer efficiency. #* Con: *Dangerous*: planning out a multi-day maintenance schedule for an entire cluster will prevent any frameworks from scheduling further work, potentially stalling the cluster. Action Items: Provide further context for each option and consider others. We need to ensure we have something immediately consumable by users to fill the gap until maintenance primitives are the norm. We also need to ensure we prevent dangerous scenarios like the Con listed for option #2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6815) Enable glog stack traces when we call things like `ABORT` on Windows
[ https://issues.apache.org/jira/browse/MESOS-6815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-6815: Priority: Critical (was: Major) > Enable glog stack traces when we call things like `ABORT` on Windows > > > Key: MESOS-6815 > URL: https://issues.apache.org/jira/browse/MESOS-6815 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Alex Clemmer >Assignee: Alex Clemmer >Priority: Critical > Labels: microsoft, windows-mvp > > Currently in the Windows builds, if we call `ABORT` (etc.) we will simply > bail out, with no stack traces. > This is highly undesirable. Stack traces are important for operating clusters > in production. We should work to enable this behavior, including possibly > working with glog to add this support if they currently they do not natively > support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4638) versioning preprocessor macros
[ https://issues.apache.org/jira/browse/MESOS-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4638: Fix Version/s: 0.28.3 > versioning preprocessor macros > -- > > Key: MESOS-4638 > URL: https://issues.apache.org/jira/browse/MESOS-4638 > Project: Mesos > Issue Type: Bug > Components: c++ api >Reporter: James Peach >Assignee: Zhitao Li > Fix For: 0.28.3, 1.0.2, 1.1.0 > > > The macros in {{version.hpp}} cannot be used for conditional build because > they are strings not integers. It would be helpful to have integer versions > of these for conditionally building code against different versions of the > Mesos API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6502) _version uses incorrect MESOS_{MAJOR,MINOR,PATCH}_VERSION in libmesos java binding.
[ https://issues.apache.org/jira/browse/MESOS-6502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-6502: Fix Version/s: 0.28.3 > _version uses incorrect MESOS_{MAJOR,MINOR,PATCH}_VERSION in libmesos java > binding. > --- > > Key: MESOS-6502 > URL: https://issues.apache.org/jira/browse/MESOS-6502 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere >Priority: Blocker > Labels: mesosphere > Fix For: 0.28.3, 1.0.2, 1.1.0 > > > When the macros were re-assigned they were not flushed fully through the > codebase: > https://github.com/apache/mesos/commit/6bc6a40a54491cfd733263cd3962e490b0b4bdbb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6502) _version uses incorrect MESOS_{MAJOR,MINOR,PATCH}_VERSION in libmesos java binding.
[ https://issues.apache.org/jira/browse/MESOS-6502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645209#comment-15645209 ] Joris Van Remoortere commented on MESOS-6502: - {{0.28.3}} {code} commit b0dd63ea35b4338dc365da7db6c79eb9731e8e8b Author: Joris Van RemoortereDate: Fri Oct 28 15:50:10 2016 -0400 Fixed MesosNativeLibrary to use '_NUM' MESOS_VERSION macros. Review: https://reviews.apache.org/r/53270 {code} > _version uses incorrect MESOS_{MAJOR,MINOR,PATCH}_VERSION in libmesos java > binding. > --- > > Key: MESOS-6502 > URL: https://issues.apache.org/jira/browse/MESOS-6502 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere >Priority: Blocker > Labels: mesosphere > Fix For: 0.28.3, 1.0.2, 1.1.0 > > > When the macros were re-assigned they were not flushed fully through the > codebase: > https://github.com/apache/mesos/commit/6bc6a40a54491cfd733263cd3962e490b0b4bdbb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4638) versioning preprocessor macros
[ https://issues.apache.org/jira/browse/MESOS-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645204#comment-15645204 ] Joris Van Remoortere commented on MESOS-4638: - {{0.28.3}} {code} commit 6408c54e0327ab864d4e193814ee69bcd24985df Author: Zhitao LiDate: Wed Aug 17 09:34:27 2016 -0700 Introduce MESOS_{MAJOR|MINOR|PATCH}_VERSION_NUM macros. This makes version based conditional compiling much easier for module writers. Review: https://reviews.apache.org/r/50992/ {code} > versioning preprocessor macros > -- > > Key: MESOS-4638 > URL: https://issues.apache.org/jira/browse/MESOS-4638 > Project: Mesos > Issue Type: Bug > Components: c++ api >Reporter: James Peach >Assignee: Zhitao Li > Fix For: 0.28.3, 1.0.2, 1.1.0 > > > The macros in {{version.hpp}} cannot be used for conditional build because > they are strings not integers. It would be helpful to have integer versions > of these for conditionally building code against different versions of the > Mesos API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6457) Tasks shouldn't transition from TASK_KILLING to TASK_RUNNING.
[ https://issues.apache.org/jira/browse/MESOS-6457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-6457: Target Version/s: 1.0.2, 1.1.0 (was: 1.1.0) > Tasks shouldn't transition from TASK_KILLING to TASK_RUNNING. > - > > Key: MESOS-6457 > URL: https://issues.apache.org/jira/browse/MESOS-6457 > Project: Mesos > Issue Type: Bug >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman >Priority: Blocker > > A task can currently transition from {{TASK_KILLING}} to {{TASK_RUNNING}}, if > for example it starts/stops passing a health check once it got into the > {{TASK_KILLING}} state. > I think that this behaviour is counterintuitive. It also makes the life of > framework/tools developers harder, since they have to keep track of the > complete task status history in order to know if a task is being killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6457) Tasks shouldn't transition from TASK_KILLING to TASK_RUNNING.
[ https://issues.apache.org/jira/browse/MESOS-6457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-6457: Target Version/s: 1.1.0 (was: 1.2.0) > Tasks shouldn't transition from TASK_KILLING to TASK_RUNNING. > - > > Key: MESOS-6457 > URL: https://issues.apache.org/jira/browse/MESOS-6457 > Project: Mesos > Issue Type: Bug >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman >Priority: Blocker > > A task can currently transition from {{TASK_KILLING}} to {{TASK_RUNNING}}, if > for example it starts/stops passing a health check once it got into the > {{TASK_KILLING}} state. > I think that this behaviour is counterintuitive. It also makes the life of > framework/tools developers harder, since they have to keep track of the > complete task status history in order to know if a task is being killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6502) _version uses incorrect MESOS_{MAJOR,MINOR,PATCH}_VERSION in libmesos java binding.
[ https://issues.apache.org/jira/browse/MESOS-6502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616903#comment-15616903 ] Joris Van Remoortere edited comment on MESOS-6502 at 10/28/16 11:16 PM: {{1.1.x}} {code} commit e105363a52e219a565acc91144788600eb0b9aeb Author: Joris Van RemoortereDate: Fri Oct 28 15:50:10 2016 -0400 Fixed MesosNativeLibrary to use '_NUM' MESOS_VERSION macros. Review: https://reviews.apache.org/r/53270 {code} {{1.0.2}} {code} commit 9b8c54282c5337e28d99bc0025661131bde2246f Author: Joris Van Remoortere Date: Fri Oct 28 15:50:10 2016 -0400 Fixed MesosNativeLibrary to use '_NUM' MESOS_VERSION macros. Review: https://reviews.apache.org/r/53270 {code} was (Author: jvanremoortere): {{1.1.x}} {code} commit e105363a52e219a565acc91144788600eb0b9aeb Author: Joris Van Remoortere Date: Fri Oct 28 15:50:10 2016 -0400 Fixed MesosNativeLibrary to use '_NUM' MESOS_VERSION macros. Review: https://reviews.apache.org/r/53270 {code} > _version uses incorrect MESOS_{MAJOR,MINOR,PATCH}_VERSION in libmesos java > binding. > --- > > Key: MESOS-6502 > URL: https://issues.apache.org/jira/browse/MESOS-6502 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere >Priority: Blocker > Labels: mesosphere > Fix For: 1.0.2, 1.1.0 > > > When the macros were re-assigned they were not flushed fully through the > codebase: > https://github.com/apache/mesos/commit/6bc6a40a54491cfd733263cd3962e490b0b4bdbb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6502) _version uses incorrect MESOS_{MAJOR,MINOR,PATCH}_VERSION in libmesos java binding.
[ https://issues.apache.org/jira/browse/MESOS-6502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616903#comment-15616903 ] Joris Van Remoortere commented on MESOS-6502: - {{1.1.x}} {code} commit e105363a52e219a565acc91144788600eb0b9aeb Author: Joris Van RemoortereDate: Fri Oct 28 15:50:10 2016 -0400 Fixed MesosNativeLibrary to use '_NUM' MESOS_VERSION macros. Review: https://reviews.apache.org/r/53270 {code} > _version uses incorrect MESOS_{MAJOR,MINOR,PATCH}_VERSION in libmesos java > binding. > --- > > Key: MESOS-6502 > URL: https://issues.apache.org/jira/browse/MESOS-6502 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere >Priority: Blocker > Labels: mesosphere > Fix For: 1.0.2, 1.1.0 > > > When the macros were re-assigned they were not flushed fully through the > codebase: > https://github.com/apache/mesos/commit/6bc6a40a54491cfd733263cd3962e490b0b4bdbb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4638) versioning preprocessor macros
[ https://issues.apache.org/jira/browse/MESOS-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616900#comment-15616900 ] Joris Van Remoortere commented on MESOS-4638: - {{1.0.2}}: {code} commit 5668d4ff2655f120ca3d66c509efa40e24d5faf3 Author: Zhitao LiDate: Wed Aug 17 09:34:27 2016 -0700 Introduce MESOS_{MAJOR|MINOR|PATCH}_VERSION_NUM macros. This makes version based conditional compiling much easier for module writers. Review: https://reviews.apache.org/r/50992/ {code} > versioning preprocessor macros > -- > > Key: MESOS-4638 > URL: https://issues.apache.org/jira/browse/MESOS-4638 > Project: Mesos > Issue Type: Bug > Components: c++ api >Reporter: James Peach >Assignee: Zhitao Li > Fix For: 1.0.2, 1.1.0 > > > The macros in {{version.hpp}} cannot be used for conditional build because > they are strings not integers. It would be helpful to have integer versions > of these for conditionally building code against different versions of the > Mesos API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4638) versioning preprocessor macros
[ https://issues.apache.org/jira/browse/MESOS-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4638: Fix Version/s: 1.0.2 > versioning preprocessor macros > -- > > Key: MESOS-4638 > URL: https://issues.apache.org/jira/browse/MESOS-4638 > Project: Mesos > Issue Type: Bug > Components: c++ api >Reporter: James Peach >Assignee: Zhitao Li > Fix For: 1.0.2, 1.1.0 > > > The macros in {{version.hpp}} cannot be used for conditional build because > they are strings not integers. It would be helpful to have integer versions > of these for conditionally building code against different versions of the > Mesos API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6502) _version uses incorrect MESOS_{MAJOR,MINOR,PATCH}_VERSION in libmesos java binding.
[ https://issues.apache.org/jira/browse/MESOS-6502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-6502: Summary: _version uses incorrect MESOS_{MAJOR,MINOR,PATCH}_VERSION in libmesos java binding. (was: MESOS_{MAJOR,MINOR,PATCH}_VERSION incorrect in libmesos java binding) > _version uses incorrect MESOS_{MAJOR,MINOR,PATCH}_VERSION in libmesos java > binding. > --- > > Key: MESOS-6502 > URL: https://issues.apache.org/jira/browse/MESOS-6502 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere >Priority: Blocker > Labels: mesosphere > Fix For: 1.1.0 > > > When the macros were re-assigned they were not flushed fully through the > codebase: > https://github.com/apache/mesos/commit/6bc6a40a54491cfd733263cd3962e490b0b4bdbb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6407) Move DEFAULT_v1_xxx macros to the v1 namespace.
[ https://issues.apache.org/jira/browse/MESOS-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592835#comment-15592835 ] Joris Van Remoortere commented on MESOS-6407: - {code} commit e9da9b3bc41aa81c25d36901e52ff1e941fa09e6 Author: Joris Van RemoortereDate: Mon Oct 17 23:15:21 2016 -0700 Split mesos test helpers into 'internal' and 'v1' namespaces. Review: https://reviews.apache.org/r/52976 commit 2373819dc3e3f8b251526db962eecde23de1545b Author: Joris Van Remoortere Date: Tue Oct 18 20:54:41 2016 -0700 Removed unused tests helper macro 'DEFAULT_CONTAINER_ID'. Review: https://reviews.apache.org/r/53014 commit 78d4ec406f7bee61eb5097bca91bf143d2f43f82 Author: Joris Van Remoortere Date: Tue Oct 18 15:33:09 2016 -0700 Removed extra 'evolve' implementation from 'api_tests.cpp'. Review: https://reviews.apache.org/r/53013 commit 7831f1fbace2ae868dd7dc80f4ddca459b9ffe19 Author: Joris Van Remoortere Date: Tue Oct 18 16:18:25 2016 -0700 Fixed usage of 'evolve' in master http endpoints. Review: https://reviews.apache.org/r/53012 {code} > Move DEFAULT_v1_xxx macros to the v1 namespace. > --- > > Key: MESOS-6407 > URL: https://issues.apache.org/jira/browse/MESOS-6407 > Project: Mesos > Issue Type: Improvement >Reporter: Anand Mazumdar >Assignee: Joris Van Remoortere > Labels: mesosphere > Fix For: 1.2.0 > > > We should clean up the existing {{DEFAULT_v1_*}} macros and bring it under > the {{v1}} namespace e.g., {{v1::DEFAULT_FRAMEWORK_INFO}}. This is necessary > for doing a larger cleanup i.e., we would like to introduce {{createXXX}} for > the {{v1}} API and would not like to add {{createV1XXX}} functions eventually. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6343) Documentation Error: Default Executor does not implicitly construct resources
Joris Van Remoortere created MESOS-6343: --- Summary: Documentation Error: Default Executor does not implicitly construct resources Key: MESOS-6343 URL: https://issues.apache.org/jira/browse/MESOS-6343 Project: Mesos Issue Type: Documentation Reporter: Joris Van Remoortere Priority: Blocker https://github.com/apache/mesos/blob/d16f53d5a9e15d1d9533739a8c052bc546ec3262/include/mesos/v1/mesos.proto#L544-L546 This probably got carried forward from early design discussions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6315) `killtree` can accidentally kill containerizer / executor
Joris Van Remoortere created MESOS-6315: --- Summary: `killtree` can accidentally kill containerizer / executor Key: MESOS-6315 URL: https://issues.apache.org/jira/browse/MESOS-6315 Project: Mesos Issue Type: Bug Affects Versions: 1.0.0 Reporter: Joris Van Remoortere The implementation of killtree is buggy. [~jieyu] has some ideas. ltrace of mesos-local: {code} [pid 19501] [0x7f89d77a61ab] libmesos-1.1.0.so->kill(29985, SIGKILL) = 0 [pid 19501] [0x7f89d77a61ab] libmesos-1.1.0.so->kill(31349, SIGKILL [pid 31359] [0x] +++ killed by SIGKILL +++ [pid 31358] [0x] +++ killed by SIGKILL +++ [pid 31357] [0x] +++ killed by SIGKILL +++ [pid 31356] [0x] +++ killed by SIGKILL +++ [pid 31354] [0x] +++ killed by SIGKILL +++ [pid 31353] [0x] +++ killed by SIGKILL +++ [pid 31351] [0x] +++ killed by SIGKILL +++ [pid 31350] [0x] +++ killed by SIGKILL +++ [pid 19501] [0x7f89d77a61ab] <... kill resumed> ) = 0 [pid 19501] [0x7f89d77a61dd] libmesos-1.1.0.so->kill(29985, SIGCONT [pid 29985] [0x] +++ killed by SIGKILL +++ [pid 19493] [0x7f89d64ceda0] --- SIGCHLD (Child exited) --- [pid 31352] [0x] +++ killed by SIGKILL +++ [pid 31349] [0x] +++ killed by SIGKILL +++ [pid 19501] [0x7f89d77a61dd] <... kill resumed> ) = 0 [pid 19501] [0x7f89d77a61dd] libmesos-1.1.0.so->kill(31349, SIGCONT) = -1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6315) `killtree` can accidentally kill containerizer / executor
[ https://issues.apache.org/jira/browse/MESOS-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549602#comment-15549602 ] Joris Van Remoortere commented on MESOS-6315: - Since {{killtree}} is only used in the posix containerizer this is not a blocker. > `killtree` can accidentally kill containerizer / executor > - > > Key: MESOS-6315 > URL: https://issues.apache.org/jira/browse/MESOS-6315 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Joris Van Remoortere > > The implementation of killtree is buggy. [~jieyu] has some ideas. > ltrace of mesos-local: > {code} > [pid 19501] [0x7f89d77a61ab] libmesos-1.1.0.so->kill(29985, SIGKILL) > = 0 > [pid 19501] [0x7f89d77a61ab] libmesos-1.1.0.so->kill(31349, SIGKILL return ...> > [pid 31359] [0x] +++ killed by SIGKILL +++ > [pid 31358] [0x] +++ killed by SIGKILL +++ > [pid 31357] [0x] +++ killed by SIGKILL +++ > [pid 31356] [0x] +++ killed by SIGKILL +++ > [pid 31354] [0x] +++ killed by SIGKILL +++ > [pid 31353] [0x] +++ killed by SIGKILL +++ > [pid 31351] [0x] +++ killed by SIGKILL +++ > [pid 31350] [0x] +++ killed by SIGKILL +++ > [pid 19501] [0x7f89d77a61ab] <... kill resumed> ) > = 0 > [pid 19501] [0x7f89d77a61dd] libmesos-1.1.0.so->kill(29985, SIGCONT return ...> > [pid 29985] [0x] +++ killed by SIGKILL +++ > [pid 19493] [0x7f89d64ceda0] --- SIGCHLD (Child exited) --- > [pid 31352] [0x] +++ killed by SIGKILL +++ > [pid 31349] [0x] +++ killed by SIGKILL +++ > [pid 19501] [0x7f89d77a61dd] <... kill resumed> ) > = 0 > [pid 19501] [0x7f89d77a61dd] libmesos-1.1.0.so->kill(31349, SIGCONT) > = -1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6264) Investigate the high memory usage of the default executor.
[ https://issues.apache.org/jira/browse/MESOS-6264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549111#comment-15549111 ] Joris Van Remoortere edited comment on MESOS-6264 at 10/5/16 3:45 PM: -- cc [~vinodkone][~jieyu] The bulk of this comes from loading in {{libmesos.so}} We do this because the autoconf build treats libmesos as a dynamic dependency. Since we load libmesos dynamically, there is no chance for the linker to strip unused code. This means that all of the code in libmesos regardless of use gets loaded into resident memory. In contrast the cmake build generates a static library for {{libmesos.a}}. This is then used to build the {{mesos-executor}} binary without a dynamic dependency on libmesos. The benefit of this approach is that the linker is able to strip out all unused code. In an optimized build this is {{~10MB}}. Some approaches for the quick win are: # Consider using the cmake build. This only needs to be modified slightly to strip symbols from the final executor binary {{-s}}. # Modiy the autoconf build to build a {{libmesos.a}} so that we can statically link it in to the {{mesos-executor}} binary and allow the linker to strip unused code. Regardless of the above approach, {{libmesos}} would still be by far the largest contributor of the {{RSS}}. This is for 2 reasons: # Much of our code is structured such that the linker can't determine if it is unused. We would need to adjust our patterns such that the unused code analyzer can do a better job. # Much of our code is {{inlined}} or written such that it can't be optimized. 2 examples are: ## https://github.com/apache/mesos/blob/9beb8eae6408249cdb3e2f16ba68b31a00d3452c/3rdparty/libprocess/include/process/mime.hpp#L35-L154 This code could be moved to a {{.cpp}} file and should be a {{static const std::unordered_map}} that we {{insert(begin(), end())}} into {{types}}. This would reduce the size of libmesos by {{~20KB}}! ## https://github.com/apache/mesos/blob/9beb8eae6408249cdb3e2f16ba68b31a00d3452c/3rdparty/libprocess/include/process/http.hpp#L453-L517 This code and sibling {{struct Request}} have auto-generated {{inlined}} destructors. These are very expensive. Just declaring and then defining in the {{.cpp}} the default destructor can remove another {{~20KB}} each from libmesos. There are plenty of other opportunities like this scattered through the codebase. It's work to find them and the returns are small for each, but end up adding to much of the {{9MB}} left over. was (Author: jvanremoortere): cc [~vinodkone][~jieyu] The bulk of this comes from loading in {{libmesos.so}} We do this because the autoconf build treats libmesos as a dynamic dependency. Since we load libmesos dynamically, there is no chance for the linker to strip unused code. This means that all of the code in libmesos regardless of use gets loaded into resident memory. In contrast the cmake build generates a static library for {{libmesos.a}}. This is then used to build the {{mesos-executor}} binary without a dynamic dependency on libmesos. The benefit of this approach is that the linker is able to strip out all unused code. In an optimized build this is {{~10MB}}. Some approaches for the quick win are: # Consider using the cmake build. This only needs to be modified slightly to strip symbols from the final executor binary {{-s}}. # Modiy the autoconf build to build a {{libmesos.a}} so that we can statically link it in to the {{mesos-executor}} binary and allow the linker to strip unused code. Regardless of the above approach, {{libmesos}} would still be by far the largest contributor of the {{RSS}}. This is for 2 reasons: # Much of our code is structured such that the linker can't determine if it is unused. We would need to adjust our patterns such that the unused code analyzer can do a better job. # Much of our code is {{inlined}} or written such that it can't be optimized. 2 examples are: ## https://github.com/apache/mesos/blob/9beb8eae6408249cdb3e2f16ba68b31a00d3452c/3rdparty/libprocess/include/process/mime.hpp#L35-L154 This code could be moved to a {{.cpp}} file and should be a {{static const std::unordered_map }} that we {{insert(begin(), end())}} into {{types}}. This would reduce the size of libmesos by {{~20KB}}! ## https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/http.hpp#L453-L517 This code and sibling {{struct Request}} have auto-generated {{inlined}} destructors. These are very expensive. Just declaring and then defining in the {{.cpp}} the default destructor can remove another {{~20KB}} each from libmesos. There are plenty of other opportunities like this scattered through the codebase. It's work to find them and the returns are small for each, but end up adding to much of the {{9MB}} left over. > Investigate the high memory usage of the default
[jira] [Commented] (MESOS-6264) Investigate the high memory usage of the default executor.
[ https://issues.apache.org/jira/browse/MESOS-6264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549111#comment-15549111 ] Joris Van Remoortere commented on MESOS-6264: - cc [~vinodkone][~jieyu] The bulk of this comes from loading in {{libmesos.so}} We do this because the autoconf build treats libmesos as a dynamic dependency. Since we load libmesos dynamically, there is no chance for the linker to strip unused code. This means that all of the code in libmesos regardless of use gets loaded into resident memory. In contrast the cmake build generates a static library for {{libmesos.a}}. This is then used to build the {{mesos-executor}} binary without a dynamic dependency on libmesos. The benefit of this approach is that the linker is able to strip out all unused code. In an optimized build this is {{~10MB}}. Some approaches for the quick win are: # Consider using the cmake build. This only needs to be modified slightly to strip symbols from the final executor binary {{-s}}. # Modiy the autoconf build to build a {{libmesos.a}} so that we can statically link it in to the {{mesos-executor}} binary and allow the linker to strip unused code. Regardless of the above approach, {{libmesos}} would still be by far the largest contributor of the {{RSS}}. This is for 2 reasons: # Much of our code is structured such that the linker can't determine if it is unused. We would need to adjust our patterns such that the unused code analyzer can do a better job. # Much of our code is {{inlined}} or written such that it can't be optimized. 2 examples are: ## https://github.com/apache/mesos/blob/9beb8eae6408249cdb3e2f16ba68b31a00d3452c/3rdparty/libprocess/include/process/mime.hpp#L35-L154 This code could be moved to a {{.cpp}} file and should be a {{static const std::unordered_map}} that we {{insert(begin(), end())}} into {{types}}. This would reduce the size of libmesos by {{~20KB}}! ## https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/http.hpp#L453-L517 This code and sibling {{struct Request}} have auto-generated {{inlined}} destructors. These are very expensive. Just declaring and then defining in the {{.cpp}} the default destructor can remove another {{~20KB}} each from libmesos. There are plenty of other opportunities like this scattered through the codebase. It's work to find them and the returns are small for each, but end up adding to much of the {{9MB}} left over. > Investigate the high memory usage of the default executor. > -- > > Key: MESOS-6264 > URL: https://issues.apache.org/jira/browse/MESOS-6264 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar > Labels: mesosphere > Fix For: 1.1.0 > > Attachments: pmap_output_for_the_default_executor.txt > > > It seems that a default executor with two sleep tasks is using ~32 mb on > average and can sometimes lead to it being killed for some tests like > {{SlaveRecoveryTest/0.ROOT_CGROUPS_ReconnectDefaultExecutor}} on our internal > CI. Attached the {{pmap}} output for the default executor. Please note that > the command executor memory usage is also pretty high (~26 mb). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6247) Enable Framework to set weight
[ https://issues.apache.org/jira/browse/MESOS-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548652#comment-15548652 ] Joris Van Remoortere commented on MESOS-6247: - [~klaus1982]Do you mean they can not share reserved resources with each other? If they are in the same role they are supposed to be co-operative. At that point why does the weight matter? They should both be yielding all unavailable resources to each other. If we add support for weights now it will make it *even* harder to move people into the hierarchical role world described by benm. It seems like the frameworks co-operating (as they should per the contract of sharing a role) is the right temporary solution for you. > Enable Framework to set weight > -- > > Key: MESOS-6247 > URL: https://issues.apache.org/jira/browse/MESOS-6247 > Project: Mesos > Issue Type: Bug > Components: allocation > Environment: all >Reporter: Klaus Ma >Priority: Critical > > We'd like to enable framework's weight when it register. So the framework can > share resources based on weight within the same role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6249) On Mesos master failover the reregistered callback is not triggered
[ https://issues.apache.org/jira/browse/MESOS-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548632#comment-15548632 ] Joris Van Remoortere commented on MESOS-6249: - [~markusjura] It seems like you are hitting some logic around https://issues.apache.org/jira/browse/MESOS-786 You can see the comment here. https://github.com/apache/mesos/blob/b70a22bad22e5e8668f9af62c575902dec7b0125/src/master/master.cpp#L2813-L2820 pinging [~bmahler] who wrote the comment, and [~anandmazumdar] for reference. > On Mesos master failover the reregistered callback is not triggered > --- > > Key: MESOS-6249 > URL: https://issues.apache.org/jira/browse/MESOS-6249 > Project: Mesos > Issue Type: Bug > Components: java api >Affects Versions: 0.28.0, 0.28.1, 1.0.1 > Environment: OS X 10.11.6 >Reporter: Markus Jura > > On a Mesos master failover the reregistered callback of the Java API is not > triggered. Only the registration callback is triggered which makes it hard > for a framework to distinguish between these scenarios. > This behaviour has been tested with the ConductR framework, both with the > Java API version 0.28.0, 0.28.1 and 1.0.1. Below you find the logs from the > master that got re-elected and from the ConductR framework. > *Log: Mesos master on a master re-election* > {code:bash} > I0926 11:44:20.008306 3747840 zookeeper.cpp:259] A new leading master > (UPID=master@127.0.0.1:5050) is detected > I0926 11:44:20.008458 3747840 master.cpp:1847] The newly elected leader is > master@127.0.0.1:5050 with id ca5b9713-1eec-43e1-9d27-9ebc5c0f95b1 > I0926 11:44:20.008484 3747840 master.cpp:1860] Elected as the leading master! > I0926 11:44:20.008498 3747840 master.cpp:1547] Recovering from registrar > I0926 11:44:20.008607 3747840 registrar.cpp:332] Recovering registrar > I0926 11:44:20.016340 4284416 registrar.cpp:365] Successfully fetched the > registry (0B) in 7.702016ms > I0926 11:44:20.016393 4284416 registrar.cpp:464] Applied 1 operations in > 12us; attempting to update the 'registry' > I0926 11:44:20.021428 4284416 registrar.cpp:509] Successfully updated the > 'registry' in 5.019904ms > I0926 11:44:20.021481 4284416 registrar.cpp:395] Successfully recovered > registrar > I0926 11:44:20.021611 528384 master.cpp:1655] Recovered 0 agents from the > Registry (118B) ; allowing 10mins for agents to re-register > I0926 11:44:20.536859 3747840 master.cpp:2424] Received SUBSCRIBE call for > framework 'conductr' at > scheduler-3f8b9645-7a17-4e9f-8ad5-077fe8c23b39@192.168.2.106:57164 > I0926 11:44:20.536969 3747840 master.cpp:2500] Subscribing framework conductr > with checkpointing disabled and capabilities [ ] > I0926 11:44:20.537401 3211264 hierarchical.cpp:271] Added framework conductr > I0926 11:44:20.807895 528384 master.cpp:4787] Re-registering agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 (127.0.0.1) > I0926 11:44:20.808145 1601536 registrar.cpp:464] Applied 1 operations in > 38us; attempting to update the 'registry' > I0926 11:44:20.815757 1601536 registrar.cpp:509] Successfully updated the > 'registry' in 7.568896ms > I0926 11:44:20.815992 3747840 master.cpp:7447] Adding task > 6abce9bb-895f-4f6f-be5b-25f6bd09f548 with resources mem(*):0 on agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) > I0926 11:44:20.816339 3747840 master.cpp:4872] Re-registered agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at slave(1)@127.0.0.1:5051 > (127.0.0.1) with cpus(*):8; mem(*):15360; disk(*):470832; > ports(*):[31000-32000] > I0926 11:44:20.816385 1601536 hierarchical.cpp:478] Added agent > b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 (127.0.0.1) with cpus(*):8; > mem(*):15360; disk(*):470832; ports(*):[31000-32000] (allocated: cpus(*):0.9; > mem(*):402.653; disk(*):1000; ports(*):[31000-31000, 31001-31500]) > I0926 11:44:20.816437 3747840 master.cpp:4940] Sending updated checkpointed > resources to agent b99256c3-6905-44d3-bcc9-0d9e00d20fbe-S0 at > slave(1)@127.0.0.1:5051 (127.0.0.1) > I0926 11:44:20.816787 4284416 master.cpp:5725] Sending 1 offers to framework > conductr (conductr) at > scheduler-3f8b9645-7a17-4e9f-8ad5-077fe8c23b39@192.168.2.106:57164 > {code} > *Log: ConductR framework* > {code:bash} > I0926 11:44:20.007189 66441216 detector.cpp:152] Detected a new leader: > (id='87') > I0926 11:44:20.007524 64294912 group.cpp:706] Trying to get > '/mesos/json.info_87' in ZooKeeper > I0926 11:44:20.008625 63758336 zookeeper.cpp:259] A new leading master > (UPID=master@127.0.0.1:5050) is detected > I0926 11:44:20.008965 63758336 sched.cpp:330] New master detected at > master@127.0.0.1:5050 > 2016-09-26T09:44:20Z MacBook-Pro-6.local INFO MesosSchedulerClient > [sourceThread=conductr-akka.actor.default-dispatcher-2, >
[jira] [Commented] (MESOS-6311) Consider supporting implicit reconciliation per agent
[ https://issues.apache.org/jira/browse/MESOS-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546209#comment-15546209 ] Joris Van Remoortere commented on MESOS-6311: - cc [~anandmazumdar] [~neilconway] [~vinodkone] > Consider supporting implicit reconciliation per agent > - > > Key: MESOS-6311 > URL: https://issues.apache.org/jira/browse/MESOS-6311 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Joris Van Remoortere > > Currently mesos only supports: > - total implicit reconciliation > - explicit reconciliation per task > Since agent can slowly rejoin the master after a master failover, it is hard > to have a low time bound on implicit reconciliation for tasks. > Performing the current implicit reconciliation is expensive on big clusters > so it should not be done every N seconds. > If we could perform implicit reconciliation for a particular agent, then it > would be cheap enough to after we notice that particular agent rejoining the > cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6311) Consider supporting implicit reconciliation per agent
Joris Van Remoortere created MESOS-6311: --- Summary: Consider supporting implicit reconciliation per agent Key: MESOS-6311 URL: https://issues.apache.org/jira/browse/MESOS-6311 Project: Mesos Issue Type: Improvement Components: master Reporter: Joris Van Remoortere Currently mesos only supports: - total implicit reconciliation - explicit reconciliation per task Since agent can slowly rejoin the master after a master failover, it is hard to have a low time bound on implicit reconciliation for tasks. Performing the current implicit reconciliation is expensive on big clusters so it should not be done every N seconds. If we could perform implicit reconciliation for a particular agent, then it would be cheap enough to after we notice that particular agent rejoining the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4948) Move maintenance tests to use the new scheduler library interface.
[ https://issues.apache.org/jira/browse/MESOS-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536930#comment-15536930 ] Joris Van Remoortere commented on MESOS-4948: - [~ipronin] [~anandmazumdar] will shepherd for you as he introduced the abstraction. > Move maintenance tests to use the new scheduler library interface. > -- > > Key: MESOS-4948 > URL: https://issues.apache.org/jira/browse/MESOS-4948 > Project: Mesos > Issue Type: Bug > Components: tests > Environment: Ubuntu 14.04, using gcc, with libevent and SSL enabled > (on ASF CI) >Reporter: Greg Mann >Assignee: Ilya Pronin > Labels: flaky-test, maintenance, mesosphere, newbie > > We need to move the existing maintenance tests to use the new scheduler > interface. We have already moved 1 test > {{MasterMaintenanceTest.PendingUnavailabilityTest}} to use the new interface. > It would be good to move the other 2 remaining tests to the new interface > since it can lead to failures around the stack object being referenced after > has been already destroyed. Detailed log from an ASF CI build failure. > {code} > [ RUN ] MasterMaintenanceTest.InverseOffers > I0315 04:16:50.786032 2681 leveldb.cpp:174] Opened db in 125.361171ms > I0315 04:16:50.836374 2681 leveldb.cpp:181] Compacted db in 50.254411ms > I0315 04:16:50.836470 2681 leveldb.cpp:196] Created db iterator in 25917ns > I0315 04:16:50.836488 2681 leveldb.cpp:202] Seeked to beginning of db in > 3291ns > I0315 04:16:50.836498 2681 leveldb.cpp:271] Iterated through 0 keys in the > db in 253ns > I0315 04:16:50.836549 2681 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0315 04:16:50.837474 2702 recover.cpp:447] Starting replica recovery > I0315 04:16:50.837565 2681 cluster.cpp:183] Creating default 'local' > authorizer > I0315 04:16:50.838191 2702 recover.cpp:473] Replica is in EMPTY status > I0315 04:16:50.839532 2704 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from (4784)@172.17.0.4:39845 > I0315 04:16:50.839754 2705 recover.cpp:193] Received a recover response from > a replica in EMPTY status > I0315 04:16:50.841893 2704 recover.cpp:564] Updating replica status to > STARTING > I0315 04:16:50.842566 2703 master.cpp:376] Master > c326bc68-2581-48d4-9dc4-0d6f270bdda1 (01fcd642f65f) started on > 172.17.0.4:39845 > I0315 04:16:50.842644 2703 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_http="true" > --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/DE2Uaw/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.29.0/_inst/share/mesos/webui" > --work_dir="/tmp/DE2Uaw/master" --zk_session_timeout="10secs" > I0315 04:16:50.843168 2703 master.cpp:425] Master allowing unauthenticated > frameworks to register > I0315 04:16:50.843227 2703 master.cpp:428] Master only allowing > authenticated slaves to register > I0315 04:16:50.843302 2703 credentials.hpp:35] Loading credentials for > authentication from '/tmp/DE2Uaw/credentials' > I0315 04:16:50.843737 2703 master.cpp:468] Using default 'crammd5' > authenticator > I0315 04:16:50.843969 2703 master.cpp:537] Using default 'basic' HTTP > authenticator > I0315 04:16:50.844177 2703 master.cpp:571] Authorization enabled > I0315 04:16:50.844360 2708 hierarchical.cpp:144] Initialized hierarchical > allocator process > I0315 04:16:50.844430 2708 whitelist_watcher.cpp:77] No whitelist given > I0315 04:16:50.848227 2703 master.cpp:1806] The newly elected leader is > master@172.17.0.4:39845 with id c326bc68-2581-48d4-9dc4-0d6f270bdda1 > I0315 04:16:50.848269 2703 master.cpp:1819] Elected as the leading master! > I0315 04:16:50.848292 2703 master.cpp:1508] Recovering from registrar > I0315 04:16:50.848563 2703 registrar.cpp:307] Recovering registrar > I0315 04:16:50.876277 2711 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 34.178445ms > I0315 04:16:50.876365 2711 replica.cpp:320] Persisted replica status to
[jira] [Updated] (MESOS-4948) Move maintenance tests to use the new scheduler library interface.
[ https://issues.apache.org/jira/browse/MESOS-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4948: Shepherd: Anand Mazumdar > Move maintenance tests to use the new scheduler library interface. > -- > > Key: MESOS-4948 > URL: https://issues.apache.org/jira/browse/MESOS-4948 > Project: Mesos > Issue Type: Bug > Components: tests > Environment: Ubuntu 14.04, using gcc, with libevent and SSL enabled > (on ASF CI) >Reporter: Greg Mann >Assignee: Ilya Pronin > Labels: flaky-test, maintenance, mesosphere, newbie > > We need to move the existing maintenance tests to use the new scheduler > interface. We have already moved 1 test > {{MasterMaintenanceTest.PendingUnavailabilityTest}} to use the new interface. > It would be good to move the other 2 remaining tests to the new interface > since it can lead to failures around the stack object being referenced after > has been already destroyed. Detailed log from an ASF CI build failure. > {code} > [ RUN ] MasterMaintenanceTest.InverseOffers > I0315 04:16:50.786032 2681 leveldb.cpp:174] Opened db in 125.361171ms > I0315 04:16:50.836374 2681 leveldb.cpp:181] Compacted db in 50.254411ms > I0315 04:16:50.836470 2681 leveldb.cpp:196] Created db iterator in 25917ns > I0315 04:16:50.836488 2681 leveldb.cpp:202] Seeked to beginning of db in > 3291ns > I0315 04:16:50.836498 2681 leveldb.cpp:271] Iterated through 0 keys in the > db in 253ns > I0315 04:16:50.836549 2681 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0315 04:16:50.837474 2702 recover.cpp:447] Starting replica recovery > I0315 04:16:50.837565 2681 cluster.cpp:183] Creating default 'local' > authorizer > I0315 04:16:50.838191 2702 recover.cpp:473] Replica is in EMPTY status > I0315 04:16:50.839532 2704 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from (4784)@172.17.0.4:39845 > I0315 04:16:50.839754 2705 recover.cpp:193] Received a recover response from > a replica in EMPTY status > I0315 04:16:50.841893 2704 recover.cpp:564] Updating replica status to > STARTING > I0315 04:16:50.842566 2703 master.cpp:376] Master > c326bc68-2581-48d4-9dc4-0d6f270bdda1 (01fcd642f65f) started on > 172.17.0.4:39845 > I0315 04:16:50.842644 2703 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_http="true" > --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/DE2Uaw/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.29.0/_inst/share/mesos/webui" > --work_dir="/tmp/DE2Uaw/master" --zk_session_timeout="10secs" > I0315 04:16:50.843168 2703 master.cpp:425] Master allowing unauthenticated > frameworks to register > I0315 04:16:50.843227 2703 master.cpp:428] Master only allowing > authenticated slaves to register > I0315 04:16:50.843302 2703 credentials.hpp:35] Loading credentials for > authentication from '/tmp/DE2Uaw/credentials' > I0315 04:16:50.843737 2703 master.cpp:468] Using default 'crammd5' > authenticator > I0315 04:16:50.843969 2703 master.cpp:537] Using default 'basic' HTTP > authenticator > I0315 04:16:50.844177 2703 master.cpp:571] Authorization enabled > I0315 04:16:50.844360 2708 hierarchical.cpp:144] Initialized hierarchical > allocator process > I0315 04:16:50.844430 2708 whitelist_watcher.cpp:77] No whitelist given > I0315 04:16:50.848227 2703 master.cpp:1806] The newly elected leader is > master@172.17.0.4:39845 with id c326bc68-2581-48d4-9dc4-0d6f270bdda1 > I0315 04:16:50.848269 2703 master.cpp:1819] Elected as the leading master! > I0315 04:16:50.848292 2703 master.cpp:1508] Recovering from registrar > I0315 04:16:50.848563 2703 registrar.cpp:307] Recovering registrar > I0315 04:16:50.876277 2711 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 34.178445ms > I0315 04:16:50.876365 2711 replica.cpp:320] Persisted replica status to > STARTING > I0315 04:16:50.876776 2711 recover.cpp:473] Replica is in STARTING status > I0315
[jira] [Updated] (MESOS-6237) Slave Sandbox inaccessible when using IPv6 address in patch from https://github.com/lava/mesos/tree/bennoe/ipv6
[ https://issues.apache.org/jira/browse/MESOS-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-6237: Summary: Slave Sandbox inaccessible when using IPv6 address in patch from https://github.com/lava/mesos/tree/bennoe/ipv6 (was: Agent Sandbox inaccessible when using IPv6 address in patch from https://github.com/lava/mesos/tree/bennoe/ipv6) > Slave Sandbox inaccessible when using IPv6 address in patch from > https://github.com/lava/mesos/tree/bennoe/ipv6 > --- > > Key: MESOS-6237 > URL: https://issues.apache.org/jira/browse/MESOS-6237 > Project: Mesos > Issue Type: Bug >Reporter: Lukas Loesche > > Affects https://github.com/lava/mesos/tree/bennoe/ipv6 at commit > 2199a24c0b7a782a0381aad8cceacbc95ec3d5c9 > When using IPs instead of hostnames the Agent Sandbox is inaccessible. The > problem seems to be that there's no brackets around the IP so it tries to > access e.g. http://2001:41d0:1000:ab9:::5051 instead of > http://[2001:41d0:1000:ab9::]:5051 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6122) Mesos slave throws systemd errors even when passed a flag to disable systemd
[ https://issues.apache.org/jira/browse/MESOS-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15467888#comment-15467888 ] Joris Van Remoortere commented on MESOS-6122: - These are just logging statements. The rest of the mesos code will execute just the same. All this patch will do is remove those logging lines. I appreciate that may be all you want, just want to be sure there are no other issues :-) > Mesos slave throws systemd errors even when passed a flag to disable systemd > > > Key: MESOS-6122 > URL: https://issues.apache.org/jira/browse/MESOS-6122 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 1.0.1 >Reporter: Gennady Feldman >Assignee: Jie Yu > Fix For: 1.1.0, 1.0.2 > > > Seems like the code in slave/main.cpp is logically in the wrong order: > #ifdef __linux__ > // Initialize systemd if it exists. > if (systemd::exists() && flags.systemd_enable_support) { > Lines 339-341: > https://github.com/apache/mesos/blob/master/src/slave/main.cpp#L341 > The flags should come first before the systemd::exists() check runs.Currently > the systemd.exists() always runs and there's no way to disable that check > from running in mesos-slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6122) Mesos slave throws systemd errors even when passed a flag to disable systemd
[ https://issues.apache.org/jira/browse/MESOS-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15467807#comment-15467807 ] Joris Van Remoortere commented on MESOS-6122: - The point of the code is to check if systemd exists. It should never error out, just return {{true}} / {{false}}. Can you please provide the error that you are encountering? > Mesos slave throws systemd errors even when passed a flag to disable systemd > > > Key: MESOS-6122 > URL: https://issues.apache.org/jira/browse/MESOS-6122 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 1.0.1 >Reporter: Gennady Feldman >Assignee: Jie Yu > Fix For: 1.1.0, 1.0.2 > > > Seems like the code in slave/main.cpp is logically in the wrong order: > #ifdef __linux__ > // Initialize systemd if it exists. > if (systemd::exists() && flags.systemd_enable_support) { > Lines 339-341: > https://github.com/apache/mesos/blob/master/src/slave/main.cpp#L341 > The flags should come first before the systemd::exists() check runs.Currently > the systemd.exists() always runs and there's no way to disable that check > from running in mesos-slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6122) Mesos slave throws systemd errors even when passed a flag to disable systemd
[ https://issues.apache.org/jira/browse/MESOS-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15467768#comment-15467768 ] Joris Van Remoortere commented on MESOS-6122: - [~jieyu] This change looks ok. [~gena01] Can you please provide logs for the errors you ran into? I don't understand how the logical order evaluation here is a {{bug}} unless you are running into an error during the {{exists}} check. If so can you please augment this ticket with that information? At this point all we are doing is masking that problem. Otherwise this is purely an optimization. > Mesos slave throws systemd errors even when passed a flag to disable systemd > > > Key: MESOS-6122 > URL: https://issues.apache.org/jira/browse/MESOS-6122 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 1.0.1 >Reporter: Gennady Feldman >Assignee: Jie Yu > Fix For: 1.1.0, 1.0.2 > > > Seems like the code in slave/main.cpp is logically in the wrong order: > #ifdef __linux__ > // Initialize systemd if it exists. > if (systemd::exists() && flags.systemd_enable_support) { > Lines 339-341: > https://github.com/apache/mesos/blob/master/src/slave/main.cpp#L341 > The flags should come first before the systemd::exists() check runs.Currently > the systemd.exists() always runs and there's no way to disable that check > from running in mesos-slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1474) Provide cluster maintenance primitives for operators.
[ https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435880#comment-15435880 ] Joris Van Remoortere commented on MESOS-1474: - To help clarify: The new offers have an explicit unavailability in them that indicates how long the agent will still be up. New tasks scheduled there should be able to complete prior to that time point. > Provide cluster maintenance primitives for operators. > - > > Key: MESOS-1474 > URL: https://issues.apache.org/jira/browse/MESOS-1474 > Project: Mesos > Issue Type: Epic > Components: framework, master, slave >Reporter: Benjamin Mahler >Assignee: Artem Harutyunyan > Labels: mesosphere, twitter > > Sometimes operators need to perform maintenance on a mesos cluster; we define > maintenance here as anything that requires the tasks to be drained on the > slave(s). Most mesos upgrades can be done without affecting running tasks, > but there are situations where maintenance is task-affecting: > * Host maintenance (e.g. hardware repair, kernel upgrades). > * Non-recoverable slave upgrades (e.g. adjusting slave attributes). > * etc > In order to ensure operators don’t violate frameworks’ SLAs, schedulers need > to be aware of planned unavailability events. > Maintenance awareness allows schedulers to avoid churn for long running tasks > by placing them on machines not undergoing maintenance. If all resources are > planned for maintenance, then the scheduler will prefer machines scheduled > for maintenance least imminently. > Maintenance awareness is also crucial when a scheduler uses [persistent > disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure > that the scheduler is aware of the expected duration of unavailability for a > persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate > 1TB over the network when only 1 of the 3 replicas is going to be unavailable > for a reboot (< 1 hour)). > There are a few primitives of interest here: > * Provide a way for operators to [fully shutdown a > slave|https://issues.apache.org/jira/browse/MESOS-1475] (killing all tasks > underneath it). Colloquially known as a "hard drain". > * Provide a way for operators to mark specific slaves as scheduled for > maintenance. This will inform the scheduler about the scheduled > unavailability of the resources. > * Provide a way for frameworks to be notified when resources are requested to > be relinquished. This gives the framework to proactively move a task before > it may be forcibly killed by an operator. It also allows the automation of > operations like: "please drain these slaves within 1 hour." > See the [design > doc|https://docs.google.com/a/twitter.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit#] > for the latest details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4694) DRFAllocator takes very long to allocate resources with a large number of frameworks
[ https://issues.apache.org/jira/browse/MESOS-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408562#comment-15408562 ] Joris Van Remoortere commented on MESOS-4694: - {code} commit e859d3ae8d8ff7349327b9e6a89edd6f98d2b7a1 Author: Dario RexinDate: Thu Aug 4 17:12:10 2016 -0400 Removed frameworks with suppressed offers from DRFSorter. This patch removes frameworks with suppressed offers from the sorter to reduce time spent in sorting. The allocations will remain in the sorter, so no data is lost and the numbers are still correct. When a framework revives offers, it will be re-added to the sorter. Review: https://reviews.apache.org/r/43666/ {code} > DRFAllocator takes very long to allocate resources with a large number of > frameworks > > > Key: MESOS-4694 > URL: https://issues.apache.org/jira/browse/MESOS-4694 > Project: Mesos > Issue Type: Improvement > Components: allocation >Affects Versions: 0.26.0, 0.27.0, 0.27.1, 0.27.2, 0.28.0, 0.28.1 >Reporter: Dario Rexin >Assignee: Dario Rexin > > With a growing number of connected frameworks, the allocation time grows to > very high numbers. The addition of quota in 0.27 had an additional impact on > these numbers. Running `mesos-tests.sh --benchmark > --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us > the following numbers: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 2.921202secs to make 200 offers > round 1 allocate took 2.85045secs to make 200 offers > round 2 allocate took 2.823768secs to make 200 offers > {noformat} > Increasing the number of frameworks to 2000: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 28.209454secs to make 2000 offers > round 1 allocate took 28.469419secs to make 2000 offers > round 2 allocate took 28.138086secs to make 2000 offers > {noformat} > I was able to reduce this time by a substantial amount. After applying the > patches: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 1.016226secs to make 2000 offers > round 1 allocate took 1.102729secs to make 2000 offers > round 2 allocate took 1.102624secs to make 2000 offers > {noformat} > And with 2000 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 12.563203secs to make 2000 offers > round 1 allocate took 12.437517secs to make 2000 offers > round 2 allocate took 12.470708secs to make 2000 offers > {noformat} > The patches do 3 things to improve the performance of the allocator. > 1) The total values in the DRFSorter will be pre calculated per resource type > 2) In the allocate method, when no resources are available to allocate, we > break out of the innermost loop to prevent looping over a large number of > frameworks when we have nothing to allocate > 3) when a framework suppresses offers, we remove it from the sorter instead > of just calling continue in the allocation loop - this greatly improves > performance in the sorter and prevents looping over frameworks that don't > need resources > Assuming that most of the frameworks behave nicely and suppress offers when > they have nothing to schedule, it is fair to assume, that point 3) has the > biggest impact on the performance. If we suppress offers for 90% of the > frameworks in the benchmark test, we see following numbers: > {noformat} > ==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 200 slaves and 2000 frameworks > round 0 allocate took 11626us to make 200 offers > round 1 allocate took 22890us to make 200 offers > round 2 allocate took 21346us to make 200 offers > {noformat} > And for
[jira] [Comment Edited] (MESOS-5983) Number of libprocess worker threads is not configurable for log-rotation module.
[ https://issues.apache.org/jira/browse/MESOS-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406594#comment-15406594 ] Joris Van Remoortere edited comment on MESOS-5983 at 8/3/16 8:55 PM: - https://reviews.apache.org/r/50766/ was (Author: jvanremoortere): https://github.com/dcos/dcos/pull/483 depends on https://reviews.apache.org/r/50766/ > Number of libprocess worker threads is not configurable for log-rotation > module. > > > Key: MESOS-5983 > URL: https://issues.apache.org/jira/browse/MESOS-5983 > Project: Mesos > Issue Type: Improvement > Components: containerization >Affects Versions: 1.0.0 >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere > Labels: mesosphere > Fix For: 1.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5983) Number of libprocess worker threads is not configurable for log-rotation module.
[ https://issues.apache.org/jira/browse/MESOS-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406594#comment-15406594 ] Joris Van Remoortere commented on MESOS-5983: - https://github.com/dcos/dcos/pull/483 depends on https://reviews.apache.org/r/50766/ > Number of libprocess worker threads is not configurable for log-rotation > module. > > > Key: MESOS-5983 > URL: https://issues.apache.org/jira/browse/MESOS-5983 > Project: Mesos > Issue Type: Improvement > Components: containerization >Affects Versions: 1.0.0 >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere > Labels: mesosphere > Fix For: 1.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5983) Number of libprocess worker threads is not configurable for log-rotation module.
[ https://issues.apache.org/jira/browse/MESOS-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-5983: Description: (was: https://github.com/dcos/dcos/pull/483) > Number of libprocess worker threads is not configurable for log-rotation > module. > > > Key: MESOS-5983 > URL: https://issues.apache.org/jira/browse/MESOS-5983 > Project: Mesos > Issue Type: Improvement > Components: containerization >Affects Versions: 1.0.0 >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere > Labels: mesosphere > Fix For: 1.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5983) Number of libprocess worker threads is not configurable for log-rotation module.
Joris Van Remoortere created MESOS-5983: --- Summary: Number of libprocess worker threads is not configurable for log-rotation module. Key: MESOS-5983 URL: https://issues.apache.org/jira/browse/MESOS-5983 Project: Mesos Issue Type: Improvement Components: containerization Affects Versions: 1.0.0 Reporter: Joris Van Remoortere Assignee: Joris Van Remoortere Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5943) Incremental http parsing of URLs leads to decoder error
[ https://issues.apache.org/jira/browse/MESOS-5943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404647#comment-15404647 ] Joris Van Remoortere edited comment on MESOS-5943 at 8/2/16 9:05 PM: - {code} commit 2776a09cbcd836080241a5ad8c1e003984e5a146 Author: Joris Van RemoortereDate: Sat Jul 30 12:58:28 2016 -0700 Libprocess: Fixed decoder to support incremental URL parsing. Review: https://reviews.apache.org/r/50634 {code} was (Author: jvanremoortere): {code} commit f291d5023e9f2e471c11d4f20590901d9bfc1de4 Author: Joris Van Remoortere Date: Mon Aug 1 17:14:37 2016 -0700 Libprocess: Removed old http_parser code. We remove the code that supported the `HTTP_PARSER_VERSION_MAJOR` < 2 path. Review: https://reviews.apache.org/r/50683 commit 2776a09cbcd836080241a5ad8c1e003984e5a146 Author: Joris Van Remoortere Date: Sat Jul 30 12:58:28 2016 -0700 Libprocess: Fixed decoder to support incremental URL parsing. Review: https://reviews.apache.org/r/50634 {code} > Incremental http parsing of URLs leads to decoder error > --- > > Key: MESOS-5943 > URL: https://issues.apache.org/jira/browse/MESOS-5943 > Project: Mesos > Issue Type: Bug > Components: libprocess, scheduler driver >Affects Versions: 1.0.0 >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere >Priority: Blocker > Labels: mesosphere > Fix For: 0.28.3, 1.0.1, 0.27.4 > > > When requests arrive to the decoder in pieces (e.g. {{mes}} followed by a > separate chunk of {{os.apache.org}}) the http parser is not able to handle > this case if the split is within the URL component. > This causes the decoder to error out, and can lead to connection invalidation. > The scheduler driver is susceptible to this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5970) Remove HTTP_PARSER_VERSION_MAJOR < 2 code in decoder.
Joris Van Remoortere created MESOS-5970: --- Summary: Remove HTTP_PARSER_VERSION_MAJOR < 2 code in decoder. Key: MESOS-5970 URL: https://issues.apache.org/jira/browse/MESOS-5970 Project: Mesos Issue Type: Task Components: libprocess Reporter: Joris Van Remoortere Assignee: Joris Van Remoortere Fix For: 1.0.1, 1.1.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5943) Incremental http parsing of URLs leads to decoder error
[ https://issues.apache.org/jira/browse/MESOS-5943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404647#comment-15404647 ] Joris Van Remoortere edited comment on MESOS-5943 at 8/2/16 9:03 PM: - {code} commit f291d5023e9f2e471c11d4f20590901d9bfc1de4 Author: Joris Van RemoortereDate: Mon Aug 1 17:14:37 2016 -0700 Libprocess: Removed old http_parser code. We remove the code that supported the `HTTP_PARSER_VERSION_MAJOR` < 2 path. Review: https://reviews.apache.org/r/50683 commit 2776a09cbcd836080241a5ad8c1e003984e5a146 Author: Joris Van Remoortere Date: Sat Jul 30 12:58:28 2016 -0700 Libprocess: Fixed decoder to support incremental URL parsing. Review: https://reviews.apache.org/r/50634 {code} was (Author: jvanremoortere): {code} commit f291d5023e9f2e471c11d4f20590901d9bfc1de4 Author: Joris Van Remoortere Date: Mon Aug 1 17:14:37 2016 -0700 Libprocess: Removed old http_parser code. We remove the code that supported the `HTTP_PARSER_VERSION_MAJOR` < 2 path. Review: https://reviews.apache.org/r/50683 commit 2776a09cbcd836080241a5ad8c1e003984e5a146 Author: Joris Van Remoortere Date: Sat Jul 30 12:58:28 2016 -0700 Libprocess: Fixed decoder to support incremental URL parsing. Review: https://reviews.apache.org/r/50634 {code} > Incremental http parsing of URLs leads to decoder error > --- > > Key: MESOS-5943 > URL: https://issues.apache.org/jira/browse/MESOS-5943 > Project: Mesos > Issue Type: Bug > Components: libprocess, scheduler driver >Affects Versions: 1.0.0 >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere >Priority: Blocker > Labels: mesosphere > Fix For: 0.28.3, 1.0.1, 0.27.4 > > > When requests arrive to the decoder in pieces (e.g. {{mes}} followed by a > separate chunk of {{os.apache.org}}) the http parser is not able to handle > this case if the split is within the URL component. > This causes the decoder to error out, and can lead to connection invalidation. > The scheduler driver is susceptible to this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5943) Incremental http parsing of URLs leads to decoder error
[ https://issues.apache.org/jira/browse/MESOS-5943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-5943: Fix Version/s: 0.27.4 0.28.3 > Incremental http parsing of URLs leads to decoder error > --- > > Key: MESOS-5943 > URL: https://issues.apache.org/jira/browse/MESOS-5943 > Project: Mesos > Issue Type: Bug > Components: libprocess, scheduler driver >Affects Versions: 1.0.0 >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere >Priority: Blocker > Labels: mesosphere > Fix For: 0.28.3, 1.0.1, 0.27.4 > > > When requests arrive to the decoder in pieces (e.g. {{mes}} followed by a > separate chunk of {{os.apache.org}}) the http parser is not able to handle > this case if the split is within the URL component. > This causes the decoder to error out, and can lead to connection invalidation. > The scheduler driver is susceptible to this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5944) Remove `O_SYNC` from StatusUpdateManager logs
Joris Van Remoortere created MESOS-5944: --- Summary: Remove `O_SYNC` from StatusUpdateManager logs Key: MESOS-5944 URL: https://issues.apache.org/jira/browse/MESOS-5944 Project: Mesos Issue Type: Improvement Components: slave Reporter: Joris Van Remoortere Assignee: Joris Van Remoortere Fix For: 1.1.0 Currently the {{StatusUpdateManager}} uses {{O_SYNC}} to flush status updates to disk. We don't need to use {{O_SYNC}} because we only read this file if the host did not crash. {{os::write}} success implies the kernel will have flushed our data to the page cache. This is sufficient for the recovery scenarios we use this data for. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5943) Incremental http parsing of URLs leads to decoder error
Joris Van Remoortere created MESOS-5943: --- Summary: Incremental http parsing of URLs leads to decoder error Key: MESOS-5943 URL: https://issues.apache.org/jira/browse/MESOS-5943 Project: Mesos Issue Type: Bug Components: libprocess, scheduler driver Affects Versions: 1.0.0 Reporter: Joris Van Remoortere Assignee: Joris Van Remoortere Priority: Blocker Fix For: 1.0.1 When requests arrive to the decoder in pieces (e.g. {{mes}} followed by a separate chunk of {{os.apache.org}}) the http parser is not able to handle this case if the split is within the URL component. This causes the decoder to error out, and can lead to connection invalidation. The scheduler driver is susceptible to this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5425) Consider using IntervalSet for Port range resource math
[ https://issues.apache.org/jira/browse/MESOS-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-5425: Assignee: Yanyan Hu > Consider using IntervalSet for Port range resource math > --- > > Key: MESOS-5425 > URL: https://issues.apache.org/jira/browse/MESOS-5425 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Joseph Wu >Assignee: Yanyan Hu > Labels: mesosphere > Attachments: graycol.gif > > > Follow-up JIRA for comments raised in MESOS-3051 (see comments there). > We should consider utilizing > [{{IntervalSet}}|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/3rdparty/stout/include/stout/interval.hpp] > in [Port range resource > math|https://github.com/apache/mesos/blob/a0b798d2fac39445ce0545cfaf05a682cd393abe/src/common/values.cpp#L143]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5545) Add rack awareness support for Mesos resources
[ https://issues.apache.org/jira/browse/MESOS-5545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15320739#comment-15320739 ] Joris Van Remoortere commented on MESOS-5545: - [~fan.du] I would like to; however, this is currently not high enough on my priority list. I'm passionate about this subject, which is why I've brought it up before :-) We should see in the community meeting if there is some consensus on a timeline. If the automation aspect is what is most important to you, then I would focus on a good interface between Mesos and the modules / tools you want to build to source the information. We likely won't get much traction dragging specific strategies into the Mesos project. Rather, we should take the approach of ensuring the interfaces / primitives work well for a variety of strategies and tools. > Add rack awareness support for Mesos resources > -- > > Key: MESOS-5545 > URL: https://issues.apache.org/jira/browse/MESOS-5545 > Project: Mesos > Issue Type: Story > Components: hadoop, master >Reporter: Fan Du > Attachments: RackAwarenessforMesos-Lite.pdf > > > Resources managed by Mesos master have no topology information of the > cluster, for example, rack topology. While lots of data center applications > have rack awareness feature to provide data locality, fault tolerance and > intelligent task placement. This ticket tries to investigate how to add rack > awareness for Mesos resources topology. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5545) Add rack awareness support for Mesos resources
[ https://issues.apache.org/jira/browse/MESOS-5545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318745#comment-15318745 ] Joris Van Remoortere edited comment on MESOS-5545 at 6/7/16 3:58 PM: - Hi [~fan.du]. Thanks for raising this topic and working on a design doc. This topic has been discussed a few times before, although mostly during casual conversation. It's great that you've captured and documented some ideas. I would suggest that the next steps involve: 1. Raising this at the community sync to: - Get a sense of timeline. - Find a shepherd. 2. Iterate on the design with the shepherd and a working group. 3. Validate the design with a large user base. This is critical for a component change like this. 4. Then we can get to the patches. The immediate feedback I can give is: - Although a very fun and interesting project, we haven't gotten enough interest to follow through as of yet. I would focus the most on getting this prioritized on the roadmap. - Mesos is about primitives. Your design doc mixes primitives (great) with some implementation / configuration bias (LLDP). I would work on partitioning general fault domain awareness (Mesos) from assigning of the attributes (Operator / automation). - Take a step back and consider what other information we may want to associate with fault domains in the future. Is there a structure that is more resilient to augmentation in the future than an {{optional rack_id}}? - How should schedulers use this information, and what actions may they take based upon it. Have we thought out all the actions, and whether they would require changes to Mesos? - You should clarify whether these attributes are expected to change over the life-time of an agent. For example, currently we don't allow resources or IPs to change. If this were also true for fault domain attributes, it would simplify the implementation. If you feel that dynamic attributes are necessary, then I would urge you to make that a phase 2 project and first work with the community to agree on a common pattern for updating any attributes on the agent, and how to surface consequential changes to both tasks and frameworks. (You may see why I suggest static to begin with ;-) ) was (Author: jvanremoortere): Hi [~fan.du]. Thanks for raising this topic and working on a design doc. This topic has been discussed a few times before, although mostly during casual conversation. It's great that you've captured and documented some ideas. I would suggest that the next steps involve: 1. Raising this at the community sync to: A. Get a sense of timeline. B. Find a shepherd. 2. Iterate on the design with the shepherd and a working group. 3. Validate the design with a large user base. This is critical for a component change like this. 4. Then we can get to the patches. The immediate feedback I can give is: - Although a very fun and interesting project, we haven't gotten enough interest to follow through as of yet. I would focus the most on getting this prioritized on the roadmap. - Mesos is about primitives. Your design doc mixes primitives (great) with some implementation / configuration bias (LLDP). I would work on partitioning general fault domain awareness (Mesos) from assigning of the attributes (Operator / automation). - Take a step back and consider what other information we may want to associate with fault domains in the future. Is there a structure that is more resilient to augmentation in the future than an {{optional rack_id}}? - How should schedulers use this information, and what actions may they take based upon it. Have we thought out all the actions, and whether they would require changes to Mesos? - You should clarify whether these attributes are expected to change over the life-time of an agent. For example, currently we don't allow resources or IPs to change. If this were also true for fault domain attributes, it would simplify the implementation. If you feel that dynamic attributes are necessary, then I would urge you to make that a phase 2 project and first work with the community to agree on a common pattern for updating any attributes on the agent, and how to surface consequential changes to both tasks and frameworks. (You may see why I suggest static to begin with ;-) ) > Add rack awareness support for Mesos resources > -- > > Key: MESOS-5545 > URL: https://issues.apache.org/jira/browse/MESOS-5545 > Project: Mesos > Issue Type: Story > Components: hadoop, master >Reporter: Fan Du > Attachments: RackAwarenessforMesos-Lite.pdf > > > Resources managed by Mesos master have no topology information of the > cluster, for example, rack topology. While lots of data center applications > have rack awareness feature to provide data locality,
[jira] [Commented] (MESOS-5545) Add rack awareness support for Mesos resources
[ https://issues.apache.org/jira/browse/MESOS-5545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318745#comment-15318745 ] Joris Van Remoortere commented on MESOS-5545: - Hi [~fan.du]. Thanks for raising this topic and working on a design doc. This topic has been discussed a few times before, although mostly during casual conversation. It's great that you've captured and documented some ideas. I would suggest that the next steps involve: 1. Raising this at the community sync to: A. Get a sense of timeline. B. Find a shepherd. 2. Iterate on the design with the shepherd and a working group. 3. Validate the design with a large user base. This is critical for a component change like this. 4. Then we can get to the patches. The immediate feedback I can give is: - Although a very fun and interesting project, we haven't gotten enough interest to follow through as of yet. I would focus the most on getting this prioritized on the roadmap. - Mesos is about primitives. Your design doc mixes primitives (great) with some implementation / configuration bias (LLDP). I would work on partitioning general fault domain awareness (Mesos) from assigning of the attributes (Operator / automation). - Take a step back and consider what other information we may want to associate with fault domains in the future. Is there a structure that is more resilient to augmentation in the future than an {{optional rack_id}}? - How should schedulers use this information, and what actions may they take based upon it. Have we thought out all the actions, and whether they would require changes to Mesos? - You should clarify whether these attributes are expected to change over the life-time of an agent. For example, currently we don't allow resources or IPs to change. If this were also true for fault domain attributes, it would simplify the implementation. If you feel that dynamic attributes are necessary, then I would urge you to make that a phase 2 project and first work with the community to agree on a common pattern for updating any attributes on the agent, and how to surface consequential changes to both tasks and frameworks. (You may see why I suggest static to begin with ;-) ) > Add rack awareness support for Mesos resources > -- > > Key: MESOS-5545 > URL: https://issues.apache.org/jira/browse/MESOS-5545 > Project: Mesos > Issue Type: Story > Components: hadoop, master >Reporter: Fan Du > Attachments: RackAwarenessforMesos-Lite.pdf > > > Resources managed by Mesos master have no topology information of the > cluster, for example, rack topology. While lots of data center applications > have rack awareness feature to provide data locality, fault tolerance and > intelligent task placement. This ticket tries to investigate how to add rack > awareness for Mesos resources topology. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5545) Add rack awareness support for Mesos resources
[ https://issues.apache.org/jira/browse/MESOS-5545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318745#comment-15318745 ] Joris Van Remoortere edited comment on MESOS-5545 at 6/7/16 3:58 PM: - Hi [~fan.du]. Thanks for raising this topic and working on a design doc. This topic has been discussed a few times before, although mostly during casual conversation. It's great that you've captured and documented some ideas. I would suggest that the next steps involve: 1. Raising this at the community sync to: - Get a sense of timeline. - Find a shepherd. 2. Iterate on the design with the shepherd and a working group. 3. Validate the design with a large user base. This is critical for a component change like this. 4. Then we can get to the patches. The immediate feedback I can give is: - Although a very fun and interesting project, we haven't gotten enough interest to follow through as of yet. I would focus the most on getting this prioritized on the roadmap. - Mesos is about primitives. Your design doc mixes primitives (great) with some implementation / configuration bias (LLDP). I would work on partitioning general fault domain awareness (Mesos) from assigning of the attributes (Operator / automation). - Take a step back and consider what other information we may want to associate with fault domains in the future. Is there a structure that is more resilient to augmentation in the future than an {{optional rack_id}}? - How should schedulers use this information, and what actions may they take based upon it. Have we thought out all the actions, and whether they would require changes to Mesos? - You should clarify whether these attributes are expected to change over the life-time of an agent. For example, currently we don't allow resources or IPs to change. If this were also true for fault domain attributes, it would simplify the implementation. If you feel that dynamic attributes are necessary, then I would urge you to make that a phase 2 project and first work with the community to agree on a common pattern for updating any attributes on the agent, and how to surface consequential changes to both tasks and frameworks. (You may see why I suggest static to begin with ;-) ) was (Author: jvanremoortere): Hi [~fan.du]. Thanks for raising this topic and working on a design doc. This topic has been discussed a few times before, although mostly during casual conversation. It's great that you've captured and documented some ideas. I would suggest that the next steps involve: 1. Raising this at the community sync to: - Get a sense of timeline. - Find a shepherd. 2. Iterate on the design with the shepherd and a working group. 3. Validate the design with a large user base. This is critical for a component change like this. 4. Then we can get to the patches. The immediate feedback I can give is: - Although a very fun and interesting project, we haven't gotten enough interest to follow through as of yet. I would focus the most on getting this prioritized on the roadmap. - Mesos is about primitives. Your design doc mixes primitives (great) with some implementation / configuration bias (LLDP). I would work on partitioning general fault domain awareness (Mesos) from assigning of the attributes (Operator / automation). - Take a step back and consider what other information we may want to associate with fault domains in the future. Is there a structure that is more resilient to augmentation in the future than an {{optional rack_id}}? - How should schedulers use this information, and what actions may they take based upon it. Have we thought out all the actions, and whether they would require changes to Mesos? - You should clarify whether these attributes are expected to change over the life-time of an agent. For example, currently we don't allow resources or IPs to change. If this were also true for fault domain attributes, it would simplify the implementation. If you feel that dynamic attributes are necessary, then I would urge you to make that a phase 2 project and first work with the community to agree on a common pattern for updating any attributes on the agent, and how to surface consequential changes to both tasks and frameworks. (You may see why I suggest static to begin with ;-) ) > Add rack awareness support for Mesos resources > -- > > Key: MESOS-5545 > URL: https://issues.apache.org/jira/browse/MESOS-5545 > Project: Mesos > Issue Type: Story > Components: hadoop, master >Reporter: Fan Du > Attachments: RackAwarenessforMesos-Lite.pdf > > > Resources managed by Mesos master have no topology information of the > cluster, for example, rack topology. While lots of data center applications > have rack awareness feature to provide data locality, fault
[jira] [Commented] (MESOS-5445) Allow libprocess/stout to build without first doing `make` in 3rdparty.
[ https://issues.apache.org/jira/browse/MESOS-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316681#comment-15316681 ] Joris Van Remoortere commented on MESOS-5445: - [~tillt] Great! Go for it :-) > Allow libprocess/stout to build without first doing `make` in 3rdparty. > --- > > Key: MESOS-5445 > URL: https://issues.apache.org/jira/browse/MESOS-5445 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: Kapil Arya >Assignee: Kapil Arya > Labels: mesosphere > Fix For: 1.0.0 > > > After the 3rdparty reorg, libprocess/stout are enable to build their > dependencies and so one has to do `make` in 3rdpart/ before building > libprocess/stout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5420) Implement os::exists for processes
[ https://issues.apache.org/jira/browse/MESOS-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-5420: Sprint: Mesosphere Sprint 36 > Implement os::exists for processes > -- > > Key: MESOS-5420 > URL: https://issues.apache.org/jira/browse/MESOS-5420 > Project: Mesos > Issue Type: Improvement > Environment: Windows >Reporter: Daniel Pravat >Assignee: Daniel Pravat > Labels: mesosphere > Fix For: 1.0.0 > > > os::exists returns true if the process identified by the parameter is still > running or was running and we are able to get information about it, such us > the exit code. In Windows after obtaining a handle to the process it is > possible perform those operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3624) Port slave/containerizer/mesos/launch.cpp to Windows
[ https://issues.apache.org/jira/browse/MESOS-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-3624: Sprint: Mesosphere Sprint 36 > Port slave/containerizer/mesos/launch.cpp to Windows > > > Key: MESOS-3624 > URL: https://issues.apache.org/jira/browse/MESOS-3624 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows > Fix For: 1.0.0 > > > Important subset of the dependency tree follows: > slave/containerizer/mesos/launch.cpp: os, protobuf, launch > launch: subcommand > subcommand: flags > flags.hpp: os.hpp, path.hpp, fetch.hpp -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3639) Implement stout/os/windows/killtree.hpp
[ https://issues.apache.org/jira/browse/MESOS-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15307067#comment-15307067 ] Joris Van Remoortere commented on MESOS-3639: - {code} commit 563c9ff5b539dc2d4ce1ba987dec925045cef5b8 Author: Daniel PravatDate: Mon May 30 18:02:24 2016 -0700 Windows: Enabled `JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE` on job objects. Review: https://reviews.apache.org/r/47442/ {code} > Implement stout/os/windows/killtree.hpp > --- > > Key: MESOS-3639 > URL: https://issues.apache.org/jira/browse/MESOS-3639 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Alex Clemmer >Assignee: Daniel Pravat > Labels: mesosphere, windows > Fix For: 0.29.0 > > > killtree() is implemented using Windows Job Objects. The processes created by > the executor are associated with a job object using `create_job'. killtree() > is simply terminating the job object. > Helper functions: > `create_job` function creates a job object whose name is derived from the > `pid` and associates the `pid` process with the job object. Every process > started by the process which is part of the job object becomes part of the > job object. The job name should match the name used in `kill_job`. The jobs > should be create with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE and allow the caller > to decide how to handle the returned handle. > `kill_job` function assumes the process identified by `pid` is associated > with a job object whose name is derive from it. Every process started by the > process which is part of the job object becomes part of the job object. > Destroying the task will close all such processes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5417) define WSTRINGIFY behaviour on Windows
[ https://issues.apache.org/jira/browse/MESOS-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15307063#comment-15307063 ] Joris Van Remoortere commented on MESOS-5417: - {code} commit ad3e161ac19ac32f5493e8b31bdef7b579c87177 Author: Daniel PravatDate: Mon May 30 17:48:47 2016 -0700 Windows: Added logging for `WSTRINGIFY` calls. The return codes in Windows are not standardized. The function returns an empty string and logs a warning. Review: https://reviews.apache.org/r/47473/ {code} > define WSTRINGIFY behaviour on Windows > -- > > Key: MESOS-5417 > URL: https://issues.apache.org/jira/browse/MESOS-5417 > Project: Mesos > Issue Type: Improvement >Reporter: Daniel Pravat >Assignee: Daniel Pravat >Priority: Minor > Labels: windows > > Identify the proper behaviour of WSTRINGIFY to improve the logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5375) Implement stout/os/windows/kill.hpp
[ https://issues.apache.org/jira/browse/MESOS-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-5375: Story Points: 5 > Implement stout/os/windows/kill.hpp > --- > > Key: MESOS-5375 > URL: https://issues.apache.org/jira/browse/MESOS-5375 > Project: Mesos > Issue Type: Improvement >Reporter: Daniel Pravat >Assignee: Daniel Pravat > Labels: mesosphere, windows > Fix For: 0.29.0 > > > Implement equivalent functionality on Windows -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3639) Implement stout/os/windows/killtree.hpp
[ https://issues.apache.org/jira/browse/MESOS-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284805#comment-15284805 ] Joris Van Remoortere edited comment on MESOS-3639 at 5/16/16 4:38 PM: -- https://reviews.apache.org/r/47169/ was (Author: jvanremoortere): {code} commit 769701ce36f639224a4b6763e234d153d58b297e Author: Daniel PravatDate: Mon May 16 12:20:37 2016 -0400 Windows: Stout: Implemented `killtree` using NT job objects. Review: https://reviews.apache.org/r/47169/ {code} > Implement stout/os/windows/killtree.hpp > --- > > Key: MESOS-3639 > URL: https://issues.apache.org/jira/browse/MESOS-3639 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Alex Clemmer >Assignee: Daniel Pravat > Labels: mesosphere, windows > Fix For: 0.29.0 > > > killtree() is implemented using Windows Job Objects. The processes created by > the executor are associated with a job object using `create_job'. killtree() > is simply terminating the job object. > Helper functions: > `create_job` function creates a job object whose name is derived from the > `pid` and associates the `pid` process with the job object. Every process > started by the process which is part of the job object becomes part of the > job object. The job name should match the name used in `kill_job`. > `kill_job` function assumes the process identified by `pid` is associated > with a job object whose name is derive from it. Every process started by the > process which is part of the job object becomes part of the job object. > Destroying the task will close all such processes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5371) Implement `fcntl.hpp`
[ https://issues.apache.org/jira/browse/MESOS-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282908#comment-15282908 ] Joris Van Remoortere commented on MESOS-5371: - {code} commit 4c6162d5e3535f4611e869e143c91454033dca2d Author: Alex ClemmerDate: Fri May 13 13:25:57 2016 -0400 Windows: Added stub implementations of `fcntl.hpp` functions. This commit introduces temporary versions of 2 important functions: `os::nonblock` and `os::cloexec`. We put them here in a placeholder commit so that reviewers can make progress on subprocess. In the immediate term, the plan is to figure out on a callsite-by-callsite basis how to work around the functionality of `os::cloexec`. When we collect more data, we will be in a better position to offer a way forward here. Review: https://reviews.apache.org/r/46392/ {code} > Implement `fcntl.hpp` > - > > Key: MESOS-5371 > URL: https://issues.apache.org/jira/browse/MESOS-5371 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, stout, windows-mvp > > `fcntl.hpp` has a bunch of functions that will never work on Windows. We will > need to work around them, either by working around specific call sites of > functions like `os::cloexec`, or by implementing something that keeps track > of which file descriptors are cloexec, and which aren't. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5379) Authentication documentation for libprocess endpoints can be misleading.
[ https://issues.apache.org/jira/browse/MESOS-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-5379: Assignee: (was: Joris Van Remoortere) > Authentication documentation for libprocess endpoints can be misleading. > > > Key: MESOS-5379 > URL: https://issues.apache.org/jira/browse/MESOS-5379 > Project: Mesos > Issue Type: Bug > Components: documentation, libprocess >Affects Versions: 0.29.0 >Reporter: Benjamin Bannier >Priority: Blocker > Labels: mesosphere, tech-debt > Fix For: 0.29.0 > > > Libprocess exposes a number of endpoints (at least: {{/logging}}, > {{/metrics}}, and {{/profiler}}). If libprocess was initialized with some > realm these endpoints require authentication, and don't if not. > To generate endpoint help we currently use the also function > {{AUTHENTICATION}} which injects the following into the help string, > {code} > This endpoints requires authentication iff HTTP authentication is enabled. > {code} > with {{iff}} documenting a coupling stronger between required authentication > and enabled authentication which might not be true for above libprocess > endpoints -- it is e.g., true when these endpoints are exposed through mesos > masters/agents, but possibly not if exposed through other executables. > It seems for libprocess endpoint a less strong formulation like e.g., > {code} > This endpoints supports authentication. If HTTP authentication is enabled, > this endpoint may require authentication. > {code} > might make the generated help strings more reusable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5379) Authentication documentation for libprocess endpoints can be misleading.
[ https://issues.apache.org/jira/browse/MESOS-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere reassigned MESOS-5379: --- Assignee: Joris Van Remoortere > Authentication documentation for libprocess endpoints can be misleading. > > > Key: MESOS-5379 > URL: https://issues.apache.org/jira/browse/MESOS-5379 > Project: Mesos > Issue Type: Bug > Components: documentation, libprocess >Affects Versions: 0.29.0 >Reporter: Benjamin Bannier >Assignee: Joris Van Remoortere >Priority: Blocker > Labels: mesosphere, tech-debt > Fix For: 0.29.0 > > > Libprocess exposes a number of endpoints (at least: {{/logging}}, > {{/metrics}}, and {{/profiler}}). If libprocess was initialized with some > realm these endpoints require authentication, and don't if not. > To generate endpoint help we currently use the also function > {{AUTHENTICATION}} which injects the following into the help string, > {code} > This endpoints requires authentication iff HTTP authentication is enabled. > {code} > with {{iff}} documenting a coupling stronger between required authentication > and enabled authentication which might not be true for above libprocess > endpoints -- it is e.g., true when these endpoints are exposed through mesos > masters/agents, but possibly not if exposed through other executables. > It seems for libprocess endpoint a less strong formulation like e.g., > {code} > This endpoints supports authentication. If HTTP authentication is enabled, > this endpoint may require authentication. > {code} > might make the generated help strings more reusable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5356) Add Windows support for StopWatch
Joris Van Remoortere created MESOS-5356: --- Summary: Add Windows support for StopWatch Key: MESOS-5356 URL: https://issues.apache.org/jira/browse/MESOS-5356 Project: Mesos Issue Type: Improvement Reporter: Joris Van Remoortere Assignee: Alex Clemmer Fix For: 0.29.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3643) Implement stout/os/windows/shell.hpp
[ https://issues.apache.org/jira/browse/MESOS-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275744#comment-15275744 ] Joris Van Remoortere commented on MESOS-3643: - {code} commit fc4f9d25f75dc0ca87732c8b0ee868a5713f1d0f Author: Alex ClemmerDate: Sun May 8 17:00:05 2016 -0400 Windows: Fixed shell constants, marked `os::shell` as deleted. Review: https://reviews.apache.org/r/46393/ {code} > Implement stout/os/windows/shell.hpp > > > Key: MESOS-3643 > URL: https://issues.apache.org/jira/browse/MESOS-3643 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows, windows-mvp > Fix For: 0.28.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3656) Port process/socket.hpp to Windows
[ https://issues.apache.org/jira/browse/MESOS-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275742#comment-15275742 ] Joris Van Remoortere commented on MESOS-3656: - {code} commit cd879244d42ade1f63d228694e5681ea254a9902 Author: Alex ClemmerDate: Sun May 8 13:32:09 2016 -0700 Windows: Libprocess: Winsock class to handle WSAStartup/WSACleanup. Review: https://reviews.apache.org/r/46344/ {code} > Port process/socket.hpp to Windows > -- > > Key: MESOS-3656 > URL: https://issues.apache.org/jira/browse/MESOS-3656 > Project: Mesos > Issue Type: Task > Components: libprocess >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5296) Split Resource and Inverse offer protobufs for V1 API
Joris Van Remoortere created MESOS-5296: --- Summary: Split Resource and Inverse offer protobufs for V1 API Key: MESOS-5296 URL: https://issues.apache.org/jira/browse/MESOS-5296 Project: Mesos Issue Type: Improvement Reporter: Joris Van Remoortere Assignee: Joris Van Remoortere Fix For: 0.29.0 The protobufs for the V1 api regarding inverse offers initially re-used the existing offer / rescind / accept / decline messages for regular offers. We should split these out the be more explicit, and provide the ability to augment the messages with particulars to either resource or inverse offers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5044) Temporary directories created by environment->mkdtemp cleanup can be problematic.
[ https://issues.apache.org/jira/browse/MESOS-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere reassigned MESOS-5044: --- Assignee: Joris Van Remoortere > Temporary directories created by environment->mkdtemp cleanup can be > problematic. > - > > Key: MESOS-5044 > URL: https://issues.apache.org/jira/browse/MESOS-5044 > Project: Mesos > Issue Type: Improvement > Components: test >Reporter: Gilbert Song >Assignee: Joris Van Remoortere > Labels: mesosphere > Fix For: 0.29.0 > > > Currently in mesos test, we have the temporary directories created by > `environment->mkdtemp()` cleaned up until the end of the test suite, which > can be problematic. For instance, if we have many tests in a test suite, each > of those tests is performing large size disk read/write in its temp dir, > which may lead to out of disk issue on some resource limited machines. > We should have these temp dir created by `environment->mkdtemp` cleaned up > during each test teardown. Currently we only clean up the sandbox for each > test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5044) Temporary directories created by environment->mkdtemp cleanup can be problematic.
[ https://issues.apache.org/jira/browse/MESOS-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-5044: Sprint: Mesosphere Sprint 32 Story Points: 1 Labels: mesosphere (was: ) Fix Version/s: 0.29.0 > Temporary directories created by environment->mkdtemp cleanup can be > problematic. > - > > Key: MESOS-5044 > URL: https://issues.apache.org/jira/browse/MESOS-5044 > Project: Mesos > Issue Type: Improvement > Components: test >Reporter: Gilbert Song >Assignee: Joris Van Remoortere > Labels: mesosphere > Fix For: 0.29.0 > > > Currently in mesos test, we have the temporary directories created by > `environment->mkdtemp()` cleaned up until the end of the test suite, which > can be problematic. For instance, if we have many tests in a test suite, each > of those tests is performing large size disk read/write in its temp dir, > which may lead to out of disk issue on some resource limited machines. > We should have these temp dir created by `environment->mkdtemp` cleaned up > during each test teardown. Currently we only clean up the sandbox for each > test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4353) Limit the number of processes created by libprocess
[ https://issues.apache.org/jira/browse/MESOS-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4353: Assignee: Maged Michael (was: Qian Zhang) > Limit the number of processes created by libprocess > --- > > Key: MESOS-4353 > URL: https://issues.apache.org/jira/browse/MESOS-4353 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: Qian Zhang >Assignee: Maged Michael > Labels: libprocess, mesosphere > Fix For: 0.29.0 > > > Currently libprocess will create {{max(8, number of CPU cores)}} processes > during the initialization, see > https://github.com/apache/mesos/blob/0.26.0/3rdparty/libprocess/src/process.cpp#L2146 > for details. This should be OK for a normal machine which has no much cores > (e.g., 16, 32), but for a powerful machine which may have a large number of > cores (e.g., an IBM Power machine may have 192 cores), this will cause too > much worker threads which are not necessary. > And since libprocess is widely used in Mesos (master, agent, scheduler, > executor), it may also cause some performance issue. For example, when user > creates a Docker container via Mesos in a Mesos agent which is running on a > powerful machine with 192 cores, the DockerContainerizer in Mesos agent will > create a dedicated executor for the container, and there will be 192 worker > threads in that executor. And if user creates 1000 Docker containers in that > machine, then there will be 1000 executors, i.e., 1000 * 192 worker threads > which is a large number and may thrash the OS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4576) Introduce a stout helper for "which"
[ https://issues.apache.org/jira/browse/MESOS-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4576: Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29, Mesosphere Sprint 30, Mesosphere Sprint 31) > Introduce a stout helper for "which" > > > Key: MESOS-4576 > URL: https://issues.apache.org/jira/browse/MESOS-4576 > Project: Mesos > Issue Type: Improvement > Components: stout >Reporter: Joseph Wu >Assignee: Disha Singh > Labels: mesosphere > > We may want to add a helper to {{stout/os.hpp}} that will natively emulate > the functionality of the Linux utility {{which}}. i.e. > {code} > Option which(const string& command) > { > Option path = os::getenv("PATH"); > // Loop through path and return the first one which os::exists(...). > return None(); > } > {code} > This helper may be useful: > * for test filters in {{src/tests/environment.cpp}} > * a few tests in {{src/tests/containerizer/port_mapping_tests.cpp}} > * the {{sha512}} utility in {{src/common/command_utils.cpp}} > * as runtime checks in the {{LogrotateContainerLogger}} > * etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4694) DRFAllocator takes very long to allocate resources with a large number of frameworks
[ https://issues.apache.org/jira/browse/MESOS-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210399#comment-15210399 ] Joris Van Remoortere commented on MESOS-4694: - {code} commit 6a8738f89b01ac3ddd70c418c49f350e17fa Author: Dario RexinDate: Thu Mar 24 14:10:31 2016 +0100 Allocator Performance: Exited early to avoid needless computation. Review: https://reviews.apache.org/r/43668/ {code} > DRFAllocator takes very long to allocate resources with a large number of > frameworks > > > Key: MESOS-4694 > URL: https://issues.apache.org/jira/browse/MESOS-4694 > Project: Mesos > Issue Type: Improvement > Components: allocation >Affects Versions: 0.26.0, 0.27.0, 0.27.1, 0.28.0, 0.27.2, 0.28.1 >Reporter: Dario Rexin >Assignee: Dario Rexin > > With a growing number of connected frameworks, the allocation time grows to > very high numbers. The addition of quota in 0.27 had an additional impact on > these numbers. Running `mesos-tests.sh --benchmark > --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us > the following numbers: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 2.921202secs to make 200 offers > round 1 allocate took 2.85045secs to make 200 offers > round 2 allocate took 2.823768secs to make 200 offers > {noformat} > Increasing the number of frameworks to 2000: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 28.209454secs to make 2000 offers > round 1 allocate took 28.469419secs to make 2000 offers > round 2 allocate took 28.138086secs to make 2000 offers > {noformat} > I was able to reduce this time by a substantial amount. After applying the > patches: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 1.016226secs to make 2000 offers > round 1 allocate took 1.102729secs to make 2000 offers > round 2 allocate took 1.102624secs to make 2000 offers > {noformat} > And with 2000 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 12.563203secs to make 2000 offers > round 1 allocate took 12.437517secs to make 2000 offers > round 2 allocate took 12.470708secs to make 2000 offers > {noformat} > The patches do 3 things to improve the performance of the allocator. > 1) The total values in the DRFSorter will be pre calculated per resource type > 2) In the allocate method, when no resources are available to allocate, we > break out of the innermost loop to prevent looping over a large number of > frameworks when we have nothing to allocate > 3) when a framework suppresses offers, we remove it from the sorter instead > of just calling continue in the allocation loop - this greatly improves > performance in the sorter and prevents looping over frameworks that don't > need resources > Assuming that most of the frameworks behave nicely and suppress offers when > they have nothing to schedule, it is fair to assume, that point 3) has the > biggest impact on the performance. If we suppress offers for 90% of the > frameworks in the benchmark test, we see following numbers: > {noformat} > ==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 200 slaves and 2000 frameworks > round 0 allocate took 11626us to make 200 offers > round 1 allocate took 22890us to make 200 offers > round 2 allocate took 21346us to make 200 offers > {noformat} > And for 200 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and
[jira] [Updated] (MESOS-4694) DRFAllocator takes very long to allocate resources with a large number of frameworks
[ https://issues.apache.org/jira/browse/MESOS-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4694: Affects Version/s: 0.28.1 0.28.0 0.27.2 > DRFAllocator takes very long to allocate resources with a large number of > frameworks > > > Key: MESOS-4694 > URL: https://issues.apache.org/jira/browse/MESOS-4694 > Project: Mesos > Issue Type: Improvement > Components: allocation >Affects Versions: 0.26.0, 0.27.0, 0.27.1, 0.28.0, 0.27.2, 0.28.1 >Reporter: Dario Rexin >Assignee: Dario Rexin > > With a growing number of connected frameworks, the allocation time grows to > very high numbers. The addition of quota in 0.27 had an additional impact on > these numbers. Running `mesos-tests.sh --benchmark > --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us > the following numbers: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 2.921202secs to make 200 offers > round 1 allocate took 2.85045secs to make 200 offers > round 2 allocate took 2.823768secs to make 200 offers > {noformat} > Increasing the number of frameworks to 2000: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 28.209454secs to make 2000 offers > round 1 allocate took 28.469419secs to make 2000 offers > round 2 allocate took 28.138086secs to make 2000 offers > {noformat} > I was able to reduce this time by a substantial amount. After applying the > patches: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 1.016226secs to make 2000 offers > round 1 allocate took 1.102729secs to make 2000 offers > round 2 allocate took 1.102624secs to make 2000 offers > {noformat} > And with 2000 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 12.563203secs to make 2000 offers > round 1 allocate took 12.437517secs to make 2000 offers > round 2 allocate took 12.470708secs to make 2000 offers > {noformat} > The patches do 3 things to improve the performance of the allocator. > 1) The total values in the DRFSorter will be pre calculated per resource type > 2) In the allocate method, when no resources are available to allocate, we > break out of the innermost loop to prevent looping over a large number of > frameworks when we have nothing to allocate > 3) when a framework suppresses offers, we remove it from the sorter instead > of just calling continue in the allocation loop - this greatly improves > performance in the sorter and prevents looping over frameworks that don't > need resources > Assuming that most of the frameworks behave nicely and suppress offers when > they have nothing to schedule, it is fair to assume, that point 3) has the > biggest impact on the performance. If we suppress offers for 90% of the > frameworks in the benchmark test, we see following numbers: > {noformat} > ==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 200 slaves and 2000 frameworks > round 0 allocate took 11626us to make 200 offers > round 1 allocate took 22890us to make 200 offers > round 2 allocate took 21346us to make 200 offers > {noformat} > And for 200 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 1.11178secs to make 2000 offers > round 1 allocate took 1.062649secs to make 2000 offers > round 2 allocate took 1.080181secs to make 2000 offers > {noformat} > Review requests: >
[jira] [Commented] (MESOS-3656) Port process/socket.hpp to Windows
[ https://issues.apache.org/jira/browse/MESOS-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210024#comment-15210024 ] Joris Van Remoortere commented on MESOS-3656: - {code} commit 4e19c3e6f09eaa2793f4717e414429e0e6335e0f Author: Daniel PravatDate: Thu Mar 24 09:33:05 2016 +0100 Windows: [2/2] Lifted socket API into Stout. Review: https://reviews.apache.org/r/44139/ commit 6f8544cf5e2748a58ac979e6d12336b2dccbf1fb Author: Daniel Pravat Date: Thu Mar 24 09:32:57 2016 +0100 Windows: [1/2] Lifted socket API into Stout. Review: https://reviews.apache.org/r/44138/ {code} > Port process/socket.hpp to Windows > -- > > Key: MESOS-3656 > URL: https://issues.apache.org/jira/browse/MESOS-3656 > Project: Mesos > Issue Type: Task > Components: libprocess >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4827) Destroy Docker container crashes Mesos slave
[ https://issues.apache.org/jira/browse/MESOS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4827: Affects Version/s: 0.26.0 0.27.0 0.28.0 > Destroy Docker container crashes Mesos slave > > > Key: MESOS-4827 > URL: https://issues.apache.org/jira/browse/MESOS-4827 > Project: Mesos > Issue Type: Bug > Components: docker, framework, slave >Affects Versions: 0.25.0, 0.26.0, 0.27.0, 0.28.0 >Reporter: Zhenzhong Shi >Priority: Blocker > Fix For: 0.29.0 > > > The details of this issue originally [posted on > StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. > > To be short, the problem is when we destroy/re-deploy a docker-containerized > task, the mesos-slave got killed from time to time. It happened on our > production environment and I cann't re-produce it. > Please refer to the post on StackOverflow about the error message I got and > details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4809) Allow parallel execution of tests
[ https://issues.apache.org/jira/browse/MESOS-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4809: Assignee: Benjamin Bannier > Allow parallel execution of tests > - > > Key: MESOS-4809 > URL: https://issues.apache.org/jira/browse/MESOS-4809 > Project: Mesos > Issue Type: Epic >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Minor > > We should allow parallel execution of tests. There are two flavors to this: > (a) tests are run in parallel in the same process, or > (b) tests are run in parallel with separate processes (e.g., with > gtest-parallel). > While (a) likely has overall better performance, it depends on tests being > independent of global state (e.g., current directory, and others). On the > other hand, already (b) improves execution time, and has much smaller > requirements. > This epic tracks efforts to fix test to allow scenario (b) above. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4807) IOTest.BufferedRead writes to the current directory
[ https://issues.apache.org/jira/browse/MESOS-4807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4807: Labels: mesosphere newbie parallel-tests (was: newbie parallel-tests) > IOTest.BufferedRead writes to the current directory > --- > > Key: MESOS-4807 > URL: https://issues.apache.org/jira/browse/MESOS-4807 > Project: Mesos > Issue Type: Bug > Components: libprocess, test >Reporter: Benjamin Bannier >Assignee: Yong Tang >Priority: Minor > Labels: mesosphere, newbie, parallel-tests > Fix For: 0.29.0 > > > libprocess's {{IOTest.BufferedRead}} writes to the current directory. This is > bad for a number of reasons, e.g., > * should the test fail data might be leaked to random locations, > * the test cannot be executed from a write-only directory, or > * executing the same test in parallel would race on the existence of the > created file, and show bogus behavior. > The test should probably be executed from a temporary directory, e.g., via > stout's {{TemporaryDirectoryTest}} fixture. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4831) Master sometimes sends two inverse offers after the agent goes into maintenance.
[ https://issues.apache.org/jira/browse/MESOS-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4831: Sprint: Mesosphere Sprint 30 > Master sometimes sends two inverse offers after the agent goes into > maintenance. > > > Key: MESOS-4831 > URL: https://issues.apache.org/jira/browse/MESOS-4831 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.0 >Reporter: Anand Mazumdar >Assignee: Guangya Liu >Priority: Blocker > Labels: maintenance, mesosphere > Fix For: 0.28.0 > > > Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}} > https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull > {code} > I0229 11:08:57.027559 668 hierarchical.cpp:1437] No resources available to > allocate! > I0229 11:08:57.027745 668 hierarchical.cpp:1150] Performed allocation for > slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns > I0229 11:08:57.027757 675 master.cpp:5369] Sending 1 offers to framework > fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > I0229 11:08:57.028586 675 master.cpp:5459] Sending 1 inverse offers to > framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > I0229 11:08:57.029039 675 master.cpp:5459] Sending 1 inverse offers to > framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > {code} > The ideal expected workflow for this test is something like: > - The framework receives offers from master. > - The framework updates its maintenance schedule. > - The current offer is rescinded. > - A new offer is received from the master with unavailability set. > - After the agent goes for maintenance, an inverse offer is sent. > For some reason, in the logs we see that the master is sending 2 inverse > offers. The test seems to pass as we just check for the initial inverse offer > being present. This can also be reproduced by a modified version of the > original test. > {code} > // Test ensures that an offer will have an `unavailability` set if the > // slave is scheduled to go down for maintenance. > TEST_F(MasterMaintenanceTest, PendingUnavailabilityTest) > { > Trymaster = StartMaster(); > ASSERT_SOME(master); > MockExecutor exec(DEFAULT_EXECUTOR_ID); > Try slave = StartSlave(); > ASSERT_SOME(slave); > auto scheduler = std::make_shared(); > EXPECT_CALL(*scheduler, heartbeat(_)) > .WillRepeatedly(Return()); // Ignore heartbeats. > Future connected; > EXPECT_CALL(*scheduler, connected(_)) > .WillOnce(FutureSatisfy()) > .WillRepeatedly(Return()); // Ignore future invocations. > scheduler::TestV1Mesos mesos(master.get(), ContentType::PROTOBUF, > scheduler); > AWAIT_READY(connected); > Future subscribed; > EXPECT_CALL(*scheduler, subscribed(_, _)) > .WillOnce(FutureArg<1>()); > Future normalOffers; > Future unavailabilityOffers; > Future inverseOffers; > EXPECT_CALL(*scheduler, offers(_, _)) > .WillOnce(FutureArg<1>()) > .WillOnce(FutureArg<1>()) > .WillOnce(FutureArg<1>()); > // The original offers should be rescinded when the unavailability is > changed. > Future offerRescinded; > EXPECT_CALL(*scheduler, rescind(_, _)) > .WillOnce(FutureSatisfy()); > { > Call call; > call.set_type(Call::SUBSCRIBE); > Call::Subscribe* subscribe = call.mutable_subscribe(); > subscribe->mutable_framework_info()->CopyFrom(DEFAULT_V1_FRAMEWORK_INFO); > mesos.send(call); > } > AWAIT_READY(subscribed); > v1::FrameworkID frameworkId(subscribed->framework_id()); > AWAIT_READY(normalOffers); > EXPECT_NE(0, normalOffers->offers().size()); > // Regular offers shouldn't have unavailability. > foreach (const v1::Offer& offer, normalOffers->offers()) { > EXPECT_FALSE(offer.has_unavailability()); > } > // Schedule this slave for maintenance. > MachineID machine; > machine.set_hostname(maintenanceHostname); > machine.set_ip(stringify(slave.get().address.ip)); > const Time start = Clock::now() + Seconds(60); > const Duration duration = Seconds(120); > const Unavailability unavailability = createUnavailability(start, duration); > // Post a valid schedule with one machine. > maintenance::Schedule schedule = createSchedule( > {createWindow({machine}, unavailability)}); > // We have a few seconds between the first set of offers and the > // next allocation of offers. This should be enough time to perform > // a maintenance schedule update. This update will also trigger the > // rescinding of offers from the scheduled slave. > Future response = process::http::post( > master.get(), >
[jira] [Commented] (MESOS-4827) Destroy Docker container crashes Mesos slave
[ https://issues.apache.org/jira/browse/MESOS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188564#comment-15188564 ] Joris Van Remoortere commented on MESOS-4827: - No. That is why it is marked as a blocker. It does seem like #1 and #3 may be separate issues though. 3 is what is causing the wide-spread task failure. > Destroy Docker container crashes Mesos slave > > > Key: MESOS-4827 > URL: https://issues.apache.org/jira/browse/MESOS-4827 > Project: Mesos > Issue Type: Bug > Components: docker, framework, slave >Affects Versions: 0.25.0 >Reporter: Zhenzhong Shi >Priority: Blocker > Fix For: 0.29.0 > > > The details of this issue originally [posted on > StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. > > To be short, the problem is when we destroy/re-deploy a docker-containerized > task, the mesos-slave got killed from time to time. It happened on our > production environment and I cann't re-produce it. > Please refer to the post on StackOverflow about the error message I got and > details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4827) Destroy Docker container from Marathon kills Mesos slave
[ https://issues.apache.org/jira/browse/MESOS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4827: Priority: Blocker (was: Major) > Destroy Docker container from Marathon kills Mesos slave > > > Key: MESOS-4827 > URL: https://issues.apache.org/jira/browse/MESOS-4827 > Project: Mesos > Issue Type: Bug > Components: docker, framework, slave >Affects Versions: 0.25.0 >Reporter: Zhenzhong Shi >Priority: Blocker > Fix For: 0.29.0 > > > The details of this issue originally [posted on > StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. > > To be short, the problem is when we destroy/re-deploy a docker-containerized > task, the mesos-slave got killed from time to time. It happened on our > production environment and I cann't re-produce it. > Please refer to the post on StackOverflow about the error message I got and > details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4827) Destroy Docker container crashes Mesos slave
[ https://issues.apache.org/jira/browse/MESOS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185017#comment-15185017 ] Joris Van Remoortere commented on MESOS-4827: - At first glance this looks like it is happening because the directory structure in which we want to write the sentinel file is not fully constructed. We need to: - Investigate (and fix) how this can happen. > Destroy Docker container crashes Mesos slave > > > Key: MESOS-4827 > URL: https://issues.apache.org/jira/browse/MESOS-4827 > Project: Mesos > Issue Type: Bug > Components: docker, framework, slave >Affects Versions: 0.25.0 >Reporter: Zhenzhong Shi >Priority: Blocker > Fix For: 0.29.0 > > > The details of this issue originally [posted on > StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. > > To be short, the problem is when we destroy/re-deploy a docker-containerized > task, the mesos-slave got killed from time to time. It happened on our > production environment and I cann't re-produce it. > Please refer to the post on StackOverflow about the error message I got and > details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4827) Destroy Docker container crashes Mesos slave
[ https://issues.apache.org/jira/browse/MESOS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4827: Summary: Destroy Docker container crashes Mesos slave (was: Destroy Docker container from Marathon kills Mesos slave) > Destroy Docker container crashes Mesos slave > > > Key: MESOS-4827 > URL: https://issues.apache.org/jira/browse/MESOS-4827 > Project: Mesos > Issue Type: Bug > Components: docker, framework, slave >Affects Versions: 0.25.0 >Reporter: Zhenzhong Shi >Priority: Blocker > Fix For: 0.29.0 > > > The details of this issue originally [posted on > StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. > > To be short, the problem is when we destroy/re-deploy a docker-containerized > task, the mesos-slave got killed from time to time. It happened on our > production environment and I cann't re-produce it. > Please refer to the post on StackOverflow about the error message I got and > details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4827) Destroy Docker container from Marathon kills Mesos slave
[ https://issues.apache.org/jira/browse/MESOS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4827: Fix Version/s: 0.29.0 > Destroy Docker container from Marathon kills Mesos slave > > > Key: MESOS-4827 > URL: https://issues.apache.org/jira/browse/MESOS-4827 > Project: Mesos > Issue Type: Bug > Components: docker, framework, slave >Affects Versions: 0.25.0 >Reporter: Zhenzhong Shi > Fix For: 0.29.0 > > > The details of this issue originally [posted on > StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. > > To be short, the problem is when we destroy/re-deploy a docker-containerized > task, the mesos-slave got killed from time to time. It happened on our > production environment and I cann't re-produce it. > Please refer to the post on StackOverflow about the error message I got and > details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4838) Update unavailable in batch to avoid several allocate(slaveId) call
[ https://issues.apache.org/jira/browse/MESOS-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180507#comment-15180507 ] Joris Van Remoortere commented on MESOS-4838: - [~klaus1982] I'm not sure why we need to do this. 1. Are you seeing performance issues with the {{allocate(slaveId)}} calls generated by the maintenance schedule? 2. If this is the case, why wouldn't the general batching proposal for the allocator cover this case? Why do we need to implement batching in specific API entry points? 3. If this is being suggested because a maintenance schedule tends to update many agents simultaneously, then would it make more sense to consider calling the batch {{allocate()}} function in the allocator after updating all the agent availabilities? If you are interested in considering some improvements around maintenance, let's set up a working group. I know others are also interested in this feature, and I know [~kaysoky] would love to help guide these discussions. We should discuss these kinds of larger changes and ideas in terms of their operational and development consequences before posting patches. (Though if you just want to try it out to understand the performance implications or what code would need to be touched that's totally fine; we just may decide to go in a very different direction). > Update unavailable in batch to avoid several allocate(slaveId) call > --- > > Key: MESOS-4838 > URL: https://issues.apache.org/jira/browse/MESOS-4838 > Project: Mesos > Issue Type: Bug >Reporter: Klaus Ma >Assignee: Klaus Ma > > In "/machine/schedule", all machines in master will trigger a > {{allocate(slaveId)}} which will increase the workload of master. The > proposal of this JIRA is to update unavailable in batch to avoid several > {{allocate(slaveId)}} call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4831) Master sometimes sends two inverse offers after the agent goes into maintenance.
[ https://issues.apache.org/jira/browse/MESOS-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4831: Shepherd: Joris Van Remoortere > Master sometimes sends two inverse offers after the agent goes into > maintenance. > > > Key: MESOS-4831 > URL: https://issues.apache.org/jira/browse/MESOS-4831 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.0 >Reporter: Anand Mazumdar >Assignee: Guangya Liu > Labels: maintenance, mesosphere > > Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}} > https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull > {code} > I0229 11:08:57.027559 668 hierarchical.cpp:1437] No resources available to > allocate! > I0229 11:08:57.027745 668 hierarchical.cpp:1150] Performed allocation for > slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns > I0229 11:08:57.027757 675 master.cpp:5369] Sending 1 offers to framework > fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > I0229 11:08:57.028586 675 master.cpp:5459] Sending 1 inverse offers to > framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > I0229 11:08:57.029039 675 master.cpp:5459] Sending 1 inverse offers to > framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > {code} > The ideal expected workflow for this test is something like: > - The framework receives offers from master. > - The framework updates its maintenance schedule. > - The current offer is rescinded. > - A new offer is received from the master with unavailability set. > - After the agent goes for maintenance, an inverse offer is sent. > For some reason, in the logs we see that the master is sending 2 inverse > offers. The test seems to pass as we just check for the initial inverse offer > being present. This can also be reproduced by a modified version of the > original test. > {code} > // Test ensures that an offer will have an `unavailability` set if the > // slave is scheduled to go down for maintenance. > TEST_F(MasterMaintenanceTest, PendingUnavailabilityTest) > { > Trymaster = StartMaster(); > ASSERT_SOME(master); > MockExecutor exec(DEFAULT_EXECUTOR_ID); > Try slave = StartSlave(); > ASSERT_SOME(slave); > auto scheduler = std::make_shared(); > EXPECT_CALL(*scheduler, heartbeat(_)) > .WillRepeatedly(Return()); // Ignore heartbeats. > Future connected; > EXPECT_CALL(*scheduler, connected(_)) > .WillOnce(FutureSatisfy()) > .WillRepeatedly(Return()); // Ignore future invocations. > scheduler::TestV1Mesos mesos(master.get(), ContentType::PROTOBUF, > scheduler); > AWAIT_READY(connected); > Future subscribed; > EXPECT_CALL(*scheduler, subscribed(_, _)) > .WillOnce(FutureArg<1>()); > Future normalOffers; > Future unavailabilityOffers; > Future inverseOffers; > EXPECT_CALL(*scheduler, offers(_, _)) > .WillOnce(FutureArg<1>()) > .WillOnce(FutureArg<1>()) > .WillOnce(FutureArg<1>()); > // The original offers should be rescinded when the unavailability is > changed. > Future offerRescinded; > EXPECT_CALL(*scheduler, rescind(_, _)) > .WillOnce(FutureSatisfy()); > { > Call call; > call.set_type(Call::SUBSCRIBE); > Call::Subscribe* subscribe = call.mutable_subscribe(); > subscribe->mutable_framework_info()->CopyFrom(DEFAULT_V1_FRAMEWORK_INFO); > mesos.send(call); > } > AWAIT_READY(subscribed); > v1::FrameworkID frameworkId(subscribed->framework_id()); > AWAIT_READY(normalOffers); > EXPECT_NE(0, normalOffers->offers().size()); > // Regular offers shouldn't have unavailability. > foreach (const v1::Offer& offer, normalOffers->offers()) { > EXPECT_FALSE(offer.has_unavailability()); > } > // Schedule this slave for maintenance. > MachineID machine; > machine.set_hostname(maintenanceHostname); > machine.set_ip(stringify(slave.get().address.ip)); > const Time start = Clock::now() + Seconds(60); > const Duration duration = Seconds(120); > const Unavailability unavailability = createUnavailability(start, duration); > // Post a valid schedule with one machine. > maintenance::Schedule schedule = createSchedule( > {createWindow({machine}, unavailability)}); > // We have a few seconds between the first set of offers and the > // next allocation of offers. This should be enough time to perform > // a maintenance schedule update. This update will also trigger the > // rescinding of offers from the scheduled slave. > Future response = process::http::post( > master.get(), > "maintenance/schedule", > headers, >
[jira] [Commented] (MESOS-4831) Master sometimes sends two inverse offers after the agent goes into maintenance.
[ https://issues.apache.org/jira/browse/MESOS-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175259#comment-15175259 ] Joris Van Remoortere commented on MESOS-4831: - Yep! > Master sometimes sends two inverse offers after the agent goes into > maintenance. > > > Key: MESOS-4831 > URL: https://issues.apache.org/jira/browse/MESOS-4831 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.0 >Reporter: Anand Mazumdar >Assignee: Guangya Liu > Labels: maintenance, mesosphere > > Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}} > https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull > {code} > I0229 11:08:57.027559 668 hierarchical.cpp:1437] No resources available to > allocate! > I0229 11:08:57.027745 668 hierarchical.cpp:1150] Performed allocation for > slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns > I0229 11:08:57.027757 675 master.cpp:5369] Sending 1 offers to framework > fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > I0229 11:08:57.028586 675 master.cpp:5459] Sending 1 inverse offers to > framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > I0229 11:08:57.029039 675 master.cpp:5459] Sending 1 inverse offers to > framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > {code} > The ideal expected workflow for this test is something like: > - The framework receives offers from master. > - The framework updates its maintenance schedule. > - The current offer is rescinded. > - A new offer is received from the master with unavailability set. > - After the agent goes for maintenance, an inverse offer is sent. > For some reason, in the logs we see that the master is sending 2 inverse > offers. The test seems to pass as we just check for the initial inverse offer > being present. This can also be reproduced by a modified version of the > original test. > {code} > // Test ensures that an offer will have an `unavailability` set if the > // slave is scheduled to go down for maintenance. > TEST_F(MasterMaintenanceTest, PendingUnavailabilityTest) > { > Trymaster = StartMaster(); > ASSERT_SOME(master); > MockExecutor exec(DEFAULT_EXECUTOR_ID); > Try slave = StartSlave(); > ASSERT_SOME(slave); > auto scheduler = std::make_shared(); > EXPECT_CALL(*scheduler, heartbeat(_)) > .WillRepeatedly(Return()); // Ignore heartbeats. > Future connected; > EXPECT_CALL(*scheduler, connected(_)) > .WillOnce(FutureSatisfy()) > .WillRepeatedly(Return()); // Ignore future invocations. > scheduler::TestV1Mesos mesos(master.get(), ContentType::PROTOBUF, > scheduler); > AWAIT_READY(connected); > Future subscribed; > EXPECT_CALL(*scheduler, subscribed(_, _)) > .WillOnce(FutureArg<1>()); > Future normalOffers; > Future unavailabilityOffers; > Future inverseOffers; > EXPECT_CALL(*scheduler, offers(_, _)) > .WillOnce(FutureArg<1>()) > .WillOnce(FutureArg<1>()) > .WillOnce(FutureArg<1>()); > // The original offers should be rescinded when the unavailability is > changed. > Future offerRescinded; > EXPECT_CALL(*scheduler, rescind(_, _)) > .WillOnce(FutureSatisfy()); > { > Call call; > call.set_type(Call::SUBSCRIBE); > Call::Subscribe* subscribe = call.mutable_subscribe(); > subscribe->mutable_framework_info()->CopyFrom(DEFAULT_V1_FRAMEWORK_INFO); > mesos.send(call); > } > AWAIT_READY(subscribed); > v1::FrameworkID frameworkId(subscribed->framework_id()); > AWAIT_READY(normalOffers); > EXPECT_NE(0, normalOffers->offers().size()); > // Regular offers shouldn't have unavailability. > foreach (const v1::Offer& offer, normalOffers->offers()) { > EXPECT_FALSE(offer.has_unavailability()); > } > // Schedule this slave for maintenance. > MachineID machine; > machine.set_hostname(maintenanceHostname); > machine.set_ip(stringify(slave.get().address.ip)); > const Time start = Clock::now() + Seconds(60); > const Duration duration = Seconds(120); > const Unavailability unavailability = createUnavailability(start, duration); > // Post a valid schedule with one machine. > maintenance::Schedule schedule = createSchedule( > {createWindow({machine}, unavailability)}); > // We have a few seconds between the first set of offers and the > // next allocation of offers. This should be enough time to perform > // a maintenance schedule update. This update will also trigger the > // rescinding of offers from the scheduled slave. > Future response = process::http::post( > master.get(), > "maintenance/schedule", > headers, >
[jira] [Updated] (MESOS-4691) Add a HierarchicalAllocator benchmark with reservation labels.
[ https://issues.apache.org/jira/browse/MESOS-4691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4691: Shepherd: Joris Van Remoortere (was: Michael Park) > Add a HierarchicalAllocator benchmark with reservation labels. > -- > > Key: MESOS-4691 > URL: https://issues.apache.org/jira/browse/MESOS-4691 > Project: Mesos > Issue Type: Task >Reporter: Michael Park >Assignee: Neil Conway > Labels: mesosphere > Fix For: 0.28.0 > > > With {{Labels}} being part of the {{ReservationInfo}}, we should ensure that > we don't observe a significant performance degradation in the allocator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4415) Implement stout/os/windows/rmdir.hpp
[ https://issues.apache.org/jira/browse/MESOS-4415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174573#comment-15174573 ] Joris Van Remoortere commented on MESOS-4415: - https://reviews.apache.org/r/43907/ https://reviews.apache.org/r/43908/ > Implement stout/os/windows/rmdir.hpp > > > Key: MESOS-4415 > URL: https://issues.apache.org/jira/browse/MESOS-4415 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Joris Van Remoortere >Assignee: Alex Clemmer > Labels: mesosphere, windows > Fix For: 0.27.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4780) Remove `user` and `rootfs` flags in Windows launcher.
[ https://issues.apache.org/jira/browse/MESOS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174440#comment-15174440 ] Joris Van Remoortere edited comment on MESOS-4780 at 3/1/16 10:42 PM: -- https://reviews.apache.org/r/43904/ https://reviews.apache.org/r/43905/ https://reviews.apache.org/r/40938/ https://reviews.apache.org/r/40939/ was (Author: jvanremoortere): https://reviews.apache.org/r/43904/ https://reviews.apache.org/r/43905/ > Remove `user` and `rootfs` flags in Windows launcher. > - > > Key: MESOS-4780 > URL: https://issues.apache.org/jira/browse/MESOS-4780 > Project: Mesos > Issue Type: Task >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows-mvp > Fix For: 0.28.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4780) Remove `user` and `rootfs` flags in Windows launcher.
[ https://issues.apache.org/jira/browse/MESOS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174457#comment-15174457 ] Joris Van Remoortere edited comment on MESOS-4780 at 3/1/16 10:42 PM: -- {code} commit 9f1b115a67a1625a4807c2a7d4e1a41bca1af2a6 Author: Daniel PravatDate: Tue Mar 1 14:18:41 2016 -0800 Stout: Marked `os::su` as deleted on Windows. Review: https://reviews.apache.org/r/40939/ commit a1f731746657b1cbcf136ddb2bf154ca3da271fc Author: Daniel Pravat Date: Tue Mar 1 14:16:08 2016 -0800 Stout: Marked `os::chroot` as deleted on Windows. Review: https://reviews.apache.org/r/40938/ commit a1a9cd5939d25f82214a5c533bde96a3493f81f3 Author: Alex Clemmer Date: Tue Mar 1 13:35:13 2016 -0800 Windows: Stout: Removed user based functions. Review: https://reviews.apache.org/r/43905/ commit b9de8c6a06f0d0246ea38ab5586de1d0b2478c38 Author: Alex Clemmer Date: Tue Mar 1 13:33:37 2016 -0800 Windows: Removed `user` launcher flag, preventing `su`. `su` does not exist on Windows. Unfortunately, the launcher also depends on it. In this commit, we remove Windows support for the launcher flag `user`, which controls whether we use `su` in the launcher. This allows us to divest ourselves of `su` altogether on Windows. Review: https://reviews.apache.org/r/43905/ {code} was (Author: jvanremoortere): {code} commit a1a9cd5939d25f82214a5c533bde96a3493f81f3 Author: Alex Clemmer Date: Tue Mar 1 13:35:13 2016 -0800 Windows: Stout: Removed user based functions. Review: https://reviews.apache.org/r/43905/ commit b9de8c6a06f0d0246ea38ab5586de1d0b2478c38 Author: Alex Clemmer Date: Tue Mar 1 13:33:37 2016 -0800 Windows: Removed `user` launcher flag, preventing `su`. `su` does not exist on Windows. Unfortunately, the launcher also depends on it. In this commit, we remove Windows support for the launcher flag `user`, which controls whether we use `su` in the launcher. This allows us to divest ourselves of `su` altogether on Windows. Review: https://reviews.apache.org/r/43905/ {code} > Remove `user` and `rootfs` flags in Windows launcher. > - > > Key: MESOS-4780 > URL: https://issues.apache.org/jira/browse/MESOS-4780 > Project: Mesos > Issue Type: Task >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows-mvp > Fix For: 0.28.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4833) Poor allocator performance with labeled resources and/or persistent volumes
[ https://issues.apache.org/jira/browse/MESOS-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4833: Priority: Blocker (was: Critical) > Poor allocator performance with labeled resources and/or persistent volumes > --- > > Key: MESOS-4833 > URL: https://issues.apache.org/jira/browse/MESOS-4833 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Neil Conway >Assignee: Neil Conway >Priority: Blocker > Labels: mesosphere, resources > Fix For: 0.28.0 > > > Modifying the {{HierarchicalAllocator_BENCHMARK_Test.ResourceLabels}} > benchmark from https://reviews.apache.org/r/43686/ to use distinct labels > between different slaves, performance regresses from ~2 seconds to ~3 > minutes. The culprit seems to be the way in which the allocator merges > together resources; reserved resource labels (or persistent volume IDs) > inhibit merging, which causes performance to be much worse. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4780) Remove `user` and `rootfs` flags in Windows launcher.
[ https://issues.apache.org/jira/browse/MESOS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174440#comment-15174440 ] Joris Van Remoortere edited comment on MESOS-4780 at 3/1/16 9:31 PM: - https://reviews.apache.org/r/43904/ https://reviews.apache.org/r/43905/ was (Author: jvanremoortere): https://reviews.apache.org/r/43904/ > Remove `user` and `rootfs` flags in Windows launcher. > - > > Key: MESOS-4780 > URL: https://issues.apache.org/jira/browse/MESOS-4780 > Project: Mesos > Issue Type: Task >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows-mvp > Fix For: 0.28.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3525) Figure out how to enforce 64-bit builds on Windows.
[ https://issues.apache.org/jira/browse/MESOS-3525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174156#comment-15174156 ] Joris Van Remoortere commented on MESOS-3525: - https://reviews.apache.org/r/43692/ https://reviews.apache.org/r/43693/ https://reviews.apache.org/r/43694/ https://reviews.apache.org/r/43695/ https://reviews.apache.org/r/43689/ > Figure out how to enforce 64-bit builds on Windows. > --- > > Key: MESOS-3525 > URL: https://issues.apache.org/jira/browse/MESOS-3525 > Project: Mesos > Issue Type: Task > Components: build >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: build, cmake, mesosphere > Fix For: 0.28.0 > > > We need to make sure people don't try to compile Mesos on 32-bit > architectures. We don't want a Windows repeat of something like this: > https://issues.apache.org/jira/browse/MESOS-267 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4825) Master's slave reregister logic does not update version field
[ https://issues.apache.org/jira/browse/MESOS-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173239#comment-15173239 ] Joris Van Remoortere commented on MESOS-4825: - I can shepherd this. I don't think we should reject if there is a version mismatch. That would prevent us from doing rolling upgrades. We just want to update the version to the current one the agent is running, so that the {{/slaves}} endpoint reports it correctly, and any logic that is dependent on the slave's version works correctly. > Master's slave reregister logic does not update version field > - > > Key: MESOS-4825 > URL: https://issues.apache.org/jira/browse/MESOS-4825 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Joris Van Remoortere >Priority: Blocker > Fix For: 0.28.0 > > > The master's logic for reregistering a slave does not update the version > field if the slave re-registers with a new version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4825) Master's slave reregister logic does not update version field
[ https://issues.apache.org/jira/browse/MESOS-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173208#comment-15173208 ] Joris Van Remoortere edited comment on MESOS-4825 at 3/1/16 4:18 AM: - [~klaus1982] Not all re-register paths construct a {{new Slave()}}: https://github.com/apache/mesos/blob/0fd95ccc54e4d144c3eb60e98bf77d53b6bdab63/src/master/master.cpp#L4405-L4467 was (Author: jvanremoortere): [~klaus1982]Not all re-register paths construct a {{new Slave()}}: https://github.com/apache/mesos/blob/0fd95ccc54e4d144c3eb60e98bf77d53b6bdab63/src/master/master.cpp#L4405-L4467 > Master's slave reregister logic does not update version field > - > > Key: MESOS-4825 > URL: https://issues.apache.org/jira/browse/MESOS-4825 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Joris Van Remoortere >Priority: Blocker > Fix For: 0.28.0 > > > The master's logic for reregistering a slave does not update the version > field if the slave re-registers with a new version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4825) Master's slave reregister logic does not update version field
[ https://issues.apache.org/jira/browse/MESOS-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173208#comment-15173208 ] Joris Van Remoortere commented on MESOS-4825: - [~klaus1982]Not all re-register paths construct a {{new Slave()}}: https://github.com/apache/mesos/blob/0fd95ccc54e4d144c3eb60e98bf77d53b6bdab63/src/master/master.cpp#L4405-L4467 > Master's slave reregister logic does not update version field > - > > Key: MESOS-4825 > URL: https://issues.apache.org/jira/browse/MESOS-4825 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Joris Van Remoortere >Priority: Blocker > Fix For: 0.28.0 > > > The master's logic for reregistering a slave does not update the version > field if the slave re-registers with a new version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4825) Master's slave reregister logic does not update version field
Joris Van Remoortere created MESOS-4825: --- Summary: Master's slave reregister logic does not update version field Key: MESOS-4825 URL: https://issues.apache.org/jira/browse/MESOS-4825 Project: Mesos Issue Type: Bug Components: master Reporter: Joris Van Remoortere Priority: Blocker Fix For: 0.28.0 The master's logic for reregistering a slave does not update the version field if the slave re-registers with a new version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3271) SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.
[ https://issues.apache.org/jira/browse/MESOS-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170238#comment-15170238 ] Joris Van Remoortere commented on MESOS-3271: - {code} commit 16aa038949741f4dc6bf43423dc0340f869605ce Author: Alexander RojasDate: Fri Feb 26 17:17:50 2016 -0800 Removed race condition from libevent based poll implementation. Under certains circumstances, the future returned by poll is discarded right after the event is triggered, this causes the event callback to be called before the discard callback which results in an abort signal being raised by the libevent library. Review: https://reviews.apache.org/r/43799/ {code} > SlaveRecoveryTest/0.NonCheckpointingFramework is flaky. > --- > > Key: MESOS-3271 > URL: https://issues.apache.org/jira/browse/MESOS-3271 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Paul Brett > Attachments: build.txt > > > Test failure on Ubuntu 14 configured with {{--disable-java --disable-python > --enable-ssl --enable-libevent --enable-optimize --enable-network-isolation}} > Commit: {{9b78b301469667b5a44f0a351de5f3a71edae499}} > {code} > [ RUN ] SlaveRecoveryTest/0.NonCheckpointingFramework > I0815 06:41:47.413602 17091 exec.cpp:133] Version: 0.24.0 > I0815 06:41:47.416780 17111 exec.cpp:207] Executor registered on slave > 20150815-064146-544909504-51064-12195-S0 > Registered executor on slave1-ubuntu12 > Starting task 044bd49e-2f38-4671-802a-ac6524d61a85 > Forked command at 17114 > sh -c 'sleep 1000' > [err] event_active called on a non-initialized event 0x7f6b740232d0 (events: > 0x2, fd: 21, flags: 0x80) > *** Aborted at 1439646107 (unix time) try "date -d @1439646107" if you are > using GNU date *** > PC: @ 0x7f6ba512d0d5 (unknown) > *** SIGABRT (@0x2fa3) received by PID 12195 (TID 0x7f6b9d613700) from PID > 12195; stack trace: *** > @ 0x7f6ba54c4cb0 (unknown) > @ 0x7f6ba512d0d5 (unknown) > @ 0x7f6ba513083b (unknown) > @ 0x7f6ba448e1ba (unknown) > @ 0x7f6ba448e52b (unknown) > @ 0x7f6ba447dcc9 (unknown) > @ 0x4c4033 process::internal::run<>() > @ 0x7f6ba72642ab process::Future<>::discard() > @ 0x7f6ba72643be process::internal::discard<>() > @ 0x7f6ba7262298 > _ZNSt17_Function_handlerIFvvEZNK7process6FutureImE9onDiscardISt5_BindIFPFvNS1_10WeakFutureIsEEES7_RKS3_OT_EUlvE_E9_M_invokeERKSt9_Any_data > @ 0x4c4033 process::internal::run<>() > @ 0x6fa0cb process::Future<>::discard() > @ 0x7f6ba6fb5736 cgroups::event::Listener::finalize() > @ 0x7f6ba728fb11 process::ProcessManager::resume() > @ 0x7f6ba728fe0f process::internal::schedule() > @ 0x7f6ba5c9d490 (unknown) > @ 0x7f6ba54bce9a start_thread > @ 0x7f6ba51ea38d (unknown) > + /bin/true > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4711) Race condition in libevent poll implementation causes crash
[ https://issues.apache.org/jira/browse/MESOS-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170239#comment-15170239 ] Joris Van Remoortere commented on MESOS-4711: - {code} commit 16aa038949741f4dc6bf43423dc0340f869605ce Author: Alexander RojasDate: Fri Feb 26 17:17:50 2016 -0800 Removed race condition from libevent based poll implementation. Under certains circumstances, the future returned by poll is discarded right after the event is triggered, this causes the event callback to be called before the discard callback which results in an abort signal being raised by the libevent library. Review: https://reviews.apache.org/r/43799/ {code} > Race condition in libevent poll implementation causes crash > --- > > Key: MESOS-4711 > URL: https://issues.apache.org/jira/browse/MESOS-4711 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 0.28.0 > Environment: CentOS 6.7 running in VirtualBox >Reporter: Alexander Rojas >Assignee: Alexander Rojas > Labels: mesosphere > Fix For: 0.28.0, 0.27.2 > > > The issue first arose in MESOS-3271, but can be reproduced every time by > using the mentioned environment and running: > {noformat} > sudo ./bin/mesos-tests.sh > --gtest_filter="MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery" > --gtest_repeat=1000 > {noformat} > The problem can be traced back to > [{{libevent_poll.cpp}}|https://github.com/apache/mesos/blob/3539b7a0e15b594148308319bf052d28b1429b98/3rdparty/libprocess/src/libevent_poll.cpp]. > If the event is triggered and the the future associated with the event is > discarded, the situation arises in which > [{{pollCallback()}}|https://github.com/apache/mesos/blob/3539b7a0e15b594148308319bf052d28b1429b98/3rdparty/libprocess/src/libevent_poll.cpp#L33] > starts executing just early enough to finish before > [{{pollDiscard()}}|https://github.com/apache/mesos/blob/3539b7a0e15b594148308319bf052d28b1429b98/3rdparty/libprocess/src/libevent_poll.cpp#L53] > executes. If that happens, {{pollCallback()}} deletes the poll object and > {{pollDiscard()}} is left with a dangling pointer which crashes when it > executes the line {{event_active(ev, EV_READ, 0);}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-3271) SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.
[ https://issues.apache.org/jira/browse/MESOS-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-3271: Comment: was deleted (was: {code} commit 2297a3cf8db2b88860bc839cf934894b1d09dbbc Author: Alexander RojasDate: Fri Feb 26 14:38:05 2016 -0800 Removed race condition from libevent based poll implementation. Under certains circumstances, the future returned by poll is discarded right after the event is triggered, this causes the event callback to be called before the discard callback which results in an abort signal being raised by the libevent library. Review: https://reviews.apache.org/r/43799/ {code}) > SlaveRecoveryTest/0.NonCheckpointingFramework is flaky. > --- > > Key: MESOS-3271 > URL: https://issues.apache.org/jira/browse/MESOS-3271 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Paul Brett > Attachments: build.txt > > > Test failure on Ubuntu 14 configured with {{--disable-java --disable-python > --enable-ssl --enable-libevent --enable-optimize --enable-network-isolation}} > Commit: {{9b78b301469667b5a44f0a351de5f3a71edae499}} > {code} > [ RUN ] SlaveRecoveryTest/0.NonCheckpointingFramework > I0815 06:41:47.413602 17091 exec.cpp:133] Version: 0.24.0 > I0815 06:41:47.416780 17111 exec.cpp:207] Executor registered on slave > 20150815-064146-544909504-51064-12195-S0 > Registered executor on slave1-ubuntu12 > Starting task 044bd49e-2f38-4671-802a-ac6524d61a85 > Forked command at 17114 > sh -c 'sleep 1000' > [err] event_active called on a non-initialized event 0x7f6b740232d0 (events: > 0x2, fd: 21, flags: 0x80) > *** Aborted at 1439646107 (unix time) try "date -d @1439646107" if you are > using GNU date *** > PC: @ 0x7f6ba512d0d5 (unknown) > *** SIGABRT (@0x2fa3) received by PID 12195 (TID 0x7f6b9d613700) from PID > 12195; stack trace: *** > @ 0x7f6ba54c4cb0 (unknown) > @ 0x7f6ba512d0d5 (unknown) > @ 0x7f6ba513083b (unknown) > @ 0x7f6ba448e1ba (unknown) > @ 0x7f6ba448e52b (unknown) > @ 0x7f6ba447dcc9 (unknown) > @ 0x4c4033 process::internal::run<>() > @ 0x7f6ba72642ab process::Future<>::discard() > @ 0x7f6ba72643be process::internal::discard<>() > @ 0x7f6ba7262298 > _ZNSt17_Function_handlerIFvvEZNK7process6FutureImE9onDiscardISt5_BindIFPFvNS1_10WeakFutureIsEEES7_RKS3_OT_EUlvE_E9_M_invokeERKSt9_Any_data > @ 0x4c4033 process::internal::run<>() > @ 0x6fa0cb process::Future<>::discard() > @ 0x7f6ba6fb5736 cgroups::event::Listener::finalize() > @ 0x7f6ba728fb11 process::ProcessManager::resume() > @ 0x7f6ba728fe0f process::internal::schedule() > @ 0x7f6ba5c9d490 (unknown) > @ 0x7f6ba54bce9a start_thread > @ 0x7f6ba51ea38d (unknown) > + /bin/true > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)