[jira] [Commented] (MESOS-8308) CommandExecutorCheckTest.CommandCheckTimeout is flaky on Windows
[ https://issues.apache.org/jira/browse/MESOS-8308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415007#comment-16415007 ] Armand Grillet commented on MESOS-8308: --- +1, failing consistently on my updated/new review requests. > CommandExecutorCheckTest.CommandCheckTimeout is flaky on Windows > > > Key: MESOS-8308 > URL: https://issues.apache.org/jira/browse/MESOS-8308 > Project: Mesos > Issue Type: Bug > Environment: Windows 10 >Reporter: Andrew Schwartzmeyer >Assignee: Eric Mumau >Priority: Major > Labels: executor, windows > > The test {{CommandExecutorCheckTest.CommandCheckTimeout}} can be flaky on > Windows. If the system is under heavy load, the PowerShell command can fail > poorly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6340) Set HOME for Mesos tasks
[ https://issues.apache.org/jira/browse/MESOS-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414880#comment-16414880 ] Qian Zhang commented on MESOS-6340: --- I found Docker (actually it is {{runc}} internally) will always set {{HOME}} env var when launching a container: https://github.com/opencontainers/runc/blob/master/libcontainer/init_linux.go#L319:L324 Its logic is if user does not set {{HOME}} explicitly, get user’s home directory from /etc/passwd and set {{HOME}} to it. So user may see different behaviors when launching a container via Mesos containerizer and Docker containerizer from the same Docker image, i.e., the former will not have {{HOME}} set but the later will have, that could be a problem, it might be a burden for customers to move from Docker containerizer to UCR. > Set HOME for Mesos tasks > > > Key: MESOS-6340 > URL: https://issues.apache.org/jira/browse/MESOS-6340 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: Cody Maloney >Priority: Major > Labels: containerizer > > Quite a few programs assume {{$HOME}} points to a user-editable data file > directory. > One example is PYTHON, which tries to look up $HOME to find user-installed > pacakges, and if that fails it tries to look up the user in the passwd > database which often goes badly (The container is running under the `nobody` > user): > {code} > if i == 1: > if 'HOME' not in os.environ: > import pwd > userhome = pwd.getpwuid(os.getuid()).pw_dir > else: > userhome = os.environ['HOME'] > {code} > Just setting HOME by default to WORK_DIR would enable more software to work > correctly out of the box. Software which needs to specialize / change it (or > schedulers with specific preferences), should still be able to set it > arbitrarily and anything a scheduler explicitly sets should overwrite the > default value of $WORK_DIR -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8729) Libprocess: deadlock in process::finalize
[ https://issues.apache.org/jira/browse/MESOS-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414876#comment-16414876 ] Benjamin Mahler commented on MESOS-8729: Looking at the last stack: {color:#00}...{color} {color:#00}#8 0x7f09d2ac1aac in synchronize () at ../../3rdparty/stout/include/stout/synchronized.hpp:58 #9 0x7f09d492c37b in process::ProcessManager::use () at ../../../3rdparty/libprocess/src/process.cpp:2520 #10 0x7f09d492e955 in process::ProcessManager::deliver () at ../../../3rdparty/libprocess/src/process.cpp:2775 // Trying to get a reference but blocked on the lock.{color} ... #66 0x7f09d492e988 in process::ProcessManager::deliver () at [../../../3rdparty/libprocess/src/process.cpp:2776 |https://github.com/apache/mesos/blob/2e2e38628c1b580a231ddac5270f9848ea4af7af/3rdparty/libprocess/src/process.cpp?utf8=%E2%9C%93#L2776]// XXX Holds a reference! ... This thread is doing a deliver (while holding a reference) and synchronously calls back into deliver and blocks on the lock while holding a reference. The first thread is therefore stuck spinning under the lock and the reference will never be released. {color:#00}I understand the issue now but haven't thought through a fix.{color} > Libprocess: deadlock in process::finalize > - > > Key: MESOS-8729 > URL: https://issues.apache.org/jira/browse/MESOS-8729 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.6.0 > Environment: The issue has been reproduced on Ubuntu 16.04, master > branch, commit `42848653b2`. >Reporter: Andrei Budnik >Priority: Major > Labels: deadlock, libprocess > Attachments: deadlock.txt > > > Since we are calling > [`libprocess::finalize()`|https://github.com/apache/mesos/blob/02ebf9986ab5ce883a71df72e9e3392a3e37e40e/src/slave/containerizer/mesos/io/switchboard_main.cpp#L157] > before returning from the IOSwitchboard's main function, we expect that all > http responses are going to be sent back to clients before IOSwitchboard > terminates. However, after [adding|https://reviews.apache.org/r/66147/] > `libprocess::finalize()` we have seen that IOSwitchboard might get stuck in > `libprocess::finalize()`. See attached stacktrace. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8725) Support max_duration for tasks
[ https://issues.apache.org/jira/browse/MESOS-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414617#comment-16414617 ] Zhitao Li commented on MESOS-8725: -- One minor decision I'm making is to require all tasks in the same group to have the same `max_duration` (either all absent, or carries the same value). Keeping this as record here. > Support max_duration for tasks > -- > > Key: MESOS-8725 > URL: https://issues.apache.org/jira/browse/MESOS-8725 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li >Assignee: Zhitao Li >Priority: Major > > In our environment, we run a lot of batch jobs, some of which have tight > timeline. If any tasks in the job runs longer than x hours, it does not make > sense to run it anymore. > > For instance, a team would submit a job which builds a weekly index and > repeats every Monday. If the job does not finish before next Monday for > whatever reason, there is no point to keep any task running. > > We believe that implementing deadline tracking distributed across our cluster > makes more sense as it makes the system more scalable and also makes our > centralized state machine simpler. > > One idea I have right now is to add an *optional* *TimeInfo deadline* to > TaskInfo field, and all default executors in Mesos can simply terminate the > task and send a proper *StatusUpdate.* -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8730) Provide explicit feedback when a resource becomes unavailable
Benjamin Bannier created MESOS-8730: --- Summary: Provide explicit feedback when a resource becomes unavailable Key: MESOS-8730 URL: https://issues.apache.org/jira/browse/MESOS-8730 Project: Mesos Issue Type: Bug Components: agent, master, storage Reporter: Benjamin Bannier With local resource providers we allowed for dynamic changes to agent resources. This opens up a number of new scenarios # disappeared resources have been offered to a framework and are used in tasks, or # disappeared resources have been reserved, but are currently neither offered nor used. In the first case, we cannot in general assume that tasks using disappeared resource will terminate unexpectedly, and we should handle this explicitly, e.g., by providing explicit feedback to the framework so it can migrate tasks of these resources, or by implementing explicit kill policies for such tasks. In the second case, it is not immediately clear to whom such changes should be reported since multiple frameworks can share a role. The information should already be exposed to any watchers of e.g., {{GET_RESOURCE_PROVIDERS}} calls, but we might want to think about a unified feedback channel handling both scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (MESOS-8319) Support for LXC containers (LXD?)
[ https://issues.apache.org/jira/browse/MESOS-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antonis Danezis updated MESOS-8319: --- Comment: was deleted (was: I seem to be unable to assign this to myself. Do I need further permissions?) > Support for LXC containers (LXD?) > - > > Key: MESOS-8319 > URL: https://issues.apache.org/jira/browse/MESOS-8319 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Antonis Danezis >Priority: Minor > Labels: containerizer, mesosphere > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405510#comment-16405510 ] Alexander Rukletsov edited comment on MESOS-8545 at 3/26/18 1:04 PM: - {noformat} commit 02ebf9986ab5ce883a71df72e9e3392a3e37e40e Author: Andrei BudnikAuthorDate: Mon Mar 19 22:48:31 2018 +0100 Commit: Alexander Rukletsov CommitDate: Mon Mar 19 22:48:31 2018 +0100 Fixed disconnection for ATTACH_CONTAINER_INPUT call in IOSwitchboard. Previously, an http response for the `ATTACH_CONTAINER_INPUT` call could be lost due to immediate termination of the IOSwitchboard process after the termination of the IOSwitchboard actor. Since the IOSwitchboard process didn't wait for completion of sending all responses back to the agent, the agent received disconnection error. To fix the issue, this patch adds explicit finalization of libprocess before returning from the IOSwitchboard's main function. Review: https://reviews.apache.org/r/66147/ {noformat} {noformat} commit 1ed3eae3ca09c8fdeac349d78e568d2a91be306b Author: Andrei Budnik AuthorDate: Mon Mar 26 15:03:30 2018 +0200 Commit: Alexander Rukletsov CommitDate: Mon Mar 26 15:03:30 2018 +0200 Ensured correct termination order in IOSwitchboard's main function. This patch terminates `IOSwitchboardServer` actor before calling `process::finalize()`. This patch is an addition to commit 02ebf9986a. Review: https://reviews.apache.org/r/66278/ {noformat} was (Author: alexr): {noformat} commit 02ebf9986ab5ce883a71df72e9e3392a3e37e40e Author: Andrei Budnik AuthorDate: Mon Mar 19 22:48:31 2018 +0100 Commit: Alexander Rukletsov CommitDate: Mon Mar 19 22:48:31 2018 +0100 Fixed disconnection for ATTACH_CONTAINER_INPUT call in IOSwitchboard. Previously, an http response for the `ATTACH_CONTAINER_INPUT` call could be lost due to immediate termination of the IOSwitchboard process after the termination of the IOSwitchboard actor. Since the IOSwitchboard process didn't wait for completion of sending all responses back to the agent, the agent received disconnection error. To fix the issue, this patch adds explicit finalization of libprocess before returning from the IOSwitchboard's main function. Review: https://reviews.apache.org/r/66147/ {noformat} > AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky. > --- > > Key: MESOS-8545 > URL: https://issues.apache.org/jira/browse/MESOS-8545 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.5.0 >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: Mesosphere, flaky-test > Fix For: 1.6.0 > > Attachments: > AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun.txt, > AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun2.txt > > > {code:java} > I0205 17:11:01.091872 4898 http_proxy.cpp:132] Returning '500 Internal Server > Error' for '/slave(974)/api/v1' (Disconnected) > /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/api_tests.cpp:6596: > Failure > Value of: (response).get().status > Actual: "500 Internal Server Error" > Expected: http::OK().status > Which is: "200 OK" > Body: "Disconnected" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-3858) Draft quota limits design document
[ https://issues.apache.org/jira/browse/MESOS-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-3858: --- Assignee: (was: Jan Schlicht) > Draft quota limits design document > -- > > Key: MESOS-3858 > URL: https://issues.apache.org/jira/browse/MESOS-3858 > Project: Mesos > Issue Type: Task >Reporter: Jan Schlicht >Priority: Major > Labels: mesosphere, quota > > In the design documents for Quota > (https://docs.google.com/document/d/16iRNmziasEjVOblYp5bbkeBZ7pnjNlaIzPQqMTHQ-9I/edit#) > the proposed MVP does not include quota limits. Quota limits represent an > upper bound of resources that a role is allowed to use. The task of this > ticket is to outline a design document on how to implement quota limits when > the quota MVP is implemented. -- This message was sent by Atlassian JIRA (v7.6.3#76005)