[jira] [Commented] (MESOS-8308) CommandExecutorCheckTest.CommandCheckTimeout is flaky on Windows

2018-03-26 Thread Armand Grillet (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415007#comment-16415007
 ] 

Armand Grillet commented on MESOS-8308:
---

+1, failing consistently on my updated/new review requests.

> CommandExecutorCheckTest.CommandCheckTimeout is flaky on Windows
> 
>
> Key: MESOS-8308
> URL: https://issues.apache.org/jira/browse/MESOS-8308
> Project: Mesos
>  Issue Type: Bug
> Environment: Windows 10
>Reporter: Andrew Schwartzmeyer
>Assignee: Eric Mumau
>Priority: Major
>  Labels: executor, windows
>
> The test {{CommandExecutorCheckTest.CommandCheckTimeout}} can be flaky on 
> Windows. If the system is under heavy load, the PowerShell command can fail 
> poorly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6340) Set HOME for Mesos tasks

2018-03-26 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414880#comment-16414880
 ] 

Qian Zhang commented on MESOS-6340:
---

I found Docker (actually it is {{runc}} internally) will always set {{HOME}} 
env var when launching a container:
https://github.com/opencontainers/runc/blob/master/libcontainer/init_linux.go#L319:L324
Its logic is if user does not set {{HOME}} explicitly, get user’s home 
directory from /etc/passwd and set {{HOME}} to it.

So user may see different behaviors when launching a container via Mesos 
containerizer and Docker containerizer from the same Docker image, i.e., the 
former will not have {{HOME}} set but the later will have, that could be a 
problem, it might be a burden for customers to move from Docker containerizer 
to UCR.

> Set HOME for Mesos tasks
> 
>
> Key: MESOS-6340
> URL: https://issues.apache.org/jira/browse/MESOS-6340
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Cody Maloney
>Priority: Major
>  Labels: containerizer
>
> Quite a few programs assume {{$HOME}} points to a user-editable data file 
> directory.
> One example is PYTHON, which tries to look up $HOME to find user-installed 
> pacakges, and if that fails it tries to look up the user in the passwd 
> database which often goes badly (The container is running under the `nobody` 
> user):
> {code}
> if i == 1:
> if 'HOME' not in os.environ:
> import pwd
> userhome = pwd.getpwuid(os.getuid()).pw_dir
> else:
> userhome = os.environ['HOME']
> {code}
> Just setting HOME by default to WORK_DIR would enable more software to work 
> correctly out of the box. Software which needs to specialize / change it (or 
> schedulers with specific preferences), should still be able to set it 
> arbitrarily and anything a scheduler explicitly sets should overwrite the 
> default value of $WORK_DIR



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8729) Libprocess: deadlock in process::finalize

2018-03-26 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414876#comment-16414876
 ] 

Benjamin Mahler commented on MESOS-8729:


Looking at the last stack:
 
{color:#00}...{color}
{color:#00}#8 0x7f09d2ac1aac in synchronize () at 
../../3rdparty/stout/include/stout/synchronized.hpp:58 #9 0x7f09d492c37b in 
process::ProcessManager::use () at 
../../../3rdparty/libprocess/src/process.cpp:2520 #10 0x7f09d492e955 in 
process::ProcessManager::deliver () at 
../../../3rdparty/libprocess/src/process.cpp:2775 // Trying to get a reference 
but blocked on the lock.{color}
...
#66 0x7f09d492e988 in process::ProcessManager::deliver () at 
[../../../3rdparty/libprocess/src/process.cpp:2776 
|https://github.com/apache/mesos/blob/2e2e38628c1b580a231ddac5270f9848ea4af7af/3rdparty/libprocess/src/process.cpp?utf8=%E2%9C%93#L2776]//
 XXX Holds a reference!
...
 
This thread is doing a deliver (while holding a reference) and synchronously 
calls back into deliver and blocks on the lock while holding a reference. The 
first thread is therefore stuck spinning under the lock and the reference will 
never be released.
 
{color:#00}I understand the issue now but haven't thought through a 
fix.{color}

> Libprocess: deadlock in process::finalize
> -
>
> Key: MESOS-8729
> URL: https://issues.apache.org/jira/browse/MESOS-8729
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.6.0
> Environment: The issue has been reproduced on Ubuntu 16.04, master 
> branch, commit `42848653b2`. 
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: deadlock, libprocess
> Attachments: deadlock.txt
>
>
> Since we are calling 
> [`libprocess::finalize()`|https://github.com/apache/mesos/blob/02ebf9986ab5ce883a71df72e9e3392a3e37e40e/src/slave/containerizer/mesos/io/switchboard_main.cpp#L157]
>  before returning from the IOSwitchboard's main function, we expect that all 
> http responses are going to be sent back to clients before IOSwitchboard 
> terminates. However, after [adding|https://reviews.apache.org/r/66147/] 
> `libprocess::finalize()` we have seen that IOSwitchboard might get stuck in 
> `libprocess::finalize()`. See attached stacktrace.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8725) Support max_duration for tasks

2018-03-26 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414617#comment-16414617
 ] 

Zhitao Li commented on MESOS-8725:
--

One minor decision I'm making is to require all tasks in the same group to have 
the same `max_duration` (either all absent, or carries the same value).

Keeping this as record here.

> Support max_duration for tasks
> --
>
> Key: MESOS-8725
> URL: https://issues.apache.org/jira/browse/MESOS-8725
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>
> In our environment, we run a lot of batch jobs, some of which have tight 
> timeline. If any tasks in the job runs longer than x hours, it does not make 
> sense to run it anymore. 
>  
> For instance, a team would submit a job which builds a weekly index and 
> repeats every Monday. If the job does not finish before next Monday for 
> whatever reason, there is no point to keep any task running.
>  
> We believe that implementing deadline tracking distributed across our cluster 
> makes more sense as it makes the system more scalable and also makes our 
> centralized state machine simpler.
>  
> One idea I have right now is to add an  *optional* *TimeInfo deadline* to 
> TaskInfo field, and all default executors in Mesos can simply terminate the 
> task and send a proper *StatusUpdate.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8730) Provide explicit feedback when a resource becomes unavailable

2018-03-26 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-8730:
---

 Summary: Provide explicit feedback when a resource becomes 
unavailable
 Key: MESOS-8730
 URL: https://issues.apache.org/jira/browse/MESOS-8730
 Project: Mesos
  Issue Type: Bug
  Components: agent, master, storage
Reporter: Benjamin Bannier


With local resource providers we allowed for dynamic changes to agent 
resources. This opens up a number of new scenarios
 # disappeared resources have been offered to a framework and are used in 
tasks, or
 # disappeared resources have been reserved, but are currently neither offered 
nor used.

In the first case, we cannot in general assume that tasks using disappeared 
resource will terminate unexpectedly, and we should handle this explicitly, 
e.g., by providing explicit feedback to the framework so it can migrate tasks 
of these resources, or by implementing explicit kill policies for such tasks.

In the second case, it is not immediately clear to whom such changes should be 
reported since multiple frameworks can share a role. The information should 
already be exposed to any watchers of e.g., {{GET_RESOURCE_PROVIDERS}} calls, 
but we might want to think about a unified feedback channel handling both 
scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (MESOS-8319) Support for LXC containers (LXD?)

2018-03-26 Thread Antonis Danezis (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antonis Danezis updated MESOS-8319:
---
Comment: was deleted

(was: I seem to be unable to assign this to myself. Do I need further 
permissions?)

> Support for LXC containers (LXD?)
> -
>
> Key: MESOS-8319
> URL: https://issues.apache.org/jira/browse/MESOS-8319
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Antonis Danezis
>Priority: Minor
>  Labels: containerizer, mesosphere
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-03-26 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405510#comment-16405510
 ] 

Alexander Rukletsov edited comment on MESOS-8545 at 3/26/18 1:04 PM:
-

{noformat}
commit 02ebf9986ab5ce883a71df72e9e3392a3e37e40e
Author: Andrei Budnik 
AuthorDate: Mon Mar 19 22:48:31 2018 +0100
Commit: Alexander Rukletsov 
CommitDate: Mon Mar 19 22:48:31 2018 +0100

Fixed disconnection for ATTACH_CONTAINER_INPUT call in IOSwitchboard.

Previously, an http response for the `ATTACH_CONTAINER_INPUT` call
could be lost due to immediate termination of the IOSwitchboard
process after the termination of the IOSwitchboard actor. Since the
IOSwitchboard process didn't wait for completion of sending all
responses back to the agent, the agent received disconnection error.
To fix the issue, this patch adds explicit finalization of libprocess
before returning from the IOSwitchboard's main function.

Review: https://reviews.apache.org/r/66147/
{noformat}
{noformat}
commit 1ed3eae3ca09c8fdeac349d78e568d2a91be306b
Author: Andrei Budnik 
AuthorDate: Mon Mar 26 15:03:30 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Mon Mar 26 15:03:30 2018 +0200

Ensured correct termination order in IOSwitchboard's main function.

This patch terminates `IOSwitchboardServer` actor before calling
`process::finalize()`. This patch is an addition to commit 02ebf9986a.

Review: https://reviews.apache.org/r/66278/
{noformat}


was (Author: alexr):
{noformat}
commit 02ebf9986ab5ce883a71df72e9e3392a3e37e40e
Author: Andrei Budnik 
AuthorDate: Mon Mar 19 22:48:31 2018 +0100
Commit: Alexander Rukletsov 
CommitDate: Mon Mar 19 22:48:31 2018 +0100

Fixed disconnection for ATTACH_CONTAINER_INPUT call in IOSwitchboard.

Previously, an http response for the `ATTACH_CONTAINER_INPUT` call
could be lost due to immediate termination of the IOSwitchboard
process after the termination of the IOSwitchboard actor. Since the
IOSwitchboard process didn't wait for completion of sending all
responses back to the agent, the agent received disconnection error.
To fix the issue, this patch adds explicit finalization of libprocess
before returning from the IOSwitchboard's main function.

Review: https://reviews.apache.org/r/66147/
{noformat}

> AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
> ---
>
> Key: MESOS-8545
> URL: https://issues.apache.org/jira/browse/MESOS-8545
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: Mesosphere, flaky-test
> Fix For: 1.6.0
>
> Attachments: 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun.txt, 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun2.txt
>
>
> {code:java}
> I0205 17:11:01.091872 4898 http_proxy.cpp:132] Returning '500 Internal Server 
> Error' for '/slave(974)/api/v1' (Disconnected)
> /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/api_tests.cpp:6596:
>  Failure
> Value of: (response).get().status
> Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-3858) Draft quota limits design document

2018-03-26 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-3858:
---

Assignee: (was: Jan Schlicht)

> Draft quota limits design document
> --
>
> Key: MESOS-3858
> URL: https://issues.apache.org/jira/browse/MESOS-3858
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>Priority: Major
>  Labels: mesosphere, quota
>
> In the design documents for Quota 
> (https://docs.google.com/document/d/16iRNmziasEjVOblYp5bbkeBZ7pnjNlaIzPQqMTHQ-9I/edit#)
>  the proposed MVP does not include quota limits. Quota limits represent an 
> upper bound of resources that a role is allowed to use. The task of this 
> ticket is to outline a design document on how to implement quota limits when 
> the quota MVP is implemented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)